Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 59
3 Quality and Comparability of State Tests of English Language Proficiency The No Child Left Behind (NCLB) Act of 2001 requires states to annually assess the English language proficiency of their students who are classified as limited English proficiency (LEP), also referred to as English language learner (ELL) students. The law (Title III) requires states to establish English language proficiency (ELP) content standards and to use a single ELP test to assess students’ progress in and mastery of these standards in four domains: reading, writing, speaking, and listening. Results from the annual administration of ELP tests are used to report on students’ progress in and attainment of English language proficiency. The tests may also be used to identify ELL students and to determine when they should end ELL status, often in conjunction with other criteria. In this chapter, we discuss the ELP tests that states use and compare and contrast their features. We examine the technical quality of the tests, not with the intent of doing a full-scale evaluation of each of them, but rather to consider their use in classifying ELL students and measuring students’ progress in learning English. We reviewed the tests by examining the information reported in their technical manuals and supplementary reports with regard to how the tests were developed, the skills that they measure, how the test scores are derived and reported, the reliability of those scores, and the validity of the decisions based on the scores. We consider these aspects of the tests in relation to established technical standards for developing tests, such as those published in the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999). We focus primarily on the extent to which the tests are likely to support valid decisions about students’ English proficiency and the comparability of those decisions across states, given the available data. More detailed information on the tests is in Appendix A.
OCR for page 60
NCLB REQUIREMENTS FOR ENGLISH LANGUAGE PROFICIENCY TESTS ELP tests have long been used by the states to classify ELL students by language proficiency level for instructional program placement and decision-making purposes. Many were developed in response to legislation and litigation of the 1970s (e.g., the Lau v. Nichols Supreme Court decision and the Equal Educational Opportunities Act of 1974), a time when very few instruments were available to assess ELP (Bauman et al., 2007). For the most part, these tests reflected the predominant structural linguistic approach to assessing ELP (Abedi, 2007; Francis and Rivera, 2007). They were designed to assist local educators with English as a second language and bilingual education program placement and exit decisions, and they typically focused on oral (listening and speaking) domains, measuring discrete phonological and basic interpersonal communication skills. These tests focused largely on basic interpersonal communication skills rather than academic language skills. As a result, students may have scored well on them without having mastered the English language skills needed for learning subject matter in an English-only classroom (Lara et al., 2007). Before NCLB, there was no attempt to bring uniformity to the ELP assessments with regard to what they measured, their technical measurement properties, or how they were used. Moreover, states typically allowed local school districts to choose among a variety of commercial ELP assessments that varied widely in their characteristics, emphases, and technical properties. Reviews of the pre-NCLB ELP tests have revealed that they differed from each other in their theoretical foundations, the type of language assessed, the types of skills assessed (i.e., receptive or expressive skills), the content assessed, the types of assessment tasks, structural characteristics (i.e., administration procedures, grade level ranges, assessment time required), and technical qualities (e.g., reliability and validity) (Del Vecchio and Guerrero, 1995; Zehler et al., 1994). Many of these tests were not based on an operationally defined concept of ELP, had limited questions that measured academic language proficiency, were not based on explicitly articulated ELP content standards, and had psychometric flaws and other shortcomings (Abedi, 2007, 2008; Bauman et al., 2007; Del Vecchio and Guerrero, 1995; Lara et al., 2007; Zehler et al., 1994). Under Titles I and III of NCLB, the U.S. Department of Education (DoEd) required states to make improvements to ELP assessments, specifically (adapted from Abedi, 2008, p. 5): Develop and implement ELP standards suitable for ELL students learning English as a second language. Implement a single reliable and valid ELP assessment that is aligned to ELP standards and that annually measures listening, speaking, reading, writing, and comprehension skills. Align the ELP test with the state’s challenging academic content and student academic achievement standards described in section 1111(b)(1)(PL 107-110. Available: http://www2.ed.gov/policy/elsec/leg/esea02/index.html [April 2011]).
OCR for page 61
Establish two annual measurable achievement objectives for ELL students that explicitly define, measure, and report on the student’s expected progress toward and attainment of ELP. These requirements brought about significant changes in the states’ ELP tests, and therefore the tests currently used differ in a number of important ways from the pre-NCLB tests (Abedi, 2007; Bauman et al., 2007; Francis and Rivera, 2007; Lara et al., 2007; Rebarbar et al., 2007). First, the new ELP tests are standards based. This means that the first step in the assessment development process is to identify and adopt a set of ELP content standards. Then test specifications are developed to guide test item development in each of the four major language domains (reading, writing, listening, and speaking). Test items are then designed to measure a representative sample of the standards. Although the new ELP tests are not tests of academic content, they are intended to assess the types of language skills required for students to access the core academic content.1 In line with NCLB, the new ELP tests measures both receptive (listening, reading) and expressive (speaking, writing) language proficiency skills and comprehension.2 They also more explicitly link and assess skills related to English as a second language and academic language skills, required to be successful in school (for details on the academic language construct, see Anstrom et al., 2009; Bailey and Heritage, 2008, or Scarcella, 2008). The new ELP assessments offer different forms of the test for each cluster of grades (e.g., early elementary, later elementary, middle school, high school), which are designed to measure growth in ways that reflect the increasing complexity of given language proficiency levels at different age/grade levels. For example, what constitutes intermediate-level academic oral language skill for a 3rd-grade student may be quite different from that for an 8th-grade student. Pre-NCLB assessments generally clustered large numbers of grade levels together. A last major difference is that, unlike the pre-NCLB tests, the new tests are designed for high-stakes decision making and are treated as secure assessments. These changes in the tests have been judged to represent a significant departure from prior practices (Bauman et al., 2007; Lara et al., 2007; Mathews, 2007; Rebarbar et al., 2007). STATE ENGLISH LANGUAGE PROFICIENCY TESTS Development To develop the tests required by NCLB, the DoEd provided grants under Title VI (Section 6112) of the act. The grants allowed for development, validation, and implementation of ELP assessments and encouraged states to work together in 1 In other words, the assessment should evaluate the language skill (i.e., vocabulary, structure, grammar) needed to access the content of the core academic content standards. 2 We note that there were proficiency assessments in the 1980s that measured skills in these domains, but they were not standards based.
OCR for page 62
consortia. In a second round of funding, the DoEd provided additional support for some of the consortia to field test and validate the assessments. Under the grant competition, four different consortia of states were formed, and most of the states initially joined one of these groups. One consortium was led by the Council of Chief State School Officers with states in the Limited English Proficient State Collaborative on Assessment and Student Standards (LEP-SCASS), which developed the English Language Development Assessment (ELDA).3 Initially, 18 states were members of the LEP-SCASS, and 14 states participated in the process of developing, field testing, validating, and implementing ELDA as an operational assessment (Saez, program director, Council of Chief State School Officers, personal communication, August 4, 2010).4 Another consortium funded by the DoEd initially included three states (Alabama, Delaware, and Wisconsin) and was led by the Wisconsin Department of Public Instruction.5 Shortly after being funded, seven additional states joined the consortium (Alabama, District of Columbia, Illinois, Maine, New Hampshire, Rhode Island, and Vermont). Now known as the World-Class Instructional Design and Assessment (WIDA) Consortium, this effort produced the assessment called Assessing Comprehension and Communication in English State to State for English Language Learners (or more simply, the ACCESS). Both the LEP-SCASS and the WIDA consortia continue to work actively with state constituents in administering and refining the assessments. As of March 2010 the WIDA Consortium included 23 states, and the LEP-SCASS included 7 states. Membership in these two consortia is dynamic, with new states joining the consortia on an on-going basis. Two other state consortia initially funded by the DoEd are no longer active, although they made considerable progress in developing test items. The Mountain West Assessment Consortium (MWAC) included 11 states, led by the Utah State Office of Education.6 The MWAC’s assessment was not fully operational when the grant expired, and the consortium’s test item bank was subsequently made available to the member states. Three states (Idaho, Montana, and Utah) used the item bank and incorporated the consortium’s test questions into their state proficiency assessments. 3 This consortia worked in collaboration with the American Institutes for Research and with Measurement Incorporated, with external advice from the Center for the Study of Assessment Validity and Evaluation at the University of Maryland (see Lara et al., 2007). 4 Nevada led the collaboration, with Georgia, Indiana, Iowa, Kentucky, Louisiana, Nebraska, Nevada, New Jersey, Ohio, Oklahoma, South Carolina, Virginia, and West Virginia. See Lara et al. (2007) for a more complete history of this consortium’s development efforts. 5 The Wisconsin department worked in collaboration with the Center for Applied Linguistics, the University of Wisconsin system, and the University of Illinois. See Bauman et al. (2007) for a more complete history of this consortium’s development efforts. 6 The other states were Alaska, Colorado, Idaho, Michigan, Montana, Nevada, New Mexico, North Dakota, Oregon, Utah, and Wyoming. The Utah State Office of Education collaborated with Measured Progress as the test developer. See Mathews (2007) for a more complete history of this consortium’s development efforts.
OCR for page 63
The other consortium originally funded by the DoEd was English Proficiency for All Students, which included five states and was led by Accountability Works.7 This consortium produced the Comprehensive English Language Learning Assessment (CELLA), which is now used only by Florida. In addition to state consortia funded through the DoEd, commercial test publishers also developed ELP assessments that met the requirements of NCLB. For instance, CTB/McGraw Hill, which had previously developed an assessment called the Language Assessment Scales (LAS), created the Language Assessment Scales Links K-12 (LAS Links), and concordance tables were produced so that scores on the LAS could be converted to the score scale used for LAS Links. Harcourt, Inc. (now Pearson) developed the new Stanford English Language Proficiency Test (SELP). Some states decided to use one of these commercially developed tests. Typically, the test publisher worked with the state to customize (“augment”) the assessment so that it was better aligned with the state’s ELP content standards and met the state’s needs. In a similar vein, some states created customized versions of consortia-developed tests. For instance, Ohio created its own test (the OTELA) derived from the test item bank and scales of the ELDA (American Institutes for Research).8 Other states—including some states with the largest ELL enrollments—developed their own unique ELP test: examples include the California English-Language Development Test, the New York English as a Second Language Achievement Test, Oregon’s web-based English Language Proficiency Assessment, and the Texas English Language Proficiency Assessment System. English Language Proficiency Assessments Used by the States NCLB initially required states to establish ELP standards and implement an ELP assessment aligned to these standards by the 2002-2003 school year (U.S. Department of Education, 2010b, p. 8).9 This presented a considerable challenge to states, and many participated in one of the four consortia as they worked to develop their standards or assessments and meet the federal deadlines (U.S. Department of Education, 2010b, p. 9). In the end, some adopted the consortium-based assessment, some adopted the consortium’s standards, and some adapted consortium standards for their own needs.10 7 The consortium partners were Florida, Maryland, Michigan, Pennsylvania, and Tennessee, working in collaboration with the Educational Testing Service. See Rebarber et al. (2007) for a more complete history of this consortium’s development efforts. 8 OTELA is actually a shortened version of ELDA. It was developed to reduce the administration time required for ELDA and to reduce the emphasis on entry level skills while maintaining acceptable levels of reliability and validity. 9 On July 1, 2005, the deadline was extended to the spring of the 2005-2006 school year (U.S. Department of Education, 2005a, p. 23 of Title III Policy: State of the States); http://wvconnections.k12.wv.us/documents/Timelinefor ELPAssessment.doc.) 10 For instance, in 2004-2005, 38 state Title III directors indicated that they were participating in one of the four consortia to develop standards or assessments (U.S. Department of Education, 2010b, p. 9).
OCR for page 64
Table 3-1 shows the test used by each state for the 2009-2010 school year, as reported by Title III officials in each state: 23 of the states use ACCESS (an increase from 15 states in 2005-2006); 7 use ELDA; and 4 use LAS Links, with augmentation as needed to address the state’s standards. Two states use augmented versions of the SELP assessment, published by Pearson. The remaining 15 states use a unique test (including California, New York, Oregon, Texas) or a test derived from a consortium test (e.g., Ohio). Thus, in the 2009-2010 school year, the states used approximately 19 different proficiency assessments. However, a simple count of the number of different tests (based on their names) overstates their differences because of the specificity in the federal requirements and the extent of collaboration among states and consortia, as well as private developers, to meet those requirements. Tests Selected for Panel Review An in-depth review of all of the state tests was beyond the scope of time and resources available for our study. We therefore identified a subset of the tests to review. For efficiency, we first identified the tests used by more than one state, which include the ACCESS, the ELDA, LAS Links, and SELP. We wanted to be sure to include the tests used in states with large numbers of ELL students, so we next rank ordered the states according to the numbers of ELL students, identifying the 10 states that reported the highest numbers of ELL students over the past 5 years (in order by volume):11 California, Texas, Florida, New York, Illinois, Arizona, North Carolina, Colorado, Virginia, and Washington. Together, these 10 states account for approximately 75 percent of the ELL students in the country—roughly 3.4 million students. California, Florida, New York, and Texas—the states with the highest numbers of ELL students—each use their own state-developed tests. The other six states are either members of the WIDA Consortium that uses the ACCESS test, or they use an augmented version of the SELP or LAS Links. Thus, in this chapter, we review the overall technical characteristics and comparability of eight tests, four used by multiple states (ACCESS, ELDA, LAS Links, and SELP) and four used by a single state (CELDT used in California, TELPAS used in Texas, CELLA used in Florida, and NYSESLAT used in New York). Because many other states use one of the tests in the first group, our review covers the tests used by 40 states. Table 3-2 lists the tests that we reviewed and shows the states that use them. In reviewing these assessments, we gathered general information about each test (e.g., number of subtests, types of questions, scores derived, and proficiency standards). We examined information reported in their technical manuals and supplementary reports with regard to the ways that the tests were developed, the skills that they measure, the ways that the scores are derived and reported, the reliability of those scores, and the validity of the decisions based on the scores. In conducting the review, we examined the materials for evidence that the information was pro- 11 See Chapters 4 and 5 for further details about numbers of ELL students per state.
OCR for page 65
TABLE 3-1 English Language Proficiency Assessments, by State, 2009-2010 School Year State English Language Proficiency Assessment Alabama Accessing Comprehension and Communication in English State to State (ACCESS) Alaska IDEA Proficiency Test (IPT) Arizona Arizona English Language Learner Assessment (AZELLA) (customized version of the SELP) Arkansas English Language Development Assessment (ELDA) California California English Language Development Test (CELDT) Colorado Colorado English Language Assessment (CELA) (customized version of LAS Links) Connecticut Language Assessment Scales Links (LAS Links) Delaware ACCESS District of Columbia ACCESS Florida Comprehensive English Language Learning Assessment (CELLA) Georgia ACCESS Hawaii ACCESS Idaho Idaho English Language Assessment (IELA) (items drawn from MWAC item bank) Illinois ACCESS Indiana LAS Links Iowa ELDA Kansas Kansas English Language Proficiency Assessment (KELPA) Kentucky ACCESS Louisiana ELDA Maine ACCESS Maryland LAS Links Massachusetts Massachusetts English Proficiency Assessment-Reading and Writing (MEPA-R/W) and Massachusetts English Language Assessment-Oral (MELA-O) Michigan Michigan English Language Proficiency Assessment (MI-ELPA) (items initially drawn from MWAC and SELP item banks) Minnesota K-2 Reading and Writing Checklist Test of Emerging Academic English (TEAE) (grades 3-12) Minnesota Modified Student Oral Language Observation Matrix (MN-SOLOM) (grades K-12) Mississippi ACCESS Missouri ACCESS Montana MontCAS English Language Proficiency Assessment (MontCAS ELP) (adapted items from MWAC) Nebraska ELDA Nevada Nevada State English Language Proficiency Assessment (NV-ELPA) New Hampshire ACCESS New Jersey ACCESS New Mexico ACCESS New York New York State English as a Second Language Achievement Test (NYSESLAT) (items initially drawn from SELP item bank)
OCR for page 66
State English Language Proficiency Assessment N. Carolina ACCESS N. Dakota ACCESS Ohio Ohio Test of Language Acquisition (OTELA) (modified version of ELDA) Oklahoma ACCESS Oregon Oregon English Language Proficiency Assessment (OR-ELPA) Pennsylvania ACCESS Rhode Island ACCESS S. Carolina ELDA S. Dakota ACCESS Tennessee Tennessee English Language Placement Assessment (TELPA) Texas Texas English Language Proficiency Assessment Systems (TELPAS) Utah Utah Academic Language Proficiency Assessment (UALPA) (adapted items from MWAC) Vermont ACCESS Virginia ACCESS Washington Washington Language Proficiency Test II (WLPT–II) (customized version of SELP) W. Virginia ELDA, but renamed West Virginia Test for English Language Learners (WESTELL) for use in the state Wisconsin ACCESS Wyoming ACCESS SOURCE: http://www.ncela.org; data confirmed by the state Title III directors. TABLE 3-2 Tests Reviewed by the Panel Test States That Use the Test ACCESS Alabama, Delaware, DC, Georgia, Hawaii, Illinois, Kentucky, Maine, Mississippi, Missouri, New Hampshire, New Jersey, New Mexico, North Carolina, North Dakota, Oklahoma, Pennsylvania, Rhode Island, South Dakota, Vermont, Virginia, Wisconsin, Wyoming CELDT California CELLA Florida ELDA Arkansas, Iowa, Louisiana, Nebraska, South Carolina, Tennessee, West Virginia LAS Links* Colorado, Connecticut, Indiana, Maryland NYSESLAT New York SELP* Arizona, Washington TELPAS Texas *Test is customized for each state so that it measures the state’s English language proficiency content standards.
OCR for page 67
vided, and we did a cursory review of the procedures that were used, but we did not conduct a full-scale evaluation of each test. For example, we examined the technical manuals to confirm that reliability and validity information was reported, but we did not evaluate the procedures for obtaining reliability and validity information or the quality of the information reported. Doing the latter would have required that we first agree on the criteria for evaluating the tests and then thoroughly review the processes each used and the data each reported. Time and resources for this project were too limited to perform this type of review. The information we report about the tests is primarily descriptive and intended to support our charge of evaluating the extent to which the test results yield valid and comparable decisions across the states. GENERAL SIMILARITIES AND DIFFERENCES AMONG THE STATE TESTS In the most general sense, the new ELP tests have much in common, which is understandable since all of them were designed to meet the new requirements of NCLB. All assess ELP in the four broad domains specified by the legislation: listening, speaking, reading, and writing. All assess academic language as conceptualized and defined in the ELP content standards, are standards based (i.e., designed to evaluate the ELP standards set by the state), and are aligned with the language demands in the state’s core academic content standards (discussed above). In this section we discuss the similarities and differences among the tests with regard to their content standards, the grade bands (i.e., clusters of grades) covered by the tests, the item types, the scores reported, the criteria used to determine ELP, the methods used to set cut scores, and the reliability and validity of the tests. English Language Proficiency Content Standards When NCLB was enacted, one requirement was that states develop and/or adopt a set of ELP content standards to define the knowledge and skills that ELL students would be expected to master. Some states adopted the standards developed by an organization called Teachers of English to Speakers of Other Languages (TESOL), an association whose mission is to develop and maintain professional expertise in English language teaching and learning.12 Other states created their own standards or made adjustments in the TESOL standards to meet their own needs. Articulation of the set of knowledge and skills that students should know and be able to do is the first step in designing a test, and it has a major impact on the nature of the test. Thus, while all of the tests measure ELP, they measure the skills of listening, speaking, reading, and writing in different ways. Three of the tests that we reviewed (CELDT, NYSESLAT, and TELPAS) were developed specifically for a given state and thus are designed to measure that state’s 12 For information, see www.tesol.org [December 2010].
OCR for page 68
proficiency standards. Three tests (ACCESS, CELLA, and ELDA) were developed through one of the state consortia and so had to derive a strategy for dealing with differing state standards. The strategy used for developing the ELDA standards provides an example of this process (American Institutes for Research, 2005). For this test, the ELP standards were defined through a synthesis of the standards used by the original states in the consortium.13 The standards were initially merged by the test developer (American Institutes for Research) and were then refined by a consortium steering committee. The group agreed to common standards for each of the four domains. Some member states used these ELDA standards to guide the adoption of their own standards. Other member states reviewed their existing standards for alignment with the ELDA standards and made adjustments as needed. The result of this process was that all the states using the ELDA adopted similar ELP standards. The WIDA Consortium used procedures similar to those used by ELDA for identifying the test standards, as did SELP and LAS Links. For instance, for SELP, the test framework was originally based on an analysis of ELP standards for six states (California, Delaware, Hawaii, Georgia, Missouri, and Texas) in conjunction with a review of the TESOL standards. Alignment studies were used to evaluate the correspondence of a particular state’s standards with the test itself, and adjustments were made as needed (Pearson Education, 2009). Most states that administer the SELP or LAS Links use an augmented version, meaning that items are added to ensure that the test measures a state’s standards and meets its specific needs. Grade Bands NCLB requires that ELP tests be available for students at all levels, from kindergarten through 12th grade, and so the assessments have different versions of the test for specific clusters of grades. As noted above, there are usually versions for the early elementary grades, later elementary grades, middle school, and high school, although the specific span of grades varies across tests: ELDA and ACCESS have versions intended for five grade bands: pre-K to kindergarten, grades 1 and 2, grades 3 through 5, grades 6 through 8, and grades 9 through 12. Texas has versions of the TELPAS for seven grade bands (K-1, 2, 3, 4-5, 6-7, 8-9, and 10-12). Washington and Arizona, which both use customized versions of the SELP, have versions of the test for different grade bands: both have versions for upper elementary (3-5), middle school (6-8), and high school (9-12); Washington uses a version for grades K-2, while Arizona has two versions 13 Initially, 18 states participated in the consortium, and 6 had ELP standards in place.
OCR for page 69
of the test for these grade bands, one for kindergarten and one for grades 1-2. All of the test programs have implemented vertical linking procedures to enable comparisons of performance across adjacent grade bands. Item Types Some tests use strictly multiple-choice questions (e.g., the CELLA); others use a combination of item types. For instance, the ACCESS uses multiple-choice questions for reading and listening and constructed-response questions for writing and speaking. The SELP and CELDT use a combination of multiple-choice and constructed-response (both short answer and extended answer) for each of the domains. The TELPAS uses classroom-based performance evaluation for all domains except reading. Research has shown that performance on constructed-response and performance-based items is not entirely equivalent to performance on multiple-choice items. That is, students with the same level of writing skills might perform somewhat differently on the multiple-choice questions used by the CELLA than on the constructed-response questions used by the ACCESS, which primarily require expressive skills, or on the classroom performance-based items on the TELPAS. However, the different item formats generally measure related constructs and can usually be combined into a single scale (Ercikan et al., 1998). Scores Nearly all of the tests we reviewed report scores for each of the domains (listening, speaking, reading, and writing), an overall composite score summarizing performance in all four domains, and a comprehension score that is a composite of performance on the listening and reading tests. The NYSESLAT is an exception in that it reports two composite scores, one for listening and speaking and one for reading and writing. Some tests (ACCESS, CELLA, LAS Links, and Arizona’s version of the SELP) also report an oral language score, which is derived from performance on the listening and speaking tests. In addition, the ACCESS test provides a score for literacy, based on combined performance on the reading and writing tests. Although the tests all report some type of composite score, these composites are not consistently based on either equally or unequally weighted subscale scores. For instance, the CELDT and CELLA assign equal weights to the domain scores in determining the overall score. Other tests weight the domain scores differentially. The overall score on the TELPAS accords the most weight to the reading test (75 percent), the writing score is weighted by 15 percent, and the listening and speaking scores are weighted by 5 percent each. For ACCESS, the overall score weights reading and writing by 35 percent each and weights listening and speaking by 15
OCR for page 70
percent each. For the consortium-based tests (ACCESS, ELDA), the scores that are reported and any weights that are used are the same for all states using the test. For LAS Links and SELP, the test publisher offers a number of options for the states that use the tests. Thus, the overall score for Washington’s version of the SELP may reflect a different weighting of the composite scores than the overall score for Arizona’s version of the SELP. This differential weighting reflects states’ priorities with regard to which aspects of English proficiency in the four domains are acquired first and which domains are critical to succeeding in school. For instance, the technical guide to the TELPAS notes that listening and speaking are intentionally accorded less weight than reading and writing to ensure that students do not obtain a high overall score without acquiring the necessary skills in reading and writing. Young children usually acquire listening and speaking skills first (DeÁvila, 1997; Hakuta et al., 2000). Older students who have been schooled in academic subject matter in their native language can learn to read English text fairly quickly once they have studied the subject and learned basic English vocabulary, grammar, and structure. If a student is not literate in the native language and has had minimal or interrupted schooling or has not been taught the subject matter, learning to read and write will take more time, and these skills will be more difficult to master (Hakuta et al., 2000; Parker et al., 2009). Weighting of language proficiency domains in ELP tests is important because it means that the skills represented by the overall scores differ from test to test. And we note that all of these weighting schemes simply refer to the weights applied in combining raw scores or scale scores for the four domains. Even when these nominal weights agree across different tests, the relative influence of different domains on the tests’ composite scores may differ because the relative influences are also affected by the variances of the subtests for each domain. English Language Proficiency Levels Using the results from an ELP assessment, states are required to report the number of students who made progress in learning English (the first annual measurable achievement objective, or AMAO1) and the number who attained ELP (AMAO2) each year. In order to accomplish this, each of the test publishers has developed a number of categories of performance, referred to as “performance levels” (also referred to as “proficiency levels” or “achievement levels”). In order to determine the scores on the test that are considered to define the boundaries for each of the given performance levels, a standard-setting procedure must be used. Standard setting is a process for determining the minimum score (or “cut score”) that a student must obtain on the test to be considered as having attained a given proficiency level. Standard setting is typically accomplished by using a set of trained participants who make judgments about how scores on a test relate to performance descriptors for each proficiency level. These judgments are used to set the cut score for each of the performance levels. All ELP tests now being used have established performance levels and have used
OCR for page 71
formal standard-setting methods to determine the cut scores for each performance level. They provide narrative descriptions of the knowledge and skills each performance level represents. These “performance level descriptions” characterize stages of language learning that can be used to determine test takers’ instructional needs. For accountability purposes, each state is required to determine a level of performance on its ELP test that is considered to be “English proficient” and to annually report to the DoEd the number of students who achieved this level (AMAO2). On this point it is important to distinguish between the performance level on the test that is designated as English proficient and the process that a state uses to classify a student as English proficient. The classification of a given student as English proficient may include criteria other than the student’s score on the English proficiency test. The rules for this classification are discussed in greater detail in the next chapter. Here we are concerned only with how states determine the proficient level on the test; not surprisingly, this process varies from state to state. The definition of proficiency is determined differently in each state, using varying types of information, such as judgments of the standard setters; information external to the test, including the use of empirical analyses (e.g., analyses involving decision consistency between ELP and achievement tests or regression analyses of ELP test scores and academic assessment results); and judgments by policy makers or administrators. As such, the states have adopted different operational definitions of “English proficient” performance on their tests. In some states, the definition is based on “conjunctive” rules, whereby students must meet all of a series of conditions. For example, California uses five performance levels to report performance on the CELDT: beginning, early intermediate, intermediate, early advanced, and advanced. To meet the standard of English proficient on the test, a student must have an overall score at the early advanced level, and all domain scores must be at the intermediate level or higher. In other states, the definition of English proficiency is based on compensatory rules. That is, high performance in one area or domain can compensate for lower performance in another. For example, California has an alternate definition that is based on compensatory rules: California students may also be judged to be English proficient if their overall score is at the high end of the intermediate level and there is other evidence of proficiency, such as scores on other tests, report card grades, and teacher evaluations. The ELP tests differ in three important ways with regard to performance levels. First, the performance levels vary across the tests. The tests have different numbers of performance levels, different labels for them, and different descriptions for the skills they represent. This variation in performance levels is evident even among states that essentially use the same test (i.e., the states that use customized or augmented versions of SELP). Although there may be more similarities among the performance levels than is apparent at a surface level—for example, some use the same terminology to describe the skills they represent—there have been no qualitative or quantitative studies to evaluate the similarities and differences among the levels. Second, the tests vary in the level of performance that is judged to be “English proficient” for meeting the accountability and reporting requirements of NCLB.
OCR for page 72
Here again, there may be variation even among states that use the same test. For instance, although the WIDA Consortium has adopted performance levels for the ACCESS, it is up to the state to determine the level that defines when a student is considered “English proficient.” The same is true for the ELDA test developed by the LEP-SCASS. Third, although all of the tests use a formal standard-setting procedure to set the cut scores, the standard-setting procedures differ across the tests. For instance, ACCESS, CELDT, CELLA, and LAS Links used the bookmark method;14 ELDA used the bookmark method for all subtests except writing; NYSESLAT used an item mapping approach, similar to the bookmark approach, and SELP used the modified Angoff approach.15 These different approaches can yield different results, as can the same approach used at different times. Research has also shown that the standard-setting results for the same test can vary depending on the particular set of judges that participate and the particular approach used (Impara and Plake, 1997; Jaeger, 1989; Kiplinger, 1996; Loomis, 2001; Musick, 2000; National Research Council, 2005; Texas Education Agency, 2002). No studies have been done of the extent of differences among state performance levels. When tests use the same label for a performance level, such as “intermediate” or “proficient,” one cannot simply assume that the same set of skills is represented or that a student who scores “proficient” on one test will also score “proficient” on another. Studies that do a crosswalk comparison of the performance levels used by different testing programs would help determine the extent of comparability. We describe these approaches later in the chapter. Methods to Set Cut Scores Empirically During our review, we learned that some states have conducted studies to empirically derive the level that they define as “English proficient.” These states have explored methods for using performance on the content area achievement tests required by Title I of NCLB (the English language arts and mathematics achievement tests) as a criterion for helping them to determine the “English proficient” level. The method considers how ELL students perform on both the ELP and content area assessments, classifying them as proficient or not proficient on each. The method then seeks to identify the “English proficient” level on the ELP assessment that most consistently classifies students as proficient or not proficient on the English language arts and mathematics tests. The goal is to determine a cut score on the ELP test that maximizes the proportion of correct classifications on the ELP test in 14 In the bookmark method, standard-setting panelists are asked to go through a specially constructed test booklet (arranged in order by the estimated difficulty of the items) and mark the most difficult item that a minimally proficient (or advanced) student would be likely to answer correctly; for details, see Mitzel et al. (2001). 15 In the modified-Angoff method, standard-setting panelists are asked to estimate the percentage of minimally proficient (or advanced) students who would be expected to answer each item correctly; for details, see Angoff (1971).
OCR for page 73
relation to both content tests. The cut score derived from this empirical analysis is then taken into consideration by a panel of judges when they set the cut score that defines proficient performance on the ELP test. This empirically based method has been used to set ELP cut scores for 12 states that use the ACCESS as well as several non-WIDA states (Cook et al., 2009). Reliability and Validity Detailed information about the technical qualities of the tests is available in their technical reports. For our review, we examined the reports to determine the type of information that was available, but we did not evaluate the quality of that information. For instance, we reviewed the technical information available about each test to determine if it included the appropriate kinds of analyses to examine score reliability, but we did not evaluate the methods used to estimate reliability or the adequacy of the reliability estimates. The testing programs generally report measures of internal consistency for the tests that are based on multiple-choice questions, and they provide measures of interrater agreement for tests that use constructed-response questions and are scored by humans. Most of the tests also provide an analysis of classification consistency, which examines the extent to which students are accurately classified into the various performance levels. Most of the testing programs have conducted studies to evaluate the fairness of their items and identify any items that are potentially biased. These studies usually entail reviews of the items by expert panels, although a few of the programs have conducted analyses of differential item functioning. The programs do not report an extensive amount of validity evidence. Content-related validity evidence consists primarily of alignment studies. This work involves comparison of the items (or test blueprint) and the ELP content standards to evaluate the extent to which the items measure the intended content and skills. Construct-related validity evidence typically consists of correlations between each item and the total test score (i.e., point-biserial correlations) and intercorrelations among the subtest scores. A few of the programs have conducted factor analyses to verify the factor structure of the assessment. Only two testing programs provided evidence of criterion-related validity. Analysts carried out a study of ACCESS in which performance on the test was compared with a priori proficiency categorizations of students who participated in the field tests (MacGregor et al., 2009). In another study, analysts compared students’ performance on ELDA with teachers’ ratings of students’ English language proficiency (Lara et al., 2007). Wolf et al. (2008) reported that a cut score validation study was conducted for the CELDT. For several of the tests, there is also evidence of the extent of correspondence between the scores and performance on another ELP test. For example, data are available on the correspondence between ACCESS scores and the New IDEA Proficiency Test (New-IPT) (Kenyon, 2006a), the Language Assessment Scales (LAS) (Kenyon, 2006b), the Maculaitis Assessment of Competencies Test of English Language Profi-
OCR for page 74
ciency (MAC II) (Kenyon, 2006c), and the Language Proficiency Test Series (LPTS) (Kenyon, 2006d). Data are also available on the relationships between scores on the ELDA and scores on the New-IPT and the LAS, as well as on the correspondence between scores on LAS Links and the LAS. For some of the tests (NYSESLAT, TELPAS, ACCESS), data are available on the relationships between performance on the proficiency test and the state’s English language arts test. Technical Quality and Comparability of the Tests Our review of eight ELP tests covered the information available in their technical manuals and supplementary materials with regard to test development, setting standards, deriving and reporting scores, and determining the reliability and validity of the scores. For this set of tests, we found evidence that the assessments have been developed according to accepted measurement practices. Each of the testing programs documented its efforts to evaluate the extent to which the test scores are valid for the purpose of measuring students’ language proficiency in English. NCLB set requirements for the tests, and as a result, pushed forward efforts to standardize certain aspects of these assessments. To meet the legislated requirements, the new tests must have a number of common features, and we found evidence of these features in all eight tests that we reviewed. The tests are all standards-based. They all measure some operationalized conceptualization of academic language, in addition to social/conversational language, in four broad domains and report scores for each of these domains, as well as a comprehension score and one or more composite scores. They all summarize performance using proficiency or performance levels, and states have established methods of looking at overall and domain scores in order to determine their respective definitions of English language proficiency. The tests also have versions available for students in kindergarten through 12th grade, with linkages to enable measurement of growth across adjacent grade bands. These common features provide the foundation for a certain degree of comparability across the tests. Nevertheless, there are a number of ways in which tests can differ even though they meet the requirements set by NCLB, and we found evidence of these differences in the eight tests that we reviewed. They differ in many important ways that are likely to affect the comparability of the results that are reported. For instance, we found evidence that the tests we examined differed in content coverage, the types of questions used, test length, and timing of administration. Other aspects of the tests, such as the theory about academic language that underlies the questions, the difficulty of the questions, and measurement accuracy at each score point also can affect their equivalence. These differences mean that we cannot simply assume that a student who scores at the intermediate or proficient level on one state’s ELP test will score at the intermediate or proficient level on another. Evaluating the extent to which they are comparable requires empirical analyses that may involve quantitative or qualitative approaches.
OCR for page 75
A quantitative approach for evaluating the equivalency of different assessments and putting the results on the same scale is referred to as “linking” (or “scaling”). Linking is a statistical procedure that allows one to determine the score on one test that is essentially equivalent to a score on another test (see Holland and Dorans, 2006; Johnson and Owen, 1998; Linn, 1993; Mislevy, 1992; National Research Council, 1999a, 1999b). These analyses are not easy to conduct, in part because of the data that must be collected. There are three types of linking procedures—equating, scale aligning, and predicting. The procedures range from strong to weak in terms of the assumptions they require and the inferences they permit. The strongest type of linking is equating. Equating is possible when the two assessments are designed according to the exact same specifications. That is, the assessments are matched in terms of content coverage, difficulty, type of questions used, test length, and measurement accuracy at each score point (Haertel and Linn, 1996; Holland and Dorans, 2006; Linn, 1993; Mislevy, 1992; National Research Council, 1999a, 1999b). To enable equating, the two tests must be given in a way that allows one to establish the linking function between the two tests, such as by randomly assigning students to take one or the other test (a randomized groups equating design), or having the same students take both tests or a set of items that are common across both tests (a common-items equating design). Other equating designs are also possible, but the randomized group and common-items equating designs provide the strongest basis for equating. When the scores on two tests are equated, they are considered interchangeable. For the states’ ELP tests, equating is not possible because the basic requirements for this kind of linkage have not been met (i.e., matching assessments in terms of content coverage, difficulty, type of questions used, test length, and measurement accuracy at each score point). Thus, we cannot say that the test results are comparable from state to state under the strictest definition of comparability. It may be possible to link the results of different ELP tests using the two procedures with less stringent assumptions, scale aligning and predicting. Scale aligning is conducted when the tests being linked measure different constructs, or they measure similar constructs but with different test specifications (Holland and Dorans, 2006, p. 190). The goal of the predicting linking procedure is to predict an examinee’s score on one test from some other information about that examinee (i.e., a score on another test, scores from several other tests, and possibly demographic or other information) (Holland and Dorans, 2006, p. 188). Both procedures require a linking design like those used for equating—randomized groups, same tests for groups, or common items. Less rigorous, nonequivalent groups designs are also possible, but just as with equating, they provide a weaker basis for developing the linking function, and the inferences they permit are limited. It may also be possible to compare the different ELP tests using a more qualitative approach, often referred to as a “crosswalk review.” A crosswalk evaluation is a systematic judgment comparison of key aspects of tests, including the content standards it is intended to measure, how it measures these standards, how item responses
OCR for page 76
are aggregated to summary scores, and other key elements such as the performance levels.16 In the present context, the analysis would need to focus on the levels set by the state to define when a student is “English proficient.” This approach would compare the performance levels in terms of what students are expected to know and be able to do in order to be considered “English proficient” to evaluate the extent to which the states require similar skills. The approach might compare the performance levels with other sacross the states. Or it might involve determining a priori a definition of English proficiency and evaluating each state’s performance levels in relation to this definition. The a priori definition might be determined by the DoEd or through the use of an independent expert panel. To date, no qualitative crosswalk studies or statistical linking studies have been conducted for any of the ELP assessments we reviewed. “Bridging” studies have been done that predicted performance on the ACCESS from performance on other ELP tests (the studies by Kenyon, 2006a-2006d, mentioned earlier), but these studies were restricted to the kinds of assessments in place prior to NCLB (e.g., IPT, LAS, LPTS, and MAC II). It is important to point out that this situation is not unique to the ELP tests. The content standards, tests, and performance standards that states use for other aspects of NCLB (e.g., the reading and mathematics achievement tests) also vary from state to state, and scores are not comparable across states. Furthermore, it is important to note that the ELP tests were not designed from the outset to yield comparable results across states. The development effort would likely have taken a much different focus had cross-state comparability been the original intent. It is always difficult to attach a new use to test results when the test has not been designed from the outset for that purpose. CONCLUSION 3-1 Although the English language proficiency assessments that we reviewed share common features and many states use the same test, the level of performance that defines when a student is considered to be “English proficient” is set by each state. There is no empirical evidence that has been collected to evaluate the comparability of these levels across the states. In closing, however, we point out that results from the ELP test are not the sole basis for decisions to classify ELL students. Even if the ELP tests were linked and their scores placed on the same scale, there are still differences among the states in their procedures and criteria for classifying students. We take up these issues in the next chapter. 16 Crosswalk analyses are sometimes used for alignment studies to evaluate the extent to which test items are aligned with content standards (see, e.g., http://www.adultedcontentstandards.ed.gov/docs/fieldResources/writing/Using%20Crosswalks%20for%20Alignment%20Notes.doc [December 2010]). Crosswalk analyses have also been conducted in a variety of other settings (see, e.g., http://www.calpro-online.org/eric/webliog.asp?tbl=webliog&ID=24 [December 2010]).