Page 139

6—
Program Evaluation

Of the types of programs for English-language learners reviewed in Chapter 1, the most commonly studied are those that use the native language for some period of time for core academics (i.e., transitional bilingual education programs) and those that do not use the native language in any regular or systematic way (i.e., English as a second language [ESL] and its variants, such as structured immersion and content-based ESL, as well as "submersion programs"). During the 1970s and 1980s, the federal government and advocates were keenly interested in determining which of these two models is more effective. Program evaluations were intended to provide a definitive answer to this question. This chapter examines what we know from program evaluations conducted to date and identifies research needs in this area. Note that this chapter focuses on evaluation of various models for educating English-language learners, Chapter 7 reviews studies of school and classroom effectiveness, and makes recommendations regarding research to improve educational programming for English-language learners and studies of the processes related to program development, implementation, and dissemination.

State Of Knowledge

This section begins by reviewing national-level evaluations of programs for English-language learners. It then examines reviews of smaller-scale program evaluations. This is followed by a discussion of the politicization of program evaluation. The final subsection addresses the future course of program evaluation,



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 139
Page 139 6— Program Evaluation Of the types of programs for English-language learners reviewed in Chapter 1, the most commonly studied are those that use the native language for some period of time for core academics (i.e., transitional bilingual education programs) and those that do not use the native language in any regular or systematic way (i.e., English as a second language [ESL] and its variants, such as structured immersion and content-based ESL, as well as "submersion programs"). During the 1970s and 1980s, the federal government and advocates were keenly interested in determining which of these two models is more effective. Program evaluations were intended to provide a definitive answer to this question. This chapter examines what we know from program evaluations conducted to date and identifies research needs in this area. Note that this chapter focuses on evaluation of various models for educating English-language learners, Chapter 7 reviews studies of school and classroom effectiveness, and makes recommendations regarding research to improve educational programming for English-language learners and studies of the processes related to program development, implementation, and dissemination. State Of Knowledge This section begins by reviewing national-level evaluations of programs for English-language learners. It then examines reviews of smaller-scale program evaluations. This is followed by a discussion of the politicization of program evaluation. The final subsection addresses the future course of program evaluation,

OCR for page 139
Page 140 presenting five lessons learned that can lead to better, more useful evaluations. National Evaluations There have been three large-scale national evaluations of programs for English-language learners. Although these studies provided some information about the education of English-language learners, they were of limited utility for the evaluation of programs. Nonetheless, it is instructive to review them so we can avoid the mistakes of the past, as well as benefit from what was learned. This section also summarizes the findings of a National Research Council report (Meyer and Fienberg, 1992) that reviews two of these studies. American Institutes for Research (AIR) Study The American Institutes for Research (Dannoff, 1978) conducted the first large-scale national evaluation of programs for English-language learners, commonly referred to as the AIR study. The study compared students enrolled in Title VII Spanish/English bilingual programs and comparable students (in terms of ethnicity, socioeconomic status, and grade level) not enrolled in such programs. In the AIR study, 8,200 children were measured twice during the school year on English oral comprehension and reading, math, and Spanish oral comprehension and reading. Generally, the results from this study showed that students in bilingual education programs did not gain more than students not in such programs. The study was the subject of a great deal of criticism, the major criticism addressing the strength of the treatment control group comparison (Crawford, 1995). Nearly three-quarters of the experimental group had been in bilingual programs for 2 or more years, and the study measured their gains in the last few months. Additionally, about two-thirds of the children in the control group had previously been in a bilingual program; these children did not represent a control group in the usual sense of the term. Thus, the AIR study did not compare bilingual education with no bilingual education. In part because of the ambiguity of the conclusions from the AIR study, two major longitudinal studies were commissioned by the U.S. Department of Education in 1983 to look at program effectiveness. The hope was that these studies would provide definitive evidence, one way or the other, about the effectiveness of bilingual education. These two studies are reviewed below. Longitudinal Study The National Longitudinal Evaluation of the Effectiveness of Services for Language Minority Limited English Proficient Students (Longitudinal Study)

OCR for page 139
Page 141 (Development Associates, 1984; Burkheimer et al., 1989) was conducted by Development Associates, with Research Triangle Institute as a prime subcontractor. The study had two distinct phases. The descriptive phase, which examined the variety of services provided to English-language learners, was designed as a four-stage stratified probability sample. First-stage units were states; second-stage units were school districts, counties, or clusters of neighboring districts or counties; third-stage units were schools; and fourth-stage units were teachers and students.1 A total of 342 schools were identified, but only 86 agreed to participate in the study. Because of this low participation rate, the sample cannot be considered nationally representative. Nonetheless, the descriptive phase of the Longitudinal Study provided one of the most comprehensive reports on the services received by English-language learners. Some of the major findings were as follows: • The vast majority of English-language learners came from economically disadvantaged families. • English-language learners were at risk educationally, with most performing below grade level. • Most instruction of the children was in English or in some combination of English and the native language. The goal of the second phase of the study was a longitudinal follow-up to determine the relative effectiveness of programs. The focus was on 25 schools from the first phase, with students from kindergarten to fifth grade being followed over 3 years (Burkheimer et al., 1989). Students in the sample consisted of six distinct groups—three types of students sampled in two cohorts. The English-language learners' group consisted of virtually all students in the respective cohort who were classified as having limited English proficiency by local criteria. The English-proficient group consisted of students who were receiving special services because they were placed in classes with English-language learners. The comparison group consisted of children considered English proficient who had never received such special services. Because the schools were initially selected for the follow-up phase on the basis of having representative populations of interest and not on the basis of program differences, there was not much variation in the programs represented in this phase. Moreover, some of the schools had no children in the control group. 1The target population of students consisted of elementary-age language minority students/English-language learners receiving special language-related services from any source of funding. The sample included the 10 states with at least 2 percent of the national estimated target population, and an additional 10 states were selected as a stratified random sample of the remaining states, with selection probability proportional to the estimated size of the elementary-grade target population in the state.

OCR for page 139
Page 142 In an attempt to compensate for the weakness of the study design, elaborate statistical techniques were used. Given these limitations, one should be cautious about interpreting the study findings. The major findings can be summarized as follows (Burkheimer et al., 1989): • Assignment of English-language learners to specific instructional services reflects principally local policy determinations. To a much more limited extent (mainly in the lower grades), assignment may be based on "deliberate" (and apparently criterion-driven) placement of students in the specific services for which they are ready. • Too heavy a concentration in any one aspect of the English-language learner's education can detract from achievement in other areas. • The yearly achievement of English-language learners in math and English-language arts is facilitated by different approaches, depending on student background factors. For example, students who are relatively English proficient are better able to benefit from English-language arts instruction given in English, whereas students who are weak in English and/or strong in their native language show better yearly English-language arts achievement when instructed in their native language. • In the later grades, proficiency in mathematics when tested in English seems to require proficiency in English. This is not the case in the lower grades. • Like assignment to specific services for English-language learners, exit from those services reflects both local policy determination and specific criterion-driven exit rules related to reaching certain levels of proficiency/achievement in English. • Children are more likely to be existed from English-language learner services if those services are similar to those that would be expected for English-proficient students. Immersion Study The Longitudinal Study of Immersion and Dual Language Instructional Programs for Language Minority Children (Immersion Study) (Ramirez et al., 1991), conducted by Aguirre International, was a much more focused study of program alternatives than the Longitudinal Study. It attempted a quasi-experimental longitudinal comparison of three types of programs: English-only immersion, early-exit bilingual (also known as transitional bilingual), and late-exit bilingual (also known as maintenance bilingual). The study took place at nine sites, but five of these had only one of the three types of programs (Ramirez et al., 1991). In fact, the late-exit bilingual program was completely confounded with site. Despite sophisticated statistical models of growth, the conclusions from the study are seriously compromised by the noncomparability of sites. The major findings of the comparison of program types were summarized by

OCR for page 139
Page 143 the U.S. Department of Education (1991). After 4 years in their respective programs, English-language learners in immersion strategy and early-exit programs demonstrated comparable skills in mathematics, language, and reading when tested in English. There were differences among the three late-exit sites in achievement level in the three subjects: students at the sites with the most use of Spanish and the most use of English ended sixth grade with the same skills in English language and reading; students at the two late-exit sites that used the most Spanish showed higher growth in mathematics skills than those at the site that abruptly transitioned into almost all English instruction. Students in all three programs realized a growth in English-language and reading skills that was as rapid or more so than the growth that would be expected for these children had they not received any intervention. National Research Council Report Both the Longitudinal and Immersion studies were reviewed by a National Research Council panel of the Committee on National Statistics (Meyer and Fienberg, 1992). The primary focus of the panel's report is on determining whether the statistical methods used in those studies were appropriate. The report draws important lessons learned from these major efforts, including the following: • The formal designs of the Longitudinal and Immersion studies were ill suited to answering the important policy questions that appear to have motivated them. • Execution and interpretation of these studies, especially the Longitudinal Study, were hampered by a lack of documentation of the study objectives, the operationalizing of conceptual details, actual procedures followed, and changes in all of the above. • Because of the poor articulation of study goals and the lack of fit between discernible goals and research design, it is unlikely that additional statistical analyses of the data would yield results central to the policy questions these studies were originally intended to address. • Both the Longitudinal and Immersion studies suffered from excessive attention to the use of elaborate statistical methods intended to overcome the shortcomings in the research designs. • Although the samples in the Immersion study were followed longitudinally, later comparisons are not considered valid because of sample attrition. Quite clearly the Longitudinal and Immersion studies did not provide decisive evidence about the effectiveness of bilingual programs. However, according to the National Research Council report, findings from the comparisons that were most sound with respect to study design and sample characteristics indicate that

OCR for page 139
Page 144 kindergarten and first grade students received academic instruction in Spanish had higher achievement in reading than comparable students who received academic instruction in English. In general, more has been learned from smaller-scale program evaluations, to which we now turn, than from these multi-million-dollar studies. Reviews of Smaller-Scale Evaluations This section examines five key reviews of smaller-scale program evaluations: Baker and de Kanter (1981), Rossell and Ross (1986) and Rossell and Baker (1996), Willig (1985), and the U.S. General Accounting Office (U.S. GAO) (1987). Baker and de Kanter (1981) Still by far the most influential review of studies in bilingual education, despite its age, is that of Baker and de Kanter (1981). This review, which was based on earlier reviews by Engle (1975) and Zappert and Cruz (1977), was in turn used as a basis for other reviews that are both critical (e.g., Rossell and Ross, 1986; Rossell and Baker, 1996) and supportive (e.g., Willig, 1985) of bilingual education. Since the study was conducted, it has become increasingly difficult to find comparison groups because of judicial decisions (see Appendix A); thus, the Baker and de Kanter review is more current than might be presumed. Baker and de Kanter located about 150 evaluations of programs designed for English-language learners. To be included in the review, a study essentially had to either employ random assignment of children to treatment conditions or take measures to ensure that children in the treatment groups were equivalent. Baker and de Kanter rightly rejected studies that had no comparison group. Of the studies initially located, only 28 satisfied their criteria. Other reviews of the literature have also found a disappointing percentage of studies (only about 10 percent [Lam, 1992]) to be methodologically adequate. We return to the issue of the poor quality of program evaluation later in this chapter. Baker and de Kanter (1981:1) drew the following conclusion from their review: "The case for the effectiveness of transitional bilingual education is so weak that exclusive reliance on this instruction method is clearly not justified." It is important to realize that Baker and de Kanter were not evaluating whether bilingual education was effective, rather whether there was sufficient research basis to justify transitional bilingual education over alternative forms of instruction. They adopted the conservative strategy of defining a "sufficient research basis" as showing that transitional bilingual education was significantly better statistically than a comparison program.

OCR for page 139
Page 145 Rossell and Ross (1986) and Rossell and Baker (1996) Rossell and Ross (1986) and Rossell and Baker (1996) used the Baker and de Kanter (1981) review as the basis for their review of the research literature. Working from Baker and de Kanter (1981), as well as Baker and Pelavin (1984), they considered studies that evaluated alternative second-language programs. Their review included only studies of "good quality," which was defined as having random assignment to programs or statistical control for pretreatment differences between groups when random assignment was not possible. Rossell and Ross (1986:413) drew the following conclusion: "The research…does not support transitional bilingual education as a superior instruction for increasing the English language achievement of limited-English-proficient children." Rossell and Baker (1996:7) came to essentially the same conclusion: ''Thus the research evidence does not support transitional bilingual education as a superior form of instruction for limited English proficient children." Despite these conclusions, both reviews state that a small number of studies support the view that transitional bilingual education is beneficial. Both reviews suggest that structured immersion is a more promising program. In such a program, instruction is in the language being learned (i.e., English), but the teacher is fluent in the students' native language.2 Willig (1985) Willig (1985) conducted a meta-analysis of the studies reviewed by Baker and de Kanter (1981). Meta-analysis provides a quantitative estimate of the effect of an intervention. Willig made several significant improvements to the Baker and de Kanter review (Secada, 1987). First, she eliminated five studies conducted outside of the United States (three in Canada, one in the Philippines, and one in South Africa) because of the significant differences in the students, the programs, and the context in those studies. She also excluded one study in which instruction took place outside the classroom. Second, as is required in meta-analysis, and in contrast with previous reviews, she quantitatively measured the program effect in each study, even if it was not statistically significant. To make these measurements, she had to eliminate seven more of the studies reviewed by Baker and de Kanter because they did not provide sufficient data to perform the necessary calculations.3 Thus, her 2The second language used in these programs is always geared to the children's language proficiency level at each stage so that it is comprehensible. The native tongue is used only in the rare instances when the students cannot complete a task without it. The student thus learns the second language and subject matter content simultaneously. Immersion programs in which the second language is not the dominant language of the country typically include at least 30 to 60 minutes of native-language arts, and in some cases become bilingual programs early on. 3Recent work by Bushman and Wang (1996) does permit the combining of quantitative and qualitative results.

OCR for page 139
Page 146 major conclusions are based on 16 studies, from which she computed about 500 effect size measures. (An effect size is a measure of program benefit relative to another program.) Her overall conclusion is quite different from that of Baker and de Kanter (1981): "positive effects for bilingual programs…for all major academic areas" (p. 297). However, it should be noted that Willig was asking a fundamentally different question than Baker. While Baker was asking whether bilingual education should be mandated, Willig was asking a more modest question: Does bilingual education work? As she notes, she did not compare transitional bilingual education programs with other programs, such as ESL and sheltered instruction, in part because neither she nor Baker could find many evaluations that made such a comparison. She only contrasted program versus no-program studies. Against that standard, Willig concluded that bilingual education does work (better than not having anything in place). In addition, Willig's results indicate that the better the technical quality of the study—for example, if a study used random assignment as opposed to creating post hoc comparison groups—the larger the effects. These results raise an interesting possibility: that the "effectiveness" debate may really be a debate carried on at the relatively superficial level of a study's technical quality. The Willig (1985) meta-analysis does, however, have drawbacks. Most problematic, it employs the questionable practice of including the same study more than once in the analysis. Willig used a complicated weighing procedure to compensate for this problem, but she may not have been entirely successful in this effort. While the practice of using the same study more than once is quite common in meta-analysis, it does seriously compromise the validity of the inferential statistical analysis. Willig (1985) has also been criticized for controlling for variables "which partly eliminated the actual treatment effect" (Rossell and Baker, 1996:25). It may have been necessary to control for those variables.4 However, because there were small numbers of studies for some levels of these control variables, Willig's estimate of effect became quite unstable after adjustment for these variables. It should be noted, however, that her unadjusted effects parallel her adjusted effects. U.S. General Accounting Office (1987) In the mid-1980s, Augustus Hawkins, Chairman of the Education and Labor Committee of the House of Representatives, asked GAO to evaluate the effectiveness of bilingual education programs. Perhaps to complete the report promptly, GAO surveyed 10 experts in the field and presented conclusions in the 4Variables included the following: research design (random assignment versus correlational); type of measure (e.g., language versus mathematics, raw scores versus percentiles); language of measurement (English versus not English); and type of score (e.g., d versus derived d).

OCR for page 139
Page 147 form of "7 of the 10 believe that.…" Most of the experts surveyed looked quite favorably upon educational policy that encourages the use of native-language instruction and were relatively critical of structured immersion. Moreover, most questioned the value of "aggregate program labels" (e.g., immersion or transitional bilingual education). The U.S. GAO (1987) report is problematic because it attempts to find a consensus. While it is likely that the report captures the dominant view of the experts surveyed, there is not really a strong consensus about what is best for the education of English-language learners. If different experts had been chosen, different conclusions would have been drawn. Summary of the Reviews The beneficial effects of native-language instruction are clearly evident in programs that have been labeled "bilingual education," but they also appear in some programs that are labeled "immersion" (Gersten and Woodward, 1995). There appear to be benefits of programs that are labeled "structured immersion" (Baker and de Kanter, 1981; Rossell and Ross, 1986); however, a quantitative analysis of such programs is not yet available. Based primarily on the Willig (1985) meta-analysis, the committee accepts the conclusion of the previous National Research Council panel (Meyer and Fienberg, 1992:105), noted earlier: ''the panel still sees the elements of positive relationships that are consistent with empirical results from other studies and that support the theory underlying native language instruction." However, for numerous reasons, we see little value in conducting evaluations to determine which type of program is best. First, the key issue is not finding a program that works for all children and all localities, but rather finding a set of program components that works for the children in the community of interest, given the goals, demographics, and resources of that community. The focus needs to be on the proper contexts in which a program component is most effective and conversely, the contexts in which it may even be harmful. Second, many large-scale evaluations would likely suffer from the problem encountered in some previous national evaluations: the programs would be so loosely implemented that the evaluation would have no clear focus. Third, programs are not unitary, but a complex series of components. Thus we think it better to focus on components than on programs. As we argue later, successful bilingual and immersion programs may contain many common elements. Our view is shared by Ginsburg (1992:36): "…focusing evaluations on determining a single best method of language instruction for non-English-speaking children was probably the wrong approach to take."

OCR for page 139
Page 148 Politicization of Program Evaluation It is difficult to synthesize the program evaluations of bilingual education because of the extreme politicization of the process. Research always involves compromises, and because no study is perfect, every study has weaknesses. What has happened in this area of research is that most consumers of the research are not researchers who want to know the truth, but advocates who are convinced of the absolute correctness of their positions. Advocates care mainly about the results of a study. If its conclusions support their position, they note the study's strong points; if not, they note its weak points. Because there are studies that support a wide range of positions, advocates on both sides end up with plenty of "evidence" to support their position. Policymakers are justifiably troubled by the inability of the research evidence to resolve the debate. A related issue is that very quickly a new study gets labeled as pro- or anti-bilingual education. What is emphasized in the debate is not the quality of the research or insights about school and classroom attributes that contribute to or hinder positive student outcomes, but whether the study is consistent with the advocate's position. Because advocacy is the goal, very poor studies that support an advocated position are touted as definitive. Consider two different examples. In general, we are quite positive about the California Case Studies (discussed below, as well as in the next chapter). The project was never designed to be an evaluation, and funding that might have been used for evaluation was cut. Nonetheless, there have been attempts to document the effectiveness of this bilingual project; see especially the studies presented in Krashen and Biber (1988). Yet results of these studies are presented with very little documentation. For example, sample sizes are frequently not presented for the means given, and often there are no controls for socioeconomic status. As presented, the studies would not stand the test of evidence if they were submitted to a peer-reviewed journal.5 To their credit, Krashen and Biber (1988) admit the data do not "rigorously test" (p. 31) the effectiveness of bilingual education, but others have ignored these qualifications. Advocates of bilingual education are not alone in presenting less than scientifically acceptable evidence. Recently, extensive publicity has been given to an evaluation of the New York City bilingual program (Board of Education of the City of New York, 1994). That study compared the exit rates (how long children stayed in the program) and achievement of students in ESL and bilingual programs. The comparisons made between the two programs were seriously confounded with native language: most of the students in the bilingual program had Spanish as a native language, while the students in ESL had other language backgrounds. Ironically, the New York City study carefully documented the 5Rossell and Baker (1996) arrive at a similar appraisal.

OCR for page 139
Page 149 native-language confound, but made no attempt to control for this variable or for other confounds (e.g., socioeconomic status). In the preface, the authors describe the results of the study as "preliminary" and "ongoing." Yet advocates and the media accepted the conclusions of the report. The New York City evaluation has been heralded by advocates as providing "hard evidence'' (Mujica, 1995) because it makes bilingual education look ineffective. Again study quality is ignored if the results support the advocate's position. The politicization of this research area has a further harmful consequence. The hope is that evaluations will enter the policy debate as "principled arguments" (Abelson, 1995). However, because investigators very quickly become labeled as advocates of one position or another (even if they do not see themselves that way), the basis of their arguments is not perceived as principle but as politics. Scientists are seen as advocates. The Future Course of Evaluation It is easy to criticize previous program evaluations, but we need to realize that program evaluation was in its infancy when many of these studies were initially undertaken. During the past 25 years, the model of program evaluation has evolved considerably. There are several key elements in the current model (see Fetterman et al. [1995] for one such formulation). First, the initial focus is not on comparing programs, but on determining whether a given program is properly implemented and fine tuning it so it becomes more responsive to the needs of children, the school, and the community. Once the program has been established, a summative evaluation with control groups is recommended. Second, instead of being a top-down process, the evaluation is more participatory, guided by students, staff, and the community (Cousins and Earl, 1992). Third, qualitative as well as quantitative methods are used (see Chapter 4). Although program evaluation to date has yielded disappointing results, it would be a serious mistake to say we have learned nothing from the enterprise. We see that five general lessons have been learned from the past 25 years of program evaluation: 1. Higher-quality program evaluations are needed. 2. Local evaluations need to be made more informative. 3. Theory-based interventions need to be created and evaluated. 4. We need to think in terms of components, not politically motivated labels. 5. A developmental model needs to be created for use in predicting the effects of program components on children in different environments. The following subsections expand on these five lessons.

OCR for page 139
Page 152 of the AIR study). A long interval from pretest to post-test will increase the amount of missing data in the sample (see below). Unit of Analysis Even when there is random assignment, the child is generally not the unit assigned to the intervention, but rather the classroom, the school, or sometimes the district. A related issue is that children affect each other's learning in the classroom, and indeed, several recent educational innovations (e.g., cooperative learning) attempt to capitalize on this fact. Consideration then needs to be given to whether the child or some other entity is the proper unit of analysis. Power This factor concerns the probability of detecting a difference between treatment and control groups if there actually is one. Program evaluations must be designed so that there is sufficient power. In many instances, there may be insufficient resources to achieve an acceptable level of power. For instance, there may be only 50 children eligible for a study, but to have a reasonable chance of getting a significant result may require more subjects. Even if there are sufficient resources, the study may be too large to manage. Missing Data Typically, evaluations are longitudinal, and in longitudinal research, missing data are always a serious concern. Given the high mobility of English-language learners, attrition is an especially critical issue in these types of evaluations (Lam, 1992). A plan for minimizing and estimating the effect of missing data should be attempted. To some extent, the use of growth-curve modeling (Bryk and Raudenbush, 1992) and the computation of individual change trajectories can alleviate this problem. Summary Clearly, program evaluations are difficult. The above discussion indicates that there are often tradeoffs: to maximize one aspect of a study, another must be reduced. Although research always involves compromises and limitations, there must still be some minimum degree of quality for the research to be informative. Therefore, sometimes the most prudent choice is not to conduct a program evaluation, but to devote research efforts to determining whether a program is successfully implemented in the classroom and identifying the process by which the program leads to desirable outcomes (see the discussion of lesson 2). At the same time, researchers and policymakers still need to be creative in recognizing opportunities for evaluation. Lesson 2: More Informative Local Evaluations American education has always been characterized by local control, and the current trend is for this to intensify. Evaluations are increasingly emphasizing local involvement as well (Fetterman et al., 1995). Contemporary evaluation methods can assist local educators in planning,

OCR for page 139
Page 153 implementing, and assessing programs. Evaluation needs to be viewed as a tool for program improvement, not as a bureaucratic obligation. Local evaluation efforts need to focus on methods for improving program design and implementation (Ginsburg, 1992; Meyer and Fienberg, 1992). Lam (1992:193) makes the following recommendation: "It seems reasonable to urge local educators and administrators to use the majority of the evaluation budgets for formative purposes—that is, to document and guide full implementation of the program design, including the analysis of problems arising when the school's capacity to actually implement the proposed program is being developed." Title VII legislation explicitly encourages this type of evaluation (Section 7123b). Federal and state governments might monitor local evaluations more closely. School districts that present evidence for successfully implemented models should receive grants for outcome evaluation. While we do not believe in enforcing standardization across sites in these evaluations, there should be attempts to encourage collaboration that would allow pooling of results. There are successful examples in other areas of human resource evaluation in which there is local control, but comparable measures and designs are used to allow for data aggregation. It should also be noted that both large-scale and local evaluations have their limitations. With smaller-scale evaluations, it is easier to monitor the effort, keep track of implementation, and institute procedures to minimize missing data. However, small evaluations are plagued with insufficient sample sizes and sometimes insufficient program variation. Moreover, results in one community may not be generalizable to other communities. Just as national evaluations were oversold in the 1970s and 1980s, we do not wish now to oversell local evaluations. We expect program effects to interact with site and community characteristics.8 Although some site effects will be random and inexplicable, others will be systematic. If enough sites can be studied, an understanding of the necessary conditions for successful programs can be developed. One statistical technique that is ideally suited for the analysis of within-site effects is hierarchical linear modeling (Bryk and Raudenbush, 1992), which can be used to test whether there are site interactions and what factors can explain them. Basically, any technique that permits conditioning on content variables will help. Lesson 3: Creation and Evaluation of Theory-Based Interventions Programs should be designed so they are consistent with what is known about basic learning processes. The studies and programs described in this section 8 A good example of effects varying by site is presented by Samaniego and Eubank (1991). They tested basic theory using the California Case Studies (see below) in four school districts, and results varied considerably across sites.

OCR for page 139
Page 154 are based on a theory of second-language learning and its relationship to student achievement and successful educational practice. The theory is tested through implementation in a classroom setting. While none of the examples described here is perfect, each has aspects that are exemplary. California Case Studies In 1980, the California State Department of Education applied a theory-based model for bilingual education (Gold and Tempes, 1987). The program, which came to be known as the California Case Studies, was a collaborative effort among researchers, local educators, and the California State Department of Education. The program began with a declaration of principles (see Chapter 7), many of which are based on research results reviewed in Chapters 2 through 4. Five elementary schools serving large numbers of Spanish-speaking students were selected for participation in the program. In the program, students are provided with substantial amounts of instruction in and through the native language; comprehensible second-language input is provided through both ESL classes and sheltered classes in academic content areas; and teachers attempt to equalize the status of second-language learners by treating English-language learners equitably, using cooperative learning strategies, enrolling language majority students in second-language classes, and using minority languages for noninstructional purposes. While this program is exemplary in its application of principles based on well-established basic research and its collaborative effort between educators and researchers, very few of its results have been published in peer-reviewed journals.9 An exception to this is the work of Gersten and colleagues (Gersten, 1985; Gersten et al., 1984; Gersten and Woodward, 1995). Immersion Programs In a series of studies, Gersten and colleagues (Gersten, 1985; Gersten et al., 1984; Gersten and Woodward, 1995) have tested the effectiveness of immersion programs. The program they developed is a blend of ideas from bilingual and immersion programs, hence their use of the term "bilingual immersion."10 Gersten and Woodward (1995:226) define this program as follows: "This approach retains the predominant focus on English-language instruction from the immersion model but tempers it with a substantive 4-year Spanish language program so that students maintain their facility with their native language" (p. 226). In a major 4-year study (Gersten and Woodward, 1995), 228 children in El Paso, Texas, were placed in either bilingual immersion (as defined above) or 9 The program was never intended as an evaluation, and funding was cut at the end of the project, which made evaluation more difficult. 10 Although Gersten and Woodward label this program "bilingual immersion," it should not be confused with two-way bilingual programs (also called "bilingual immersion programs," in which students are provided with subject matter instruction in their native language[s]).

OCR for page 139
Page 155 transitional bilingual programs. Children were followed from fourth through seventh grades. While differences found in language and reading ability in the early years favored the bilingual immersion approach, those differences seemed to vanish in the later years. However, almost all of the bilingual immersion children had been mainstreamed by the end of the program, while nearly one-third of the transitional bilingual children had not.11 Two-Way Programs Two-way bilingual or two-way immersion programs integrate language-minority and language-majority students for instruction in and through two languages—the native language of the language minority students and English. The rationale for two-way programs as an effective model for all students comes from principles underlying bilingual education for language-minority students and foreign-language immersion for English-speaking students. The blending of these two approaches is predicted to result not only in high achievement for both groups of students, but also in improved cross-cultural understanding as a benefit of positive interactions in the classroom (see Chapter 4). Despite the fairly elaborate theoretical justification for two-way programs, there has not been uniformity in the programs that have been implemented. The California State Department of Education developed a model in the mid-1980s, which was implemented in several schools and became the basis for one of the primary variants of two-way programs (referred to as "90-10" because 90 percent of instruction is provided in the non-English language in the first years of the program). Although the approach may be defined generally as the integration of students from two different language backgrounds in a classroom where both languages are used for instruction, there is currently tremendous variability in program implementation (Christian, 1994). While flexibility is clearly needed to adapt any program to local conditions, there has been little research directed at understanding the consequences of programmatic decisions. Moreover, there are likely to be requirements that must be satisfied before a two-way program can be instituted. Several key parameters of variation include proportion of students from the two language backgrounds in the classroom, amount of instructional time provided in the two languages, practices related to screening students and admitting newcomers to a cohort after the first year, language choice for initial literacy instruction, language use in the classroom, and whether students enter the program on a voluntary basis or are assigned by school personnel. In a review of evaluation studies of two-way programs, Mahrer and Christian 11 The statistical analysis of the data from this program by Gersten and Woodward (1995) has been less than optimal. Generally, analysis of variance is not appropriate for longitudinal studies. Moreover, growth curve analysis (Bryk and Raudenbush, 1992) can often provide a much more detailed picture of process. However, correcting these statistical problems would probably not result in major changes in the study's conclusions.

OCR for page 139
Page 156 (1993) found a great deal of variation in both the assessments used and the outcomes. Many programs showed evidence of positive language proficiency and academic achievement outcomes for both native and non-native English speakers, but most of these studies used designs in which there was no comparison group. When comparison groups are available, evaluations typically show that English-language learners in two-way programs outperform those in other programs. Summary Very often there is a fine line between theory- and advocacy-based program evaluations. Some of the examples in this section might be considered advocacy based. We see an important difference between theory- and advocacy-based programs. First, the former type of program is grounded in a theory about the learning of a second language and its relationship to student achievement, not solely in a social or political philosophy. Second, the educational curriculum is designed to implement the theory in a school setting. Third, the educational outcomes of children are used to test the theory; the program evaluation tests both the basic theory and the educational intervention. Lesson 4: Thinking in Terms of Components, Not Political Labels Historically, programs are described as unitary; a student is either in a program or not. The current debate on the relative efficacy of immersion and bilingual education has been cast in this light. However, as noted above, we need to move away from thinking about programs in such broad terms and instead see them as containing multiple components—features that are available to meet the differing needs of particular students. Thus two students in the same program could receive different elements of the program. Moreover, programs that are nominally very different—especially the most successful ones—may have very similar characteristics (see Chapter 7). These common characteristics include the following: • Some native-language instruction, especially initially • For most students, a relatively early phasing in of English instruction • Teachers specially trained in instructing English-language learners Lesson 5: Creation of a Developmental Model A general formal model is needed to predict children's development of linguistic, social, and cognitive skills. The foundation of this model would be derived from basic research reviewed in Chapters 2 through 4, including theories of linguistic, cognitive, and social development. The model would predict exact nonlinear growth trajectories for the major abilities—not only the mean or typical trajectories, but also their variability. It would be flexible enough to allow for the

OCR for page 139
Page 157 introduction of a second language and would explicitly address possible transfer and interference for different first languages, as well as educational issues for new immigrants. Learning takes place in specific environments, and these would be explicitly considered in the model as well. The environment would serve as a moderator that accelerates or decelerates the child's development. Among the school environment variables would be classroom composition, teachers, and school climate. Family and community variables would also serve as moderators. From a policy perspective, the most important moderators would be program inputs, for example, bilingual education and English immersion—not in an idealized sense, but in terms of instructional practices, such as percentage of first-language instruction. Moreover, the model would predict interactions between the effectiveness of program features and student and environmental characteristics. The creation of such a model would require collaboration among basic researchers, statisticians, and educators. It would likely occur in stages. The model would be so complex that it would have to be computer simulated. It should be able not only to explain results that are currently well established, but also to make predictions about results that have not yet been obtained and those that are unexpected. The model would be much too large to be testable in its entirety, but should be specific enough to be testable in narrow contexts. A research effort geared toward developing such a model is that of Thomas and Collier (1995).12 Using data from the immersion studies discussed above and data collected since then from other school districts, Thomas and Collier sketch approximate growth curves for different programs. The model we envision would be much more extensive in that it would predict individual (as opposed to program) growth, as well as interactions between programs and child characteristics. Moreover, we would hope that programs in the model would be replaced by program features (see lesson 4). Research Needs The research needs identified in this section are different in nature from those presented in Chapters 2 through 5 and 7 through 9. The reason for this is that research is not needed on evaluation per se; rather, program evaluations need to be conducted differently if we are to learn from the programs and practices we implement. Our recommendations for improving the conduct of program evaluations correspond to the lessons presented above. 12Findings from this study are not discussed in this report because the study had not been completed or published prior to the report's publication.

OCR for page 139
Page 158 Lesson One 6-1. To ensure high-quality program evaluations, attention is needed to the following factors: program design, program implementation, creation of a control group, group equivalence, measurement, unit of analysis, power, and missing data. Lesson Two 6-2. Local evaluation efforts should focus initially on formative purposes—documenting and guiding full implementation of the program design. Once successful implementation has been demonstrated, outcome evaluations can be conducted. Finally, if enough sites can be studied, an understanding of necessary conditions for successful programs can be developed. 6-3. Local evaluations need to focus first on determining whether programs are properly implemented and on fine tuning those programs so they become more responsive to the needs of children, the school, and the community. Only once a program has been documented as established should a summative evaluation with control groups be undertaken. Both quantitative and qualitative methods are needed to accomplish a total program evaluation. 6-4. Mandated Title VII evaluations should be monitored more closely. School districts that present evidence for successfully implemented models could receive grants for outcome evaluation. Alternatively, the Department of Education could provide competitive grants or contracts for technically skilled outsiders to conduct local evaluations of their efforts across various sites. Results might be pooled for sites where the characteristics of the programs are similar, these programs have been successfully implemented, and outcomes are similar to provide more generalizable evidence of effectiveness. 6-5. Broadly conceived assessment should be built into the educational process to assist in the monitoring of the progress of students, teachers, programs, classrooms, and schools. Lesson Three 6-6. Educational practice in general and program evaluation in particular require the specification of explicit goals, a set of instructional practices and curriculum, and methods for implementing those practices in the schools and classrooms. 6-7. Basic researchers need to develop curriculum and instructional techniques following from theoretically grounded research and theory. Collaborating with school systems, the researchers would then implement and test these models in the schools.

OCR for page 139
Page 159 Lesson Four 6-8. We recommend evaluations that are locally conceived, based, and conducted. Evaluations of the relative efficacy of broad programs that are loosely implemented in a wide range of settings are likely to yield little information about what interventions are effective. Lesson Five 6-9. An interdisciplinary formal model of development should be formulated to predict the growth of individual children's skills and the moderation of that growth as a function of program features. References Abelson, R.P. 1995 Statistics as Principled Argument. Hillsdale, NJ: Erlbaum. Baker, K.A., and A.A. de Kanter 1981 Effectiveness of Bilingual Education: A Review of the Literature. Washington, DC: U.S. Department of Education. Baker, K.A., and S. Pelavin 1984 Problems in Bilingual Education. Paper presented at annual meeting of American Educational Research Association, New Orleans, LA. American Institutes for Research, Washington, DC. Board of Education of the City of New York 1994 Educational Progress of Students in Bilingual and ESL Programs: A Longitudinal Study, 1990-1994. New York. Bryk, A.S., and S.W. Raudenbush 1992 Hierarchical Linear Models: Applications and Data Analysis Methods. Newbury Park, CA: Sage. Burkheimer, Jr., G.J., A.J. Conger, G.H. Dunteman, B.G. Elliott, and K.A. Mowbray 1989 Effectiveness of Services for Language-minority Limited-English-proficient Students. 2 vols. Technical Report. Research Triangle Park, NC: Research Triangle Institute. Bushman, B.J., and M.C. Wang 1996 A procedure for combining sample standardized mean differences and vote counts to estimate the population standardized mean difference in fixed effects models. Psychological Methods 1:66-80. Campbell, D.T., and J.C. Stanley 1963 Experimental and quasi-experimental designs for research in teaching. In N.L. Gage, ed., Handbook of Research on Teaching. Chicago: Rand-McNally. Christian, D. 1994 Two-Way Bilingual Education: Students Learning Through Two Languages. Education Practice Report No. 12. Santa Cruz, CA, and Washington, DC: National Center for Research on Cultural Diversity and Second Language Learning. Cousins, J.B., and L.M. Earl 1992 The case for participatory evaluation. Educational Evaluation and Policy Analysis 14:397-418.

OCR for page 139
Page 160 Crawford, J. 1995 Bilingual Education: History Politics Theory and Practice. Los Angeles: Bilingual Educational Services. Dannoff, M.N. 1978 Evaluation of the Impact of ESEA Title VII Spanish-English Bilingual Education Programs. Technical Report. Washington, DC: American Institutes for Research. Development Associates 1984 Overview of the Research Design Plans for the National Longitudinal Study of the Effectiveness of Services for Language Minority Students. Arlington, VA: Development Associates. Engle, P. 1975 The use of the vernacular language in education. Bilingual Education Series No. 2. Washington, DC: Center for Applied Linguistics. Fetterman, D.M., S.J. Kaftarian, and A. Wandersman, eds. 1995 Empowerment Evaluation: Knowledge and Tools for Self-Assessment and Accountability. Thousand Oaks, CA: Sage. Gersten, Russell 1985 Structured immersion for language minority students: Results of a longitudinal evaluation. Education Evaluation and Policy Analysis 7(3):187-196. Gersten, Russell, and John Woodward 1995 A longitudinal study of transitional and immersion bilingual education programs in one district. Elementary School Journal 95(3):223-239. Gersten, Russell, R. Taylor, J. Woodward, and W.A.T. White 1984 Structured English Immersion for Hispanic Students in the U.S.: Findings from the Fourteen-year Evaluation of the Uvalde, Texas, Program. Technical Report 84-1, Follow Through Project. Eugene: University of Oregon. Ginsburg, A.L. 1992 Improving bilingual education programs through evaluation. Pp. 31-42 in Proceedings of the Second National Research Symposium on Limited English Proficient Student Issues: Focus on Evaluation and Measurement. Vol. 1. OBEMLA. Washington, DC: U.S. Department of Education. Gold, T., and F. Tempes 1987 A State Agency Partnership with Schools to Improve Bilingual Education. Paper presented at the annual meeting of the American Educational Research Association, Washington, DC. California State Department of Education. Krashen, S., and D. Biber 1988 On Course: Bilingual Education's Success in California. Sacramento: California Association for Bilingual Education. Lam, Tony C.M. 1992 Review of practices and problems in the evaluation of bilingual education. Review of Educational Research 62(2):181-203. Lord, F.M. 1967 A paradox in the interpretation of group comparisons. Psychological Bulletin 68:304-305. Mahrer, C., and D. Christian 1993 A Review of Findings from Two-Way Bilingual Education Evaluation Reports. Santa Cruz, CA. and Washington, DC: National Center for Research on Cultural Diversity and Second Language Learning.

OCR for page 139
Page 161 Meyer, M.M., and S.E. Fienberg, eds. 1992 Assessing Evaluation Studies: The Case of Bilingual Education Strategies. Panel to Review Evaluation Studies of Bilingual Education, Committee on National Statistics, National Research Council. Washington, DC: National Academy Press. Mujica, B. 1995 Findings of the New York City longitudinal study: Hard evidence on bilingual and ESL programs. READ Perspectives 2:7-34. Ramirez, D.J., S.D. Yuen, D.R. Ramey, and D.J. Pasta 1991 Final Report: National Longitudinal Study of Structured-English Immersion Strategy, Early-Exit and Late-Exit Transitional Bilingual Education Programs for Language-Minority Children, Vol. 1 and 11, Technical Report. San Mateo, CA: Aguirre International. Rossell, Christine H., and Keith Baker 1996 The educational effectiveness of bilingual education. Research in the Teaching of English 30(1):7-74. Rossell, Christine H., and J. Michael Ross 1986 The social science evidence on bilingual education. Journal of Law and Education 15(4):385-419. Samaniego, F., and L. Eubank 1991 A Statistical Analysis of California's Case Study Project in Bilingual Education TR #208. Intercollegiate Division of Statistics. Davis: University of California. Secada, Walter G. 1987 This is 1987, not 1980: A comment on a comment. Review of Educational Research 57(3):377-384. Thomas, W., and V. Collier 1995 Language Minority Student Achievement and Program Effectiveness. Washington, DC: National Clearinghouse for Bilingual Education. U.S. Department of Education 1991 The Condition of Bilingual Education in the Nation: A Report to the Congress and the President. Office of the Secretary. Washington, DC: U.S. Department of Education, Washington, DC. U.S. Government Accounting Office 1987 Bilingual Education: A New Look at the Research Evidence. Briefing report to the Chairman, Committee on Education, Labor, House of Representatives, GAO/PEMD-87-12BR. Washington, DC. Willig, A.C. 1985 A meta-analysis of selected studies on the effectiveness of bilingual education. Review of Educational Research 55(3):269-317. Zappert, L.T., and B.R. Cruz 1977 Bilingual Education: An Appraisal of Empirical Research. Berkeley, CA: BAHIA Press.

OCR for page 139
Page 162 STUDIES OF SCHOOL AND CLASSROOM EFFECTIVENESS: SUMMARY OF THE STATE OF KNOWLEDGE The literature on school and classroom effectiveness provides the following key findings: • The studies reviewed here provide some evidence to support the ''effective schools" attributes identified nearly 20 years ago, with at least two important qualifications: — The studies challenge the conceptualization of some of those attributes, for example, the idea that implementing characteristics of effective schools and classrooms makes schools and classrooms more effective. — The studies suggest that factors not identified in the effective schools literature may be important as well if we are to create schools where English-language learners, indeed all students, will be successful and productive. Examples of such factors are a focus on more than just basics, ongoing staff development, and home-school connections. • The following attributes are identified as being associated with effective schools and classrooms: a supportive school-wide climate, school leadership, a customized learning environment, articulation and coordination within and between schools, some use of native language and culture in the instruction of language-minority students, a balanced curriculum that incorporates both basic and higher-order skills, explicit skills instruction, opportunities for student-directed activities, use of instructional strategies that enhance understanding, opportunities for practice, systematic student assessment, staff development, and home and parent involvement. • Although suggestive of key attributes that are important for creating effective schools and classrooms, most studies reviewed here cannot give firm answers about any particular attribute and its relationship to student outcomes. For example, the nominated schools designs do not report data on student outcomes and are thus inconclusive. Prospective case studies lack comparison groups, so that changes in student outcomes may be due to extraneous factors. And while quasi-experimental studies that focus on an entire program provide the strongest basis for claims about program or school effects, they make direct claims only about the program or school effect overall. Claims about the effects of specific components must, in general, rest on other studies that examine those components explicitly.