6
Program Evaluation

Of the types of programs for English-language learners reviewed in Chapter 1, the most commonly studied are those that use the native language for some limited period of time for core academics (e.g., transitional bilingual education programs) and those that do not use the native language in any regular or systematic way (i.e., English as a second language [ESL] and its variants, such as structured immersion and content-based ESL, as well as "submersion programs"). During the 1970s and 1980s, the federal government and advocates were keenly interested in determining which of these two models in more effective. Program evaluations were intended to provide a definitive answer to this question. This chapter examines what we know from program evaluations conducted to date. Chapter 7 reviews studies of school and classroom effectiveness.

FINDINGS

This section begins by reviewing national-level evaluations of programs for English-language learners and then examines reviews of smaller-scale program evaluations. This is followed by a discussion of the politicization of program evaluation. The final subsection addresses the future course of program evaluation, presenting five lessons learned that can lead to better, more useful evaluations.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 55
6 Program Evaluation Of the types of programs for English-language learners reviewed in Chapter 1, the most commonly studied are those that use the native language for some limited period of time for core academics (e.g., transitional bilingual education programs) and those that do not use the native language in any regular or systematic way (i.e., English as a second language [ESL] and its variants, such as structured immersion and content-based ESL, as well as "submersion programs"). During the 1970s and 1980s, the federal government and advocates were keenly interested in determining which of these two models in more effective. Program evaluations were intended to provide a definitive answer to this question. This chapter examines what we know from program evaluations conducted to date. Chapter 7 reviews studies of school and classroom effectiveness. FINDINGS This section begins by reviewing national-level evaluations of programs for English-language learners and then examines reviews of smaller-scale program evaluations. This is followed by a discussion of the politicization of program evaluation. The final subsection addresses the future course of program evaluation, presenting five lessons learned that can lead to better, more useful evaluations.

OCR for page 55
National Evaluations There have been three large-scale national evaluations of programs for English-language learners: the American Institutes for Research evaluation of programs, referred to as the AIR study (Dannoff, 1978); the National Longitudinal Evaluation of the Effectiveness of Services for Language Minority Limited English Proficient Students, referred to as the Longitudinal Study (Development Associates, 1984; Burkheimer et al., 1989); and the Longitudinal Study of Immersion and Dual Language Instructional Programs for Language Minority Children, referred to as the Immersion Study (Ramirez et al., 1991). This section summarizes these three studies, as well as the findings of a National Research Council report (Meyer and Fienberg, 1992) that reviews two of the three. The AIR Study The AIR study compared students enrolled in Title VII Spanish/English bilingual programs and comparable students (in terms of ethnicity, socioeconomic status, and grade level) not enrolled in such programs. For the study, 8,200 students were measured twice during the school year on English oral comprehension and reading, math, and Spanish oral comprehension and reading. Generally, the results from this study showed that students in bilingual education programs did not gain more in academic achievement than students not in such programs. However, the study was the subject of a great deal of criticism, the major criticism questioning the strength of the treatment control group comparison (Crawford, 1995). The study did not compare bilingual education with no bilingual education because two-thirds of the children in the control group had previously been in bilingual programs. In part because of the ambiguity of the conclusions from the AIR study, the other two major longitudinal studies discussed in this section were commissioned by the U.S. Department of Education in 1983 to look at program effectiveness. The Longitudinal Study The Longitudinal Study had two distinct phases. The descriptive phase examined the variety of services provided to English-language learners in 86 schools. Some of its major findings were as follows:

OCR for page 55
The vast majority of English-language learners came from economically disadvantaged families. English-language learners were at risk educationally, with most performing below grade level. Most instruction of the children was in English or in some combination of English and the native language. The goal of the second phase of the study was a longitudinal follow-up to determine the relative effectiveness of programs. The focus was on 25 schools from the first phase, with both English-language learners and English-proficient students from kindergarten to fifth grade being followed over 3 years (Burkheimer et al., 1989). Because schools for the follow-up phase were initially selected on the basis of having representative populations of interest and not on the basis of program differences, there was not much variation in the programs represented in this phase. Moreover, some of the schools had no children in the control group. In an attempt to compensate for the weakness of the study design, elaborate statistical techniques were used. Given these limitations, one should be cautious about interpreting the study findings. The major findings can be summarized as follows (Burkheimer et al., 1989): Assignment of English-language learners to specific instructional services reflects principally local policy determinations. To a much more limited extent (mainly in the lower grades), assignment may be based on "deliberate" (and apparently criterion-driven) placement of students in the specific services for which they are ready. Too heavy a concentration in any one aspect of the English-language learner's education (e.g., reading) can detract from achievement in other areas (e.g., math or science). The yearly achievement of English-language learners in math and English-language arts is facilitated by different approaches, depending on student background factors. For example, students who are relatively English proficient are better able to benefit from English language arts instruction given in English, whereas students who are weak in English and/or strong in their native language eventually show better English language arts achievement when instructed in their native language. In the later grades, proficiency in mathematics when tested in English seems to require proficiency in English. This is not the case in the lower grades.

OCR for page 55
As in the assignment of students to specific services for English-language learners, exit from those services reflects both local policy and specific criterion-driven exit rules related to reaching certain levels of proficiency/achievement in English. Children are more likely to be exited from English-language learner services if these services are similar to those for English-proficient students. The Immersion Study The Immersion Study, conducted by Aguirre International, was a much more focused study of program alternatives than the Longitudinal Study. It attempted a quasi-experimental longitudinal comparison of three types of programs: English-only immersion, early-exit bilingual (also known as transitional bilingual), and late-exit bilingual (also known as maintenance bilingual). The study took place at nine sites, but five of these had only one of the three types of programs (Ramirez et al., 1991). In fact, the late-exit bilingual program was completely confounded with site, being the only program implemented at three sites. Despite sophisticated statistical models of growth, the conclusions from the study are seriously compromised by the noncomparability of sites. The major findings of the comparison of program types were summarized by the U.S. Department of Education (1991). After 4 years in their respective programs, English-language learners in immersion strategy and early-exit programs demonstrated comparable skills in mathematics, language, and reading when tested in English. There were differences among the three late-exit sites in achievement level in the three subjects: students at the sites with the most use of Spanish and the most use of English ended sixth grade with the same skills in English language and reading; students at the two late-exit sites that used the most Spanish showed higher growth in mathematics skills than those at the site that abruptly transitioned into almost all-English instruction. Students in all three programs realized a growth in English-language and reading skills that was as rapid or more so than the growth that would have been expected for these children had they not received any intervention. National Research Council Report Both the Longitudinal and Immersion studies were reviewed by a National Research Council panel of the Committee on National Statistics

OCR for page 55
(Meyer and Fienberg, 1992). The primary focus of the panel's report is on determining whether the statistical methods used in those studies were appropriate. The report identified important flaws in these major efforts, including the following: The formal designs of the Longitudinal and Immersion studies were ill suited to answering the important policy questions that appear to have motivated them. Execution and interpretation of these studies, especially the Longitudinal Study, were hampered by a lack of documentation of the study objectives, the operationalizing of conceptual details, actual procedures followed, and changes in all of the above. Because of the poor articulation of study goals and the lack of fit between discernible goals and research design, it is unlikely that additional statistical analyses of the data would yield results central to the policy questions these studies were originally intended to address. Both the Longitudinal and Immersion studies suffered from excessive attention to the use of elaborate statistical methods intended to overcome the shortcomings in the research designs. Although the samples in the Immersion study were followed longitudinally, later comparisons are not considered valid because of sample attrition. Quite clearly the Longitudinal and Immersion studies did not provide decisive evidence about the effectiveness of bilingual education programs. However, according to the National Research Council report, findings from the comparisons that were most sound with respect to study design and sample characteristics indicate that kindergarten and first grade students who received academic instruction in Spanish had higher achievement in reading in English (at kindergarten and grade 1) than comparable students who received academic instruction in English. Reviews of Smaller-Scale Evaluations Five key reviews of smaller-scale program evaluations were examined: Baker and de Kanter (1981) reviewed 28 studies of programs designed for English-language learners that reported evaluations considered

OCR for page 55
to be methodologically sound.1 Based on their review, they concluded that "the case for the effectiveness of transitional bilingual education is so weak that exclusive reliance on this instruction method is clearly not justified" (p. 1). Rossell and Ross (1986) and Rossell and Baker (1996) considered studies that evaluated alternative second-language programs. Their review included only studies that had random assignment to programs or statistical control for pretreatment differences between groups when random assignment was not possible. They concluded that the evidence from these studies did not support the superiority of transitional bilingual education for English-language learners. Willig (1985) conducted a meta-analysis of studies reviewed by Baker and de Kanter. In contrast with previous reviews, her analysis quantitatively measured the program effect in each study, even if it was not statistically significant. Her overall conclusion is quite different from that of the previous reviews: "positive effects for bilingual programs … for all major academic areas" (p. 297). However, as she notes, she did not compare bilingual education programs with other programs, but only contrasted program versus no-program studies. The U.S. General Accounting Office (1987) surveyed 10 experts in the field to gauge the effectiveness of bilingual education programs. Most of the experts surveyed looked quite favorably upon educational policy that encourages the use of the native language and were critical of structured immersion. Moreover, most questioned the value of "aggregate program labels" (e.g., immersion or transitional bilingual education) because such labels fail to capture fully the instructional activities and context at each site. In summary, the beneficial effects of native-language instruction are clearly evident in programs that have been labeled "bilingual education," but they also appear in some programs that are labeled "bilingual immersion" (Gersten and Woodward, 1995). There appear to be benefits of programs that are labeled "structured immersion" (Baker and de Kanter, 1981; Rossell and Ross, 1986); however, a quantitative analysis of such programs is not yet available. Based primarily on the Willig (1985) meta-analysis, this report supports the conclusion of the previous National Re- 1    To be considered methodologically sound, studies had to employ random assignment of children to treatment conditions or take measures to ensure that children in the treatment groups were equivalent. Studies with no comparison group were rejected.

OCR for page 55
search Council panel discussed earlier: "The panel still sees the elements of positive relationships that are consistent with empirical results from other studies and that support the theory underlying native language instruction" (Meyer and Fienberg, 1992:105). However, for numerous reasons, we see little value in continuing to focus evaluations on the question of which type of program is best. First, the key to program improvement is not in finding a program that works for all children and all localities, but rather finding a set of program components that works for the children in the community of interest, given the goals, demographics, and resources of that community. The focus needs to be on the proper contexts in which a program component is most effective and conversely, the contexts in which it may even be harmful. Second, many large-scale evaluations would likely suffer from the problem encountered in some previous national evaluations: the programs would be so loosely implemented that the evaluation would have no clear focus. Third, programs are not unitary, but a complex series of components. Thus we think it better to focus on components than programs and on the needs of the local setting. As we argue later, successful bilingual and immersion programs may contain many common elements. Politicization of Program Evaluation It is difficult to synthesize the program evaluations of bilingual education because of the extreme politicization of the process. Research always involves compromises, and because no study is perfect, every study has weaknesses. What has happened in this area of research is that most consumers of the research are not researchers who want to know the truth, but advocates who are convinced of the absolute correctness of their positions. Advocates care mainly about the results of a study. If its conclusions support their position, they note the study's strong points; if not, they note its weak points. Because there are studies that support a wide range of positions, advocates on both sides end up with plenty of "evidence" to support their position. Policymakers are justifiably troubled by the inability of the research evidence to resolve the debate. A related issue is that very quickly a new study gets labeled as pro- or anti-bilingual education. What is emphasized in the debate is not the quality of the research or insights about school and classroom attributes that contribute to or hinder positive student outcomes, but whether the study is consistent with the advocate's position. Because advocacy is the goal, very poor studies that support an advocated position are touted as definitive.

OCR for page 55
Future Course of Program Evaluation It is easy to criticize previous program evaluations, but we need to realize that program evaluation was in its infancy when many of these studies were initially undertaken. During the past 25 years, the model of program evaluation has evolved considerably. There are several key elements in the current model (see Fetterman et al. [1995] for one such formulation). First, the initial focus is not on comparing programs, but on determining whether a given program is properly implemented and fine tuning it so it becomes more responsive to the needs of children, the school, and the community. Once the program has been established, a summative evaluation with control groups is recommended. Second, instead of being a top-down process, the evaluation is more participatory, guided by students, staff, and the community (Cousins and Earl, 1992). Third, qualitative as well as quantitative methods are used (see Chapter 4). Although program evaluation to date has yielded disappointing results, it would be a serious mistake to say we have learned nothing from the enterprise. Five general lessons learned from the past 25 years of program evaluation follow. Lesson 1: Higher-Quality Program Evaluations The following factors are critical to high-quality program evaluations: program design, program implementation, creation of a control group, group equivalence, measurement, unit of analysis, power, and missing data. Program Design A program should have clearly articulated goals. Although scientific research can play a role in determining intermediate goals, program goals are generally determined by the school community. For instance, some communities may place a premium on students maintaining a native language, whereas others may prefer to encourage only the speaking of English. Once the program goals have been set, curriculum must be found or created, staffing requirements determined, and staff development procedures implemented. The program should be designed using basic theory (see the discussion of lesson 3 below), but should also be practical enough to be implemented in the schools. Program Implementation Many programs created for English-language learners by government, schools, researchers, and courts have not been

OCR for page 55
fully implemented. An evaluation without evidence of successful implementation is an evaluation of an unknown quantity. Demonstration of program implementation requires more than examining the educational background of teachers and the completion of forms filled out by administrators. While interviews with teachers and students can provide an approximate fix on what is actually being delivered, the best approach is to observe what teachers and students do in the classroom (see Chapter 7). Examining program implementation offers several advantages. First, it encourages thinking of the program not as unitary (e.g., bilingual education), but as a series of components; one can then determine whether each of these components has been implemented (see the discussion of lesson 4 below). Second, it allows for the measurement of processes that would otherwise not be measured, such as opportunity to learn. Third, if the implementation data are measured for the same children for whom outcome data are measured, it is possible to analyze the process by which program features are translated into outcomes. Creation of a Control Group Even when a program has clearly articulated goals, is based on sound theory, and is adequately implemented, a program evaluation is of little value if one does not know what experiences the children in the control group have had. Identifying control groups may be difficult. Because of legislative, judicial, and educational constraints, an untreated group may be difficult to find. Moreover, the researcher should not assume that just because children are not currently receiving an intervention, they never have. The current and past experiences of children in the control group need to be carefully documented. One might suppose that an emphasis on standards precludes the need for a control group. While it may be important to examine whether students meet performance standards, we still need to know whether a program improves performance over what was achieved under a previous program. Moreover, given the economic background of most English-language learners (see Chapter 1) and the likely heavy English load in most testing (see Chapter 5), the use of standards could create an unduly pessimistic appraisal of these children. While high standards are the ultimate goal, they will likely have to be reached gradually. Group Equivalence Program evaluation involves comparison of an experimental and a control group. These two groups should be demographically and educationally equivalent. Equivalent groups are guaranteed by random assignment. Because of legislative, judicial, and administrative

OCR for page 55
constraints, random assignment of students to conditions may not generally be feasible; nonetheless, we urge vigilance in attempting to find opportunities for random assignment. When random assignment is not feasible, other ways must be found to ensure that the groups are similar. As recommended by Meyer and Fienberg (1992), there is a greater likelihood of equivalence if the control group students can be selected from the same school. If another classroom, school, or school district must be chosen, it should be as similar as possible to the treated units. Researchers need to ascertain whether the groups are equivalent before the intervention begins. The best way to do this is to measure the children in both groups to obtain a baseline measure. Ideally, there should be little or no difference at baseline.2 If there are differences, statistical analysis can be used to make the groups more similar,3 but it cannot be expected to make them truly equivalent. Measurement The difficult issues of student assessment have already been discussed in Chapter 5. However, we note that longitudinal assessment of English-language learners virtually guarantees that different tests will be taken by different groups of children, making it necessary to equate the tests. The timing of measurement is important. The baseline or pretest measure should occur before the program begins, and the post-test should occur after the program has been completed and its potential effects are evident. A long interval from pretest to post-test will increase the amount of missing data in the sample (see below). Unit of Analysis Even when there is random assignment, the child is generally not the unit assigned to the intervention, but rather the classroom, the school, or sometimes the district. A related issue is that children affect each other's learning in the classroom, and indeed, several recent educational innovations (e.g., cooperative learning) attempt to capitalize on this fact. Consideration then needs to be given to whether the child or some other entity is the proper unit of analysis. Power This factor concerns the probability of detecting a difference between treatment and control groups if there actually is one. Program 2    Ensuring equivalence by matching individual scores at the pretest only appears to create equivalence (Campbell and Stanley, 1963). 3    There is considerable controversy about how to adjust statistically for baseline differences (e.g., Lord, 1967). This controversy reinforces the point that the presence of baseline differences seriously compromises the persuasiveness of the evaluation.

OCR for page 55
evaluations must be designed so that there is sufficient power. In many instances, there may be insufficient resources to achieve an acceptable level of power. For instance, there may be only 50 children eligible for a study, but to have a reasonable chance of getting a statistically reliable result may require more children in the sample. Even if there are sufficient resources, the study may be too large to manage. Missing Data Typically, evaluations are longitudinal, and in longitudinal research, missing data are always a serious concern. Given the high mobility of English-language learners, attrition is an especially critical issue in these types of evaluations (Lam, 1992). A plan for minimizing and estimating the effect of missing data should be attempted. To some extent, the use of growth-curve modeling (Bryk and Raudenbush, 1992) or the computation of individual change trajectories can alleviate this problem. Summary Clearly, program evaluations are difficult. The above discussion indicates that there are often tradeoffs: to maximize one aspect of a study, another must be reduced. Although research always involves compromises and limitations, there must still be some minimum degree of quality for the research to be informative. Therefore, sometimes the most prudent choice is not to conduct a program evaluation, but to devote research efforts to determining whether a program is successfully implemented in the classroom and identifying the process by which the program leads to desirable outcomes (see the discussion of lesson 2). At the same time, researchers and policymakers still need to be creative in recognizing opportunities for evaluation. Lesson 2: More Informative Local Evaluations Evaluation needs to be viewed as a tool for program improvement, not as a bureaucratic obligation. Local evaluation efforts need to focus on methods for improving program design and implementation (Ginsburg, 1992; Meyer and Fienberg, 1992). Lam (1992:193) makes the following recommendation: "It seems reasonable to urge local educators and administrators to use the majority of the evaluation budgets for formative purposes—that is, to document and guide full implementation of the program design, including the analysis of problems arising when the school's capacity to actually implement the proposed program is being developed."

OCR for page 55
Title VII legislation explicitly encourages this type of evaluation (Section 7123b). Federal and state governments might monitor local evaluations more closely. School districts that present evidence for successfully implemented models should receive grants for outcome evaluation. While we do not believe in enforcing standardization across sites in these evaluations, there should be attempts to encourage collaboration that would allow pooling of results. There are successful examples in other areas of human resource evaluation in which there is local control, but comparable measures and designs are used to allow for data aggregation. It should also be noted that both large-scale and local evaluations have their limitations. With smaller-scale evaluations, it is easier to monitor the effort, keep track of implementation, and institute procedures to minimize missing data. However, small evaluations are plagued with insufficient sample sizes and sometimes insufficient program variation. Moreover, results in one community may not be generalizable to other communities. Just as national evaluations were oversold in the 1970s and 1980s, we do not wish now to oversell local evaluations. We expect program effects to interact with site and community characteristics. 4 Although some site effects will be random and inexplicable, others will be systematic. If enough sites can be studied, an understanding of the necessary conditions for successful programs can be developed. One statistical technique that is ideally suited for the analysis of within-site effects is hierarchical linear modeling (Bryk and Raudenbush, 1992), which can be used to test whether there are site interactions and determine what factors can explain them. Lesson 3: Creation and Evaluation of Theory-Based Interventions Programs should be designed so they are consistent with what is known about basic learning processes. The studies and programs described in this section are based on a theory of second-language learning and its relationship to student achievement and successful educational practice. The theory is tested through implementation in a classroom 4    A good example of effects varying by site is presented by Samaniego and Eubank (1991). They tested basic theory using the California Case Studies in four school districts, and results varied considerably across sites.

OCR for page 55
setting. While none of the examples described here is perfect, each has aspects that are exemplary. California Case Studies In 1980, the California State Department of Education, in a collaboration with researchers and local educators, applied a theory-based model for bilingual education (Gold and Tempes, 1987). The program, which came to be known as the California Case Studies, began with a declaration of principles (see Chapter 7), many of which were based on research results reviewed in Chapters 2 through 4. Five elementary schools serving large numbers of Spanish-speaking students were selected for participation in the program. Students were provided substantial amounts of instruction in and through the native language; comprehensible second-language input was provided through both ESL classes and sheltered classes in academic content areas; and teachers attempted to equalize the status of second-language learners by treating English-language learners equitably, using cooperative learning strategies, providing second-language classes, and using minority languages for noninstructional purposes. While the California Case Studies is exemplary in its application of principles based on well-established basic research and collaboration between educators and researchers, very few of its results have been published in peer-reviewed journals.5 Gersten's Bilingual Immersion Programs In a series of studies published in peer-reviewed journals, Gersten and colleagues (Gersten, 1985; Gersten et al., 1984; Gersten and Woodward, 1995) tested the effectiveness of "bilingual immersion programs" for English-language learners. Gersten and Woodward (1995:226) define such a program as follows: "This approach retains the predominant focus on English-language instruction from the immersion model but tempers it with a substantive 4-year Spanish language program so that students maintain their facility with their native language." The program they developed is a blend of ideas from bilingual and immersion programs, hence their use of the term "bilingual immersion."6 5    The program was never intended as an evaluation, and funding was cut at the end of the project, which made evaluation more difficult. 6    Although Gersten and Woodward label this program "bilingual immersion," it should not be confused with two-way bilingual programs (also called "bilingual immersion programs"), in which native speakers of English and English-language learners are provided with subject matter instruction in their respective native language(s).

OCR for page 55
In a major 4-year study (Gersten and Woodward, 1995), 228 children in El Paso, Texas, were placed in either bilingual immersion (as defined above) or transitional bilingual programs. Children were followed from fourth through seventh grades. While differences found in language and reading ability in the early years favored the bilingual immersion approach, those differences seemed to vanish in the later years. However, almost all of the bilingual immersion children had been mainstreamed by the end of the program, while nearly one-third of the transitional bilingual children had not.7 Summary Some of the examples in this section might be considered advocacy based. While there is very often a fine line between theory- and advocacy-based program evaluations, we see an important difference between the two. First, the former type of program is grounded in a theory about the learning of a second language and its relationship to student achievement, not solely in a social or political philosophy. Second, the educational curriculum is designed to implement the theory in a school setting. Third, the educational outcomes of children are used to test the theory; the program evaluation tests both the basic theory and the educational intervention. Lesson 4: Thinking in Terms of Components, Not Political Labels Historically, programs are described as unitary; a student is either in a program or not. The current debate on the relative efficacy of English immersion and bilingual education has been cast in this light. However, as noted above, we need to move away from thinking about programs in such broad terms and instead see them as containing multiple components—features that are available to meet the differing needs of particular students. Thus two students in the same program could receive different elements of the program. Moreover, programs that are nominally very different—especially the most successful ones—may have very similar characteristics (see Chapter 7). These common characteristics include the following: 7    The statistical analysis of the data from this program by Gersten and Woodward (1995) has been less than optimal. Generally, analysis of variance is not appropriate for longitudinal studies. Moreover, growth curve analysis (Bryk and Raudenbush, 1992) can often provide a much more detailed picture of process. However, correcting these statistical problems would probably not result in major changes in the study's conclusions.

OCR for page 55
Some native-language instruction, especially initially For most students, a relatively early phasing in of English instruction Teachers specially trained in instructing English-language learners Lesson 5: Creation of a Developmental Model A general formal model is needed to predict children's development of linguistic, social, and cognitive skills. The foundation of this model would be derived from basic research reviewed in Chapters 2 through 4, including theories of linguistic, cognitive, and social development. The model would predict nonlinear growth trajectories for the major abilities—not only the mean or typical trajectories, but also their variability. It would be flexible enough to allow for the introduction of a second language and would explicitly address possible transfer and interference for different first languages, as well as educational issues for new immigrants. Learning takes place in specific environments, and these would be explicitly considered in the model as well. The environment would serve as a moderator that accelerates or decelerates the child's development. Among the school environment variables would be classroom composition, teachers, and school climate. Family and community variables would also serve as moderators. From a policy perspective, the most important moderators would be program inputs, for example, bilingual education and English immersion—not in an idealized sense, but in terms of instructional practices, such as percentage of first-language instruction. Moreover, the model would predict interactions between the effectiveness of program features and student and environmental characteristics. The creation of such a model would require collaboration among researchers, statisticians, and educators. It would likely occur in stages. The model would be so complex that it would have to be computer simulated. It should be able not only to explain results that are currently well established, but also to make predictions about results that have not yet 8    Findings from this study are not discussed in this report because the study had not been completed or published prior to the report's publication.

OCR for page 55
been obtained and those that are unexpected. The model would be much too large to be testable in its entirety, but should be specific enough to be testable in narrow contexts. A research effort geared toward developing such a model is that of Thomas and Collier (1995).8 Using data from the immersion studies discussed above and data collected since then from other school districts, Thomas and Collier sketch approximate growth curves for different programs. The model we envision would be much more extensive in that it would predict individual (as opposed to program) growth, as well as interactions between programs and child characteristics. Moreover, we would hope that programs in the model would be replaced by program features (see lesson 4). IMPLICATIONS The educational implications of the findings presented in this chapter correspond to the five lessons presented in the previous section.9 First, higher-quality program evaluations are needed. Factors critical to high-quality program evaluations include sound program design, full program implementation, creation of a control group, group equivalence, adequate measurement, proper unit of analysis, enough power, and methods for dealing with missing data. Second, local evaluations need to be made more informative. Further, they need to focus on methods for improving program design and implementation. Evaluation needs to be viewed as a tool for program improvement, not as a bureaucratic obligation. Third, theory-based interventions need to be created and evaluated. Programs should be designed so they are consistent with what is known about basic learning processes. The studies and programs described in this chapter are based on a theory of second-language learning and its relationship to student achievement and successful educational practice. The theory is tested through implementation in a classroom setting. Fourth, we need to move away from thinking about programs in broad terms and instead see them as containing multiple components—features that are available to meet the differing needs of particular students. 9    This section does not present research implications because research is not needed on evaluation per se; rather, program evaluations need to be conducted differently if we are to learn from the programs and practices we implement.

OCR for page 55
Finally, a developmental model needs to be created for use in predicting the effects of program components on children in different environments. The foundation of this model would be derived from basic research reviewed in Chapters 2 through 4, including theories of linguistic, cognitive, and social development.

OCR for page 55
STUDIES OF SCHOOL AND CLASSROOM EFFECTIVENESS: KEY FINDINGS The literature on school and classroom effectiveness provides the following key findings: The studies reviewed here provide some evidence to support the "effective schools" attributes identified nearly 20 years ago (strong leadership, high expectations for students, clear school-wide focus on basic skills, safe and orderly environment, and frequent assessment of student progress), with at least two important qualifications: The studies challenge the conceptualization of some of those attributes, for example, the idea that implementing characteristics of effective schools and classrooms makes schools and classrooms more effective. The studies suggest that factors not identified in the effective schools literature may be important as well if we are to create schools where English-language learners, indeed all students, will be successful and productive. Examples of such factors are a focus on more than just basics, ongoing staff development, and home-school connections. The following attributes are identified as being associated with effective schools and classrooms: a supportive school-wide climate, school leadership, a learning environment tailored to local goals and resources, articulation and coordination within and between schools, some use of native language and culture in the instruction of English-language learners, a balanced curriculum that incorporates both basic and higher-order skills, explicit skills instruction, opportunities for student-directed activities, use of instructional strategies that enhance understanding, opportunities for practice, systematic student assessment, staff development, and home and parent involvement. Although suggestive of key attributes that are important for creating effective schools and classrooms, most studies reviewed here cannot give firm answers about any particular attribute and its relationship to student outcomes. For example, the nominated schools designs do not report data on student outcomes and are thus inconclusive. Prospective case studies lack comparison groups, so that changes in student outcomes may be due to extraneous factors. And while quasi-experimental studies that focus on an entire program provide the strongest basis for claims about program or school effects, they make direct claims only about the program or school effect overall. Claims about the effects of specific components must, in general, rest on other studies that examine those components explicitly.