Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 90
Assessing Evaluation Studies: The Case of Bilingual Education Strategies 5 Lessons for the Future This chapter addresses the issue of theory and its interplay with the statistical design of studies involving educational interventions. Building on experience in related areas of study and its review of the program in the Longitudinal and Immersion Studies, the panel has concluded that an explicit theory of bilingual education, including explicit objectives, is required to both motivate and structure the sensible statistical design of a study to evaluate alternative forms of bilingual education. Moreover, the development of a plan to evaluate an explicit theory must conform to the realities of the political and social setting of U.S. public education, as well as to the principles of sound experimental design and appropriate statistical analyses. The panel believes strongly that without proper statistical designs most intervention studies are likely to fail. That is, they will be such that either the educational innovations will not achieve the desired objectives or the study will be insufficient to distinguish amongst innovations. The panel also believes that it is possible to overcome some shortcomings in design by relying on empirically supported theory. It is for this reason that the panel has focused on issues of underlying substantive theories and on related issues of statistical design of the two studies under review. The shortcomings of the Longitudinal and Immersion Studies are many, but they can be used to motivate the type of study that has a high probability of successfully demonstrating the value of particular forms of bilingual education.
OCR for page 91
Assessing Evaluation Studies: The Case of Bilingual Education Strategies SPECIFYING OBJECTIVES FOR BILINGUAL EDUCATION The primary objectives of bilingual education for public schools in the United States remain controversial. In the absence of a well-defined set of program objectives, any research effort to assess “success” of programs will encounter problems and difficulties from the start. For many, the primary objective of bilingual education is the development of English-language proficiency at the earliest possible age, to expedite the transition of language-minority limited-English-proficient (LM-LEP) students to classes for which English is the sole medium of communication. To meet this objective, instructional goals during early years of schooling would concentrate on the learning of English. Students, however, would be disadvantaged in subject-matter knowledge to the extent that they fall behind cohorts for whom English is their native language and who had already gained knowledge from classes taught in English. An alternate, or perhaps concurrent, objective is to maintain equity of opportunity to learn subject-matter skills and knowledge, regardless of the students' facility in English at a given stage of development. This objective would seem to require subject-matter teaching in each child's native language until such time as sufficient proficiency in English allows the child to be an active learner in classes conducted in English. Still another objective may be the preservation of linguistic and cultural pluralism in U.S. society, and there may well be others. From the perspective of students and their parents, the objectives may be somewhat different, and different objectives may be differentially important for children from different language groups. For example, some immigrant groups might encourage the early development of English-language facility for their children, while others might place greater importance on mastering their native language. Within a language community, different objectives may be favored for different subgroups of children, depending on the length of time since their arrival in this country, the values espoused by adults in the local community, or a variety of other factors. It is imperative that objectives be agreed upon before assigning children to particular programs. Establishing program objectives is not a function of educational research, but, rather, must be accomplished as part of a political process. It is desirable that teachers, parents, and children be willing to accept program goals and subgoals, and that an instructional program then be oriented toward fulfilling those goals. Once objectives are set and programs to meet the objectives are in place, then criteria consistent with those objectives can be specified, and research designs to assess the relative success of alternative programs can be developed. It is also important to have operational definitions of limited-English proficient (LEP) as well as fully English proficient (FEP). Without operational definitions, it is very difficult to quantify what the objectives of a bilingual education program are and just as difficult to decide if the objectives have been achieved. In the Immersion Study, the Foreign Service Institute oral proficiency test was used to measure the proficiency of teachers, but no measure of students' initial language
OCR for page 92
Assessing Evaluation Studies: The Case of Bilingual Education Strategies proficiency was offered. The Foreign Service Institute oral proficiency test is inappropriate for young students, but other measures are available. Rather than leave the classification of LEP to local discretion, it is important that there be a uniform definition, at least within studies. In addition to descriptions of levels of English-language proficiency, future studies might also be enriched by descriptions of changes in language proficiency in the students' first language—in most cases, Spanish. In addition to English-language proficiency, there are many other possible outcomes from bilingual education, such as the long-term impact of a program and its effect on students' attitude and motivation. The program that results in the “best” short-term effect may not prove to be the best in the long run. It is important to consider how one might measure long-term effects and the impact of a program on the subsequent motivation and attitude of students. DEVELOPING INTERVENTIONS FROM THEORY One of the fundamental difficulties that the panel encountered with its review of the Longitudinal and Immersion Studies was the absence of a careful description of an explicit theory for learning, either of language or school subject matter. This absence inevitably leaves a lean platform upon which to build data collection, analysis, and interpretation. As a result, the studies used vague and inconsistent methods to operationalize treatments, outcomes, and measurements. The problem was especially conspicuous in the Longitudinal Study, in which even the explanatory variables were inductively formulated without any explicit theory of learning or instruction. In the case of the Immersion Study, there is an explicit theory of treatment, but it is really a characterization of instructional strategies in terms of the amount of native-language involvement, and not tied, for example, to school and community contexts that might influence the choice and implementation of the programs. In both studies, the key outcomes of interest were the students' English proficiency and their basic skills in reading and mathematics. What theory of education regards these outcomes as more important, to take an extreme example, than students self-esteem? Is there a theory that might explain how English proficiency is related to other desirable outcomes, such as access to instruction in general, access to instruction in a specific subject matter (for example, reading or mathematics) or motivation to learn? The panel had difficulty answering such questions on the basis of the material associated with the two studies it reviewed. Whatever structure a theory of bilingual education takes, it must address how the English and Spanish proficiency of students might introduce additional factors in the measurement of basic skills in reading and mathematics as a function of the language of the instrument (in the case of the two studies we reviewed, only English, since the Spanish achievement scores were not analyzed in the Immersion Study). More generally, in such a theory, treatment, outcome, and measurement need much greater development and articulation. Otherwise, future studies of alterna-
OCR for page 93
Assessing Evaluation Studies: The Case of Bilingual Education Strategies tives for bilingual education will not profit from the painful experiences of the Longitudinal and Immersion Studies. The rest of this section offers some guides to an array of theoretical components that might be of value for future studies. The panel has attempted to construct an implicit theory around which the two studies were built; to do so we have read all of the available background information and planning documents, including the legislative language surrounding bilingual education, as well as the actual study reports. What comes through repeatedly is an implicit theory in which the treatment is almost exclusively concerned with the amount and duration of native-language instruction. One might ask, however, whether this dimension is really separable from community, demographic, and historical and other factors that form the context of a school. For example, is a program that fully utilizes native-language instruction (such as a Late-exit Program in the Immersion Study), implemented in a rural town on the U.S. border with Mexico, really comparable to one implemented in an urban immigrant community in Northern California? What are the sociological and educational theories that can explain the similarities and differences in the process by which a given treatment is implemented in different communities? The idea that treatments are confounded with communities has become a major concern of the panel, especially for the Immersion Study. The rationale that Late-exit Programs could not be compared directly with other programs in the Immersion Study was justified in the face of glaring differences in the community contexts in which the programs existed. The panel questions the theoretical basis of that rationale. For example, there are several contextual community variables of possible interest that can be specified: the political base of language groups as reflected in bilingual education programs; rates of literacy or school involvement on the part of parents; and linguistic proficiency and ethnicity of the teachers in the programs. There is a potentially infinite list of potential correlates of the treatment variables, but, in the absence of good theory, they are unavailable for empirical inquiry. Equally critical is a theory of how the process or treatment affects the outcome. What is the theory of language proficiency (for example, how many dimensions does it have)? What is the theory of its acquisition (for example, rate, critical input, constraints, sources of group and individual variation)? And what is the relationship between the two languages that a student must master? Given natural variation in the amount and type of exposure to English depending on programs, what is the theory of language learning and the relationship between the two languages? For example, from the perspective of a theory of “time on task” as applied to second-language learning, it is interesting to note that the comparison of Immersion and Early-exit Programs provides rather disappointing evidence; those students who were exposed only to English performed significantly worse on English reading than those who were exposed
OCR for page 94
Assessing Evaluation Studies: The Case of Bilingual Education Strategies to varying amounts of Spanish (see the discussion in Chapter 4, “Adjusting for Overt Biases”). If the effectiveness of instructional delivery might vary with the amount and duration of native-language instruction, what is the theory that relates language of instruction to learning of content? For example, from the viewpoint of student learning, what characterizes the relationship between a student's language proficiency and the student's ability to profit from instruction in English or Spanish? How involved are social and motivational factors in learning? Perhaps more basic: What is the role of the teacher, other school personnel, and community in student learning? What defines language proficiency? Is it a purely cognitive construct devoid of the social circumstances of language use, or is it inseparable from sociolinguistic reality? Theories about student learning and expected outcomes offer the only way in which questions such as these might be systematically addressed—and answered by—research. Issues surrounding the ways in which outcomes might be measured share parallels with the assessment of outcomes in most educational programs, such as, content, authenticity, level, and manner of assessment. In addition, there needs to be an articulation of how language proficiency interacts with measurement. Measurement involves the nature of language proficiency, which minimally needs to be distinguished between language for conversational and academic purposes. For example, superficial aspects of language proficiency, such as a non-native accent and clarity of speaking, may be influential in some performance-based assessment systems rather than in more traditional forms of assessment (for example, multiple choice tests). In contrast, variance in academic proficiency in English may take on increasing importance for measurements that require greater cognitive engagement of language, such as tests that require large amounts of reading or writing. One clear lesson from the two studies the panel reviewed is the extent of variation across school sites and the fact that these variations are not randomly distributed. Thus, for example, in the Immersion Study, the safest inferences that could be drawn came from comparisons of programs in the same school. The differential effects of programs of the same bilingual education theory is perhaps best demonstrated in a study conducted under the auspices of the California State Department of Education, in which a theory of bilingual education was implemented in six different school sites. Samaniego and Eubank (1991, p. 13) concluded that: the effect of a given bilingual education initiative will vary with the environment in which it is implemented. In particular, a treatment which works well in one setting may fail in another, or may require nontrivial modifications if it is to be effective elsewhere. The same sorts of warnings about generalizability can be made about the native languages of the students, whether their social setting is conductive for additive or subtractive bilingualism. In general it is very important to find out if
OCR for page 95
Assessing Evaluation Studies: The Case of Bilingual Education Strategies a program works at all before becoming too concerned about the generalizability of the program results (see Cochran, 1965). DESIGNING AND IMPLEMENTING EXPERIMENTS The Applicability of Controlled Randomized Trials In Chapter 2 the panel outlines a hierarchy of statistical designs for confirmatory evaluation studies, and it stresses the value of the randomized controlled experiments. The panel knows of few models of successful studies of bilingual educational approaches that have used a truly randomized experimental design. The are many reasons for this—including the practical difficulties investigators encounter when they try to adapt this evaluation strategy to the kinds of interventions at issue and the natural setting in which they must be implemented. In other subject areas, such as evaluating the efficiency of alternative drug therapies in medicine, a controlled randomized trial is an effective evaluation strategy when three conditions are met: the intervention is acute and, a priori, is expected to yield an acute outcome (produce a large effect); the intervention acts rapidly, and the evaluation can be carried out over a short period of time. the imposition of controls for the trial does not create an environment for the study that is substantially different from the environment in which the proposed therapy would be used on a routine basis. Violations of 1. and 2. very frequently lead to lack of adherence to the study protocol; in medicine, much of the switching of patients among treatment regimens is driven by clinical patient management decisions that are both necessary (as judged by physicians) and in conflict with the original study design. This makes the evaluation of a therapy very problematic and reduces the chances of being able to provide conclusive results at the end of the study. Violation of 3. reflects strongly on generalizability of conclusions: in many instances it means that an artificial environment, having very little resemblance to the ultimate (clinical) setting and (patient) characteristics, is being used to evaluate a given intervention. It is the panel's judgment that virtually all bilingual education interventions violate, at least to some extent, all three conditions. For example, a given teaching regimen in and of itself may have only a modest effect on improving test scores, presuming that test scores are the outcome to be measured. Thus, the experimental manipulations would violate 1. It is perfectly plausible, however, that a more complex treatment—for example, (Spanish literate parents) and (parents help children with homework) and (a given teaching regimen)—could have large effects, but only the third component can be manipulated in a randomized study. If an investigator designs a teaching regimen to operate over several years, this would represent a violation of 2. When 2. is violated, it is likely that uncontrolled and uncoordinated adaptive improvements will occur in the middle of a study
OCR for page 96
Assessing Evaluation Studies: The Case of Bilingual Education Strategies when such improvements are judged by teachers and principals to be desirable. This situation makes it extremely difficult to generalize study results because of a lack of precision about what the treatments are and the consequent inability to duplicate them. Moreover, different adaptations are likely to occur at different sites, creating unplanned variations in treatments. Finally, if school rules, class composition, and a variety of other structural changes at the school level are imposed for purposes of an evaluation study without any intent of permanence of the arrangements, condition 3. would be violated. For the evaluation of bilingual education treatments, the panel believes that multiple highly focused studies are likely to be much more informative than a single large study. Since discovery of candidates for a potentially successful treatment is an integral part of the bilingual education research agenda, multiple studies that vary across intervention and environment are more likely to both uncover potential winners and afford insight into how interventions and setting interact. A problem with such a series of studies, whether the sampling of sites is random or not, is the introduction of at least three new components of variation: (1) variations on the intervention over space and time, (2) the interaction of intervention with environment, and (3) variation in the magnitude of measurement error across the studies. In agricultural experiments, Yates and Cochran (1938) note these difficulties, and we discuss them below. Formal statistical methods for combining information across experiments have a long history in the physical and agricultural sciences. More recently, meta-analyses have been carried out in education, psychology, medicine and other social science settings. More elaborate statistical modeling for combining evidence across sites or studies was used by DuMouchel and Harris (1983) for combining evidence from cancer experiments and by Mason, Wong, and Entwisle (1984) for a comparative analysis of fertility patterns across 30 developing countries. Mosteller and Tukey (1982, 1984) treat combining evidence from multiple sources more generally; see also Wachter and Straf (1990). Meta-analytic techniques were originally planned for use in the analysis of the longitudinal study. The research design overview (Development Associates, 1984a) states: [the] mere presence of common … instruments does not guarantee that they are suitable for directly combining across several sites into a single analysis. [Objectively derived measures such as] test scores … normally are combinable across sites … However, subjectively determined ratings … normally are not suitable for across-site combination [because between site variability, and because raters differ between sites and] rater-specific biases are almost certain to distort the picture … But meta-analytic techniques will enable us to combine conclusions drawn from ratings at different sites. We believe this conclusion is warranted in bilingual research and we recommend that multiple studies be combined in this way. Such studies are, however, best carried out independently, perhaps with some coordination to ensure that similar
OCR for page 97
Assessing Evaluation Studies: The Case of Bilingual Education Strategies measures are taken across sites. A single large study such as the Longitudinal Study is likely to fail to account adequately for the special factors affecting each of the sites. Evolutionary Study Design In addition to combining information across different studies as described above, it is useful to plan for sequential accumulation of knowledge. One such experimental program is “evolutionary operation” (EVOP)—see Box and Draper (1969, 1987). EVOP uses the results of a subexperiment both to check model assumptions and to suggest the most efficient way to improve the model for the follow-on subexperiment. Similar evolutionary processes have been used in educational research under the label of “formative evaluation”. For bilingual education research, Tharp and Gallimore (1979) provide a model for program development at evaluation that has great potential. They describe systematic strategies for adaptive study design which should be the basis for sequential discovery of causal mechanisms (treatments) and then evaluation in bilingual education interventions. Tharp and Gallimore (1979) applied their sequential approach, which they call succession evaluation, to the study of educational alternatives for underachieving Hawaiian native children. They used a much smaller scale than that typically used in a large EVOP industrial experiment, and they used less formal and less complicated statistical methods. Letting data from one phase of a study inform choices of succeeding phases has, however, led to the development of educational programs that appear to work. Table 5-1, taken from Tharp and Gallimore (1979), outlines the succession evaluation model. Their first steps in developing a program are likely to be more qualitative and subjective, involving the development of theoretical questions, the consideration of qualitative data to understand the phenomenon or program under study, analysis of previous research, and clarification of values. At this stage they ask such questions as, “Does this idea make sense?” and “Is it generalizable?” Moving from planning to experimentation occurs at step 3, which is still at the qualitative and subjective level. Step 4 tests the program in an experimental setting, possibly with successive interactions to improve the program and testing new elements. The final step of one complete cycle of program development is testing the full-scale program. The evaluation at this step may lead to a recycling to step 1 to develop new or improve existing program elements. This paradigm has also been suggested for research in other areas, such as the rehabilitation of criminal offenders (National Research Council, 1981) and, in the panels' view, shows great potential for bilingual education research. There is an inevitable tension between possible contamination among units and the desire to maintain comparability between treatment and control groups. For example, consider an experiment that used “advertising” as a treatment. The treatment might be applied to a specific neighborhood of a city with the control neighborhood (chosen for comparability) being in close geographic proximity.
OCR for page 98
Assessing Evaluation Studies: The Case of Bilingual Education Strategies TABLE 5-1 An Evaluation Succession for Program Research and Development Stages Stages in Program Element Development Ways of Knowing Validity Concerns Selection Process Step 1. Initial selection values, theories, goals, and objectives Qualitative/personal Construct and External Does the idea have potential? 1 Step 2. Treatment, independent variable formation Experimentation: true or quasi External-internal or internal-external Are the relationships of enough magnitude? 2 Step 3. Decision point: review, evaluate, translate; proceed, recycle Qualitative/personal Construct and External Can treatments/independent variables be translated into stable, program element(s)? 3 Step 4. Program element formation Experimentation: true or quasi External-internal Does it work in the setting employed? Step 5. Decision point: review, evaluate, proceed, recycle Qualitative/personal Constructive and external Is it worth further work Step 6. Program element implementation Data guidance External-internal Can it be improved with tinkering? Step 7. Decision point review, evaluate, proceed Qualitative/personal Construct and external When and if in or out of the program? Step 8. Fullscale program element operation Program evaluation Internal and conclusion, statistical Does the program element, with other elements in association, bring benefit? SOURCE: Taken from Tharp and Gallimore (1979)
OCR for page 99
Assessing Evaluation Studies: The Case of Bilingual Education Strategies Knowledge of the advertising campaign in the treatment neighborhood could readily be transmitted to the control neighborhood, thereby converting it into a treatment neighborhood. For a classic example of this phenomenon, see Freedman and Takeshita (1969), which gives a superb discussion of the design and analysis of a randomized advertising experiment on behalf of birth control strategies. In many evaluation studies, the responses to treatments are often found to vary dramatically in what is referred to as “a change from one environment to another.” This finding is frequently viewed as a negative finding; however, it can also be thought of as a clue to the definition of a more complex cause in which environmental conditions (that is, a profile of environmental attributes) define part of the cause. Taking this point of view changes the way one thinks about the definition of “intervention”: indeed this interaction is usually viewed as an treatment-by-environment interaction. The panel believes, however, that it may be productive to think of an evaluation study in which “treatment,” as usually constituted, corresponds to the part of a complex cause that can be manipulated, while the “environment” component of the complex cause corresponds to the part that can only be studied observationally. If this point of view is adopted, then program evaluation consists of a mixture of experimental and observational studies. As an example of this view for bilingual education evaluation, a candidate cause for successful performance on tests is (Spanish-literate parents) and (parents help children with homework) and (late-exit language program). One can manipulate the language program but must find natural settings (communities) to see variation in (Spanish literate parents) and (parents help children with homework). REFERENCES Box, G. E. P., and Draper, N. R. (1969) Evolutionary Operation: A Statistical Method for Process Improvement. New York: John Wiley. Box, G. E. P., and Draper, N.R. (1987) Empirical Model-building and Response Surfaces. New York: John Wiley. Cochran, W. G. (1965) The planning of observational studies of human populations (with discussion). Journal of the Royal Statistical Society, Series A, 128, 124–135. Development Associates (1984) The descriptive phase report of the national longitudinal study of the effectiveness of services for LMLEP students . Technical report, Development Associates Inc., Arlington, Va. DuMouchel, W. H., and Harris, J. E. (1983) Bayes methods for combining the results of cancer studies in humans and other species (with discussion). Journal of the American Statistical Association, 78, 293–308. Freedman, R., and Takeshita, J. (1969) Family Planning in Taiwan: An Experiment in Social Change. Princeton, N.J.: Princeton University Press. Mason, W. M., Wong, G. Y., and Entwisle, B. (1984) Contextual analysis through the multilevel linear model. In S. Leinhardt, ed., Sociological Methodology 1984, pp. 72–103. Washington, D.C.: American Sociological Association. Mosteller, F., and Tukey, J. W. (1982) Combination of results of stated precision: I. the optimistic case. Utilitas Mathematica, 21, 155–178.
OCR for page 100
Assessing Evaluation Studies: The Case of Bilingual Education Strategies Mosteller, F., and Tukey, J. W. (1984) combination of results of stated precision: II. a more realistic case. In P. S. R. S. Rao and J. Sedransk, eds., W. G. Cochran's Impact on Statistics, pp. 223–252. New York: John Wiley. National Research Council (1981) New Directions in the Rehabilitation of Criminal Offenders. Panel on Research on Rehabilitative Techniques, Committee on Research on Law Enforcement and the Administration of Justice, Commission on Behavioral and Social Sciences and Education, National Research Council. Washington, D.C.: National Academy Press. Samaniego, F. J., and Eubank, L. A. (1991) A statistical analysis of California's case study project in bilingual education. Technical Report 208, Division of Statistics, University of California, Davis. Tharp, R., and Gallimore, R. (1979) The ecology of program research and evaluation: A model of succession evaluation. In L. Sechrest, M. Philips, R. Redner, and S. West, eds., Evaluation Studies Review Annual: 4. Beverly Hills, Calif.: Sage Publications. Wachter, K. W., and Straf, M. L., eds. (1990) The Future of meta-analysis. Committee on National Statistics, Commission on Behavioral and Social Sciences and Education, National Research Council. New York: Russell Sage Foundation. Yates, F., and Cochran, W. G. (1938) The analysis of groups of experiments. The Journal of Agricultural Science, XXVIII(IV), 556–580.
Representative terms from entire chapter: