Research Methodology and Bilingual Education
Statistics is often thought of as a body of methods for learning from experience. The standard repertoire of statistical methods includes techniques for data collection, data analysis and inference, and the reporting of statistical results. Just as some experiences are more informative than others, some statistical techniques for data collection are more valuable than others in assisting researchers and policy makers in drawing inferences from data. The two studies reviewed by the panel used quite different statistical methodologies, although the methodologies fit within a common framework. An appreciation of this framework is essential to an understanding of the panel's evaluation of the two studies.
This chapter reviews some of the basic techniques for the collection of statistical data, the definition of outcomes and treatments, the units of analysis, and the kinds of inferences that are appropriate. In essence, there is a hierarchy of approaches to data collection: the more rigorous the design for data collection and the control of the setting under study, the stronger the inferences that can be made from the data that result about the universe of interest. Users of statistical methods often wish to draw causal conclusions, for example, from programs to achievement outcomes. This is especially true in a policy setting. If one concludes that when a school follows approach X to bilingual education, the performance and achievement of the students will be Y, one is claiming, at least in a loose sense, that X “causes” Y. The notion of which designs allow conclusions about the causal effects of treatments is critical to an appreciation of the evaluation of
alternative bilingual education programs. Chapter 4 on the Immersion Study provides a focused discussion of a design-based approach to causal inference; here we present a general introduction to the topic.
There are no general sufficient conditions that can be used to declare and defend a claim that X “causes” Y. The evidence used to support such claims varies substantially with the subject matter under investigation and the technology available for measurement. Statistical methodology alone is of limited value in the process of inferring causation. Furthermore, consensus on causality criteria evolves over time among practitioners in different scientific lines of inquiry. A clear statement of the evidential base that should support claims that a particular treatment will be “effective” for English-language training in a bilingual education setting has not been put forth. For a historical discussion of the evolution of causality criteria in the setting of infectious diseases, see Evans (1976). For a discussion along similar lines in the social sciences, see Marini and Singer (1988). Nonetheless, throughout this chapter and the rest of this report the panel uses the word “causal” in its more narrow statistical or technical sense, fully realizing that in some cases such usage may violate the broader scientific meaning.
As part of the development of the evidential base leading to a claim that X causes Y, it is useful to distinguish between confirmation studies, whose purpose is to confirm a prior hypothesis, and discovery studies, whose purpose is to discover candidate mechanisms. For confirmation studies one introduces candidate causes in an intervention study (possibly randomized) whose objective is to evaluate the effects. For discovery studies, one starts from effects and carries out investigations whose objective is the determination of candidate causes. Both forms of studies are necessary in the context of bilingual education.
In a typical discovery study one analyzes what are regarded as successful intervention studies and asks what the underlying mechanism(s) was (were) that led to success. This is usually an analysis of multiple case studies in which one is looking for common features that led to success (effectiveness). In the process of discovering candidate causes that could be responsible for observed effects, one frequently infers that a complex cause—consisting of multiple interventions acting together—is really driving the system. It may be exceedingly difficult to determine the relative contribution of each intervention acting alone; however, in many instances whether one can isolate components may not even be a useful question to address. The key to devising a real-world intervention is simply the identification of the cluster of interventions that need to be simultaneously applied to produce the desired effects. A confirmatory intervention study should then use the full cluster of interventions as a treatment. This important scientific strategy must confront a political environment in which different parties have vested interests in demonstrating the effectiveness of their intervention acting by itself. There are a variety of methods for gathering information about bilingual education, including case studies and anecdotes, sample surveys, observational studies, and experiments or field trials. Another source of information is expert opinion, but we exclude that from our definition of data.
SOURCES OF DATA
Case Studies and Anecdotes
The careful study of a single case often allows for a rich contextual picture of circumstances surrounding some activity or event of interest. The event may be the legislative history of a bilingual education act in California or it may be the immersion experiences of students in a particular sixth grade classroom in Taos, New Mexico, in 1985–1986. The technical term for such a study or observation is a case study. Anecdotal evidence is usually less comprehensive than a formal case study: it may consist of informal comments by observers or through a “Suggestion Box”—soliciting comments at the end of an opinion survey. In medical research, carefully documented case studies of small groups of patients provide preliminary evidence that may contribute to understanding how successful a treatment is. Experts often use the case-study method to consider the special features of events and subsequently build a body of knowledge by the accumulation of cases.
Every field has its own pattern for case studies, based on experience and theory, and the field of bilingual education is no exception For example, Samaniego and Eubank (1991) carried out several case studies examining how an ambitious California experiment in bilingual education affected English proficiency. They used linear and logistic regression analyses to develop school specific profiles of students who successfully transferred skills developed in their first language into skills in the second language. Two profiles illustrate the diversity. One school consisted entirely of Spanish-speaking students who had never attended school in the United States. These students studied at this school (in Spanish) for just 1 year before transferring to other elementary schools in the district. A second site was a rural school whose student body was mostly children of agricultural workers. This school had early “mainstreaming” in subjects such as art, music, and physical education and more gradual mainstreaming in other subjects. These case studies suggested the presence of strong contextual effects: that is, outcomes were strongly affected by factors other than when and how mainstreaming occurred, such as the quality of an individual teacher.
Case studies often provide the starting point for research investigations by helping to define treatments and outcomes, as well as assisting in determining how to measure them in the discovery mode described above. This information, however, usually comes from a variety of sources of unknown reliability and with unknown biases. Consequently, it is difficult to generalize from a case study, since the data are collected in a manner not necessarily grounded in any formal rules of inference.
When the purpose of a study is to provide a systematic description of a large number of programs, institutions, or individuals, a case-study approach will simply not do. Although the collection of many cases provide more reliable information
than a single case, a major issue is often how to generalize to other circumstances. Individual cases require some form of systematic selection in order to allow for the possibility of generalization. One way to gather information systematically in a manner that allows for generalization is through sample surveys.
Through surveys, investigators are able to ask questions about facts and quantities as they currently exist, recollections and records about past circumstances, and relationships among them. Thus, one might do a survey of schools and ask the principals questions about the bilingual and other language education programs in place and about the students enrolled in those programs. One might even record information about individual students. By reinterviewing the principals or the students several months or years later, one might add a longitudinal dimension to the survey, thus allowing for the study of how quantities and relationships have changed over time. In such longitudinal surveys the investigator does not seek to change the values of variables (for example, to influence the kinds of language programs at the school or the teacher training and support for those programs, in order to see what effects such changes might make). Rather, the investigator seeks to record information on a sample of schools or a sample of students in order to learn how things are changing.
The sampling aspect of a survey provides the mechanism for generalizing from the units at hand: for example, from the students in six school districts in Cook County, Illinois, to some larger population of interest, such as all elementary students in the state of Illinois. The latter is usually referred to as a target population. In a well-designed probability sample, the investigators can make inferences about occurrences in the population from the sample information. The term representative sample is often used in reports and accounts of sample surveys in the public press. As Kruskal and Mosteller (1988) note, however, the term has no precise meaning and thus the panel avoids the use of that term in this report.
Many large-scale surveys, such as the one described in Chapter 3 on the Longitudinal Study, use a complex form of random selection that involves the stratification and clustering of the units under investigation. Each unit in the population typically has a nonzero probability of being included in the sample. This probability can be determined in advance. In this sense, sample surveys allow for external validity, that is, for the generalization from sample to population. Several issues affect the ability to make inferences from a sample to a population of interest:
the nonresponse rate in the sample (that is, what proportion of the originally designated sample units actually participated in the survey);
the extent of missing data;
in a longitudinal survey, the attrition rate of sample respondents over time; and
the factual accuracy of responses.
External validity in data collection and analysis is often not sufficient for policy purposes. When an investigator wants to compare the effects of different
treatments or programs, a sample survey has major weaknesses, even if the treatments or programs are in use in the population under study. The major difficulty has to do with how the individuals were assigned to the treatments or programs in the first place. If assignment is perfectly related to (or totally confounded with) the outcome of interest, then even the very best sampling scheme will not allow the investigator to sort out what treatment causes what outcome. For example, if students with greater academic or language-learning ability are placed in certain bilingual immersion programs, then the outcome that these students perform better in both languages may be attributable to the assignment mechanism and not the program itself. As a specific example, the Late-exit Program of the Immersion Study (see Chapter 4) used teachers with better training and more enthusiasm for the program. As those teachers differed systematically from teachers in other programs, the effects of the program itself cannot be disentagled from the effects of the teachers. To understand the effect of treatments on outcomes, one needs a stronger design, one that allows for the control of the assignment mechanism. (Sometimes statisticians speak of assigning treatments to subjects and other times of assigning subjects to treatments. The two are essentially equivalent concepts, and we use them interchangeably, depending on the context.)
Sample surveys are a form of observational study involving a collection of cases. As noted above, in observational studies the investigator does not change the values of variables such as the treatments applied to individuals, but rather compares groups of individuals for which the values already differ.
The information from the case studies in the Samaniego and Eubank (1991) bilingual study, referred to above, could be regarded as an observational study with a small sample size (four or five schools, depending on definitions). The investigators fit separate regression models at each of the schools to adjust for differences across students. Their principal finding from the regressions was that there were quite different learning dynamics in different settings.
Observational studies are most naturally suited to drawing descriptive conclusions or statements about how groups differ. However, investigators and policy makers are often interested in why the differences occur. That is, they wish to draw inferences about the causal mechanisms responsible for observed differences and make policy recommendations based on the causal conclusions. Borrowing terminology from the literature on experiments, one often refers to treatment groups (groups that have received an intervention of policy interest) and control groups (groups that have received no treatment or the standard treatment). Descriptive inference draws conclusions about how the treatment and control groups differ. Causal inference attempts to attribute these differences to the treatment itself.
In an observational study the investigator does not control how the treatments are applied, but must simply observe units that have received the treatments as they occur naturally. This lack of control over assignment to treatments makes it diffi-
cult to draw unambiguous causal inferences from observational studies. The problem is that there is no way to ensure that the difference in treatments is the only relevant difference between treatment and control groups. Selection bias is the term used to describe the situation in which the variables that affect the response also affect whether or not the intervention is received. Selection bias must be a major concern of any investigation that seeks to draw causal conclusions from observational data. For example, suppose that a study seeks to discern differences between public and private school students. One question of interest is how much better or worse public school students would do if they were assigned to a private school. An outcome (measured response) is a test score, and the conceptual intervention is assignment to a private school rather than a public school. The study, however, is observational because the students are not assigned to private or public schools, but choose one or the other, presumably on the basis of individual characteristics that may or may not be known to, and observed by, the investigator. A simple statistical model to describe the response variables in this study is as the sum of three parts: (1) the effect of assignment to a private school (intervention effect); (2) an innate ability; and (3) a randomly distributed statistical error. The average response for private school students is the intervention effect plus average innate ability. The average response for public school students is simply average innate ability. Thus, at first blush, it seems that the difference between average responses for private and public school students is the effect of being assigned to a private school.
The problem is that the average innate ability of private school students may not be the same as that of public school students. Innate ability may affect whether a student goes to a private school. If it were possible to predict innate ability through some surrogate, such as family income, without error, then one could discern the intervention effect. Clearly, this is not possible. Alternatively, one could determine the intervention effect through random assignment to the intervention: see the role of random assignment in experiments below.
Analyses that takes explicit account of the process of self-selection into public versus private schools, or more general selection processes, are an essential ingredient of assessments (in observational studies) of alternative treatments. For a discussion of the basic ideas in modeling processes of selection, see Cochran (1965).
Experiments and Quasi-Experiments
In a study of the effect of an intervention, the intervention may or may not be controlled by the investigator. In a study with a controlled intervention, there are usually secondary, intervening, or confounding variables that may affect the results but are not themselves of primary interest. These confounding effects must be allowed for, either in the design or in the analysis of the study. Inference is necessarily based on a model, whether implicit or explicit, that describes how the secondary variables affect the results. Either the model is known or is not known but is assumed. This is a fuzzy distinction. A better distinction would be whether
or not the model is widely accepted in scientific communities. When a model is established by the design of the study, the distinction is clear, and there is common acceptance that the model is known. Two common ways in which a model is established by design are when levels of secondary variables are predetermined and controlled for the desired inference and when the intervention is randomly assigned, thereby forming a group of those selected to receive the intervention to be compared with those who have not been selected. In both of these situations, the secondary variables do not have to be measured or adjusted for. Moreover, with randomization, the secondary variables do not even need to be identified.
If the intervention in a study is controlled by the investigator and the model that describes the effects of secondary variables is known, as is the case with randomization, the study is an experiment. The term quasi-experiment pertains to studies in which the model to describe effects of secondary variables is not known but is assumed. It applies to either observational studies or studies with controlled interventions but without randomization.
One widely used example of a quasi-experiment involves an interrupted time series, with observations before and after the “interruption.” In that context, “before” serves as a control for “after,” and the difference is used to measure the effect of the treatment that is applied during the interruption. When change is at issue, one needs multiple time series, with an interruption in one or more of the series, but not in the others. Quasi-experiments allow for stronger causal inferences than do uncontrolled observational studies. They fall short of randomized controlled experiments as the study of choice for causal purposes. In the interrupted time-series example, if it is a change in the prior conditions that produces the interruption, it is a serious concern.
In controlled field trials or experiments, the investigator explores what happens when treatments or programs are deliberately implemented or changed. An experiment is a carefully controlled study designed to discover what happens when variables are changed. In experiments, the process of randomization—that is, the random assignment of treatments to units or of units to treatments—is used to remove doubts about the link of the assignment to the outcomes of interest. There is a tension between balancing the treatments and controls with respect to other variables that may be correlated with the outcome and randomization in treatment assignment. Randomization provides for internal validity of the results of the experiment: that is, if one applied the treatments with a different randomization one would still expect to be able to make similar inferences to the collection of units in the experiment. External validity then allows for generalizations to some larger population. The issues of internal and external validity are treated in more detail in Campbell (1978).
Control in an experiment may be exercised not only over the implementation of the treatments, but also over other variables that might otherwise impede the ability of linking differences in outcome to differences in treatment. Control in the sense of coordination of a multisite experiment is of special relevance to experimental studies for educational policy.
As valuable as controlled experiments are, it is sometimes impossible to implement them, for practical or ethical reasons, and they are sometimes undesirable. For example, the treatments may simply not be assignable to all subjects. Offering an English/Spanish immersion program to all students may be of little value to those whose native tongue is Vietnamese. Yet having such students in a classroom may be unavoidable. Furthermore, many government funding bodies require that investigators gain the informed consent of participants in an experiment (or in the case of students, the informed consent of their parents). Understandable though this is, students' refusal to participate can undermine the integrity of the experiment. In the case of educational experiments, the cooperation of school boards, administrators, teachers, and parents is often required. Finally, resources may simply be inadequate to implement a carefully designed randomized experiment. Controlled experiments may, under certain circumstances, also be undesirable. A controlled experiment in a classroom may lead to significant and interesting results, but if the classroom intervention can not be generally applied, then the outcome may be useless.
One often settles for an observational study, without randomization, and typically without the various levels of control that are the hallmark of the well-designed randomized, controlled field trial. One must then attempt to identify confounding variables, measure them in the study, and then adjust for them to the extent possible in the analysis. This is a perilous enterprise, and it is often difficult to defend such a study against concerns about alternative causes or variables not measured or poorly measured. Confounding can also, but is less likely to, occur in randomized studies. For descriptions of some large-scale national social experiments in which these issues and concerns are discussed, see Fienberg, Singer, and Tanur (1985).
The Role of Randomization
Deliberate randomization provides an unambiguous probability model on which to base statistical inferences. Surveys typically use random selection of sample units from some defined population. Each population unit has a known, nonzero probability of being selected into the sample. These selection probabilities are used as the basis for generalizing the sample results to the population of interest. For example, a study might conclude that students in Program A scored on average 15 points higher than students in Program B. Often, this result is accompanied by a statement of confidence in the result; for example, that the difference in scores is accurate to within plus or minus 1 percentage point. The score difference in the sample is a matter of verifiable fact. Of more interest is the inference that scores in the population differ on average between 14 and 16 points. Randomization provides the theoretical basis for extending results from students whose scores have been measured to conclusions about students whose scores have not been measured. That is, in this context, randomization or random selection from a well-defined population provides external validity.
Experiments use random assignment to treatments in drawing inferences about
the effect of treatments. Again, the observed difference between the treatment and control groups is a matter of verifiable fact, but one is really interested in what would have happened had the treatments been assigned differently (or, in what would happen if the treatments were assigned in similar circumstances at a future time). Randomization provides a theoretical basis for extending the sample results to conclusions about what would have happened to the units in the experiment if a different realization of the randomization had occurred. That is, randomization provides for internal validity.
Some statisticians are willing to generalize from samples in which selection was not random, as long as a plausible argument can be made that the results would be similar had selection been random. Thus, they would accept external validity in a population similar in important respects to the one studied, even if the sample units were not selected at random from that population. They might also accept internal validity for a study that could be argued to resemble a randomized experiment or for which they could build a defensible statistical model of the differences from a randomized experiment and use it to adjust for those differences.
Those involved with evaluation studies are legitimately concerned with both internal and external validity. Without external validity, one cannot tell whether a proposed policy would be effective beyond the specific units that have been studied. Without internal validity, one cannot even tell whether the proposed policy is effective in the group studied. Unfortunately, there is a tension between achieving internal and external validity. Generalizing to large populations (such as all school children in the United States) requires large studies, preferably with random selection of participating schools and districts. In such large studies, it is extremely difficult to exercise the controls required for internal validity. In smaller studies, it is much easier to achieve internal validity, but one generally cannot include all the subpopulations and sets of conditions of policy interest.
TREATMENTS AND OUTCOMES
Since the results of an evaluation study will depend critically on which outcomes are studied, it is important to define clearly the outcome of interest. In bilingual education there are a variety of possible outcomes, of which two are of particular interest: one is proficiency in English as soon as possible, and the other is proficiency in academic subjects—mathematics, English, reading, etc. The preferred bilingual education treatment may vary depending on the choice of outcome.
Treatment Definitions and Treatment Integrity
No matter how hard an investigator works to define the treatment to be applied in a given study, when the treatment is actually applied, changes tend to creep in. Over the course of a multiple-year immersion study, teachers may treat students differently than the program protocol requires. For example, if teachers
expect a new program to be superior to the standard one, they may begin to change what they do in classrooms that are supposed to be implementing the standard program. This would change the treatment being applied to the control group and make it more difficult to find differences in outcomes among the groups under study. One of the lessons from medical experimentation is that a medicine prescribed is not necessarily a medicine actually used. Thus, for experiments in bilingual education, investigators must specify a priori exactly how the treatment or program is to be implemented and how they plan to monitor that implementation.
Distinguishing the Unit of Experimentation from the Unit of Analysis
There is an intimate link between the unit of randomization in an experiment and the unit of analysis. Suppose one assigns programs to schools but one measures outcomes on students. What does the randomization justify as the level of analysis for internal variability? The usual social science and educational research textbooks are often silent on this issue.
If one carries out the assignment of treatments at the level of schools, then that is the level that can be justified for causal analysis. To analyze the results at the student level is to introduce a new, nonrandomized level into the study, and it raises the same issues as does the nonrandomized observational study. This means that if one does an experiment with 10 schools organized into 2 districts of 5 each and if one randomly assigns Program X to District 1 and Program Y to District 2, then there are just 2 observations at the level of assignment even though there were thousands of students participating in the study.
The implications of these remarks are twofold. First, it is advisable to use randomization at the level at which units are most naturally manipulated. Second, when the unit of observation is “lower” than the unit of randomization or assignment of treatment, then for many purposes the data need to be aggregated in some appropriate fashion to provide a measure that can be analyzed at the level of assignment. Such aggregation may be as simple as a summary statistic or as complex as a context-specific model for association among lower-level observations. There are concerns, however, other than the validity of the randomization in designing and carrying out an analysis. Even when treatments have been randomized, it is sometimes desirable both for reasons of improved accuracy or because of the research questions being addressed in the study to statistically adjust for attributes of the units or to carry out the analysis at a lower level of aggregation than the units being randomized. An example of the former is in the RAND Health Insurance Experiment (Marquis et al., 1987) and of the latter in a randomized school-based drug prevention study (Ellickson and Bell, 1992). As this brief discussion suggests, there are tradeoffs in designing and carrying out valid evaluation studies.
DEVELOPING A RESEARCH AND EVALUATION PROGRAM
In undertaking a program of research and evaluation, it is important to distinguish three broad classes of objectives, labeled here as description, research, and evaluation. In general these classes of objectives require different types of study design, and there is an order among them: general descriptive and research studies must precede evaluation studies if there is to be a likelihood of success in conducting an evaluation.
A descriptive study is one for which the intention is to characterize the population and its subgroups. The descriptive component of the Longitudinal Study and, to a lesser extent, the descriptive phase of the longitudinal component of that study belong in this mode. They attempt to characterize the types of bilingual education programs available and the students and teachers who participate in them. One tries to design a descriptive study so that it is possible to generalize to the population of interest through the use of statistical methods. Before embarking on an evaluation study of a series of program types, it is essential to know about the programs actually in place and the extent of their implementation.
Research studies attempt to determine explicitly the relative effectiveness of specifically defined programs. Investigators often conduct studies in restricted populations and attempt to use experimental design features, such as randomization and matching. It is common practice to conduct research studies under relatively controlled conditions. These features enhance the ability of the studies to determine that the programs studied have had a real effect, and they can give an indication of the magnitude of such effects.
Evaluation studies attempt to ascertain the general effectiveness of broad classes of programs for the purposes of informing public policy. As with research studies, the use of controlled conditions and even randomization adds to their value. A given program or intervention is likely to vary from site to site with regard to the details of implementation, and an evaluation study often covers a variety of subpopulations. Thus, design of an evaluation study can be extremely difficult.
In an orderly world, one would expect a natural progression from description to research to evaluation. Yet in the real world, studies to evaluate the impact of policy decisions often proceed in the absence of careful research studies to inform policy.
Hoaglin et al. (1982) and Mosteller, Fienberg, and Rourke (1983) provide elementary introductions to the types of statistical studies described in this chapter. Yin (1989) gives some formal ideas on the design and methodology of case-studies. Kish (1965) is a standard text for the design and analysis of complex sample sample surveys. Cochran (1965) and Rubin (1984) discuss the planning and analysis of observational studies and the devices that can be used to strengthen them for the purposes of causal inference. Cook and Campbell (1979) give a description
of specific designs for quasi-experiments and Cox (1958b) provides a detailed introduction to randomized controlled experiments. For a detailed introduction to assorted topics in statistical studies and methods of analysis, see the encyclopedias edited by Kruskal and Tanur (1978) and Johnson and Kotz (1982–1989).
Campbell, D. T. (1978) Experimental design: Quasi-experimental design. In W. H. Kruskal and J. M. Tanur, eds., International Encyclopedia of Statistics, pp. 299–304. New York: The Free Press.
Cochran, W. G. (1965) The planning of observational studies of human populations (with discussion). Journal of the Royal Statistical Society, Series A, 128, 124–135.
Cook, T. D., and Campbell, D. T. (1979) Quasi-experimentation. Chicago, Ill.: Rand McNally.
Cox, D. R. (1958b) The Planning of Experiments. New York: John Wiley.
Ellickson, P. L., and Bell, R. M. (1992) Challenges to social experiments: A drug prevention example. Journal of Research in Crime and Delinquency, 29, 79–101.
Evans, A. S. (1976) Causation and disease: The Henle-Kock postulates revisited. Yale Journal of Biology and Medicine, 49, 175–195.
Fienberg, S. E., Singer, B., and Tanur, J. (1985) Large-scale social experimentation in the united states. In A. C. Atkinson and S. E. Fienberg, eds., A Celebration of Statistics: The ISI Centenary Volume, pp. 287–326. New York: Springer Verlag.
Hoaglin, D. C., Light, R., McPeek, B., Mosteller, F., and Stoto, M. (1982) Data for Decisions. Cambridge, Mass.: Abt Associates.
Johnson, N. L., and Kotz, S., eds. (1982–1989) The Encyclopedia of Statistical Sciences (10 volumes). New York: John Wiley.
Kish, L. (1965) Survey Sampling. New York: John Wiley.
Kruskal, W. H., and Mosteller, F. (1988) Representative sampling. In S. Kotz and N. L. Johnson, eds., Encyclopedia of Statistical Sciences, volume 8, pp. 77–81. New York: John Wiley and Sons.
Kruskal, W. H., and Tanur, J. M., eds. (1978) The International Encyclopedia of Statistics (2 volumes). New York: Macmillan and the Free Press.
Marini, M. M., and Singer, B. (1988) Causality in the social sciences. In C. C. Clogg, ed., Sociological Methodology 1988, chapter 11, pp. 347–409. Washington, D.C.: American Sociological Association.
Marquis, W. G., Newhouse, J. P., Duan, N., Keeler, E. B., Leibowitz, A., and Marqui, M. S. (1987) Health insurance and the demand for medical care: Evidence from a randomized experiment. American Economic Review, 77, 252–277.
Mosteller, F., Fienberg, S. E., and Rourke, R. E. K. (1983) Beginning Statistics with Data Analysis. Reading, Mass.: Addison-Wesley.
Rubin, D. B. (1984) William G. Cochran's contributions to the design, analysis, and evaluation of observational studies. In P. S. R. S. Rao and J. Sedransk, eds., W. G. Cochran's Impact on Statistics, pp. 37–69. New York: Wiley.
Samaniego, F. J., and Eubank, L. A. (1991) A statistical analysis of California's case study project in bilingual education. Technical Report 208, Division of Statistics, University of California, Davis.
Yin, R. K. (1989) Case-Study Research. Design and Methods (revised ed.). Newbury Park, Calif.: Sage Publications.