Designs for the Conduct of Scientific Research in Education
The salient features of education delineated in Chapter 4 and the guiding principles of scientific research laid out in Chapter 3 set boundaries for the design and conduct of scientific education research. Thus, the design of a study (e.g., randomized experiment, ethnography, multiwave survey) does not itself make it scientific. However, if the design directly addresses a question that can be addressed empirically, is linked to prior research and relevant theory, is competently implemented in context, logically links the findings to interpretation ruling out counterinterpretations, and is made accessible to scientific scrutiny, it could then be considered scientific. That is: Is there a clear set of questions underlying the design? Are the methods appropriate to answer the questions and rule out competing answers? Does the study take previous research into account? Is there a conceptual basis? Are data collected in light of local conditions and analyzed systematically? Is the study clearly described and made available for criticism? The more closely aligned it is with these principles, the higher the quality of the scientific study. And the particular features of education require that the research process be explicitly designed to anticipate the implications of these features and to model and plan accordingly.
Our scientific principles include research design—the subject of this chapter—as but one aspect of a larger process of rigorous inquiry. How-
ever, research design (and corresponding scientific methods) is a crucial aspect of science. It is also the subject of much debate in many fields, including education. In this chapter, we describe some of the most frequently used and trusted designs for scientifically addressing broad classes of research questions in education.
In doing so, we develop three related themes. First, as we posit earlier, a variety of legitimate scientific approaches exist in education research. Therefore, the description of methods discussed in this chapter is illustrative of a range of trusted approaches; it should not be taken as an authoritative list of tools to the exclusion of any others.1 As we stress in earlier chapters, the history of science has shown that research designs evolve, as do the questions they address, the theories they inform, and the overall state of knowledge.
Second, we extend the argument we make in Chapter 3 that designs and methods must be carefully selected and implemented to best address the question at hand. Some methods are better than others for particular purposes, and scientific inferences are constrained by the type of design employed. Methods that may be appropriate for estimating the effect of an educational intervention, for example, would rarely be appropriate for use in estimating dropout rates. While researchers—in education or any other field—may overstate the conclusions from an inquiry, the strength of scientific inference must be judged in terms of the design used to address the question under investigation. A comprehensive explication of a hierarchy of appropriate designs and analytic approaches under various conditions would require a depth of treatment found in research methods textbooks. This is not our objective. Rather, our goal is to illustrate that among available techniques, certain designs are better suited to address particular kinds of questions under particular conditions than others.
Third, in order to generate a rich source of scientific knowledge in education that is refined and revised over time, different types of inquiries and methods are required. At any time, the types of questions and methods depend in large part on an accurate assessment of the overall state of knowl-
edge and professional judgment about how a particular line of inquiry could advance understanding. In areas with little prior knowledge, for example, research will generally need to involve careful description to formulate initial ideas. In such situations, descriptive studies might be undertaken to help bring education problems or trends into sharper relief or to generate plausible theories about the underlying structure of behavior or learning. If the effects of education programs that have been implemented on a large scale are to be understood, however, investigations must be designed to test a set of causal hypotheses. Thus, while we treat the topic of design in this chapter as applying to individual studies, research design has a broader quality as it relates to lines of inquiry that develop over time.
While a full development of these notions goes considerably beyond our charge, we offer this brief overview to place the discussion of methods that follows into perspective. Also, in the concluding section of this chapter, we make a few targeted suggestions for the kinds of work we believe are most needed in education research to make further progress toward robust knowledge.
TYPES OF RESEARCH QUESTIONS
In discussing design, we have to be true to our admonition that the research question drives the design, not vice versa. To simplify matters, the committee recognized that a great number of education research questions fall into three (interrelated) types: description—What is happening? cause—Is there a systematic effect? and process or mechanism—Why or how is it happening?
The first question—What is happening?—invites description of various kinds, so as to properly characterize a population of students, understand the scope and severity of a problem, develop a theory or conjecture, or identify changes over time among different educational indicators—for example, achievement, spending, or teacher qualifications. Description also can include associations among variables, such as the characteristics of schools (e.g., size, location, economic base) that are related to (say) the provision of music and art instruction. The second question is focused on establishing causal effects: Does x cause y? The search for cause, for example,
can include seeking to understand the effect of teaching strategies on student learning or state policy changes on district resource decisions. The third question confronts the need to understand the mechanism or process by which x causes y. Studies that seek to model how various parts of a complex system—like U.S. education—fit together help explain the conditions that facilitate or impede change in teaching, learning, and schooling. Within each type of question, we separate the discussion into subsections that show the use of different methods given more fine-grained goals and conditions of an inquiry.
Although for ease of discussion we treat these types of questions separately, in practice they are closely related. As our examples show, within particular studies, several kinds of queries can be addressed. Furthermore, various genres of scientific education research often address more than one of these types of questions. Evaluation research—the rigorous and systematic evaluation of an education program or policy—exemplifies the use of multiple questions and corresponding designs. As applied in education, this type of scientific research is distinguished from other scientific research by its purpose: to contribute to program improvement (Weiss, 1998a). Evaluation often entails an assessment of whether the program caused improvements in the outcome or outcomes of interest (Is there a systematic effect?). It also can involve detailed descriptions of the way the program is implemented in practice and in what contexts (What is happening?) and the ways that program services influence outcomes (How is it happening?).
Throughout the discussion, we provide several examples of scientific education research, connecting them to scientific principles (Chapter 3) and the features of education (Chapter 4). We have chosen these studies because they align closely with several of the scientific principles. These examples include studies that generate hypotheses or conjectures as well as those that test them. Both tasks are essential to science, but as a general rule they cannot be accomplished simultaneously.
Moreover, just as we argue that the design of a study does not itself make it scientific, an investigation that seeks to address one of these questions is not necessarily scientific either. For example, many descriptive studies—however useful they may be—bear little resemblance to careful scientific study. They might record observations without any clear conceptual viewpoint, without reproducible protocols for recording data, and so
forth. Again, studies may be considered scientific by assessing the rigor with which they meet scientific principles and are designed to account for the context of the study.
Finally, we have tended to speak of research in terms of a simple dichotomy— scientific or not scientific—but the reality is more complicated. Individual research projects may adhere to each of the principles in varying degrees, and the extent to which they meet these goals goes a long way toward defining the scientific quality of a study. For example, while all scientific studies must pose clear questions that can be investigated empirically and be grounded in existing knowledge, more rigorous studies will begin with more precise statements of the underlying theory driving the inquiry and will generally have a well-specified hypothesis before the data collection and testing phase is begun. Studies that do not start with clear conceptual frameworks and hypotheses may still be scientific, although they are obviously at a more rudimentary level and will generally require follow-on study to contribute significantly to scientific knowledge.
Similarly, lines of research encompassing collections of studies may be more or less productive and useful in advancing knowledge. An area of research that, for example, does not advance beyond the descriptive phase toward more precise scientific investigation of causal effects and mechanisms for a long period of time is clearly not contributing as much to knowledge as one that builds on prior work and moves toward more complete understanding of the causal structure. This is not to say that descriptive work cannot generate important breakthroughs. However, the rate of progress should—as we discuss at the end of this chapter—enter into consideration of the support for advanced lines of inquiry. The three classes of questions we discuss in the remainder of this chapter are ordered in a way that reflects the sequence that research studies tend to follow as well as their interconnected nature.
WHAT IS HAPPENING?
Answers to “What is happening?” questions can be found by following Yogi Berra’s counsel in a systematic way: if you want to know what’s going on, you have to go out and look at what is going on. Such inquiries are descriptive. They are intended to provide a range of information from
documenting trends and issues in a range of geopolitical jurisdictions, populations, and institutions to rich descriptions of the complexities of educational practice in a particular locality, to relationships among such elements as socioeconomic status, teacher qualifications, and achievement.
Estimates of Population Characteristics
Descriptive scientific research in education can make generalizable statements about the national scope of a problem, student achievement levels across the states, or the demographics of children, teachers, or schools. Methods that enable the collection of data from a randomly selected sample of the population provide the best way of addressing such questions. Questionnaires and telephone interviews are common survey instruments developed to gather information from a representative sample of some population of interest. Policy makers at the national, state, and sometimes district levels depend on this method to paint a picture of the educational landscape. Aggregate estimates of the academic achievement level of children at the national level (e.g., National Center for Education Statistics [NCES], National Assessment of Educational Progress [NAEP]), the supply, demand, and turnover of teachers (e.g., NCES Schools and Staffing Survey), the nation’s dropout rates (e.g., NCES Common Core of Data), how U.S. children fare on tests of mathematics and science achievement relative to children in other nations (e.g., Third International Mathematics and Science Study) and the distribution of doctorate degrees across the nation (e.g., National Science Foundation’s Science and Engineering Indicators) are all based on surveys from populations of school children, teachers, and schools.
To yield credible results, such data collection usually depends on a random sample (alternatively called a probability sample) of the target population. If every observation (e.g., person, school) has a known chance of being selected into the study, researchers can make estimates of the larger population of interest based on statistical technology and theory. The validity of inferences about population characteristics based on sample data depends heavily on response rates, that is, the percentage of those randomly selected for whom data are collected. The measures used must have known reliability—that is, the extent to which they reproduce results. Finally, the value of a data collection instrument hinges not only on the
sampling method, participation rate, and reliability, but also on their validity: that the questionnaire or survey items measure what they are supposed to measure.
The NAEP survey tracks national trends in student achievement across several subject domains and collects a range of data on school, student, and teacher characteristics (see Box 5-1). This rich source of information enables several kinds of descriptive work. For example, researchers can estimate the average score of eighth graders on the mathematics assessment (i.e., measures of central tendency) and compare that performance to prior years. Part of the study we feature (see below) about college women’s career choices featured a similar estimation of population characteristics. In that study, the researchers developed a survey to collect data from a representative sample of women at the two universities to aid them in assessing the generalizability of their findings from the in-depth studies of the 23 women.
The NAEP survey also illustrates how researchers can describe patterns of relationships between variables. For example, NCES reports that in 2000, eighth graders whose teachers majored in mathematics or mathematics education scored higher, on average, than did students whose teachers did not major in these fields (U.S. Department of Education, 2000). This finding is the result of descriptive work that explores the correlation between variables: in this case, the relationship between student mathematics performance and their teachers’ undergraduate major.
Such associations cannot be used to infer cause. However, there is a common tendency to make unsubstantiated jumps from establishing a relationship to concluding cause. As committee member Paul Holland quipped during the committee’s deliberations, “Casual comparisons inevitably invite careless causal conclusions.” To illustrate the problem with drawing causal inferences from simple correlations, we use an example from work that compares Catholic schools to public schools. We feature this study later in the chapter as one that competently examines causal mechanisms. Before addressing questions of mechanism, foundational work involved simple correlational results that compared the performance of Catholic high school students on standardized mathematics tests with their
Simply collecting data is not in and of itself scientific. It is the rigorous organization and analysis of data to answer clearly specified questions that form the basis of scientific description, not the data themselves. Quantitative data appear in many ways in education research; their most common form of organization is as a “units-by-variables” array. The National Assessment of Educational Progress (NAEP) is an instructive example. This large survey (implemented and maintained by the National Center for Education Statistics) of 4th, 8th, and 12th graders in the United States collects information on a variety of academic subject areas, including mathematics and literacy, from samples drawn from these grades on a regular schedule.
There are several types of units*, for example, students and teachers. Information is systematically collected from both students and teachers in areas that are appropriate to each type of unit. For students, NAEP collects data on academic performance as well as background information. Teachers are surveyed about their training and experience and their methods of instruction. The units-by-variables organization of data is important because each row corresponds to all the data for each unit and the columns correspond to the information represented by a single variable across all the units in the study. Modern psychometric methods are available to summarize this complex set of information into reports on student achievement and its relation to other factors. This combination of rigorous data collection, analysis, and reporting is what distinguishes scientific description from casual observation.
counterparts in public schools. These simple correlations revealed that average mathematics achievement was considerably higher for Catholic school students than for public school students (Bryk, Lee, and Holland, 1993). However, the researchers were careful not to conclude from this analysis that attending a Catholic school causes better student outcomes, because there are a host of potential explanations (other than attending a Catholic school) for this relationship between school type and achievement. For example, since Catholic schools can screen children for aptitude, they may have a more able student population than public schools at the outset. (This is an example of the classic selectivity bias that commonly threatens the validity of causal claims in nonrandomized studies; we return to this issue in the next section.) In short, there are other hypotheses that could explain the observed differences in achievement between students in different sectors that must be considered systematically in assessing the potential causal relationship between Catholic schooling and student outcomes.
Descriptions of Localized Educational Settings
In some cases, scientists are interested in the fine details (rather than the distribution or central tendency) of what is happening in a particular organization, group of people, or setting. This type of work is especially important when good information about the group or setting is non-existent or scant. In this type of research, then, it is important to obtain first-hand, in-depth information from the particular focal group or site. For such purposes, selecting a random sample from the population of interest may not be the proper method of choice; rather, samples may be purposively selected to illuminate phenomena in depth.2 For example, to better understand a high-achieving school in an urban setting with children of predominantly low socioeconomic status, a researcher might conduct a detailed case study or an ethnographic study (a case study with a focus on culture) of such a school (Yin and White, 1986; Miles and Huberman,
1994). This type of scientific description can provide rich depictions of the policies, procedures, and contexts in which the school operates and generate plausible hypotheses about what might account for its success. Researchers often spend long periods of time in the setting or group in order to understand what decisions are made, what beliefs and attitudes are formed, what relationships are developed, and what forms of success are celebrated. These descriptions, when used in conjunction with causal methods, are often critical to understand such educational outcomes as student achievement because they illuminate key contextual factors.
Box 5-2 provides an example of a study that described in detail (and also modeled several possible mechanisms; see later discussion) a small group of women, half who began their college careers in science and half in what were considered more traditional majors for women. This descriptive part of the inquiry involved an ethnographic study of the lives of 23 first-year women enrolled in two large universities.
Scientific description of this type can generate systematic observations about the focal group or site, and patterns in results may be generalizable to other similar groups or sites or for the future. As with any other method, a scientifically rigorous case study has to be designed to address the research question it addresses. That is, the investigator has to choose sites, occasions, respondents, and times with a clear research purpose in mind and be sensitive to his or her own expectations and biases (Maxwell, 1996; Silverman, 1993). Data should typically be collected from varied sources, by varied methods, and corroborated by other investigators. Furthermore, the account of the case needs to draw on original evidence and provide enough detail so that the reader can make judgments about the validity of the conclusions (Yin, 2000).
Results may also be used as the basis for new theoretical developments, new experiments, or improved measures on surveys that indicate the extent of generalizability. In the work done by Holland and Eisenhart (1990), for example (see Box 5-2), a number of theoretical models were developed and tested to explain how women decide to pursue or abandon nontraditional careers in the fields they had studied in college. Their finding that commitment to college life—not fear of competing with men or other hypotheses that had previously been set forth—best explained these decisions was new knowledge. It has been shown in subsequent studies to
In the late 1970s cultural anthropologists Dorothy Holland and Margaret Eisenhart set out to learn more about why so few women who began their college careers in nontraditional majors (e.g., science, mathematics, computer science) ended up working in those fields. At the time, several different explanations were being proposed: Women were not well prepared before coming to college; women were discriminated against in college; women did not want to compete with men for jobs. Holland and Eisenhart (1990) first designed ethnographic case studies of a small group of freshman women at two public, residential universities—one historically black, one historically white. From volunteers on each campus, matched groups were selected—based on a survey of their high school grades, college majors, college activities, and college peers. All of the 23 women who participated had at least a B+ average in high school. Half from each campus were planning traditional majors for women; half were planning nontraditional majors.
Based on analysis of the ethnographic data obtained from a year of participant observation and open-ended interviews with the women, models were developed to describe how the 23 women participated in college life. The models depicted three different kinds of commitment to school work in college. Each model included: (1) the women’s views about the value of schoolwork; (2) their reasons for doing schoolwork; (3) and the perceived costs (both financial and social) of doing schoolwork. Extrapolating from the models, the researchers predicted what each woman would do after college—continue in school, get a job in her field, get a job outside of her field, get married, etc. At the end of 4 years and again after 3 more years, the researchers followed up with telephone interviews with each woman. In all 23 cases, their predictions made based on the models of commitment to schoolwork were confirmed. Also, in all cases, the models of commitment were better predictors of the future than precollege preparation (grades, courses taken), discrimination against women, or feelings about competing with men.
generalize somewhat to similar schools, though additional models seem to exist at some schools (Seymour and Hewitt, 1997).
Although such purposively selected samples may not be scientifically generalizable to other locations or people, these vivid descriptions often appeal to practitioners. Scientifically rigorous case studies have strengths and weaknesses for such use. They can, for example, help local decision makers by providing them with ideas and strategies that have promise in their educational setting. They cannot (unless combined with other methods) provide estimates of the likelihood that an educational approach might work under other conditions or that they have identified the right underlying causes. As we argue throughout this volume, research designs can often be strengthened considerably by using multiple methods— integrating the use of both quantitative estimates of population characteristics and qualitative studies of localized context.
Other descriptive designs may involve interviews with respondents or document reviews in a fairly large number of cases, such as 30 school districts or 60 colleges. Cases are often selected to represent a variety of conditions (e.g., urban/rural; east/west; affluent/poor). Such descriptive studies can be longitudinal, returning to the same cases over several years to see how conditions change.
These examples of descriptive work meet the principles of science, and have clearly contributed important insights to the base of scientific knowledge. If research is to be used to answer questions about “what works,” however, it must advance to other levels of scientific investigation such as those considered next.
IS THERE A SYSTEMATIC EFFECT?
Research designs that attempt to identify systematic effects have at their root an intent to establish a cause-and-effect relationship. Causal work is built on both theory and descriptive studies. In other words, the search for causal effects cannot be conducted in a vacuum: ideally, a strong theoretical base as well as extensive descriptive information are in place to provide the intellectual foundation for understanding causal relationships.
The simple question of “does x cause y?” typically involves several different kinds of studies undertaken sequentially (Holland, 1993). In basic
terms, several conditions must be met to establish cause. Usually, a relationship or correlation between the variables is first identified.3 Researchers also confirm that x preceded y in time (temporal sequence) and, crucially, that all presently conceivable rival explanations for the observed relationship have been “ruled out.” As alternative explanations are eliminated, confidence increases that it was indeed x that caused y. “Ruling out” competing explanations is a central metaphor in medical research, diagnosis, and other fields, including education, and it is the key element of causal queries (Campbell and Stanley 1963; Cook and Campbell 1979, 1986).
The use of multiple qualitative methods, especially in conjunction with a comparative study of the kind we describe in this section, can be particularly helpful in ruling out alternative explanations for the results observed (Yin, 2000; Weiss, in press). Such investigative tools can enable stronger causal inferences by enhancing the analysis of whether competing explanations can account for patterns in the data (e.g., unreliable measures or contamination of the comparison group). Similarly, qualitative methods can examine possible explanations for observed effects that arise outside of the purview of the study. For example, while an intervention was in progress, another program or policy may have offered participants opportunities similar to, and reinforcing of, those that the intervention provided. Thus, the “effects” that the study observed may have been due to the other program (“history” as the counterinterpretation; see Chapter 3). When all plausible rival explanations are identified and various forms of data can be used as evidence to rule them out, the causal claim that the intervention caused the observed effects is strengthened. In education, research that explores students’ and teachers’ in-depth experiences, observes their actions, and documents the constraints that affect their day-to-day activities provides a key source of generating plausible causal hypotheses.
We have organized the remainder of this section into two parts. The first treats randomized field trials, an ideal method when entities being examined can be randomly assigned to groups. Experiments are especially well-suited to situations in which the causal hypothesis is relatively simple. The second describes situations in which randomized field trials are not
feasible or desirable, and showcases a study that employed causal modeling techniques to address a complex causal question. We have distinguished randomized studies from others primarily to signal the difference in the strength with which causal claims can typically be made from them. The key difference between randomized field trials and other methods with respect to making causal claims is the extent to which the assumptions that underlie them are testable. By this simple criterion, nonrandomized studies are weaker in their ability to establish causation than randomized field trials, in large part because the role of other factors in influencing the outcome of interest is more difficult to gauge in nonrandomized studies. Other conditions that affect the choice of method are discussed in the course of the section.
Causal Relationships When Randomization Is Feasible
A fundamental scientific concept in making causal claims—that is, inferring that x caused y—is comparison. Comparing outcomes (e.g., student achievement) between two groups that are similar except for the causal variable (e.g., the educational intervention) helps to isolate the effect of that causal agent on the outcome of interest.4 As we discuss in Chapter 4, it is sometimes difficult to retain the sharpness of a comparison in education due to proximity (e.g., a design that features students in one classroom assigned to different interventions is subject to “spillover” effects) or human volition (e.g., teacher, parent, or student decisions to switch to another condition threaten the integrity of the randomly formed groups). Yet, from a scientific perspective, randomized trials (we also use the term “experiment” to refer to causal studies that feature random assignment) are the ideal for establishing whether one or more factors caused change in an outcome because of their strong ability to enable fair comparisons (Campbell and Stanley, 1963; Boruch, 1997; Cook and Payne, in press). Random allocation of students, classrooms, schools—whatever the unit of comparison may be—to different treatment groups assures that these comparison groups are, roughly speaking, equivalent at the time an intervention is introduced (that is, they do not differ systematically on account of hidden
influences) and chance differences between the groups can be taken into account statistically. As a result, the independent effect of the intervention on the outcome of interest can be isolated. In addition, these studies enable legitimate statistical statements of confidence in the results.
The Tennessee STAR experiment (see Chapter 3) on class-size reduction is a good example of the use of randomization to assess cause in an education study; in particular, this tool was used to gauge the effectiveness of an intervention. Some policy makers and scientists were unwilling to accept earlier, largely nonexperimental studies on class-size reduction as a basis for major policy decisions in the state. Those studies could not guarantee a fair comparison of children in small versus large classes because the comparisons relied on statistical adjustment rather than on actual construction of statistically equivalent groups. In Tennessee, statistical equivalence was achieved by randomly assigning eligible children and teachers to classrooms of different size. If the trial was properly carried out,5 this randomization would lead to an unbiased estimate of the relative effect of class-size reduction and a statistical statement of confidence in the results.
Randomized trials are used frequently in the medical sciences and certain areas of the behavioral and social sciences, including prevention studies of mental health disorders (e.g., Beardslee, Wright, Salt, and Drezner, 1997), behavioral approaches to smoking cessation (e.g., Pieterse, Seydel, DeVries, Mudde, and Kok, 2001), and drug abuse prevention (e.g., Cook, Lawrence, Morse, and Roehl, 1984). It would not be ethical to assign individuals randomly to smoke and drink, and thus much of the evidence regarding the harmful effects of nicotine and alcohol comes from descriptive and correlational studies. However, randomized trials that show reductions in health detriments and improved social and behavioral functioning strengthen the causal links that have been established between drug use and adverse health and behavioral outcomes (Moses, 1995; Mosteller, Gilbert, and McPeek, 1980). In medical research, the relative effectiveness of the Salk vaccine (see Lambert and Markel, 2000) and streptomycin (Medical Research Council, 1948) was demonstrated through such trials. We have also learned about which drugs and surgical treatments are useless by depending on randomized controlled experiments (e.g., Schulte et al.,
2001; Gorman et al., 2001; Paradise et al., 1999). Randomized controlled trials are also used in industrial, market, and agricultural research.
Such trials are not frequently conducted in education research (Boruch, De Moya, and Snyder, in press). Nonetheless, it is not difficult to identify good examples in a variety of education areas that demonstrate their feasibility (see Boruch, 1997; Orr, 1999; and Cook and Payne, in press). For example, among the education programs whose effectiveness have been evaluated in randomized trials are the Sesame Street television series (Bogatz and Ball, 1972), peer-assisted learning and tutoring for young children with reading problems (Fuchs, Fuchs, and Kazdan, 1999), and Upward Bound (Myers and Schirm, 1999). And many of these trials have been successfully implemented on a large scale, randomizing entire classrooms or schools to intervention conditions. For numerous examples of trials in which schools, work places, and other entities are the units of random allocation and analysis, see Murray (1998), Donner and Klar (2000), Boruch and Foley (2000), and the Campbell Collaboration register of trials at http://campbell.gse.upenn.edu.
Causal Relationships When Randomization Is Not Feasible
In this section we discuss the conditions under which randomization is not feasible nor desirable, highlight alternative methods for addressing causal questions, and provide an illustrative example. Many nonexperimental methods and analytic approaches are commonly classified under the blanket rubric “quasi-experiment” because they attempt to approximate the underlying logic of the experiment without random assignment (Campbell and Stanley, 1963; Caporaso and Roos, 1973). These designs were developed because social science researchers recognized that in some social contexts (e.g., schools), researchers do not have the control afforded in laboratory settings and thus cannot always randomly assign units (e.g., classrooms).
Quasi-experiments (alternatively called observational studies),6 for example, sometimes compare groups of interest that exist naturally (e.g.,
existing classes varying in size) rather than assigning them randomly to different conditions (e.g., assigning students to small, medium, or large class size). These studies must attempt to ensure fair comparisons through means other than randomization, such as by using statistical techniques to adjust for background variables that may account for differences in the outcome of interest. For example, researchers might come across schools that vary in the size of their classes and compare the achievement of students in large and small classes, adjusting for other differences among schools and children. If the class size conjecture holds after this adjustment is made, the researchers would expect students in smaller classes to have higher achievement scores than students in larger size classes. If indeed this difference is observed, the causal effect is more plausible.
The plausibility of the researchers’ causal interpretation, however, depends on some strong assumptions. They must assume that their attempts to equate schools and children were, indeed, successful. Yet, there is always the possibility that some unmeasured, prior existing difference among schools and children caused the effect, not the reduced class size. Or, there is the possibility that teachers with reduced classes were actively involved in school reform and that their increased effort and motivation (which might wane over time) caused the effect, not the smaller classes themselves. In short, these designs are less effective at eliminating competing plausible hypotheses with the same authority as a true experiment.
The major weakness of nonrandomized designs is selectivity bias—the counter-interpretation that the treatment did not cause the difference in outcomes but, rather, unmeasured prior existing differences (differential selectivity) between the groups did.7 For example, a comparison of early literacy skills among low-income children who participated in a local preschool program and those who did not may be confounded by selectivity bias. That is, the parents of the children who were enrolled in preschool may be more motivated than other parents to provide reading experiences to their children at home, thus making it difficult to disentangle the several potential causes (e.g., preschool program or home reading experiences) for early reading success.
It is critical in such studies, then, to be aware of potential sources of bias and to measure them so their influence can be accounted for in relation to the outcome of interest.8 It is when these biases are not known that quasi-experiments may yield misleading results. Thus, the scientific principle of making assumptions explicit and carefully attending to ruling out competing hypotheses about what caused a difference takes on heightened importance.
In some settings, well-controlled quasi-experiments may have greater “external validity”—generalizability to other people, times, and settings— than experiments with completely random assignment (Cronbach et al., 1980; Weiss, 1998a). It may be useful to take advantage of the experience and investment of a school with a particular program and try to design a quasi-experiment that compares the school that has a good implementation of the program to a similar school without the program (or with a different program). In such cases, there is less risk of poor implementation, more investment of the implementers in the program, and potentially greater impact. The findings may be more generalizable than in a randomized experiment because the latter may be externally mandated (i.e., by the researcher) and thus may not be feasible to implement in the “real-life” practice of education settings. The results may also have stronger external validity because if a school or district uses a single program, the possible contamination of different programs because teachers or administrators talk and interact will be reduced. Random assignment within a school at the level of the classroom or child often carries the risk of dilution or blending the programs. If assignment is truly random, such threats to internal validity will not bias the comparison of programs—just the estimation of the strength of the effects.
In the section above (What Is Happening?), we note that some kinds of correlational work make important contributions to understanding broad patterns of relationships among educational phenomena; here, we highlight a correlational design that allows causal inferences about the relationship between two or more variables. When correlational methods use what are called “model-fitting” techniques based on a theoretically gener-
Recent methodological advances—instrumental variables in particular—attempt to address the problem of selection in nonrandomized causal studies. The study described in Box 5-3 utilized this technique.
ated system of variables, they permit stronger, albeit still tentative, causal inferences.
In Chapter 3, we offer an example that illustrates the use of model-fitting techniques from the geophysical sciences that tested alternative hypotheses about the causes of glaciation. In Box 5-3, we provide an example of causal modeling that shows the value of such techniques in education. This work examined the potential causal connection between teacher compensation and student dropout rates. Exploring this relationship is quite relevant to education policy, but it cannot be studied through a randomized field trail: teacher salaries, of course, cannot be randomly assigned nor can students be randomly assigned to those teachers. Because important questions like these often cannot be examined experimentally, statisticians have developed sophisticated model-fitting techniques to statistically rule out potential alternative explanations and deal with the problem of selection bias.
The key difference between simple correlational work and model-fitting is that the latter enhances causal attribution. In the study examining teacher compensation and dropout rates, for example, researchers introduced a conceptual model for the relationship between student outcomes and teacher salary, set forth an explicit hypothesis to test about the nature of that relationship, and assessed competing models of interpretation. By empirically rejecting competing theoretical models, confidence is increased in the explanatory power of the remaining model(s) (although other alternative models may also exist that provide a comparable fit to the data).
The study highlighted in Box 5-3 tested different models in this way. Loeb and Page (2000) took a fresh look at a question that had a good bit of history, addressing what appeared to be converging evidence that there was no causal relationship between teacher salaries and student outcomes. They reasoned that one possible explanation for these results was that the usual “production-function” model for the effects of salary on student outcomes was inadequately specified. Specifically, they hypothesized that nonpecuniary job characteristics and alternative wage opportunities that previous models had not accounted for may be relevant in understanding the relationship between teacher compensation and student outcomes. After incorporating these opportunity costs in their model and finding a sophisticated way to control the fact that wealthier parents are likely to send their
In several comprehensive reviews of research on the effects of educational expenditures on student outcomes, Hanushek (1986, 1997) found that student outcomes were not consistently related either to per-pupil outlays or to teacher salaries. Grogger (1996), Betts (1995), and Altonji (1988), using national longitudinal data sets, produced similar results.
However, Loeb and Page (2000) noted a discrepancy between these findings and studies that found school and non-salary teacher effects (e.g., Altonji, 1988; Ehrenberg and Brewer, 1994; Ferguson, 1991). Indeed, Hanushek, Kain, and Rivkin (1998) found a reliable relationship between teacher quality and students’ achievement. For Loeb and Page, these findings add a new dimension to the puzzle. “If teacher quality affects student achievement, then why do studies that predict student outcomes from teacher wages produce weak results?” (2000, p. 393).
Loeb and Page pointed out that the previous education expenditure studies failed to account for nonmonetary job characteristics and opportunities that might be open to wouldbe teachers in the local job market (“opportunity costs”). Both might affect a qualified teacher’s decision to teach. Consequently, they tested two competing models, the commonly used “production function” model, which predicted outcomes from expenditures and had formed the theoretical basis of most prior work on the topic, and a modified production-function model that incorporated opportunity costs. They replicated prior findings using traditional production-function procedures from previous studies. However, once they statistically adjusted for opportunity costs, they found that raising teacher wages by 10 percent reduced high school dropout rates by 3-4 percent. They suggested that previous research on the effects of teacher wages on student outcomes failed to show effects because they lacked adequate controls for nonwage aspects of teaching and market differences in alternative occupational opportunities.
children to schools that pay teachers more, Loeb and Page found that raising teacher wages by 10 percent reduced high school dropout rates by 3 to 4 percent.
WHY OR HOW IS IT HAPPENING?
In many situations, finding that a causal agent (x) leads to the outcome (y) is not sufficient. Important questions remain about how x causes y. Questions about how things work demand attention to the processes and mechanisms by which the causes produce their effects. However, scientific research can also legitimately proceed in the opposite direction: that is, the search for mechanism can come before an effect has been established. For example, if the process by which an intervention influences student outcomes is established, researchers can often predict its effectiveness with known probability. In either case, the processes and mechanisms should be linked to theories so as to form an explanation for the phenomena of interest.
The search for causal mechanisms, especially once a causal effect has garnered strong empirical support, can use all of the designs we have discussed. In Chapter 2, we trace a sequence of investigations in molecular biology that investigated how genes are turned on and off. Very different techniques, but ones that share the same basic intellectual approach to casual analysis reflected in these genetic studies, have yielded understandings in education. Consider, for example, the Tennessee class-size experiment (see discussion in Chapter 3). In addition to examining whether reduced class size produced achievement benefits, especially for minority students, a research team and others in the field asked (see, e.g., Grissmer, 1999) what might explain the Tennessee and other class-size effects. That is, what was the causal mechanism through which reduced class size affected achievement? To this end, researchers (Bohrnstedt and Stecher, 1999) used classroom observations and interviews to compare teaching in different class sizes. They conducted ethnographic studies in search of mechanism. They correlated measures of teaching behavior with student achievement scores. These questions are important because they enhance understanding of the foundational processes at work when class size is reduced and thus
improve the capacity to implement these reforms effectively in different times, places, and contexts.
Exploring Mechanism When Theory Is Fairly Well Established
A well-known study of Catholic schools provides another example of a rigorous attempt to understand mechanism (see Box 5-4). Previous and highly controversial work on Catholic schools (e.g., Coleman, Hoffer, and
In the early 1980s two influential books (Coleman, Hoffer, and Kilgore, 1982; Greeley, 1982) set off years of controversy and debate in academic and policy circles about the relative effectiveness of Catholic schools and public schools. In a synthesis of several lines of inquiry over a 10-year period, Bryk and colleagues (Byrk, Lee, and Holland, 1993) focused attention on how Catholic schools functioned to better understand this prior work and to offer insights about improving schools more generally. This longitudinal study is an excellent example of the use of multiple methods, both quantitative and qualitative, to generate converging evidence about such a complex topic. It featured in-depth case studies of seven particularly successful Catholic schools, descriptive profiles of Catholic schools nationally, and sophisticated statistical modeling techniques to assess causal mechanism.
One line of inquiry within this multilayered study featured a quasi-experiment that compared the mathematics achievement of Catholic high school students and public high school students. Using simple correlational techniques, the researchers showed that the social distribution of academic achievement was more equalized in Catholic than non-Catholic schools: for
Kilgore, 1982) had examined the relative benefits to students of Catholic and public schools. Drawing on these studies, as well as a fairly substantial literature related to effective schools, Bryk and his colleagues (Byrk, Lee, and Holland, 1993) focused on the mechanism by which Catholic schools seemed to achieve success relative to public schools. A series of models were developed (sector effects only, compositional effects, and school effects) and tested to explain the mechanism by which Catholic schools successfully achieve an equitable social distribution of academic achievement. The
example, the achievement gap between minority and non-minority students was smaller in Catholic schools than in public schools. To better understand the possible causes behind these “sector” differences, Bryk and his colleagues used data from a rich, longitudinal data set to test whether certain features of school organization explained these differences and predicted success. Because students in this data set were not randomly assigned to attend Catholic or public schools, the researchers attempted to ensure fair comparisons by statistically holding constant other variables (such as student background) that could also explain the finding about the social distribution of achievement. Three potential explanatory models were developed and tested with respect to explaining the relative effectiveness of Catholic schools: sector effects only (the private and spiritual nature of Catholic schools); compositional effects (the composition of the student body in Catholic schools); and school effects (various features of school operations that contribute to school life). In combination, analyzing data with respect to these three potential theoretical mechanisms suggested that it is the coherence of school life in Catholic schools that most clearly accounts for its relative success in this area. Nonetheless, controversy still exists about the circumstances when Catholic schools are superior, about how to control for family differences in the choice of schools, and about the policy implications of these findings.
researchers’ analyses suggested that aspects of school life that enhance a sense of community within Catholic schools most effectively explained the differences in student outcomes between Catholic and public schools.
Exploring Mechanism When Theory Is Weak
When the theoretical basis for addressing questions related to mechanism is weak, contested, or poorly understood, other types of methods may be more appropriate. These queries often have strong descriptive components and derive their strength from in-depth study that can illuminate unforeseen relationships and generate new insights. We provide two examples in this section of such approaches: the first is the ethnographic study of college women (see Box 5-2) and the second is a “design study” that resulted in a theoretical model for how young children learn the mathematical concepts of ratio and proportion.
After generating a rich description of women’s lives in their universities based on extensive analysis of ethnographic and survey data, the researchers turned to the question of why women who majored in nontraditional majors typically did not pursue those fields as careers (see Box 5-2). Was it because women were not well prepared before college? Were they discriminated against? Did they not want to compete with men? To address these questions, the researchers developed several theoretical models depicting commitment to schoolwork to describe how the women participated in college life. Extrapolating from the models, the researchers predicted what each woman would do after completing college, and in all cases, the models’ predictions were confirmed.
A second example highlights another analytic approach for examining mechanism that begins with theoretical ideas that are tested through the design, implementation, and systematic study of educational tools (curriculum, teaching methods, computer applets) that embody the initial conjectured mechanism. The studies go by different names; perhaps the two most popular names are “design studies” (Brown, 1992) and “teaching experiments” (Lesh and Kelly, 2000; Schoenfeld, in press).
Box 5-5 illustrates a design study whose aim was to develop and elaborate the theoretical mechanism by which ratio reasoning develops in young children and to build and modify appropriate tasks and assessments that
In a project on student reasoning on ratio and proportion, Confrey and Lachance (2000) and colleagues examined a group of 20 students over a 3-year period in one classroom. Beginning with a conjecture about the relative independence of rational number structures (multiplication, division, ratio and proportion) from additive structures (addition and subtraction), the investigators sought the roots of ratio reasoning in a meaning of equivalence unfamiliar to the children. Consider how a 9-year-old might come to understand that 4:6 is equivalent to 6:9. Using a series of projects, tasks and challenges (such as designing a wheelchair access ramp or tourist guide to a foreign currency) researchers documented how students moved from believing that equivalence can be preserved through doubling (4:6 = 8:12) and halving (4:6 = 2:3), to the identification of a ratio unit (the smallest ratio to describe the equivalence in a set of proportions), to the ability to add and subtract ratio units (8:12 = 8+2:12+3), to the ability to solve any ratio and proportion challenge in the familiar form a:b :: c:x.
This operational description of the mechanism behind ratio reasoning was used to develop instructional tasks—like calculating the slopes of the handicapped access ramps they had designed—and to observe students engaged in them. Classroom videotaping permitted researchers to review, both during the experiment and after its completion, the actual words, actions, and representations of students and teachers to build and elaborate the underlying conjectures about ratio reasoning.
At the same time, students’ performance on mathematics assessments was compared with that of students in other classes and schools and to large-scale measures of performance on items designed to measure common misconceptions in ratio and proportion reasoning. The primary scientific product of the study was a theoretical model for ratio and proportion learning refined and enriched by years of in-depth study.
incorporate the models of learning developed through observation and interaction in the classroom. The work was linked to substantial existing literature in the field about the theoretical nature of ratio and proportion as mathematical ideas and teaching approaches to convey them (e.g., Behr, Lesh, Post, and Silver, 1983; Harel and Confrey, 1994; Mack, 1990, 1995). The initial model was tested and refined as careful distinctions and extensions were noted, explained, and considered as alternative explanations as the work progressed over a 3-year period, studying one classroom intensively. The design experiment methodology was selected because, unlike laboratory or other highly controlled approaches, it involved research within the complex interactions of teachers and students and allowed the everyday demands and opportunities of schooling to affect the investigation.
Like many such design studies, there were two main products of this work. First, through a theory-driven process of designing—and a data-driven process of refining—instructional strategies for teaching ratio and proportion, researchers produced an elaborated explanatory model of how young children come to understand these core mathematical concepts. Second, the instructional strategies developed in the course of the work itself hold promise because they were crafted based on a number of relevant research literatures. Through comparisons of achievement outcomes between children who received the new instruction and students in other classrooms and schools, the researchers provided preliminary evidence that the intervention designed to embody this theoretical mechanism is effective. The intervention would require further development, testing, and comparisons of the kind we describe in the previous section before it could be reasonably scaled up for widespread curriculum use.
Steffe and Thompson (2000) are careful to point out that design studies and teaching experiments must be conducted scientifically. In their words:
We use experiment in “teaching experiment” in a scientific sense…. What is important is that the teaching experiments are done to test hypotheses as well as to generate them. One does not embark on the intensive work of a teaching experiment without having major research hypotheses to test (p. 277).
This genre of method and approach is a relative newcomer to the field of education research and is not nearly as accepted as many of the other
methods described in this chapter. We highlight it here as an illustrative example of the creative development of new methods to embed the complex instructional settings that typify U.S. education in the research process. We echo Steffe and Thompson’s (2000) call to ensure a careful application of the scientific principles we describe in this report in the conduct of such research.9
This chapter, building on the scientific principles outlined in Chapter 3 and the features of education that influence their application in education presented in Chapter 4, illustrates that a wide range of methods can legitimately be employed in scientific education research and that some methods are better than others for particular purposes. As John Dewey put it:
We know that some methods of inquiry are better than others in just the same way in which we know that some methods of surgery, arming, road-making, navigating, or what-not are better than others. It does not follow in any of these cases that the “better” methods are ideally perfect…We ascertain how and why certain means and agencies have provided warrantably assertible conclusions, while others have not and cannot do so (Dewey, 1938, p. 104, italics in original).
The chapter also makes clear that knowledge is generated through a sequence of interrelated descriptive and causal studies, through a constant process of refining theory and knowledge. These lines of inquiry typically require a range of methods and approaches to subject theories and conjectures to scrutiny from several perspectives.
We conclude this chapter with several observations and suggestions about the current state of education research that we believe warrant attention if scientific understanding is to advance beyond its current state. We do not provide a comprehensive agenda for the nation. Rather, we
wish to offer constructive guidance by pointing to issues we have identified throughout our deliberations as key to future improvements.
First, there are a number of areas in education practice and policy in which basic theoretical understanding is weak. For example, very little is known about how young children learn ratio and proportion—mathematical concepts that play a key role in developing mathematical proficiency. The study we highlight in this chapter generated an initial theoretical model that must undergo sustained development and testing. In such areas, we believe priority should be given to descriptive and theory-building studies of the sort we highlight in this chapter. Scientific description is an essential part of any scientific endeavor, and education is no different. These studies are often extremely valuable in themselves, and they also provide the critical theoretical grounding needed to conduct causal studies. We believe that attention to the development and systematic testing of theories and conjectures across multiple studies and using multiple methods—a key scientific principle that threads throughout all of the questions and designs we have discussed—is currently undervalued in education relative to other scientific fields. The physical sciences have made progress by continuously developing and testing theories; something of that nature has not been done systematically in education. And while it is not clear that grand, unifying theories exist in the social world, conceptual understanding forms the foundation for scientific understanding and progresses—as we showed in Chapter 2—through the systematic assessment and refinement of theory.
Second, while large-scale education policies and programs are constantly undertaken, we reiterate our belief that they are typically launched without an adequate evidentiary base to inform their development, implementation, or refinement over time (Campbell, 1969; President’s Committee of Advisors on Science and Technology, 1997). The “demand” for education research in general, and education program evaluation in particular, is very difficult to quantify, but we believe it tends to be low from educators, policy makers, and the public. There are encouraging signs that public attitudes toward the use of objective evidence to guide decisions is improving (e.g., statutory requirements to set aside a percentage of annual appropriations to conduct evaluations of federal programs, the Government Performance and Results Act, and common rhetoric about “evidence-based” and “research-based” policy and practice). However, we believe stronger
scientific knowledge is needed about educational interventions to promote its use in decision making.
In order to generate a rich store of scientific evidence that could enhance effective decision making about education programs, it will be necessary to strengthen a few related strands of work. First, systematic study is needed about the ways that programs are implemented in diverse educational settings. We view implementation research—the genre of research that examines the ways that the structural elements of school settings interact with efforts to improve instruction—as a critical, underfunded, and underappreciated form of education research. We also believe that understanding how to “scale up” (Elmore, 1996) educational interventions that have promise in a small number of cases will depend critically on a deep understanding of how policies and practices are adopted and sustained (Rogers, 1995) in the complex U.S. education system.10
In all of this work, more knowledge is needed about causal relationships. In estimating the effects of programs, we urge the expanded use of random assignment. Randomized experiments are not perfect. Indeed, the merits of their use in education have been seriously questioned (Cronbach et al., 1980; Cronbach, 1982; Guba and Lincoln, 1981). For instance, they typically cannot test complex causal hypotheses, they may lack generalizability to other settings, and they can be expensive. However, we believe that these and other issues do not generate a compelling rationale against their use in education research and that issues related to ethical concerns, political obstacles, and other potential barriers often can be resolved. We believe that the credible objections to their use that have been raised have clarified the purposes, strengths, limitations, and uses of randomized experiments as well as other research methods in education. Establishing cause is often exceedingly important—for example, in the large-scale deployment of interventions—and the ambiguity of correlational studies or quasi-experiments can be undesirable for practical purposes.
In keeping with our arguments throughout this report, we also urge that randomized field trials be supplemented with other methods, including in-depth qualitative approaches that can illuminate important nuances,
identify potential counterhypotheses, and provide additional sources of evidence for supporting causal claims in complex educational settings.
In sum, theory building and rigorous studies of implementations and interventions are two broad-based areas that we believe deserve attention. Within the framework of a comprehensive research agenda, targeting these aspects of research will build on the successes of the enterprise we highlight throughout this report.