Accumulation of Scientific Knowledge
The charge to the committee reflects the widespread perception that research in education has not produced the kind of cumulative knowledge garnered from other scientific endeavors. Perhaps even more unflattering, a related indictment leveled at the education research enterprise is that it does not generate knowledge that can inform education practice and policy. The prevailing view is that findings from education research studies are of low quality and are endlessly contested—the result of which is that no consensus emerges about anything.
We argue in Chapter 1 that this skepticism is not new. Most recently, these criticisms were expressed in proposed reauthorization legislation for the Office of Educational Research and Improvement (OERI) (H.R. 4875) and related congressional testimony and debate (Fuhrman, 2001); in the committee’s workshop with educators, researchers, and federal staff (National Research Council, 2001d); and at the 2001 annual meeting of the American Educational Research Association (Shavelson, Feuer, and Towne, 2001). U.S. Representative Michael Castle (R-DE), in a press release reporting on the subcommittee’s action on H.R. 4875 said:
Education research is broken in our country . . . and Congress must work to make it more useful. . . . Research needs to be conducted on a more scientific basis. Educators and policy makers need objective, reliable research. . . .
Is this assessment accurate? Is there any evidence that scientific research in education accumulates to provide objective, reliable results? Does knowledge from scientific education research progress as it does in the physical, life, or social sciences? To shed light on these questions, we consider how knowledge accumulates in science and provide examples of the state of scientific knowledge in several fields. In doing so, we make two central arguments in this chapter.
First, research findings in education have progressed over time and provided important insights in policy and practice. We trace the history of three productive lines of inquiry related to education as “existence proofs” to support this assertion and to convey the promise for future investments in scientific education research. What is needed is more and better scientific research of this kind on education.
Our second and related argument is that in research across the scientific disciplines and in education, the path to scientific understanding shares several common characteristics. Its advancement is choppy, pushing the boundaries of what is known by moving forward in fits and starts as methods, theories, and empirical findings evolve. The path to scientific knowledge wanders through contested terrain as researchers, as well as the policy, practice, and citizen communities critically examine, interpret, and debate new findings and it requires substantial investments of time and money. Through examples from inside and outside education, we show that this characterization of scientific advancement is shared across the range of scientific endeavors.
We chose the examples that appear in this chapter to illustrate these core ideas. We do not suggest that these lines of inquiry have provided definitive answers to the underlying questions they have addressed over time. As we argue in Chapter 1, science is never “finished.” Science provides a valuable source of knowledge for understanding and improving the world, but its conclusions always remain conjectural and subject to revision based on new inquiry and knowledge. As Thomas Henry Huxley once said: “The great tragedy of Science—the slaying of a beautiful hypothesis by an ugly fact” (cited in Cohn, 1989, p. 12).
Thus, the examples we highlight in this chapter show that sustained inquiry can significantly improve the certainty with which one can claim to understand something. Our descriptions necessarily convey the state of
knowledge as it is understood today; to be sure, subsequent work is already under way in each area that will refine, and may overturn, current understanding. It is always difficult to assess the progress of a line of research at a given point in time; as Imre Lakatos once wrote: “. . . rationality works much slower than most people tend to think, and even then, fallibly” (1970, p. 174).
A final point of clarification is warranted. In this chapter we rely on the metaphor of “accumulating” knowledge. This imagery conveys two important notions. First, it suggests that scientific understanding coalesces, as it progresses, to make sense of systems, experiences, and phenomena. The imagery also connotes the idea that scientific inquiry builds on the work that has preceded it. The use of the word “accumulation” is not, however, intended to suggest that research proceeds along a linear path to ultimately culminate in a complete and clear picture of the focus of inquiry (e.g., education). Again, as we show through several examples, science advances understanding of various phenomena through sustained inquiry and debate among investigators in a field.
ILLUSTRATIONS OF KNOWLEDGE ACCUMULATION
In this section we provide examples of how scientific knowledge has accumulated in four areas. First, we describe the progression of scientific insight in differential gene activation, a line of inquiry in molecular biology that began 50 years ago and laid the foundation for today’s groundbreaking human genome project. Next, we trace advances in understanding how to measure and assess human performance, including educational achievement, that have evolved over more than a century. We then describe two controversial but productive lines of research in education: phonological awareness and early reading skill development, and whether and how schools and resources matter to children’s achievement.
These examples are provided to illustrate that lines of scientific inquiry in education research can generate cumulative knowledge with a degree of certainty and that they do so in ways similar to other scientific endeavors. To be sure, the nature of the work varies considerably across the examples. We address broad similarities and differences among disciplines and fields in Chapters 3 and 4. The lines of inquiry in this chapter demonstrate how knowledge is acquired through systematic scientific study.
Differential Gene Activation
The rise of molecular biology and the modern concept of the gene provides an especially clear illustration of the progression of scientific understanding in the life sciences. The earliest model of the gene was derived from Mendel’s pea plant experiments in the 1860s. Mendel concluded that these plants exhibited dominant and recessive traits that were inherited. The key concept at this stage was the trait itself, with no attempt to conceptualize the physical mechanism by which the trait was passed on from generation to generation (Derry, 1999). By the time Mendel’s work became known to the scientific world, cell biologists with newly improved microscopes had identified the threadlike structures in the nuclei of cells called chromosomes, which soon became known through experiments as the carriers of hereditary information. It was quickly recognized that some traits, eventually to be called genes, were inherited together (linked), and that the linkage was due to those genes being located on the same chromosome. Using breeding experiments with strains of various organisms, some having altered (mutated) genes, geneticists began to map various genes to their chromosomes. But there was still no conceptualization of the nature or structure of the genes themselves.
The next refinement of the model was to identify the gene as a molecular structure, which required the development of biochemical and physical techniques for working with large, complex molecules. Although other experiments at nearly the same time pointed to deoxyribonucleic acid (DNA) as carrying genetic information, the structure of DNA was not yet known. Scientists of the day were reluctant to accept the conclusion that DNA is the primary hereditary material because a molecule composed of only four base units, it was thought, could hardly store all the information about an organism’s features. Moreover, there was no mechanism known for passing such information on from one generation to the next.
It was these developments that led to the watershed discovery by Watson and Crick (1953) (and related work of a host of other scientists in the emerging field of molecular biology) of the DNA double helix and the subsequent evidence that genes are lengths of DNA composed of specific sequences of its four basic elements. The double helix structure that Watson and Crick discovered from analyzing DNA X-ray diffraction data also was
crucial because it suggested a major revision to the extant model of how the molecule can replicate itself.
Genetic analysis by Francois Jacob and Jacques Monod, also in the 1950s, showed that in addition to providing the templates for constructing important proteins, some genes code regulatory proteins that can turn specific sets of genes on or off (see Alberts et al., 1997). Early work on gene regulation had suggested that when the sugar lactose is present in the nutrient medium of the common bacterium E. coli, the bacteria produce a set of enzymatic proteins that are responsible for metabolizing that sugar. If the lactose is removed from the medium, those enzymes disappear. The first evidence that led to an understanding of gene regulation was the discovery that there were mutant strains of E. coli in which those enzymes never disappeared because the bacteria were unable to shut off specific sets of genes. Previous work had shown that mutations—changes in one or more nucleotides in the gene sequence—could alter the activity of enzymatic proteins by changing the protein structure. Thus, it was first hypothesized that in these mutant E. coli strains, mutations resulted in some enzymes being changed to an “always-on” state. Again, this model was later shown to be invalid when Jacob and Monod demonstrated experimentally that these mutant bacteria were instead deficient in the proteins that served as regulators that specifically repressed (or “turned off”) those sets of genes.
Because most regulatory proteins are present in cells in minute quantities, it required more than a decade for advances in cell fractionation and purification to isolate these repressor proteins by chromatography. But once isolated, the proteins were shown to bind to specific DNA sequences, usually adjacent to the genes that they regulate. The order of nucleotide bases in these DNA sequences could then be determined by a combination of classical genetics and molecular sequencing techniques.
This is just the beginning of the story. This work has led to new knowledge in molecular biology that affects our understanding of both cell development and of genetic disease. The work of countless molecular biologists in the past 50 years has resulted in the recent publication of the linear “map” of the entire human genome, from which, one day—perhaps many years in the future—all genetically influenced diseases and all developmental steps may be deduced.
Now, after half a century of publications describing these fundamental discoveries, the theoretical model of the gene can be tested in DNA microarrays the size of a postage stamp that promulgate up to 60 million DNA/RNA (ribonucleic acid) reactions simultaneously (Gibbs, 2001). Over 1,100 disease-related genes have been discovered that are associated with almost 1,500 serious clinical disorders (Peltonen and McKusick, 2001). More than 1,000 mutations have been linked to cystic fibrosis alone, for example. Uncertainties that must be resolved by future research revolve around which of those genes or gene complexes are critical for the onset of the disease and how to correct for the errors.
Testing and Assessment
The recorded history of testing is over four millennia old (Du Bois, 1970); by the middle of the nineteenth century, written examinations were used in Europe and in the United States for such high-stakes purposes as awarding degrees, government posts, and licenses in law, teaching, and medicine. Today, the nation relies heavily on tests to assess students’ achievement, reading comprehension, motivation, self-concept, political attitudes, career aspirations, and the like. The evolution of the science of educational testing, similar in many ways to the progress in genetics, follows a long line of work in educational psychology, psychometrics, and related fields dating back to the late 1800s. We take up the evolution of testing over the past 150 years when the scientific study of tests and assessments was still in its infancy. Steady but contested and nonlinear progress has been made since the early days, often from the melding of mathematics and psychology: “Criticizing test theory . . . becomes a matter of comparing what the mathematician assumes with what the psychologist can reasonably believe about people’s responses” (Cronbach, 1989, p. 82).
This evolution can be seen in the development of three related strands of work in this field: reliability, validity, and mathematical modeling.
The notion of test reliability—the consistency of scores produced by a test—grew out of the recognition that test scores could differ from one
occasion to another even when the individual being tested had not changed. The original notion of reliability was based on the simplifying assumption that a single underlying trait accounted for widely observed consistency in test performance and that variations in test scores for the same person at different times were due to an undifferentiated, constant measurement error. While the mathematics for dealing with reliability under these assumptions was straightforward (a correlation coefficient—see below), Thorndike (1949), Guttman (1953), and Cronbach (1951, 1971), among others, recognized that the assumptions did not align with what could be reasonably believed about human behavior. For example, in practice different methods of calculating a reliability coefficient defined “true score”—the consistent part of a respondent’s performance—and measurement error—the inconsistent part, somewhat differently. For instance, remembering an answer to a particular question when the same test was administered twice meant that “memory” contributed to a respondent’s consistency or true score, but not so upon taking parallel forms of the test. Moreover, Cronbach, Guttman, Thorndike, and others recognized that test performance is more complex than what a single trait could predict, and that there can be many sources of measurement error including inconsistency due to different occasions, different test forms, different test administrations, and the like.
In the late 1800s, Edgeworth (1888) applied the theory of probability to model the uncertainty in the scores that graders assigned to essays. He estimated how many examinees who had failed to get college honors, would have slipped over the “honors line” had there been different, but equally competent, graders. Krueger and Spearman (1907) introduced the term “reliability coefficient.” They used a measure similar to the correlation coefficient (a measure of the strength of the relationship between two variables) that extended Edgeworth’s ideas and provided a measure of the difference in the rankings of individuals that would occur had the assessment consisted of different but comparable test questions. The Spearman-Brown (Spearman, 1910; Brown, 1910) formula gave researchers a way to estimate the reliability of a test of a certain length without having to give both it and a “comparable version” of it to the same examinees. Kelley (1923) in an early text gave a detailed treatment of various “reliability coefficients.” Kuder and Richardson (1937) produced a more streamlined technique that also did not require obtaining the performance of the same
individuals on two tests. However, in working with the Kuder-Richardson formulas, Cronbach (1989) found that at times it produced numbers that were not believable—e.g., sometimes the estimated reliability was negative. In response, he (Cronbach, 1951) extended this work by providing a general formula that fit a very wide class of situations, not just dichotomously scored test questions.
Once easily usable formulas were available for computing measures of a test’s reliability, these measures could be used to study the factors that affect reliability. This led to improved test development and to the gradual recognition that different test uses required different measures of test reliability. In the 1960s, Cronbach, Rajaratnam, and Gleser (1963), drawing on advances in statistical theory (especially Fisher’s variance partitioning and random components of variance theory) incorporated this understanding into a framework that accounted, simultaneously, for multiple sources of measurement error. Generalizability theory (Cronbach, Gleser, Nanda and Rajaratnam, 1972), now provides a systematic analysis of the many facets that affect test score consistency and measurement error.
In a similar manner, the concept of test validity—initially conceived as the relation between test scores and later performance—has evolved as straightforward mathematical equations have given way to a growing understanding of human behavior. At first, validity was viewed as a characteristic of the test. It was then recognized that a test might be put to multiple uses and that a given test might be valid for some uses but not for others. That is, validity came to be understood as a characteristic of the interpretation and use of test scores, and not of the test itself, because the very same test (e.g., reading test) could be used to predict academic performance, estimate the level of an individual’s proficiency, and diagnose problems. Today, validity theory incorporates both test interpretation and use (e.g., intended and unintended social consequences).
While the problem of relating test results to later performance is quite old, Wissler (1901) was the first to make extensive use of the correlation coefficient, developed a decade earlier, to measure the strength of this relationship. He showed that the relationship between various physical and
mental laboratory measures with grades in college was too small to have any practical predictive value. Spearman (1904b) discussed factors that distort measured correlation coefficients: these included ignoring variation in the ages of the children tested as well as other factors that affect both quantities being correlated and the correlation of quantities subject to substantial measurement error.
The “Army Alpha” test was developed in 1917 for use in classification and assignment during World War I. It forced the rapid development of group testing and with it the increased need for test validation that was interpreted primarily as the correlation of the test with other “outside” criteria. Gulliksen’s (1950a) work during World War II with tests used by the Navy to select engineers led him to emphasize a test’s “intrinsic content validity” as well as its correlations with other criteria in test validation studies. By 1954, the American Psychological Association recognized three forms of test validity—content, criterion-related, and construct validity.
In 1971, Cronbach put forth, and in 1993 Messick reaffirmed, the current view that validity is a property of the uses and inferences made on the basis of test scores rather than a property of a test. In this view, the establishment of the validity of an inference is a complex process that uses a variety of systematic evidence from many sources including test content, correlations with other quantities, and the consequences of the intended use of the test scores. Because of the variety of test uses and of the evidence that can be brought to bear on the validity of each use, claims for and against test validity are potentially the most contested aspects of testing in both the scientific and public policy arenas.
The mathematical models and theories underlying the analysis of tests have also evolved from modest beginnings. These models were first introduced in the beginning of the twentieth century as single-factor models to explain correlations among mental ability tests. Spearman (1904a) introduced his “one factor” model to explain the positive intercorrelations between various tests of mental abilities. (This led directly to his original definition of test reliability.) This unidimensional view of general ability (intelligence) or “g” immediately raised controversy about the nature of
human abilities. Thurstone (1931), assuming a multidimensional structure of intelligence developed more complicated multifactor analysis models and Guilford (1967) posited no less than 120 factors based on three fundamental dimensions. In perhaps the most comprehensive analysis to date, Carroll (1993; see also Gustafsson and Undheim, 1996) found strong empirical support for a hierarchical model.
These mathematical models then evolved to classical reliability theory with a single underlying trait. The mathematical models developed in close conjunction with the increasingly more complicated uses of tests and more complex demands made on the inferences based on them.
Kelley (1923) gave an exposition of “true score theory” that provides precise definitions to various different quantities, all called “test reliability,” and introduced his formula relating observed scores, true scores, and test reliability. This led to classical test theory, which was codified in Gulliksen (1950b). However, this theory was limited in its simple, unidimensional conception of behavior and its undifferentiated notion of measurement error (noted above). Moreover, as Lord (1952) pointed out, this test theory ignored information about the nature of the test items (e.g., difficulty) that individuals were responding to. With the advent of high speed computing, the integration of the trait focus of test theory with information about test-item characteristics led to a major advance in scaling test scores— item-response theory.
Item response theory (IRT) (Lord, 1952), a detailed mathematical model of test performance at the test question level, developed over the next few decades, with major publications by Rasch (1960) and Lord and Novick (1968). IRT expanded quickly and is now a very active area of research. There are several important applications of IRT: development of item and test information curves for use in test development; detailed analyses of item-level data; pooling data from related assessments given to different examinees; linking scores on different tests; reporting scores on a common scale for tests—such as the National Assessment of Educational Progress— for which each examinee only takes a small portion of the whole assessment; and the creation of “adaptive tests” given on computers.
Current developments include using IRT to model the cognitive and evidentiary reasoning processes involved in answering test questions so as to improve the use of tests in diagnosis and learning (National Research
Council, 2001b). The next evolution of models most likely will incorporate what are called “Bayesian inference nets” to construct appropriate theory-driven interpretations of complex test performance (Mislevy, 1996; National Research Council, 2001b).
Phonological Awareness and Early Reading Skills
A third example traces the history of inquiry into the role of phonological awareness, alphabetic knowledge, and other beginning reading skills. This research has generated converging evidence that phonological awareness is a necessary, but not sufficient, competency for understanding the meaning embedded in print, which is the ultimate goal of learning to read.
Research on the role of phonological awareness and alphabetic knowledge in beginning reading began at the Haskins Laboratories in the 1960s under the leadership of Isabelle Liberman, a psychologist and educator, and her husband, Alvin Liberman, a speech scientist. At the time, Alvin Liberman and his colleagues were interested in constructing a reading machine for the blind. They made important observations about the production and perception of speech that they hypothesized might be related to the development of reading. Most pertinent was the observation that speech is segmented phonologically, although the user of speech may not consciously recognize this segmented nature because phonological segments are merged together during speech production (A.L. Liberman, Cooper, Shankweiler, and Studdert-Kennedy, 1967). So a word like “bag,” which actually has three segments represented at a phonemic level, is heard as one sound as phonological segments are merged together in speech.
Isabelle Liberman subsequently applied these observations to reading, hypothesizing that the phonetic segments of speech that are more or less represented in print might not be readily apparent to a young child learning to read (I. Liberman, 1971). It had long been recognized that teaching the relationship of sounds and letters helped children develop word recognition capacities (Chall, 1967). What was unique about the Haskins research was the clear recognition that written language is scaffolded, or built, on oral language and that literacy is a product of long-established human capabilities for speech (A.M. Liberman, 1997). But speech is usually learned naturally without explicit instruction. In order to learn to read (and write),
the relationship of speech and print (i.e., the alphabetic principle) typically must be taught since children do not naturally recognize the relationship. This principle helps explain the role of phonics—instructional practices that emphasize how spellings are related to speech sounds—in beginning reading instruction.
In a series of studies, the Libermans and their colleagues systematically evaluated these hypotheses. They demonstrated that young children were not aware of the segmented nature of speech, that this awareness developed over time, and that its development was closely linked with the development of word recognition skills (Shankweiler, 1991). They emphasized that phonological awareness is an oral language skill and is not the same as phonics. However, the research demonstrated that these capabilities are necessary, though not sufficient, for learning to read (Blachman, 2000); proficient reading comprehension requires additional linguistic and cognitive capabilities. Thus, it was necessary to integrate research on word recognition with the broader field of reading research. Children vary considerably in how easily they develop phonological awareness and grasp the alphabetic principle, which has led to controversy about how explicitly it should be taught (Stanovich, 1991; 2000).
From these origins in the 1960s and early 1970s, research on phonological awareness and reading expanded (Stanovich, 1991; 2000). In the latter part of the 1970s, struggling against a background of older theories that were behavioristic or focused on the role of perceptual factors in reading (e.g., Gibson and Levin, 1975), the view of written language as scaffolded on oral language gradually took hold—despite criticisms that the research was simplistic and reductionistic (e.g., Satz and Fletcher, 1980). In the 1980s, research expanded into areas that involved the development of phonological awareness and reading capabilities, ultimately leading to large-scale longitudinal studies showing that phonological awareness could be measured reliably in young children and that its development preceded the onset of word recognition skills (Wagner, Torgesen, and Rashotte, 1994). Other research strengthened findings concerning the critical relationship of phonological awareness skills and word recognition deficits in children, adolescents, and adults who had reading difficulties. This led directly to reconceptualizations of disorders such as dyslexia as word-level reading disabilities caused by problems developing phonological awareness and the
ensuing development of another program of research to evaluate this hypothesis (Vellutino, 1979; Shaywitz, 1996).
These later findings were of great interest to people studying learning disabilities, who expressed concern about whether the findings were being applied to children in the classroom and whether they were being used to understand reading failure. In 1985, at the request of Congress, the National Institute of Child Health and Human Development (NICHD) was asked to initiate a research program on learning disabilities. This program led to research on multiple factors underlying reading disability, including research on cognitive factors, the brain, genetics, and instruction.1 Many studies varying in research questions and methods have built on and emerged from these initiatives. For example, epidemiological studies of the prevalence of reading disabilities in North America, the United Kingdom, and New Zealand showed that reading skills were normally distributed in the population. This finding was a major breakthrough because it meant that children who were poor readers were essentially in the lower part of the continuum of all readers, rather than qualitatively different from good readers (Shaywitz et al., 1992). These studies overturned prevailing notions about reading disability that reported non-normality and implied qualitative differences between good and poor readers that had led to theories specific to the poor reader; rather, these findings indicated that the same theory could be used to explain good and poor reading. The prevalence studies also showed that most poor readers had word recognition difficulties and that the prevalence of reading failure was shockingly high (Fletcher and Lyon, 1998).
These studies were pivotal for other areas of inquiry, and convergence has slowly emerged across different domains of inquiry: cognitive, genetic, brain, and ultimately, instruction. Cognitive studies explored the limits of phonological processing and word recognition in poor readers using a
variety of models stemming from laboratory-based research on reading. Many methods and paradigms were used: developmental studies, information processing studies focusing on connectionistic models of the reading process, eye movement studies, psychometric studies oriented to measurement, and observational studies of teachers in classrooms—a broad approach. Genetic studies (Olson, Forsberg, Gayan and DeFries, 1999; Pennington, 1999; Olson, 1999; Grigorenko, 1999) showed that reading skills were heritable, but that heritability only accounted for 50 percent of the variability in reading skills: the remainder reflects environmental factors, including instruction. Functional brain imaging studies—possible only over the past few years—have identified neural networks that support phonological processing and word recognition. These findings have been replicated in several laboratories using different neuroimaging methods and reflect more than 20 years of research to identify reliable neural correlates of reading disability (Eden and Zeffiro, 1998).
Current and future work in reading skill development is sure to build on and refine this base. Indeed, under the leadership of several federal agencies—NICHD, Department of Education, and National Science Foundation (NSF)—instruction research has now come to the forefront of how to “scale up” education research for reading (as well as mathematics and science) for pre-kindergarten through high school (preK-12). This intervention and implementation research itself has a long history and is closely linked with other lines of inquiry. The research takes place in schools, which need to be seen as complex social organizations embedded in a larger context of communities, universities, and government. However, the origins are still in basic research, still connected with the “big idea” of the 1960s and the accumulation of knowledge since then.
This line of research evolved over 30 years, and accelerated, albeit along a jagged and contested course, when significant federal leadership and funding became available. The National Research Council (1998) and the National Reading Panel (2000), as well as Adams (1990) have summarized this body of research.
Education Resources and Student Achievement
Perhaps the most contentious area in current education research is the role of schools and resources in education outcomes. For much of the
twentieth century, most policy makers and members of the public believed that increases in education resources (e.g., money, curricula, and facilities) led to improved education outcomes, such as student achievement (Cohen, Raudenbush, and Ball, in press).2 However, over the past few decades, research has shown that these direct relationships between resources and outcomes are either very weak or elusive—perhaps products of wishful or somewhat simplistic thinking.
Beginning with Coleman et al.’s (1966) Equality of Educational Opportunity (see also Jencks et al., 1972), social science research began to document the relative absence of direct schooling effects on student achievement in comparison with the effects of students’ background characteristics. It became clear that resources such as money, libraries, and curricula had, at best, a very weak effect on students’ achievement, a counterintuitive finding. Rather, students’ home background (parents’ educational and social backgrounds) had the biggest effect on their achievement.
Needless to say, the Coleman finding was controversial because it seemed to say that schools don’t make a difference in student learning. This is not exactly what Coleman and others had found (see Coleman, Hoffer, and Kilgore, 1982), but rather how it has been (mis)interpreted over time. The key finding was that school-to-school differences were not nearly as large relative to student-to-student differences as had been supposed. Moreover, most economic studies of the direct relationship between educational resources (especially money) and student outcomes have reached conclusions similar to Coleman et al. (1966) and Jencks et al. (1972) (see, especially, Hanushek, 1981, 1986, Hedges, Laine and Greenwald, 1994; Loeb and Page, 2000; Mosteller, 1995). As Cohen et al. (in press) explained, this was “an idea which many conservatives embraced to support arguments against liberal social policy, and which liberals rejected in an effort to retain such policies” (p. 3).
Coleman’s work spawned a great deal of research attempting to find out if “schools do matter.” An argument was made that Coleman’s notion of how schools worked (e.g., resources represented as library holdings) was too simple (e.g., Rutter, Maughan, Mortimore, Ousten, and Smith, 1979). That is, Coleman had not adequately captured either how school and class-
room processes transform educational resources such as money into education outcomes or how contextual factors (e.g., local job markets in competition for college graduates) affect the direct effects of (say) teachers’ pay on student outcomes (Loeb and Page, 2000).
Cohen et al. (in press) traced several lines of inquiry that have, over time, begun to establish links between resources, transformational educational processes, and student outcomes. One line of work begun in the 1970s and 1980s compared more and less effective teachers as measured by students’ gains in achievement. Brophy and Good (1986) found—perhaps not surprisingly—that in contrast to less effective teachers, unusually effective teachers were more likely to have “planned lessons carefully, selected appropriate materials, made their goals clear to students, maintained a brisk pace in lessons, checked student work regularly, and taught material again when students had trouble learning” (Cohen et al., in press, p.4). Another line of inquiry examined teacher-student interactions around specific content learning. These studies found that overall, time on task (time being the resource) was unrelated to students’ achievement. “Only when the nature of academic tasks was taken into account were effects on learning observed” (Cohen et al., in press, p. 5; see also Palinscar and Brown, 1984; Brown, 1992). Still another line of inquiry focused on school processes, attempting to find what made the difference between more and less effective schools (e.g., Edmonds, 1984; Stedman, 1985). The more effective schools could be distinguished from their less effective counterparts by how they translated resources into education practices. High-performing schools had faculty and staff who shared a vision of instructional purpose, who believed that all students could learn, who believed that they were responsible for helping students learn, and who committed themselves to improving students’ academic performance.
This line of teaching and schooling research, continuing today, has provided evidence that the “theory” of direct effects of educational resources on student outcomes (e.g., achievement) may be too simple. Suppose, following Cohen et al. (in press) that resources were viewed as a necessary but not sufficient condition for productive education, and educational experiences were viewed as the mechanism through which resources are transformed into student outcomes. It may be that resources do matter when translated into productive learning experiences for students. Some policy research now is opening up the “black box” of education production
and examining just how resources are used to create educational learning experiences that may lead, in turn, to improved student achievement. This focus on educational experiences as a medium through which resources get translated is leading to (microlevel) work on classroom instruction.
A very different line of (macrolevel) work is focused on incentives and organizational structures of schools. This work is premised on the notion that adequately describing the complexity of classrooms and of alternative ways of stimulating student learning is beyond the current capacity of research methods. Therefore, an alternative approach is concentrating research efforts on understanding how different incentive structures affect student outcomes (Hanushek et al., 1994).
Both of these avenues of research build on existing evidence. Their divergent foci, however, illustrate how sophisticated scientific inquiry, addressing the same basic questions, can legitimately pursue different directions when the underlying phenomena are not well understood.
CONDITIONS FOR AND CHARACTERISTICS OF SCIENTIFIC KNOWLEDGE ACCUMULATION
This walk though the history of several lines of inquiry in educational research, alongside the stories of how knowledge has been integrated and synthesized in other areas, serves to highlight common conditions for, and characteristics of, the accumulation of knowledge in science-based research. The examples also show that educational research, like other sciences, often involves technical and theoretical matters of some complexity.
Certain enabling conditions must be in place for scientific knowledge to grow. The clearest condition among them is time. In each of the diverse examples we provided—in molecular biology, psychological testing, early reading skills, and school resources—the accumulation of knowledge has taken decades, and in some cases centuries, to evolve to its current state. And while we chose these examples to highlight productive lines of inquiry, the findings that we highlight in this report may be revised or even proven wrong 50 years from now.
A second condition for knowledge accumulation is fiscal support. As our example of the role of phonological awareness and early reading proficiencies in particular suggests, building the education research knowledge base and moving towards scientific consensus may take significant federal leadership and investment. The many compelling reasons for increased federal leadership and investment will be explored more fully in Chapter 6.
A final condition that facilitates this accumulation is public support for sustained scientific study. For example, the public posture toward medical research, including the mapping of the human genome and related molecular study, is fundamentally different than it is toward education research. Citizens and their elected leaders acknowledge that investing in medical science is needed, and the funding pattern of federal agencies reflects this attitude (National Research Council, 2001c). Although difficult to measure precisely, it seems clear that by and large, the public trusts scientists to develop useful knowledge about foundations of disease and their prevention and treatment. In contrast, in education research technical achievements are often ignored, and research findings tend to be dismissed as irrelevant or (sometimes vehemently) discredited through public advocacy campaigns when they do not comport with conventional wisdom or ideological views. Further, with dispute about scientific quality, findings from poorly conducted studies are often used to contradict the conclusions of higher quality studies. In the social realm, people and policy makers do not tend to distinguish between scientific and political debate as they do in medical and other “hard” sciences, seriously weakening the case for such research- and evidence-based decision making. The difficulties associated with conducting randomized experiments in education is particularly problematic (Cook, 2001; Burtless, in press). The early reading example we provide is an exception in this regard: the significant and sustained congressional support beginning in the 1980s was a crucial factor in the progress of this line of work.
The nature of the progression of scientific insight across these examples also exhibits common characteristics. In all cases, the accumulation of
knowledge was accomplished through fits and starts—that is, it did not move directly from naïveté to insight. Rather, the path to scientific understanding wandered over time, buffeted by research findings, methodological advances, new ideas or theories, and the political and ideological ramifications of the results. As scientists follow new paths, blind alleys are not uncommon; indeed, trying things out in the face of uncertainty is a natural and fundamental part of the scientific process (Schum, 1994; National Research Council, 2001d). Nevertheless, scientific research has a “hidden hand” (Lindblom and Cohen, 1979) that seems to lead to self-correction as debates and resolutions occur and new methods, empirical findings, or theories emerge to shed light on and change fundamental perceptions about an issue (e.g., Shavelson, 1988; Weiss, 1980).
A second characteristic of knowledge accumulation is that it is contested. Scientists are trained and employed to be skeptical observers, to ask critical questions, and to challenge knowledge claims in constructive dialogue with their peers. Indeed, we argue in subsequent chapters that it is essentially these norms of the scientific community engaging in such professional critique of each other’s work that enables scientific consensus and extends the boundaries of what is known. As analytic methods for synthesizing knowledge across several studies (e.g., meta-analysis) have advanced rapidly in recent decades (Cooper and Hedges, 1994), they have enhanced the ability to make summary statements about the state-of-the-art knowledge in particular areas. These techniques are particularly useful in fields like education in which findings tend to contradict one another across studies, and thus are an increasingly important tool for discovering, testing, and explaining the diversity of these findings. The Cochrane Collaboration in medical research and the new Campbell Collaboration in social, behavioral, and educational arenas (see Box 2-1) use such methods to develop reviews that synthesize findings across studies.
In each example, the substantial progress we feature does not imply that the research community is of one mind. In all fields, scientists debate the merits of scientific findings as they attempt to integrate individual findings with existing knowledge. In education and related social sciences, this debate is intensified because of the range of legitimate disciplinary perspectives that bear on it (see Chapter 4). This issue is aptly characterized by the proverbial description of an elephant being studied by seven
The international Cochrane Collaboration in health care was created in 1993 to produce systematic reviews of studies of effectiveness (http://www.cochrane.org). Since then, an electronic library of randomized trials on health interventions has been produced and made accessible; it contains over 250,000 entries. In addition, more than 1,000 systematic reviews of sets of trials have been produced to synthesize the accumulation of knowledge from different studies. These reviews cover the effect of new approaches to handling a variety of illnesses and providing health services. The main benefit of such a system in health care is that it operationalizes the idea of systematic accumulation of knowledge internationally.
In the social, behavioral, and educational fields, the international Campbell Collaboration (http://campbell.gse.upenn.edu/) was formed in 2000 to produce systematic reviews in the future. The object again is to create a mechanism for preparing, maintaining, and making accessible systematic reviews and electronic libraries of randomized and nonrandomized trials that are useful to policy makers, practitioners, and the public.
blind scientists, each touching a different part of the animal. Indeed, the social sciences have been characterized by increasing specialization of method and theory, with each method-theory pair describing a different aspect of a social phenomenon (Smelser, 2001).
Another source of controversy among scientists arises out of differing views about what is possible in policy and practice. For example, some policy studies (e.g., Hanushek, 1986, 1997; also see Burtless, 1996) conclude that the indirect effects of resources on student outcomes are both small and, as policy instruments, difficult to manage. For them, establishing
the direct effect of a resource—such as selecting teachers for their subject and cognitive ability—is a more manageable policy instrument that is more likely to affect student achievement than, say, difficult-to-control indirect mechanisms such as teaching practices.
The examples in this chapter demonstrate a third characteristic of scientific knowledge generation and accumulation: the interdependent and cyclic nature of empirical findings, methodological developments, and theory building. Theory and method build on one another both as a contributor to and a consequence of empirical observations and assertions about knowledge. New knowledge gained from increased precision in measurement (say) increases the accuracy of theory. An increasingly accurate theory suggests the possibility of new measurement techniques. Application of these new measurement techniques, in turn, produces new empirical evidence, and so the cycle continues. This cycle is characteristic of the natural sciences, as illustrated in our example of differential gene activation, and also evident in social science in the measurement of economic and social indicators (deNeufville, 1975; Sheldon, 1975) and education measurement (National Research Council, 2001b).
A fourth and final characteristic that emerges from these examples is a comparative one: studying humans is inherently complex. Humans are complex beings, and modeling their behavior, belief systems, actions, character traits, location in culture, and volition is intrinsically complicated. Research challenges arise in large part because social scientists lack the high degree of control over their subjects that is typical in the “hard” sciences—for example, gaggles of molecules are better behaved than a classroom of third-graders. This observation is not intended to suggest that science is incompatible with the study of the human world. Nor do we mean to say that scientific work is fundamentally different in these domains (indeed, the main message of Chapter 3 is that the core principles of science apply across all fields). Rather, scientific inquiry involving humans is qualitatively more complex than inquiry in the natural sciences, and thus scientific understanding often requires an integration of knowledge across a rich array of paradigms, schools of thought, and approaches (Smelser, 2001).
Science is an important source of knowledge for addressing social problems, but it does not stand in isolation. If we had continued our story about school resources and entered the current debate around education reform, we could readily show that the generation of scientific knowledge—particularly in social realms—does not guarantee its public adoption. Rather, scientific findings interact with differing views in practical and political arenas (Lindblom and Wodehouse, 1993; Feldman and March, 1981; Weiss, 1998b, 1999; Bane, 2001; Reimers and McGinn, 1997). The scientist discovers the basis for what is possible. The practitioner, parent, or policy maker, in turn, has to consider what is practical, affordable, desirable, and credible. While we argue that a failure to differentiate between scientific and political debate has hindered scientific progress and use, scientific work in the social realm—to a much greater extent than in physics or biology—will always take place in the context of, and be influenced by, social trends, beliefs, and norms.
Finally, we acknowledge that the degree to which knowledge has accumulated in the physical and life sciences exceeds that accumulation in the social sciences (e.g., Smelser, 2001) and far exceeds it in education. And there is clearly very hard work to be done to bring the kind of science-based research we highlight in this chapter to bear on education practice and policy. Indeed, scholars have long recognized that some aspects of human knowledge are not easily articulated (Polanyi, 1958). Some have argued that knowledge in education in particular is often tacit and less precise than other fields (Murnane and Nelson, 1984), rendering its use in practice more difficult than for other fields (Nelson, 2000). But, above all, the examples we provide in this chapter suggest what is possible; the goal should be to build on their successes to forge additional ones.