10
Drawing Inferences for National Policy from Large-Scale Cross-National Education Surveys1

Marshall S. Smith*

This chapter focuses on three questions:

  1. What kinds of inferences may be validly drawn from large-scale, cross-national surveys about the effects on student achievement of differences among nations in the structure, content, and practice of education?

  2. What inferences from past surveys have been used to inform national policy and were they methodologically and substantively reasonable?

  3. Are there lessons from the past that might be used to improve the quality of policy considerations stimulated by results of TIMSS-R, the 1999 repeat survey of the 1995 Third International Mathematics and Science Study (TIMSS)?

Before addressing these questions, two clarifications are in order. First is the observation that “national policy” in the United States consists of two distinct parts: a formal part that is usually called “federal” policy and an “informal” part that is made up of consensus or near-consensus views of national organizations involved in education, informed media, policy analysts, state policy makers, and so forth. We are interested in both

*  

Marshall S. Smith is a professor of education in the School of Education at Stanford University and program director of education at the William and Flora Hewlett Foundation.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement 10 Drawing Inferences for National Policy from Large-Scale Cross-National Education Surveys1 Marshall S. Smith* This chapter focuses on three questions: What kinds of inferences may be validly drawn from large-scale, cross-national surveys about the effects on student achievement of differences among nations in the structure, content, and practice of education? What inferences from past surveys have been used to inform national policy and were they methodologically and substantively reasonable? Are there lessons from the past that might be used to improve the quality of policy considerations stimulated by results of TIMSS-R, the 1999 repeat survey of the 1995 Third International Mathematics and Science Study (TIMSS)? Before addressing these questions, two clarifications are in order. First is the observation that “national policy” in the United States consists of two distinct parts: a formal part that is usually called “federal” policy and an “informal” part that is made up of consensus or near-consensus views of national organizations involved in education, informed media, policy analysts, state policy makers, and so forth. We are interested in both *   Marshall S. Smith is a professor of education in the School of Education at Stanford University and program director of education at the William and Flora Hewlett Foundation.

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement parts. Second, to simplify the discussion, I consider only implications for U.S. national policy and concentrate primarily on TIMSS, the most recent, ambitious, and methodologically sophisticated of the large-scale, cross-national surveys.2 At various times in the discussion, I use examples from other cross-national surveys. The extant literature on the cross-national educational surveys is daunting. The National Academies Web site alone has more than 100 papers on TIMSS. I recommend two papers previously prepared for the Academies on topics directly related to this chapter. Haertel’s (1997) paper on what might be learned from the TIMSS data and Elmore’s (1997) on the political and policy implications of TIMSS are thoughtful and powerful. I refer to both at various points in this discussion. CONTEXT FOR THINKING ABOUT INFERENCES FROM TIMSS FOR NATIONAL POLICY Some Recent History on the Use of Data from Cross-National Studies for Developing Policy Educators and policy makers in the United States have long made comparisons between U.S. schools and students and those in other nations. At times, depending on the larger social and economic context, the comparisons have had considerable influence on education in America. In the late 1800s, for example, as schools and cities grew larger, the nation adopted the Prussian approach of sorting students into grades by age. After the Russians launched Sputnik in the 1950s, the federal government hurried to design and fund the development and implementation of the “new” mathematics and hands-on science curricula. More formalized cross-national surveys in education began with the first International Association for the Evaluation of Educational Achievement (IEA) mathematics survey in 1964. That survey, and later ones in the 1960s and early 1970s in other subject areas, carried two general messages. First, although young U.S. students appeared to score well above the international average, U.S. students in grades beyond elementary school scored less well on the tests than did students in many other countries. Second, the surveys underscored the importance of the content of the curriculum. Carroll (1963), for example, the author of a report on the IEA Foreign Language study, used the results to argue for his “model of school learning,” which allocated a central role to curricular content. That curriculum is important—or as Carroll argued in conversation, that it was difficult to learn French in U.S. schools unless you were taught it—seems like an obvious idea in 2002. Carroll’s work legitimately might be viewed

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement as the precursor of the concept of opportunity to learn, a concept that has been embedded in three generations of IEA surveys and, in the late 1980s and into the 1990s, played an important role in discussions of U.S. national policy. However, in the late 1960s and early 1970s, the effects of social and political processes on schools were taking clear precedence over curriculum content in education policy and reform discussions. With the notable exceptions of early childhood studies and scattered local efforts, the nation’s diminished thirst for changes in curriculum might be attributed both to the influence of broad social movements taking place at this time and to the political and educational backlash against the post-Sputnik reforms of the late 1950s and early 1960s. Similarly, the findings of the early cross-national studies that U.S. students in the later grades did not score as well as students in many other nations appear to have had little independent effect on policy deliberations during the late 1960s and into the early 1970s. Again, matters that addressed equality of opportunity—including community control, the growth and quality of Title I and Head Start, bilingual education, and desegregation—were primary concerns of education policy makers during this period. In addition, however, the public and policy makers often discounted results of early cross-national studies as unconvincing. After all, “everyone knew” that other nations had a much smaller percentage of eligible students in their secondary schools and that this explained the relatively low scores of the older U.S. students. Moreover, it was not until 1974 that the public discovered that Scholastic Aptitude Test (SAT) scores had peaked almost a decade earlier, a fact that might have provided support for the cross-national findings. During the late 1970s, cross-national surveys continued to be unnoticed in policy debates at the national level.3 National policy focused on how to improve the effectiveness of schools for poor and minority students even as SAT scores continued to drop and the progressive policies initiated in the 1960s were challenged by opposition to northern school desegregation and negative evaluations of Head Start and Title I. The early years of the Reagan administration changed the nation’s agenda. As SAT scores continued their fall and the nation struggled through severe inflation and fears of a decline in economic prosperity, national policy concerns turned away from a focus on lower achieving students toward the conclusion that our nation’s human capital reservoir was being drained. Drawing on cross-national surveys carried out in the 1960s and early 1970s, a 1983 report released by the president found that “what was unimaginable a generation ago has begun to occur—others are matching and surpassing our educational attainments” (National Commission on Excellence in Education, 1983, p. 5). The policy recommenda-

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement tions of A Nation at Risk—more rigorous academic courses, longer school days and years, more homework, and overall a tougher program of study for U.S. students—were amplified by a variety of other reports put out by prestigious national organizations at roughly the same time.4 A consensus of elite analysts and business and political groups emerged, a consensus that spawned a new “national policy.” By and large the implementation of this new policy direction was initiated and carried out at the state level as governors and chief state school officers became more active in education reform. Federal-level policies changed little, and for a variety of reasons the federal government exerted little leadership until the end of the 1980s. There is little question about the important role that cross-national survey data played in the arguments that were so badly presented in A Nation at Risk and the other reports. The reports, of course, did not rely exclusively on the international studies. Other test score and course-taking data also indicated a need to increase the academic rigor of U.S. schooling. Nonetheless, the cross-national survey results were prominently cited as showing the comparative inadequacy of U.S. education, to which the reports in turn linked the relative weakness of the U.S. economy. The now familiar argument was simple: The economic future of the United States was jeopardized because the students in countries that were our economic competitors had higher scores than U.S. students.5 When the Second International Mathematics Study (SIMS) was released in 1984, the political and educational climate was ripe for it to receive substantial attention. The two principal messages of SIMS to the U.S. public and policy makers were similar to earlier studies. The first, which reinforced the urgency of the emerging reforms, was that U.S. students achieved less well than did students of many other developed nations. The second message underscored the importance of curriculum and, in middle and secondary schools, tougher courses. Both messages found fertile soil in which to flourish. The U.S. Educational Environment “National education policy” in the United States—whether the emphasis was on equal opportunity in the 1960s and 1970s or the push for academic rigor in the 1980s—has generally a weaker influence on school practice than does national educational policy in many other countries. One reason for this is that the United States has a uniquely organized, managed, and governed education system. Most of the other nations that participated in TIMSS, for example, have a centralized system for control over curriculum specifications, the nature of instruction, and the treatment of low-achieving students.

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement In the United States, political control over most formal educational policy decisions resides in 50 states and a small number of very big cities, and reforms are implemented in 14,000 districts and 95,000 schools. Elmore (1997, p. 3) labeled the “two imperatives of educational governance in the United States as ‘dispersed control’ and ‘political pluralism’” in the paper he did for a TIMSS seminar at the National Academies in 1997. Understanding the nature and complexity of the U.S. educational system is important as we think about where and to whom policy implications from TIMSS might usefully be addressed. One simple conclusion might be that political and policy leadership can emanate from Washington D.C., but policy making that influences schooling is typically the province of the states and some large cities. Perhaps then comparisons more profitably might be made between other nations and individual states in the United States. Though this might be a general rule, however, things are a little more complex. For example, during the 1960s and 1970s, the federal government became increasingly aggressive at protecting rights and providing services for students who lived in high-poverty areas, were disabled, or otherwise needed special help. On these issues the federal government both led and contributed to making policy. Other conditions further demonstrate nationalizing influences in U.S. education while complicating the picture more. The past half-century has seen organized national interest groups in education expand dramatically in size and authority, and the forces of communication, transportation, and mobility have helped create schools that look, and are, extraordinarily alike from state to state across the nation. This is true even though states differ in the balance of state and local control, in their methods of financing schools, and in many of their specific policies. However, great differences remain in quality and performance among schools. Interacting with “dispersed control,” the powerful national homogenizing forces have produced a system where there is far more variation in the quality and nature of schooling among schools within districts, and among districts within states, than there is among states. As we enter the 21st century, the homogeneity among states has extended to the intentions and strategies of their reforms. Stimulated in the early 1990s by the National Governors Association, a wide variety of national education groups, and vigorous federal leadership and money, a standards-based reform movement has swept the country. Each of the 50 states, in policy talk and action, is attempting to implement its own brand of standards-based reform. Viewed from one perspective, the national nature of the reforms and of the problems facing the reformers provides a simpler context for draw-

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement ing policy implications from cross-national surveys than has existed in the past. In effect, federal and state policies are aligned with one another and with a national consensus about a desirable policy direction. At the same time, the variation among states in the nature of their politics and their finance and governance systems and the dispersed nature of control within states mean that no one can expect a uniform national response across or even within states to any particular policy. What Levels of Government in the United States Should Be Interested in Policy Inferences from TIMSS? The previous discussion suggests four points about what levels of government should be interested in policy inferences based on findings from TIMSS. First, regardless of whether a policy is initiated at the federal, state, or local level, it will be interpreted and potentially implemented within the context of state standards-based reforms in each of the 50 different states. Second, even though new policies are interpreted within a common framework for reform, their implementation may vary from state to state and community to community and school to school. This suggests that the potential robustness of policies across state and local contexts should be examined. Third, some policies, such as those affecting the content and structure of the curriculum and the nature of pedagogy, are more appropriately targeted for state and local governments and schools, even though the research, dissemination, and funding for these policies may come from the federal government, foundations or national interest groups. Regardless of who initiates a policy or new program, it ultimately will be implemented in states, districts, and schools. Finally, the United States is considerably larger and politically more complex than any other nation studied in TIMSS. This makes direct application of lessons or policies from those other nations to the United States a questionable proposition. Consider, for example, a parallel situation within a single state in the Unites States. In this scenario, the superintendent of schools in Palo Alto, California, based on his experience in that affluent suburb, proposes a school reform to his fellow superintendent from Los Angeles. Likely the conversation would be polite, but the Los Angeles superintendent probably would not race home and initiate the reform. The differences between Palo Alto and Los Angeles are huge on a very large number of the important dimensions. The differences are equally large between small nations, such as Singapore or Iceland and larger, more complex nations, such as Germany or the United States. The closer the form of educational governance and the size and structure of

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement the political entities, the more likely there will be easy transfer of ideas. All of the points lead us again to the proposition that policy inferences from most cross-national studies are generally more appropriate for states than for other entities in the United States. On the basis of size, complexity, and governance and fiscal responsibility, states are closer in character to many TIMSS nations than is the United States itself. DIFFERENT KINDS OF INFERENCES To “infer” means to reach a conclusion based on evidence. Most interesting inferences would require going beyond a simple descriptive conclusion such as “students in small classes achieve to a higher level than students in large classes.” A next step would be to use such descriptive data and perhaps other data to argue that class size has some causal relationship to student achievement. To be convincing, this later step generally requires well-developed theory and/or a rigorous experimental design. Assumptions Underlying Inferences In most discussions of TIMSS, we assume a variety of things before we begin to make substantive inferences. We assume that the samples of students in the various countries are representative of the populations of students in those countries. We also assume that the TIMSS assessments are valid measures of mathematics and science content, as set out in the specifications for the assessments. Using these assumptions, we infer that the TIMSS assessments give us valid measurements of how much students in the various countries know of the specified mathematics and science. This allows us to say, for example, that compared to students in the other TIMSS nations, U.S. students achieve in math and science, as defined by the TIMSS assessments, relatively well in the fourth grade and relatively badly in the eighth grade. However, unless we are convinced that the TIMSS mathematics and science assessments substantially measure what is taught in U.S. schools, we cannot say very much about what factors, other than curricular content, affect U.S. student achievement as measured by the TIMSS tests. For example, if teachers in U.S. schools do not teach science, or teach substantially different science than that measured by the TIMSS assessments, we do not have, from the TIMSS test data, sufficient evidence to say very much about the effects and quality of their teaching. A great deal rests on the alignment of the assessment with the curriculum. Ironically, if the assessment is not fully aligned with the curriculum in a certain country and students in that country do badly in content areas of the assessment that are “underrepresented” in their curriculum, there

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement is a plausible case to be made for the validity of the assessment, at least for making inferences about the effects of the curriculum on performance on the assessment. Indeed, the country might be motivated to alter its curriculum to place more emphasis on the “left-out” content area, if the country valued it. I am not suggesting that the TIMSS assessment data are badly aligned with the curricula in U.S. schools—I know a great deal of effort was expended to ensure that the alignment was as close as possible. I also know that a variety of studies have been carried out on subdomains of science and mathematics that are represented in the TIMSS assessments and that have varying levels of alignment with typical U.S. curricula. I am simply pointing out that understanding how well the assessment is aligned with the curricula in various countries is a very important building block for making valid inferences from TIMSS data. In addition to assumptions about the student assessment, we often have to make assumptions about the quality of other data gathered in TIMSS. I don’t know enough about these data to suggest specific concerns, but I wonder, for example, whether there are independent measures of reliability or, even better, validity of the responses to the items for any of the questionnaires. I could imagine, for example, a number of legitimate ways to answer questions about teaching and the curriculum and school resources in the United States. We already know that the assumption of a representative sample is violated for substantial numbers of countries, particularly at the Population 3 level (the school-leaving grade, grade 12 in the United States). For example, 16 of the 21 countries that participated in the Population 3 Upper Grade Mathematics Literacy assessment were in violation of the international sampling guidelines; one of the 16 was the United States. Even though there were substantial numbers of countries in violation in the other Populations (nine of 26 in Population 1 Upper Grade Science and 17 of 41 in Population 2 Upper Grade Science), the proportion of violators in Population 3 stands out. This fact, along with my own intuition about the lack of motivation of 12th-grade students on this assessment, makes me skeptical of the results of the Population 3 assessments. In the following discussion, I will focus only on Populations 1 and 2 (fourth and eighth grades). Types of Inferences Before we look at some examples of inferences from TIMSS, let’s distinguish among three types of inferences: causal, “weak,” and “synthetic.”

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement Causal Inferences For all sorts of reasons, the data from TIMSS and other cross-national surveys, combined with our lack of strong theoretical models, will not support causal inferences. For example, using only the TIMSS data, when analyzed with the most powerful statistical techniques, we cannot validly assert that differences in the pedagogy used by teachers in different nations result in differences in student achievement. We do not have adequate measures of differences in pedagogy nor a theory that is robust enough to account for the variation in the contexts of different nations. In economists’ terms, we do not know enough about the production function. A very strong correlation does not overcome this weakness. Haertel said it well in 1997, shortly after TIMSS was released: These (TIMSS) descriptive data may suggest causal connections, but it is important to remember that TIMSS data alone cannot support any claims about causal relationships among variables. The rhetoric of “natural experiments” is seductive. It conjures up an image of the world as a great laboratory, with different countries trying alternative educational approaches and TIMSS as the common examination to see which approach worked best. But TIMSS is a comparative observational study, not an experiment. It involved the collection of data from scientific samples chosen to represent pre-existing populations, but those populations, the students of different nations, are not interchangeable. No group can be randomly assigned to any other nation’s methods of child rearing or education. Students’ experiences both in school and out of school vary in countless ways that have not been captured by the TIMSS study, and no statistical method can be relied upon to disentangle those innumerable influences unambiguously, nor to accurately quantify the effect on achievement of any one variable in isolation from all the others. (p. 2) “Weak” Inferences We are on sturdier ground when talking about TIMSS findings if we a) use qualifying language such as “supports the hypothesis that”; b) are able to support our inferences with similar findings from other research; and c) ground the findings and hypotheses in well-tested theory. A “weak” inference suggests a direction, indicates a hypothesis, or supports the elimination of a hypothesis. Admittedly, there are different grades of “weakness” in inferences from cross-national survey data. We might argue, for example, that a strong case for a causal inference would be presented by a cross-national panel study where a priori hypotheses about the effects of certain interventions were tested rigorously with appropriate methodological models and found to have strong effects. Unfortu-

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement nately, our data, theory, and analyses of cross-national surveys, to date, are far from this optimal situation. Nonetheless, the findings would still cry out for further investigation and replication. The differences among the educational systems and cultures of nations are far too great to control away statistically. Every “inference” that may be made legitimately from analyses of the TIMSS data is a “weak” inference. “Synthetic” Inferences This kind of inference is a special case of a weak inference. A synthetic inference is the most powerful for policy purposes and the most speculative. It involves piecing together a story line (a “script” in Elmore’s language) that integrates a variety of findings into a compelling and coherent picture. It typically draws on a variety of sources of information within a large study. It also may include other supporting data. Unlike a formal model, it often exists without precise parameters. Moreover, there is often no easy way to test the validity of the “story.” The “story line” from the presidential campaign in 2000 involved a complex set of inferences from a wide body of research that had convinced the candidates of the importance of testing, and of student and institutional accountability for improving educational achievement. An alternative “story line” often told by Jonathan Kozol, among others, rests on studies that support the position that providing more resources to needy schools is a viable strategy for improving achievement. A third story line grew out of the TIMSS data and will be considered in the next section. FOUR EXAMPLES OF INFERENCES FROM THE TIMSS STUDY Now, let’s take a look at four sets of findings or nonfindings from TIMSS and consider for each whether inferences about national policy legitimately might be made and whether such inferences, in fact, were made. The possible inferences in the first three sets of findings all fit the “weak” inference label; the last example fits in the category of “synthetic” inference. • A nonfinding—no single-variable “magic bullets.” In the original TIMSS reports, there are no outstanding examples of single variables found to have special power to explain differences among countries in level of student achievement.6 Difference in class size, the kind of governance system, and per-pupil expenditure, for example, are all viewed as operating within specific national contexts. As far as I know,

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement these (non) findings received little attention at the national level. As a result, we were able to avoid the loose and inaccurate inference that because such factors did not have explanatory power among nations, they were not important within countries. Suppose, however, that the TIMSS reports or reanalyses reported single-variable relationships at the national level between school resources and student achievement. Imagine, then, that class size had statistically differentiated among high- and low-scoring nations in TIMSS. In Washington in 1997-98, Democrats in the Clinton administration and in Congress were attempting to pass a bill to reduce class size in the early grades. They were supporting their position on the generally positive findings of the Tennessee experiment on class size. In that environment, if class size had been a powerful predictor of student achievement across nations, it would have been very difficult for Democrats to resist making an inference about national policy from the TIMSS data. I have two points to make here. First, it is to their credit that the original TIMSS analysts did not attempt to create single-variable relationships and make something of them. It would have been easy for them to search for and erroneously find “statistically significant” relationships—after all, they were dealing with only 40+ degrees of freedom when they were using country analyses, and they could have tried out literally hundreds of different variables. Second, single-variable and other seemingly simple relationships derived from survey data are relatively easy for policy makers to understand and discuss. Even if informed analysts had cautioned against interpreting any cross-national relationship involving class size because of the tremendous differences among nations in the reasons why classes are relatively large or small, the impulse to use the finding to support the class size legislation would have been very powerful. • Benchmarking and “existence proofs.” The fact that students in the “First in the World Consortium”7 achieve at a level that is competitive with the highest scoring nations supported many policy makers in making the inference that U.S. students, in general, could succeed at far more challenging levels. Just as Jaime Escalante and District 2 in New York City are “existence proofs,” so is the Consortium. Even though Consortium students, on average, are more advantaged than most U.S. students, the Consortium results have been seen to demonstrate that U.S. schools can be as effective as schools in the highest scoring nations. Given the right opportunities, the argument goes, all of our students can attain far higher standards of academic achievement. The belief supported by this inference became national policy in 1983 in A Nation at Risk, a policy that was reiterated again by the governors and

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement This finding was not a surprise to aficionados of cross-national surveys of achievement. Earlier international comparative studies in mathematics, science, and reading showed similar patterns. For example, the Second International Mathematics Study (SIMS) received national attention because of a similar conclusion. Analyses of SIMS data indicated that the U.S. curriculum in the middle grades, and especially in eighth grade, was not as challenging as the curricula in other nations. Recall that, in SIMS, pre- and post-eighth-grade achievement data were collected, so it was possible to explore differences in achievement gains among and within countries during the eighth grade. The cross-national average comparison in SIMS was that, on average, U.S. eighth graders gained less during the year than did eighth graders in other countries. The more interesting analyses, however, used only data from the U.S. sample. SIMS collected sufficient curriculum information within the United States to be able to categorize different eighth-grade mathematics classes into four groups (remedial, general, advanced, and algebra). Moreover, there was considerable variation in the scores of entering algebra students. This allowed a within-country comparison of the gains of students within the four different types of eighth grade classes. It turned out that after controlling for prior (pretest) achievement, students gained very little from the “remedial” experience, a little more from the general curriculum, and a substantial amount from the advanced and algebra courses. This particular analysis of eighth-grade curricular differences was an entirely within-country study—it need not have been carried out under the IEA umbrella. However, in conjunction with the cross-national comparative finding that U.S. eighth-grade students were lagging behind students in other nations, the within-country finding gained credibility and lent support to the policy argument that U.S. students should take more challenging courses in general, and eighth-grade algebra in particular. During the middle and late 1980s and into the 1990s, this policy argument gained backing from a wide variety of state and local educators and policy makers as well as organizations such as the College Board. Governors and federal policy makers urged local districts to make algebra available to larger numbers of students in the eighth and ninth grades. Although this erstwhile national policy was commendable, it also had its critics, some of whom were scholars of international studies. It turns out that many of the countries whose students do well in mathematics in general and algebra in particular in eighth-grade international assessments actually prepare students in algebra over an extended number of years. Algebraic concepts are introduced as early as third or fourth grade and extended upon over the next few years. Instead of the content of algebra being introduced and taught largely within a specific grade, it is

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement carefully integrated with other mathematics, as the child grows more able to learn and understand it. In other words, the inference that students should take an algebra course in eighth grade did not take into account the more extensive differences in mathematics curricula between nations. Part of the reason that the U.S. policy developed in the way that it did is that many of the earlier grade teachers are not prepared to teach algebra content. Thus, to policy makers, there really was no alternative but to introduce it as a single-grade subject. Besides, I can hear them arguing, there are no studies that indicate that integration works better and algebra has always been a single-grade offering in the United States—this new policy just extends the opportunity to more students. The upshot of this is that a policy at least partly based on international findings was altered to fit the U.S. circumstances. • U.S. students in eighth and 12th grades score well below our nation’s competitors in mathematics and science achievement. A plausible reason for this is that compared to other nations, U.S. curricula in mathematics and science lack focus, rigor, and coherence, and U.S. teaching emphasizes fragmented and disconnected bits of knowledge, rather than deep mastery of ideas.8 It is not hard to leap from this “plausible reason” to at least suggest the inference that the United States needs a policy that leads to improving the focus, rigor, and coherence of mathematics and science curricula and that supports teaching that emphasizes mastery of ideas, rather than fragmented facts. The data supporting this “synthetic” inference are drawn from a number of separate places throughout TIMSS. The achievement data are taken from the assessments, the curriculum data from the achievement survey questionnaire and a separate data bank and analyses of the curricula in many nations, and the teaching data from questionnaires and videotapes. Over the past few years, the inferential leap that this is what TIMSS shows has been publicly made or implied hundreds of times by many dozens of thoughtful people. In my former role as Under-Secretary and Acting Deputy Secretary of the U.S. Department of Education for most of the Clinton administration, I am one of those people; it was easy to do because it made a good story. The temptation to simplify complex results and to frame their implications into a story or capture them with a single example is very powerful, especially when talking to a lay audience or the press, which, of course, is where we should be most careful. Even when told with responsible caveats about cross-national survey data not supporting causal inferences, a story of this sort often leaves the audience with a clear impression that such inferences are justified.

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement This story has had a very successful run on the policy stage. There are at least three key reasons for this success that have to do with the nature and content of the story. In addition, the TIMSS rollout and the sustained effort at disseminating information also appeared to have an effect. First, the three key reasons: The political stars were aligned. The story was easily seen as strongly supporting the goals of state standards-based reform, a policy advocated by the Clinton administration and governors across the nation. During the first few years of the Clinton administration, policies in education were focused specifically on policies that called for higher academic standards, challenging curricula aligned with the standards, and better training for teachers that would enable them to effectively teach the curricula. The key words were coherence and rigor—words that directly resonated with the story of the TIMSS results. The administration saw this and deliberately set out to use these results to support its policies. The story is simple and plausible. It was supported with pictures communicated through easy-to-understand anecdotes and powerful videotapes of teachers from the United States, Japan, and Germany. The tapes showed striking differences between U.S. and Japanese teachers, differences that could be seen as helping to explain why Japanese students scored much higher on TIMSS assessments than U.S. students. Moreover, a widely available book, A Splintered Vision, authored by Schmidt, the principal U.S. TIMSS researcher, along with two colleagues (Schmidt, McKnight, & Raizen, 1997), presents a heavily documented version of the story. The TIMSS data collection and analyses were conducted in a highly professional manner, and data from sources outside of TIMSS tended to corroborate at least parts of the story. These three reasons created a rich environment for promoting the story. The alignment with current policy induced President Clinton and Secretary of Education Riley to be involved in releasing the data. Both saw the basic TIMSS data releases as well as the release of the data for the First in the World Consortium as powerful opportunities to support administration policies. The exposure for the administration was enhanced by the fact that different parts of the study were released over a six-month period, offering a number of opportunities for administration officials to comment to the press on the studies and their relationship to current policy. Political and national interest group leaders amplified comments by the president and the secretary. Many education and business groups

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement disseminated summaries of TIMSS findings and “implications” to their members. The telling and dissemination of the story were enhanced by the videos of teachers from three nations, toolkits for teacher professional development, Web sites dedicated to TIMSS, extensive efforts by professional education groups, a public relations firm, and a tireless campaign by Schmidt. Finally, the implicit and explicit imprimaturs of the National Center for Education Statistics (NCES) and BICSE at the National Research Council attested to the quality of the TIMSS work and corroborating data from earlier cross-national studies, and other research provided support to help validate the story. TENTATIVE CONCLUSIONS ABOUT THE IMPACTS OF EARLY CROSS-NATIONAL SURVEYS AND TIMSS ON POLICY What can we conclude from all of this? I have addressed two of the questions posed at the beginning of this chapter. The first had to do with the kinds of inferences that may legitimately be drawn from cross-national survey data. I argued against the possibility of drawing causal inferences and for a careful examination of context and corroborating data from other sources in drawing “weak,” noncausal policy inferences. I also argued that the politically most powerful “weak” inference is a “synthetic” inference made up of a number of parts that together tell a plausible story. The TIMSS data provide a powerful example of a “synthetic” inference. The second question had to do with past cross-national surveys and whether the findings from these surveys have been used to inform national policy. I argued that the early survey findings initially had little effect beyond raising issues such as the fact that U.S. secondary school students had lower test scores than students in many other nations and the importance of content and students’ opportunity to learn. However, I also argued that the early cross-national studies were quite influential 10 to 20 years after they were released in helping to actually establish the policy directions for the rash of reports that came out in the early 1980s, including A Nation at Risk. I believe it is safe to say that the earlier reports stimulated a great deal of “policy talk” at the national level, which in time supported policy changes at state and local levels. In this section I continue to consider the impact of TIMSS. One thing we are certain of is that the findings of TIMSS were widely disseminated, discussed, and used in policy circles. The results of TIMSS were a topic of conversation and a source of information for officials at all levels of the Clinton administration, in Congress, in numerous state houses and state

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement departments of education, and throughout the extended national policy community. Experience with other international assessments suggests that timing is critical—findings typically will be used when the political and policy climate is ready. In this case the climate was ready. A further question has to do with whether the effect of TIMSS on policy has been only to support existing policy—or has it had a substantial independent impact on policy? That is, would state and/or federal policy have been different if there had been no Third International Mathematics and Science Study? We do know that 14 districts or clusters of districts, including the First in the World Consortium, the Miami-Dade County Public Schools, and the Smart Consortium of districts in Ohio, as well as 13 states, opted to be part of TIMSS-R and took various administrative and policy steps to help students prepare for it.9 We might infer from this that TIMSS and the advent of TIMSS-R have influenced policy formulation and development in a substantial number of states and communities around the nation. In these places TIMSS seems to have had more than a supportive and marginal effect. I hope studies are conducted of whether, how, and why these districts and states have taken the step of using TIMSS-R as a lever for reform. It may be too early tell whether TIMSS has had a true independent effect at the federal level. At this point I think the weight of the evidence is that there were not independent effects. As far as I can tell, both from my own experience and from the literature, federal policies were not directly affected by the TIMSS results. In other words, the relevant education policies of the federal government (including the legislative and administrative branches) were not changed because of the results of TIMSS. However, without question the TIMSS results reinforced the existing policies. There is an irony here, for often a study will be widely discussed and disseminated if it broadly supports current policy, in which case it is unlikely to have a powerful independent influence on policy. On the other hand, a study that does not support existing policy may not be as widely disseminated. In this instance, we may need to wait some time to see the effects of the study, as in the cases of the international comparative studies carried out in the 1960s and 1970s. ISSUES AND QUESTIONS THAT BICSE MIGHT ADDRESS THAT DRAW FROM OUR EXPERIENCES WITH TIMSS AND/OR TIMSS-R This section addresses the last question I posed at the beginning of this chapter. Are there lessons from the past that might be used to im-

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement prove the quality of inferences that might be made in the future from TIMSS, TIMSS-R, and their successors? As I thought about this issue, I broadened the section to include issues and ideas that might be part of BICSE’s agenda as it works to make cross-national studies more robust, valid, and useful. These thoughts are intended to be provocative. 1. There is no way to fully control either policy makers from making “magic bullet” or other erroneous causal inferences from international data or researchers from promoting them. But BICSE might do a few things to reduce the chance that mischief results from either a sloppy original report or a reanalysis of the cross-national data. One might be to put out some voluntary standards for the use and interpretation of cross-national survey data. The standards might cover topics such as the usefulness of a priori theory and approaches for adjusting degrees of freedom when many equations or comparisons are explored and not reported. 2. This chapter suggests that states have been an important consumer of information about TIMSS, perhaps the most important. A substantial number of states administered TIMSS-R and the results were reported in the spring of 2001. The fact that a state deliberately signed up for TIMSS-R indicates that it might pay more attention to the results than it would if the results were based on a national sample. Could BICSE suggest guidelines and develop useful benchmarks for state policy makers as they interpret the results of the national data and of the data on their own states? Are there ways to think about the appropriateness of various policy inferences suggested by TIMSS-R for different state environments? 3. What kinds of national inferences will be drawn when the nation shows only modest gains at best when the TIMSS-R (1999) and TIMSS (1995) scores are compared? Perhaps BICSE could help policy makers, the press, and the public better understand what magnitude of student achievement gains are possible or likely over certain time periods. How fast can a system (teachers, curriculum, etc.) change to the point where it is having a substantially increased effect? Can we benchmark against average gains of other countries? What are expected gains and extraordinary gains? These questions were crafted before TIMSS-R was released. Now after the TIMSS-R release, we know a little more now about how large a gain countries might expect to achieve over four years. The average gain for a country participating in both the TIMSS and TIMSS-R eighth-grade assessments in mathematics was two points. The U.S. score for 1999 was nine points higher than the U.S. score for 1995. Unfortunately this difference was not statistically significant, so it shows up in the report as “no gain.” Of interest is that another nation also had a nine-point gain that was found to be statistically significant. This suggests that the United States was close to showing a statistically significant result.

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement Also of interest is that a subgroup of U.S. students, African-Americans, did show a statistically significant gain in eighth-grade mathematics. U.S. students showed no statistical or even suggestive gain in science achievement.10 With respect to what effect TIMSS-R has had, I must admit I know little. However, my sense is that it had little impact in Washington, D.C. I wonder, for example, how many readers of this chapter knew that African-American students made a statistically significant gain on the TIMSS test of mathematics between 1995 and 1999. I suspect that state and local press were far more attuned to the TIMSS-R state results when they were released in late spring than the Washington, D.C. press was to the release of the national TIMSS-R results. 4. I understand that in TIMSS-R we will have videos of teaching from seven or so countries. Suppose that one of the countries has both very high eighth-grade mathematics scores and a substantially different way of teaching than the Japanese? What will be a reasonable inference about differences in the effectiveness of various pedagogies? Can BICSE help with this? My own suspicion is that we will find that coherent, challenging instruction carried out well by teachers who understand the content will explain differences in achievement far better than differences in styles of instruction. 5. State standards-based reform in the United States has reached early adolescence. We are facing difficult implementation issues—and possible mid-course corrections to the reforms in various states. Are there lessons that might be drawn from TIMSS that would inform state or national policy as we work through this period? Are there other TIMSS-R countries going through major reforms? How are they doing? Cross-national studies offer insights into major differences in the policies of different nations; can they offer insights into implementation problems? States might contrast themselves with smaller countries that participate in TIMSS-R. It also might be worth focusing on the experiences at the federal and state levels of other nations that have a federal form of government. Australia, Canada, and Germany, for example, have states or provinces that operate as semiautonomous entities in much the same way that our states operate. 6. Is it possible for BICSE to consider some side studies that might help to inform the interpretation of possible policy inferences? For example, what are the effects of jukus on the test taking of eighth graders in Japan and Korea? Does the curriculum of a juku in a particular country reinforce the curriculum in the country’s schools? Or does the juku experience provide students with knowledge and skills that they do not receive in school? Perhaps when we compare the schooling experiences of U.S. students with students in countries where there is a high use of

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement jukus, we should include the juku experience in the analysis. Another issue for BICSE to examine is what we know about context. What makes a policy robust enough to travel, to be appropriate across a wide variety of contexts? What makes a context amenable to change? Can we imagine a theory that would inform us about what kinds of ideas travel to and from countries that are very different in culture and experience? 7. Finally, a big question. The U.S. eighth and 12th graders who did so badly in SIMS in the early 1980s are now fueling the dot.com revolution. The eighth and 12th graders who did badly in the international mathematics and science studies in the 1960s and 1970s led the United States out of the traditional ways of doing business in the late 1980s and early 1990s. I was recently in Japan and China. In both countries they are examining their educational system to try to understand how to stimulate creativity in their students. Is it possible that the cross-national studies are leading the United States astray by measuring the wrong things? Is measuring success in student learning of the academic content of schools the same thing as measuring potential human capital? The Secretary’s Commission on Necessary Skills (SCANS) established in the early 1990s proposed five competencies, in addition to basic and advanced academic skills, as necessary for people to have when they enter the modern workplace.11 The five competencies include interpersonal skills, knowledge of systems, and the use of technology and information. Considerable evidence exists that motivation and the skill of working in groups have powerful effects on the quality of a person’s work. Perhaps the SCANS commission and the Japanese and Chinese are on the right track in worrying about creating learning environments that foster skills and characteristics such as creativity and interpersonal skills as well as academic learning. These are not new questions but they continue to deserve attention. NOTES 1.   I gratefully acknowledge the thoughtful comments of Jennifer A. O’Day on drafts of this chapter and of Catherine Sousa and Tricia Tupano for their help in pulling together information that helped me to formulate the issues contained in the chapter. I also appreciate the time and energy that the Board on International Comparative Studies in Education (BICSE) members and staff spent on reviewing drafts of this chapter. 2.   For a review of some of the “effects” of TIMSS on policies in other countries, see Macnab (2000). See also Lew and Kim (2000). Lew’s e-mail address is hclew@cc.knuje.ac.kr. For other studies, contact the American Institutes for Research, in Palo Alto, CA and Washington, D.C., for its extensive bibliography; visit the National Academies Web site (www.nationalacademies.org); and visit the Consortium for Policy Research in Education Web site (www.cpre.org) or write the Consortium at the University of Pennsylvania, School of Education, Philadelphia, PA. Also visit the TIMSS Web site at Michigan State University, E. Lansing, Michigan (http://ustimss.msu.edu).

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement 3.   I served in the Carter administration for four years in two formal education policy development roles as a political appointee. I can recall no instance where the cross-national studies were important sources of policy information. 4.   A Nation at Risk op cit.; Education Commission of the States, Task Force on Education for Economic Growth, Action for Excellence, Denver: 1983; Committee for Economic Development: Research and Policy Committee, Investing in Our Children: Business and the Public Schools; Committee for Economic Development, 1985. National Science Board, Commission on Precollege Education in Mathematics, Science and Technology, Educating Americans for the 21st Century, Washington, D.C.: National Science Foundation, 1983. 5.   It should come as little surprise that there were substantial weaknesses in the inferences and policy conclusions drawn from analyses of test scores (including scores from the cross-cultural assessments). The stories of two sets of inferences are instructive. Only the second study refers to international comparative data. I quote from a report of a talk about the national commission reports that I gave in 1984 in the U.S. Capitol Building in Washington to a seminar organized by the Federation of Behavioral, Social and Cognitive Scientists. a. “Test Scores of College-Bound Youth—Several reports mention the decline in the number of students who scored over a certain level on the Scholastic Aptitude Test (SAT). . . . There has been a fairly dramatic decline in the number and percentage of people that score above that level (650 and 700), particularly on the verbal test. This . . . may represent a decline in the capacity of our most able students.” One response to this in the reports was to argue for increased rigor in high school courses, particularly in mathematics and science. “However, . . . since 1975 the College Board test scores for advanced mathematics, chemistry and physics have gone up slightly, while at the same time more students are taking the tests.” These latter data were not mentioned in the reports. b. Comparisons with other countries—“The IEA mathematics test cited in A Nation at Risk was administered in 1964. The other IEA tests used as examples by the commissions were administered between 1967 and 1972. An interesting irony is associated in the commissions’ use of data collected in 1964. There is a continual theme through the reports that suggests that the nation’s schools ought to return to the way they were in the 1950s and early 1960s when there was more discipline and control, before students were coddled and before lots of electives were offered. Of course, the 1964 IEA tests of 13- and 17-year-olds were administered to students who went through a major part of their schooling in the 1950s.” 6.   However, there may be a tendency in some of the “production function”-like reanalyses of the TIMSS data to highlight single variables, a tendency that should be watched. 7.   See Dunson (2000) for a description of the First in the World Consortium. 8.   See Elmore (1997) and Dunson (2000). 9.   See Dunson (2000). 10.   See http://nces.ed.gov/timss/timss-r/highlights.asp. 11.   See http://wdr.doleta.gov/SCANS/whatwork/whatwork.html. REFERENCES Carroll, J. (1963). A model for school learning. Teachers College Record, 64, 723-733. Committee for Economic Development. (1985). Investing in our children: Business and the public schools. Research and Policy Committee. NY: Author. Dunson, M. A. (2000). From research to practice and back again: TIMSS as a tool for educational improvement. Consortium for Policy Research in Education Policy Brief RB-30, Graduate School of Education, University of Pennsylvania.

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement Education Commission of the States. (1983). Action for excellence. Task Force on Education for Economic Growth. Denver: Author. Elmore, R. F. (1997, February). Education policy and practice in the aftermath of TIMSS. Paper prepared for Learning from TIMSS: An NRC Symposium on the Results of the Third International Mathematics and Science Study, convened by the Board on International Comparative Studies in Education, National Research Council, National Academy of Sciences, Washington, D.C. Haertel, E. H. (1997, February). Exploring and explaining U.S. TIMSS performance. Paper prepared for Learning from TIMSS: An NRC Symposium on the Results of the Third International Mathematics and Science Study, convened by the Board on International Comparative Studies in Education, National Research Council, National Academy of Sciences, Washington, D.C. Lew, H.-C., & Kim, O.-K. (2000, April). What is happening in Korea after TIMSS? A rough road for remodeling math classes. Paper presented at the 78th Annual Meeting of the National Council of Teachers of Mathematics, Chicago. Macnab, D. S. (2000). Forces for change in mathematics education: The case of TIMSS. Educational Policy Analysis Archives, 8(15). National Commission on Excellence in Education. (1983). A nation at risk: The imperative for educational reform. Washington, D.C.: U.S. Government Printing Office. National Science Board. (1983). Educating Americans for the 21st century. Commission on Precollege Education in Mathematics, Science and Technology. Washington, D.C.: National Science Foundation. Schmidt, W., McKnight, C. C., & Raizen, S. (1997). A splintered vision of U.S. science and mathematics education. Dordrecht, Netherlands: Kluwer Academic.

OCR for page 295
Methodological Advances in Cross-National Surveys of Educational Achievement This page in the original is blank.