Progress and Challenges for Large-Scale Studies
Andrew C. Porter and Adam Gamoran*
“Poor Scores by U.S. Students Lead to 10-State Math Effort,” “U.S. Seniors Near Bottom in World Test,” “A World-Class Education Eludes Many in the U.S.” Headlines like these have splashed across the pages of major newspapers in the United States with increasing frequency in recent years. Although international studies of student achievement have attracted positive attention, they have also drawn critics. How useful are these international comparisons? A crucial issue in judging the value of international studies is the quality of their methodologies. A symposium of leading experts on the methodology of large-scale international education surveys, organized by the Board on International Comparative Studies in Education (BICSE) in November 2000, addressed the following questions:
What is the methodological quality of the most recent international surveys of student achievement? How authoritative are their results?
Has the methodological quality of international achievement studies improved over the past 40 years?
What are promising opportunities for further improvement?
The chapters in this volume are products of that symposium, and they answer these questions. Readers will learn that, overall, the quality
Andrew C. Porter is the former chair of BICSE. Adam Gamoran is a current member of BICSE. Biographical sketches of both can be found in Appendix B.
of international achievement studies is high, and according to these experts, the results can be taken as authoritative. Four decades of experience with large-scale cross-national surveys have led to substantial improvements in methodology, including better tests, better samples, better documentation, and better statistical analyses. At the same time, substantial challenges remain, including a need to develop a better appreciation of differences in the social and cultural contexts in which education takes place in different nations and the manner in which those contextual differences may be reflected in the results of achievement tests.
The symposium had a considerable body of experience from which to draw. Since 1960, the United States has participated in 15 large-scale crossnational education surveys: 13 conducted by the International Association for the Evaluation of Educational Achievement (IEA) and two by the International Assessment of Education Progress (IAEP). In more recent years, the Organization for Economic Cooperation and Development (OECD) also has become involved, creating a Program for International Student Assessment (PISA) that will survey achievement of 15-year-olds. The most assessed subjects have been science and mathematics, though reading comprehension, geography, nonverbal reasoning, literature, French, English as a foreign language, civic education, history, computers in education, primary education, and second-language acquisition all have been assessed. Several of the international studies have included survey questionnaires that supplemented the achievement tests. Most commonly, the surveys were addressed to students, but teachers and principals also have been queried. A few studies have had case study components, and most recently, one study (the Third International Mathematics and Science Study [TIMSS] and its repeat, TIMSS-R) had a largescale classroom video component. The frequency, number, and complexity of these international studies of student achievement have increased in recent years.
PURPOSES OF INTERNATIONAL COMPARATIVE STUDIES OF ACHIEVEMENT
Before assessing the methodology of international comparative studies, it is important to clarify their purposes. Previous writers have offered a variety of reasons why international comparative studies of student achievement are useful (Beaton, Postlethwaite, Ross, Spearritt, & Wolf, 1999; National Research Council, 1990, 1993; Postlethwaite, 1999). The most powerful and widely agreed upon of these is that education in one country can be better understood in comparison to education in other countries. One piece of this argument could be called benchmarking: For example, how does the achievement of U.S. students compare to the
achievement of students in other countries? Do some countries stand as existence proofs for the possibility of higher levels of achievement? Another piece of this argument concerns hypothesis generation: By studying education in other countries, alternative approaches to teaching and learning may be discovered. When alternative practices occur in unusually high-achieving countries, they may suggest hypotheses for how education in low-achieving countries might be improved. Of course, countries differ in so many ways that one cannot simply interpret associations between alternative practices and high student achievement as matters of cause and effect. Furthermore, because cultures differ across countries, sometimes quite sharply, it may be that practices in one country cannot be replicated in another. Still, hypotheses about potentially more effective educational practices have been generated by international studies of student achievement, and they can be tested for their feasibility and effects. For example, an analysis of TIMSS (Schmidt, McKnight, & Raizen, 1996) concluded that U.S. mathematics and science education, in comparison with that of higher achieving countries, is characterized by a “splintered vision” and that the United States must strive to create a curriculum with greater focus and less redundancy across grades. Standards-based reform could be seen as testing that hypothesis.
Another reason why international comparative studies of student achievement are useful is that, at least in the United States, policy makers often view them as more authoritative than within-country research. For example, the highly visible and influential report A Nation at Risk (National Commission on Excellence in Education, 1983) used international surveys of student achievement results to argue, “The educational foundations of our society are presently being eroded by a rising tide of mediocrity that threatens our very future as a nation and a people” (p. 5). Unquestionably, the inflammatory language of the report had a great deal to do with its influence. Still, it was largely the international comparative data on student achievement, showing the United States as ranking low among other countries, upon which the report built its case. There was plenty of within-country data on which the report might have built its case, including the declines in National Assessment of Educational Progress (NAEP) and Scholastic Aptitude Test (SAT) scores during the 1970s.
There is at least one more purpose that international comparative studies of student achievement can serve—contributing to the advance of methodology. As is clearly documented in this volume, international comparative studies are complicated and difficult to do well. Over the past 40 years, many methodological advances have been made in the context of this international comparative work, and these advances have strengthened the quality of education research within the United States. The de-
velopment of video methodologies for TIMSS is one example. The evolution of the concept of opportunity to learn and its shift from use as a control variable to a policy output variable is another.
The BICSE Symposium
Nine papers were commissioned by BICSE for the November 2000 symposium. Revised versions of those papers constitute the main body of this volume, and they are grouped into three areas: Study Design, Culture and Context, and Making Inferences.
Study design. Robert Linn wrote on the measurement of cognitive achievement in the design, conduct, and analysis of international studies in education. Ronald Hambleton addressed the translation of achievement tests and other instruments involved in international studies. James Chromy discussed the statistical issues of sampling, excluded populations, and age-versus-grade cohorts.
Culture and context. Janine Bempechat, Norma Jimenez, and Beth Boulay looked at cultural-cognitive issues in academic achievement and the assessment of student achievement beliefs, including attributions for success or failure. Claudia Buchmann addressed the measurement and use of family background variables in international studies of education. Gerald LeTendre identified methodological issues that arise in international comparative studies on education stemming from the varying cultural contexts of schooling in different nations.
Making inferences. Robert Floden addressed the measurement and use of opportunity to learn and other explanatory variables in the design, conduct, and analysis of international studies in education. Stephen Raudenbush and Ji-Soo Kim wrote on the statistical issues surrounding the comparisons of data across countries and between countries over time. Marshall Smith, former U.S. Undersecretary of Education, addressed the use of international comparative studies of education for drawing inferences for national policy.
Each author was asked (1) to describe, with reference to the particular topic, how, if at all, international surveys of student achievement have improved over time; (2) to assess the quality of the most recent work in the area; and (3) to identify ways in which future work might be strengthened.
Criteria for High-Quality International Studies of Student Achievement
How might one assess the quality of international comparative studies of student achievement? In 1990, BICSE identified the following criteria for a quality study (National Research Council, 1990):
The study has value for better understanding and/or improving U.S. education.
The study is tied to previous work for purposes of comparison.
The study takes into account cultural differences between countries.
The study is characterized by research neutrality (e.g., not just a Western perspective).
There is adequate capacity to conduct a study, both internationally and within each participating country.
The study has technical validity, including representative samples; precise estimates of parameters; appropriate achievement tests, with standardized administration; good translations; appropriate background questionnaires; an adequate analysis plan; an adequate reporting plan; dissemination to both technical and lay audiences; and adequate data audit.
These criteria informed the plans for the November 2000 symposium and the feedback to authors for revisions of their papers.
Other Types of International Comparative Work
Although the focus of this volume is on large-scale quantitative studies of student achievement, BICSE recognizes the importance of other forms of international comparative work. Small, focused studies can provide much greater depth in addressing questions of the purposes of schooling, the nature of teaching, and the attitudes and beliefs of students and parents. Reiterating a BICSE statement in A Collaborative Agenda for Improving International Comparative Studies in Education (National Research Council, 1993, p. 22):
In addition to large-scale surveys, there is a need for a wide range of other cross-national research, such as ethnographic studies, case studies, small-scale focused, quantitative and qualitative studies, and historical studies, that would allow us to understand what it means to be educated in diverse settings around the world.
KEY FINDINGS: METHODOLOGICAL ADVANCES AND LIMITATIONS
The nine research syntheses included in this volume provide a sense of progress over time, indicate current methodological quality, and identify areas in which improvements are needed. The remainder of this chapter highlights the major findings and conclusions of the syntheses. A concluding chapter by Brian Rowan, following the nine syntheses, offers further reflections on what the major findings mean for the future of large-scale comparative international studies of achievement.
Study Design: Achievement Tests, Translation, and Sampling
The chapters by Linn, Hambleton, and Chromy report substantial progress over 40 years of effort in improving the quality of the design and execution of comparative international studies of achievement. Recent and ongoing studies such as TIMSS and PISA exhibit high levels of methodological quality; confidence in the results they provide is justified. Important exceptions to the general high level of quality exist, and these will be noted here (and are spelled out in greater detail in subsequent chapters). However, the overriding conclusion from this volume is that the level of methodological quality is high and therefore the findings of large-scale studies are worth taking seriously.
Progress in test design, translation, and sampling. Tests of student achievement used in international comparative studies have improved markedly over time. The frameworks used to guide test construction have gotten better, delineating not only finer arrays of topics, but also—crossed with topics—cognitive demand (expectations for student achievement). Using the elaborated framework, matrix sampling has allowed achievement tests to achieve greater breadth of coverage. The result has been better attention to testing higher order skills than was previously possible. At the same time, increasingly high-quality country reviews of item pools have helped to determine test alignment with each country’s curriculum. This alignment, in turn, has made possible analyses of how well a country does on just those items that it deems appropriate. (Interestingly, the results have not greatly changed between-country rankings.)
Despite having students take different samples of items, more sophisticated uses of item response theory have allowed the creation of common scales. At the same time, there also has been an increase in the use of subscales, an important development because countries can differ in their achievement on subscales. Test designers now use differential item functioning (DIF) to identify items that might be culturally biased. Field test-
ing of items and clearer standards for item statistics represent another improvement. Rigorous translation procedures are used in the most recent international studies, and translation errors involve only a small fraction of the test items. In short, important psychometric advances have made it feasible to accomplish many of the improvements in student achievement testing that early researchers recognized as needed.
Early cross-national analyses of educational achievement often had major shortcomings in both the design and the execution of sampling. Response rates were weak and poorly documented. That is no longer the case. Recent studies have improved dramatically and now meet reasonable standards that justify confidence in statistical inferences. These improvements include much better monitoring of quality and substantial improvement in documentation, supported by increased technological capacity to gauge response rates and sampling error. For example, sampling designs, response rates, and documentation for TIMSS were much improved over the Second International Mathematics Study (SIMS).1 PISA is at least as strong. Indeed, the execution of samples has improved to the point where the sampling plans may be more problematic than the samples that are drawn in some cases.
Important challenges of study design. The chapters on study design provide details about several problems that require attention. In BICSE’s view, three problems are paramount. First, the design of international standardized achievement tests reflects an inherent tension between depth and coverage of topic areas, given necessary constraints on test burden. Tests reflect mainly the intersection of curricular topics from different countries, rather than their union. Testing the content that is common among participating countries is challenging enough. Testing the shared content plus content specific to subsets of countries has not proven feasible. Still, it would be desirable to know not only which countries are most effective in the achievement of common content, but also how achievement is affected by the unique content focuses of specific countries. Moreover, the need to cover a wide range of topics and the emphasis on testing efficiency have led to a reliance on multiple-choice questions and thus have limited the assessment of higher order skills (e.g., through problems whose solutions require multiple steps). We conclude that the tension between depth and coverage probably has no resolution, but it is important to be aware of its source and consequences. Thus far, large-scale international studies of achievement have tilted toward broad coverage of content common across countries.
Second, whereas excellent work has been done to create common scales across studies within grades, more work needs to be done to create common scales across grades within studies. TIMSS-R could not estimate
cohort gains from fourth to eighth grade because the tests at fourth and eighth grade were not constructed to be on a common scale, despite the fact that TIMSS-R was sold largely on the ability to do exactly such analyses.
Third, sampling designs for the oldest secondary cohort are unsatisfactory. In TIMSS, the most recent effort to attempt a study of this cohort (known in TIMSS as Population 3), students in the last year of compulsory schooling were surveyed. Because of differential dropout rates across nations, the last year of school is attended by very different fractions of the population in different countries. Moreover, students’ exposure to schooling varies dramatically across countries, as some have school-leaving ages as young as 16 and others as old as 21. Consequently, it is not clear what comparison this cohort study really offers. The symposium papers in this volume and other related work lead BICSE to conclude that the type of end-of-school cohort design used in previous studies should not be repeated. A different approach is offered by PISA, which is sampling 15-year-olds regardless of their grade level. This approach offers the best way to obtain a representative sample of an entire age cohort. Of course, it does not address the goal that TIMSS’s Population 3 study attempted to meet—namely, an estimate of achievement in the final year of compulsory schooling. Because of differences within and across countries in what compulsory schooling means, we conclude that goal is probably impossible to achieve. Still, a sample of 25-year-olds might provide a better estimate of achievement differences after completion of schooling.
Culture and Context
In the early surveys of international achievement, the focus was mainly on describing between-country differences in attainment. It wasn’t long, though, before interest spread to understanding differences in approach, context, and explanation of achievement differences. The November 2000 BICSE symposium leads the Board to conclude there are at least three reasons to study culture and context in large-scale studies: (1) to help in applying findings to our own country; (2) to interpret findings about cross-national achievement differences appropriately; and (3) to learn more about designing indicators of context.
The chapters in this volume on culture and context bring together three different views on the different contexts in which large-scale international studies take place. Bempechat, Jimenez, and Boulay write on culture and cognition from a psychological perspective; Buchmann focuses on the concept and measurement of family background from the standpoint of sociology; and LeTendre explores the cultural contexts of cross-national analyses from the viewpoint of anthropology. The differ-
ent perspectives add richness to our understanding of methodological advances and challenges in the study of the contexts of international achievement studies.
Progress in understanding the contexts of international studies. Each of the three perspectives can identify progress from its own standpoint, but each also sees a need for deeper appreciation of contextual differences. Bempechat, Jimenez, and Boulay’s statement (this volume, p. 127) no doubt holds for the other authors as well: “Much of the literature on student achievement across nations has been devoid of the cultural contexts in which learning takes place.”
From the psychological perspective, because cognitive processes and sociocultural conditions are interdependent, understanding cognitive performance requires appreciation of cultural conditions. For this reason, no matter how tempting, it may not be effective to simply import the pedagogy of one nation into another, whatever the achievement standing of the first nation. An example is the Japanese study lesson, an approach to professional development in which teachers work together over time and through many iterations to produce highly polished instructional units and to master the teaching of those units. Will study lessons work in the United States, despite our very different norms for the teaching profession? Probably not without at least some modifications and special support (Fernandez, Chokshi, Cannon, & Yoshida, in press). Still, the principles on which Japanese lesson studies are based may be useful in designing and implementing effective professional development in other countries. Furthermore, researchers are just beginning to attend to cross-national variation in the meaning of school achievement. For the most part, however, this approach has not been integrated into large-scale studies.
The sociological perspective offers more evidence of progress. As the conceptualization of family background has become broader and more complex in sociological research, the measurement of family background in international education surveys has developed correspondingly. Family background now refers not only to socioeconomic status (i.e., parents’ education, occupation, and income), but also to family structure, parent involvement in schooling, and social and cultural resources. On the whole, recent comparative international studies are doing a better job of taking these conditions into account than did the early studies, although important exceptions to the general trend of progress exist. Moreover, reports of the results of comparative international analyses of achievement are increasingly likely to attend to the association of achievement with students’ family background conditions. Nonetheless, international studies
fall well short of appreciating differences in the meaning of various family background measures in different national contexts.
An increasingly rich understanding of the cultural contexts of teaching and learning, which has emerged quite apart from large-scale studies, has helped enable us to place findings from large-scale studies in cultural perspective. For example, insights from previous case studies allowed Stevenson and Baker (1992) and Akiba and LeTendre (1999) to understand the importance of extra tutoring, to conceptualize it as a sort of “shadow education,” and to test their conception with large-scale data. More recently in TIMSS and the IEA Civic Education Study, case studies have augmented the large-scale data collections. Although the case study reports have been insightful, they have not been analytically linked to the large-scale surveys, so their full potential for providing richer understanding of the context of achievement differences has not been effectively realized.
Important challenges of culture and context. Overall, progress in grappling with issues of cultural context has been modest, and the need for deeper understanding remains great. The three perspectives on context expressed in this volume are intertwined, as each recognizes that differences in cultural meanings across context affect the interpretation of cross-national achievement differences. For example, to understand achievement differences it is essential to understand differences in wealth and poverty, among as well as within nations. But what represents wealth in different contexts? Although much progress has been made in measuring family circumstances, such “background” conditions cannot be fully standardized across countries. Consequently, although large-scale international studies need to do a better job of applying comparable indicators across contexts, they also need to expand the range of indicators that are specific to particular contexts. These indicators will not necessarily be comparable across contexts, but they are essential for enabling deeper analysis of differences within countries.
Although TIMSS had many strengths, it took a step backward in measuring family background: Its survey question about family structure was too crude to be fully useful, and it omitted parental occupation entirely from student surveys. By contrast, PISA contains a fuller and more sophisticated array of background questions, including parents’ occupation and a question about parents’ education that lends itself to standardized categories more easily than the IEA surveys. Despite problems with missing data on questions about parent characteristics, we can find no compelling reason for omitting such questions entirely. Both TIMSS and PISA included questions about family cultural and economic resources,
and questions of this sort offer promise for richer indicators of social background in future studies.
One of the most powerful features of TIMSS was its case study and video components along with the achievement surveys. Yet the different components remain largely separate, partly by design and partly, according to LeTendre, because TIMSS has so much information to offer that integration has been difficult to manage: “[The] simple fact of ‘data overload’ meant that the potential insight to be gained by comparing the [qualitative and quantitative] databases has not yet been achieved. . . .” (this volume, p. 211). LeTendre offers a vision of an iterative, multimethod process in which different methodological approaches and conceptual lenses would be placed in dialogue with one another over time.
Making Inferences from Large-Scale Studies
One might expect that substantial progress in designing studies would bring corresponding developments in the ability to draw inferences from large-scale international achievement studies. On the whole, this is true, especially as this progress was accompanied by developments in the assessment of opportunity to learn and in statistical methods of processing data.
Progress in drawing inferences. International comparative studies of student achievement have popularized the concept of opportunity to learn; over time, opportunity to learn has become an increasingly large part of these international comparative studies. While there is not one universally accepted definition of opportunity to learn, in the international comparative research context, opportunity to learn means students receiving instruction on certain content in an academic subject area. Similarly, international comparative studies have distinguished between the intended curriculum, the implemented curriculum, and the achieved curriculum; these distinctions, in turn, have had a considerable impact on education research in the United States. In this volume, Floden notes that measures of opportunity to learn have changed over time, but, with one exception, not with clear improvement. Initially, questions about opportunity to learn were asked in the context of whether students had received sufficient instruction to answer a particular achievement item correctly. TIMSS shifted the focus away from specific items to topics represented by multiple items. This change in focus should improve the quality of information by clarifying that a particular topic is of interest, not features of a particular achievement item (including, for example, format). However, because TIMSS contains no measure of achievement prior to the reported opportunity to learn (unlike SIMS, which included pretest and posttest
achievement data for the U.S. sample), it is not possible to link the TIMSS measures of opportunity to learn to achievement growth or to see whether the new measures perform better than those of previous studies in predicting gains in achievement. Moreover, advances in measuring opportunity to learn outside the large-scale studies, such as work by Porter (1998) and Mayer (1999), have not been incorporated into comparative international research. Instead, large-scale international studies seem to have made more progress in developing new ways to measure the intended rather than the implemented curriculum.
The importance of progress in statistical analyses during the history of comparative international studies cannot be overstated. As Raudenbush and Kim explain (this volume, p. 292), “Over the past several decades, statistical methods have greatly enhanced the capacity of researchers to summarize evidence from large-scale, multilevel surveys such as TIMSS and IALS [the International Adult Literacy Survey].” Developments such as item response models, estimation procedures for multilevel data, and new approaches for handling missing data have greatly enhanced the quality with which international comparative data can be analyzed. Better statistical analyses allow researchers to learn more from the data, and concomitant improvements in other aspects of study design—such as achievement tests, translation, sampling, and measurement of family background and opportunity to learn—make statistical inferences more reliable and meaningful. Despite these advances, great caution is needed in making statistical inferences from cross-national studies.
Since the 1980s, results from international studies have drawn considerable attention from policy makers and their advisors. Yet these results tend to become politicized, used as ammunition to support a position rather than as data to help select among competing alternatives. The decentralized governance structure of U.S. education makes it difficult to identify appropriate inferences from comparative international studies at the national level. At the same time, the increasing salience of state-level decision making in U.S. education, coupled with state- and district-level participation in international studies (TIMSS and TIMSS-R), may mean that the results from international studies have more bearing on decisions about education policy at the state level than at the federal level.
Important challenges in making inferences. Despite improvements in opportunity-to-learn measures and statistical modeling techniques, important challenges remain. In the case of measuring opportunity to learn, it is ironic that international studies, which pioneered and popularized the study of opportunities for learning, have not kept pace with research developments. Moreover, few reports on the results of international studies take advantage of the opportunity-to-learn data that are available. In
the future, more sophisticated and fine-grained assessment of opportunities for learning should be considered.
The fact that TIMSS lacks a pretest makes it quite clear that the TIMSS data cannot be used for drawing causal inferences. Interestingly, however, Raudenbush and Kim point out that even with pretest data other sources of bias may be present, and causal inference is problematic. Indeed, Raudenbush and Kim as well as Smith question whether large-scale comparative international studies are the proper arena for assessing causal explanations for achievement differences. In light of the important contextual differences between countries, the value of international studies may lie more in their potential for generating hypotheses about causal explanation than in their use as platforms for testing hypotheses. After all, it makes little difference for U.S. education policy if, for example, extra tutoring “accounts” for the achievement advantage of Japanese over U.S. students, if tutoring does not enhance achievement in an internal assessment of variation among U.S. students. As the international comparative education community has learned from its consideration of context and culture, a policy that works in Japan may not work in the United States; consequently, what can be expected from a comparative study are provocative new hypotheses about what may account for differences in student achievement. Studies of internal U.S. variation are indispensable if one wishes to determine whether a policy change in the United States would make a difference in this context under our cultural conditions.
In some cases, sufficient variation may not exist for an internal assessment of variation. For example, very few U.S. students may receive rigorous academic tutoring, so it may not be possible to assess whether tutoring would make a difference here. But a comparative international study cannot help with this quandary; the solution, as Raudenbush and Kim explain, lies in manipulating the U.S. context in order to create the variation necessary for a test of the hypothesis within the United States.
In short, the symposium papers in this volume lead BICSE to conclude that it is most productive to use international comparative studies to develop hypotheses that are then tested in experimental and quasi-experimental studies within the United States. It is not clear that the hypothesis-testing studies need to be part of the comparative international framework. If they are, however, then it is absolutely essential that pretest data be gathered from study participants.
Although statistical analyses in current international studies are competent, they could do much more to describe national systems of education. In particular, indicators of central tendency (means) garner more attention than they deserve, and more attention should be paid to two additional elements: the overall dispersion of achievement around the central tendency, and the relationship of achievement to important social
categories such as family background. Raudenbush and Kim commend the use of confidence intervals to mark the degree of uncertainty around particular means. Finally, Raudenbush and Kim as well as Smith recognize that more could be done to enhance readers’ understanding of the possibilities and limitations of statistical analyses of large-scale comparative databases. A sort of “consumer’s guide” might be helpful to policy makers, journalists, educators, and the general public—in short, the entire potential audience of consumers of international comparative education studies.
The chapters in this volume and the discussions at the symposium where they were originally presented lead the Board to conclude that the methodology of large-scale international comparative studies of student achievement has improved markedly over the past 40 years. Methodological work in these studies is sounder today than ever, and in absolute terms very good. Although progress has been made on virtually all fronts —achievement testing, translations, sampling, sensitivity to culture and context, and statistical analysis—more progress has been made in some areas (e.g., sampling and testing) than in others (e.g., accounting for culture and context).
Benefits of Large-Scale International Studies
These increasingly rigorous studies of international student achievement have produced basic knowledge, generated ideas for improved practice in the United States, and contributed to methodological advances. Not only has this body of work made clear that U.S. student achievement in mathematics and science at the eighth grade is not ideal, it also has made clear that this deficiency is not a new development. The international standing of the United States in mathematics and science achievement in eighth grade has remained relatively stable over all of the international assessments addressing those subjects. Large-scale international surveys of student achievement also have made clear that U.S. achievement is among the best in the world in mathematics and science at the fourth-grade level and also in reading achievement in the early grades. Not only does such international benchmarking of U.S. achievement help U.S. citizens, policy makers, and educators better understand our system’s productivity, but it also helps to set a context for interpreting achievement standards recently adopted in the United States. For example, the finding that progressively fewer U.S. students achieve at high levels on NAEP standards as one goes from elementary to middle to high school is
consistent with the finding that U.S. achievement falls progressively further behind that of other countries at increasing grade levels. In contrast, the high performance of the United States relative to other countries in reading and science in the elementary grades is not consistent with U.S. students’ performance against NAEP standards.
Large-scale international assessments of student achievement also have generated important new hypotheses about how U.S. education might be strengthened. TIMSS curriculum analyses, as well as earlier results from SIMS, revealed that the U.S. curriculum puts a premium on breadth of content coverage, whereas curricula in some other countries are much more focused, emphasizing depth over breadth. This finding has led researchers to the hypothesis that a more highly focused U.S. curriculum might result in improved student achievement. Although the current U.S. standards reform movement was not stimulated by this hypothesis, it represents a massive attempt to bring greater focus and depth to the U.S. curriculum. This reform may provide an opportunity to test the importance of depth over breadth. The Japanese lesson study is yet another example of an approach to education from another country that might prove useful in the United States. Still, whether the lesson study can actually be implemented in our culture and, if so, what its effects might be on student achievement are questions that remain to be answered. Fortunately, work is under way to test the hypothesis that lesson studies could improve the quality of U.S. education (Fernandez et al., in press).
The TIMSS video studies provided powerful new insights into how U.S. teachers share certain tendencies in their pedagogical practices and style, and how those practices and style stand in sharp contrast to those of teachers in Japan. Stigler and Hiebert (1999) conclude that Japanese teachers teach in ways more consistent with the vision of today’s U.S. education reformers than do our own teachers.
Methodologically, large-scale international studies of student achievement have been influential as well. The powerful concept of opportunity to learn, conceptualized originally as a control variable but more recently as an explanatory variable in studies of student achievement, has had an enormous impact on education research in the United States. Increasingly, studies attempting to explain differences in student achievement in the United States are including opportunity-to-learn variables as well as pedagogical strategy variables. In some cases, opportunity-to-learn variables are being used as education system output measures in their own right. Clearly, opportunity to learn has established itself as the legal requirement for use of high-stakes testing (National Research Council, 1999). Another example of a methodological advance is the use of video as a research tool. Although video has been used in U.S. education research
for some time, its use was relatively limited prior to TIMSS. The surge in enthusiasm for video as an education research tool, although undoubtedly influenced by the TIMSS results, also surely has been fueled by rapid advances in video technology, especially digital video.
In addition to improvements in methodology and in understanding of U.S. education, large-scale international studies of student achievement have led to a third accomplishment—the building of an international infrastructure for conducting comparative research in education. Thirty-five years ago, most countries lacked the capacity to participate in an international study of student achievement. Today, perhaps as many as 60 countries have that capacity. This increased capacity makes it likely that future studies will be of higher quality. However, it also represents an accomplishment in and of itself because the increased capacity of those 60 countries undoubtedly will be used not only for international studies of student achievement, but also for within-country education research.
Thoughts on the Future
The chapters in this volume lead BICSE to conclude that the United States should continue to participate in and encourage the regular conduct of large-scale international studies of student achievement. The Board has not reached consensus on how frequently such studies are needed or how comprehensive each study needs to be in collecting data beyond student achievement (e.g., video; case studies; surveys of principals, students, and teachers). Rowan, in his chapter, suggests that once a decade may be enough for each subject, but that, as with NAEP, there should be a cycle of subjects, so that an international comparative study of student achievement is conducted every two or three years in one subject or another. He points out that this approach should help to maintain the within-country capacity to do such work.
BICSE is also unclear about what age or grade span should be surveyed, though we are inclined to believe that some preschool studies should be attempted, generating baselines for education system productivity, and that some post-school-age cohorts should be studied to get at system yield. On these specifics, the chapters in this volume raise more questions than they answer.
A number of issues are central in thinking about the design of future work and interpretations of past work. The chapters that follow identify these issues; we outline them here as a partial road map to thinking about that work.
Adjustments for between-country differences in background conditions. One issue concerns whether to adjust for background conditions when
comparing the achievement in one nation to the achievement in another. On the one hand, we know that the student populations being served by one country at a particular age or grade level are not comparable to the student populations being served by another country. They differ on dimensions such as socioeconomic status (SES), age, ethnicity, and urbanicity. As Raudenbush and Kim point out, countries might differ dramatically in average student achievement. Yet if comparisons were made between countries by subgroups, such as within levels of SES, no differences might be found. On the other hand, the chapters in this volume lead to the conclusion that attempting to draw causal inferences from between-country comparisons is not possible. In short, it is not possible to include in analyses sufficient controls to appropriately conclude that between-country differences in student achievement are due to the differences in educational practices between those countries. When adjustments should be made and what kinds of adjustments are appropriate when creating indicators of between-country differences in student achievement remain unclear. Nevertheless, continuing to rely solely on unadjusted differences in student achievement seems certain to be misleading.
Age cohorts versus grade cohorts. Some international studies of student achievement use age cohorts, and others use grade cohorts. Which approach is more useful remains an issue. There are advantages and disadvantages to each. Age-based samples make studying education effects more difficult because students are spread across a number of grades, making a curriculum-based achievement test problematic. Age-based samples are also more expensive to survey because cluster sampling is more difficult to achieve. Still, when a grade-based sample is taken, countries can differ dramatically in the ages of the students included. An age-based cohort controls for such confounding. Furthermore, an age-based cohort, using household sampling to draw a probability sample of all people of a specific age in a specific country, controls for differential dropout rates between countries. Grade-based cohorts have been the dominant mode in international studies of student achievement, but that pattern may not continue. For example, PISA will sample 15-year-olds. An interesting argument taken from chapters in this volume by Raudenbush and Kim and Rowan suggests that important new insights might result from studies of age-based cohorts at (1) an age prior to entry into schooling for most countries and (2) an age after the completion of most schooling (say, age 25). The preschool age cohort would establish a baseline for theories about the effects of schooling. The 25-year-old cohort, as a household survey, would provide information on whether U.S. students catch up with their international comparison groups during the postsecondary school years.
Assessment of common versus unique content. A third issue concerns
the construction of student achievement tests used in international comparative studies. Linn makes a distinction between (1) testing only content common to the curricula of all participating countries (the intersection) and (2) testing not only that common content, but also all of the unique content across all participating countries (the union). The intersection is clearly the more feasible and does represent common ground for comparison. Nevertheless, although testing the union of content across countries may not be feasible, moving in that direction might produce a more ambitious test of student achievement and in that sense might provide a different kind of international benchmarking than we currently have. Until now, achievement tests used in international comparative studies of student achievement have been said to have a U.S. bias. That is, the tests have been similar to tests commonly used in the United States, including a multiple-choice format, and presumably fairly well aligned with current U.S. practice. But today’s education reforms in the United States call for a much more ambitious curriculum and much more ambitious student achievement. Are other countries teaching the type of curriculum we seek, and if so, how are their students achieving relative to our own? Perhaps some component of the achievement test used in international comparative studies should test the content we seek, rather than the content we are providing.
Analysis of within-country variance versus central tendency. A theme across many of the chapters in this volume is the need to look at variance in student achievement within countries, as well as central tendency. Analyzing within-country variance as well as central tendency would represent a change in the practice of large-scale international studies of student achievement, and so we list it here as an issue. In the TIMSS-R benchmarking study in the United States, more than 30 states and large districts participated in the student achievement study as though they were nations (Martin et al., 2001; Mullis et al., 2001). The results from that work reveal—as did the First in the World Consortium from TIMSS (http://www.1stintheworld.org/) —that there exists enormous variance in student achievement within the United States, and that some states and districts achieve at levels comparable to the highest achieving countries in the world. The emphasis on state and district participation in international studies of student achievement ensures that within-country variance, at least in the United States, will be addressed in the future. Designs and analyses that produce valid estimates of between-school, between-class, and within-class variance also are needed. Other approaches also should be utilized. LeTendre points out in his chapter that qualitative studies of achievement contexts also need to focus more on within-country differences than they have in the past. For the United States, where variance in student achievement is enormous, an examination of distribu-
tions will represent a huge improvement in the information yield of international studies of student achievement.
Integration of qualitative and quantitative data. Over the last 35 years, quantitative results on student achievement from surveys of teachers and students have been augmented by qualitative data from case studies and video. The chapters on culture and context in this volume argue forcefully for the need for such qualitative data to help understand the effects of context and culture on student achievement. Nevertheless, those chapters make clear that progress thus far has been limited, especially in integrating qualitative data with quantitative data. New designs and new analysis strategies will need to be created if the desired integration of quantitative and qualitative data is to be achieved. Qualitative data from small-scale focused studies within and comparatively between countries hold promise for informing the direction and character of the large-scale survey work. Qualitative data might profitably be included as an integral component of large-scale comparative studies as well, although not all efforts to date have resulted in integrated analyses and interpretations.
Use and impact of results. The international comparative education field needs to think more about the uses of results from international studies of student achievement. Do the results have a positive influence on education policy and practice? Do they contribute to our understanding of the quality of education and how it might be improved in the United States? Are the results used at the local level by districts and states, as well as by the federal government? What are the effects on the international infrastructure for conducting comparative research? Recently, BICSE attempted to determine what was being learned about the uses of TIMSS data by states and districts. Although many anecdotes were offered about important uses, no systematic studies could be found. Smith speculates about how, more generally, the international comparative results have been used, suggesting that they are most influential when used to support reforms already under way (although he acknowledges that results such as those from the early surveys also have been used to generate new education reform). The chapters in this volume lead the Board to conclude that it is worth considering how the impact of international surveys of student achievement on U.S. policy and practice may best be documented and studied. Perhaps a greater investment should be made in documenting this impact.
This collection of methodological analyses is intended to guide three audiences: governmental agencies in the United States and elsewhere that support international studies; nongovernmental agencies such as the IEA and OECD that carry out international studies; and the many researchers who both benefit from and contribute to the findings and methods of
large-scale international studies of student achievement. Taken as a whole, the chapters also show how BICSE can contribute to informed decisions about the nature of and participation in these studies. In the future, BICSE may take up some of the key challenges identified in this volume, such as integrating culture and context more effectively in future surveys and encouraging studies about the uses and impact of such work.
See, for example, the technical standards developed for the International Association for the Evaluation of Educational Achievement (IEA) studies (Martin, Rust, & Adams, 1999).
Akiba, M., & LeTendre, G. (1999). Remedial, mixed, and enhancement systems: An analysis of math achievement and extra-lessons. Paper presented at the meeting of the Comparative International Education Society, Toronto, Canada.
Beaton, A. E., Postlethwaite, T. N., Ross, K. N., Spearritt, D., & Wolf, R. M. (on behalf of the International Academy of Education). (1999). The benefits and limitations of international educational achievement studies. Paris: International Institute for Educational Planning/ UNESCO.
Fernandez, C., Chokshi, S., Cannon, J., & Yoshida, M. (in press). Learning about lesson study in the United States . In E. Beauchamp (Ed.), New and old voices on Japanese education. Armonk, NY: M. E. Sharpe.
Martin, M. O., Mullis, I. V. S., Gonzalez, E. J., O’Connor, K. M., Chrostowski, S. J., Gregory, K. D., Smith, T. A., & Garden, R. A. (2001). Science benchmarking report: TIMSS 1999— Eighth grade: Achievement for U.S. states and districts in an international context. Chestnut Hill, MA: Boston College, Lynch School of Education, International Study Center.
Martin, M. O., Rust, K., & Adams, R. J. (Eds.). (1999). Technical standards for IEA studies. Amsterdam: IEA.
Mayer, D. P. (1999). Measuring instructional practice: Can policymakers trust survey data? Educational Evaluation and Policy Analysis, 21, 29-45.
Mullis, I. V. S., Martin, M. O., Gonzalez, E. J., O’Connor, K. M., Chrostowski, S. J., Gregory, K. D., Garden, R. A., & Smith, T. A. (2001). Mathematics benchmarking report: TIMSS 1999—Eighth grade: Achievement for U.S. states and districts in an international context. Chestnut Hill, MA: Boston College, Lynch School of Education, International Study Center.
National Commission on Excellence in Education. (1983). A nation at risk: The imperative for educational reform. Washington, DC: U.S. Department of Education.
National Research Council. (1990). A framework and principles for international comparative studies in education. Board on International Comparative Studies in Education, N. M. Bradburn and D. M. Gilford, Eds. Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press.
National Research Council. (1993). A collaborative agenda for improving international comparative studies in education. Board on International Comparative Studies in Education, D. M. Gilford, Ed. Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press.
National Research Council. (1999). High stakes testing for tracking, promotion, and graduation. Board on Testing and Assessment, J. P. Heubert and R. M. Hauser, Eds. Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press.
Porter, A. C. (1998). The effects of upgrading policies on high school mathematics and science. In D. Ravitch (Ed.), Brookings papers on education policy 1998 (pp. 123-167). Washington, DC: Brookings Institution Press.
Postlethwaite, T. N. (1999). International studies of educational achievement: Methodological issues. Hong Kong: University of Hong Kong, Comparative Education Research Center.
Schmidt, W. H., McKnight, C. C., & Raizen, S. A. (1996). A splintered vision: An investigation of U.S. science and mathematics education. Dordrecht, Netherlands: Kluwer Academic.
Stevenson, D. L., & Baker, D. K. (1992). Shadow education and allocation in formal schooling: Transition to university in Japan. American Journal of Sociology, 97, 1639-1657.
Stigler, J. W., & Hiebert, J. (1999). The teaching gap: Best ideas from the world’s teachers for improving education in the classroom. New York: Free Press.