Conclusions and Recommendations
Investigating curricular design and implementation is a complex undertaking, and so is reviewing the evaluations of curricula. The committee has conducted its work within a climate of controversy over whether U.S. children are being well served by their mathematical fare. We worked in a period during which proponents of changes to curriculum and pedagogy are struggling to gain acceptance of those changes and being subjected to intense scrutiny as they do so. If these approaches are fundamentally wrongheaded, criticizing them at this time of precariousness is entirely appropriate. If these approaches are potentially worthy, spurious critiques themselves may cause the experimentation to fail. Between these two extremes are a host of other possibilities. The fundamental question of this study was, What is the quality of the evaluations that were designed to judge the effectiveness of these 19 mathematics curricula? An answer to this question should help us learn how to respond to these debates.
Curricular implementation involves the coordination of a variety of factors at differing levels of a system. Evaluations of curricular implementation should acknowledge this complexity, and yet produce reasonably concise, reliable, valid, and cost-efficient evidence of their effectiveness. Education is not simply a bottom-line phenomenon. Thus the effectiveness of curricula depends not only on a simple average or accumulation of effects across test takers, but on a careful assessment of the distribution of effects across grades and topics, across subgroups over time, and across the myriad of unique regional variations of our nation. Implementation, for its part, is not achieved by a blind execution of procedures, but rather by the develop-
ment of a community of practitioners competently prepared to make appropriate use of materials and exercise judgment in their use. Furthermore, curriculum design is not a rigid scripting of a scope and sequence, but the presentation of sets of tasks and instructional materials linked to relevant standards that can engage students, build on their previous knowledge, and assist them in gaining the mental discipline and proficiency required of knowledgeable citizens and world-class scholars. Effectiveness should consider all these factors, in terms of both potential impact and associated opportunities and risks, and transform them into a judgment concerning a curricular program. In an age of instantaneous recipes and 10-second sound bites, evaluators should provide potential and actual clients with theory-driven, methodologically astute and sound, and practitioner-informed evaluations on which to base curricular decisions.
The committee held fast to a single commitment, namely, that our greatest contribution would be to clarify the proper elements of an array of evaluation studies designed to judge the effectiveness of mathematics curricula, and to clarify what standards of evidence would need to be met to draw conclusions on effectiveness. The committee does not believe any single study determines effectiveness; however, drawing from what could be learned from previously conducted evaluations, we sought to uncover and present practical, sound, and rigorous evaluation designs that could produce the necessary evidence to resolve the debates, bring to the surface variations in values, and propel us toward better serving the nation’s youth. We do not claim that the evaluation framework presented herein is a perfect solution to the problems of assessing curricular effectiveness, but rather view it as a way to take stock of our current knowledge base and stimulate the field to modify or refine.
In building the framework, we drew heavily from the National Research Council (NRC) publication entitled Scientific Research in Education (NRC, 2002), in which six qualities of scientific research were identified as crucial:
Posing significant questions that can be investigated empirically;
Linking research to relevant theory;
Using methods that permit direct investigation of the question;
Providing a coherent and explicit chain of reasoning;
Replicating and generalizing across studies; and
Disclosing research to encourage professional scrutiny and critique.
Evaluation, as firmly as research, should answer to this set of principles, except that the replication and generalization are more closely subject to the constraints and conditions specified in the program’s design. As with scientific research, evaluators need to ensure that when there are
competing theories concerning a phenomenon, they work to rule out alternative hypotheses. Scientific Research in Education further advocated the value of the use of multiple methodologies to improve one’s chances of understanding the complexity in the phenomena under investigation. To complement this work, the committee argued for approaches that draw from multiple methodologies, involve multidisciplinary roots, recognize the importance of ethics and volition, and acknowledge the dependency of the work on building and maintaining mutually respectful relationships with practitioners. Furthermore, evaluation, like research, benefits from the careful accumulation and synthesis of such work. The best that can be asked for and expected is that such experiments in curricular reform be conducted with great care and sensitivity to the values of the constituencies, that they be monitored and reviewed with careful and thorough evaluations, and that the evidence be examined rigorously with periodic review to ensure continuous improvement. As with any scientific enterprise, the specifics of the approaches will evolve with the understanding of the problems.
In addition, there needs to be a commitment by all evaluators and investigators, including the committee, to a generous, thoughtful, and critical consideration of various possible interpretations of the data and a profound intellectual respect for others also undertaking these studies. It does not serve the public well to dismiss the considerable work represented by both the development of the 19 curricula under examination and the efforts of evaluators to document and study their effects. In fact, given the preponderance of studies regarding the curricula supported by the National Science Foundation (NSF), one should acknowledge and value the role of the NSF in requiring the production of many summative evaluations and related research. These were a byproduct of NSF’s role in stimulating the development of significant numbers of alternative approaches to curricula in order to meet the need to address the relatively weak performance of American students in mathematics and address the inequities in mathematics learning. Multiple publishers testified that they followed NSF’s lead in undertaking their own development efforts. The NSF’s activity has been crucial in making this evaluation of evaluations possible and thereby in propelling the nation toward new insights and standards with regard to the conduct of curricular development and accompanying evaluation.
The history of science concerns not only the accumulation of facts and theories, but also the development of method. Developing method combines both a technical prowess as well as theoretical clarifications and negotiated agreement on what terms means and how to gather evidence on issues. To date, there has been too little focus on how to resolve disputes and how to interpret evidence, and too much fractious commentary dismissing others’ perspectives on the basis of anecdote and thin doses of empirical data. The committee saw our charge as a means to stabilize
method around feasible, valid, and reliable ways to evaluate the quality of evidence on effectiveness.
The committee proceeded in a systematic way to accumulate the array of studies on these curricula, categorizing those studies into four major methods that could shed light on the determination of effectiveness: content analyses, comparative studies, case studies, and synthesis studies. Other studies and submitted reports provided valuable information on the background or the emerging constructs for curricular implementation, but were not sufficiently relevant to our charge. Within these four categories, subcommittees again scrutinized the evaluations and identified the studies that met adequate standards for that methodology. This task required committee members to articulate those standards in the context of mathematics curricula. Each subcommittee compiled their findings, which are based on a careful review of the evaluation studies. Finally, these findings were submitted to the whole committee for review. Then, the committee as a whole drew relationships among those findings, connected those reviews to the framework, and crafted the conclusions and recommendations.
These 19 curricular projects essentially have been experiments. We owe them a careful reading on their effectiveness. Demands for evaluation may be cast as a sign of failure, but we would rather stress that this examination is a sign of the success of these programs to engage a country in a scholarly debate on the question of curricular effectiveness and the essential underlying question, What is most important for our youth to learn in their studies in mathematics? To summarily blame national decline on a set of curricula whose use has a limited market share lacks credibility. At the same time, to find out if a major investment in an approach is successful and worthwhile is a prime example of responsible policy. In experimentation, success and worthiness are two different measures of experimental value. An experiment can fail and yet be worthy. The experiments that probably should not be run are those in which it is either impossible to determine if the experiment has failed or it is ensured at the start, by design, that the experiment will succeed. The contribution of the committee is intended to help us ascertain these distinctive outcomes.
THE QUALITY OF THE EVALUATIONS
The charge to the committee was “to assess the quality of studies about the effectiveness of 13 sets of mathematics curriculum materials developed through NSF support and six sets of commercially generated curriculum materials.” Based on our activities, the final product of our work was to present “the criteria and framework for reviewing the evidence, and indicating whether the currently available data are sufficient for evaluating the efficacy of these materials.” Finally, if these data were not sufficiently
robust, then the committee was also asked to “develop recommendations about the design of a subsequent project that could result in the generation of more reliable and valid data for evaluating these materials.”
In response to our charge, the committee finds that:
The corpus of evaluation studies as a whole across the 19 programs studied does not permit one to determine the effectiveness of individual programs with high degree of certainty, due to the restricted number of studies for any particular curriculum, limitations in the array of methods used, and the uneven quality of the studies.
Therefore, according to our charge, we recommend that:
No second phase of this evaluation review should be conducted to determine the effectiveness of any particular program or set of curricular programs dependent on the current database.
The committee emphasizes that we did not directly evaluate the materials. We present no analysis of results aggregated across studies by naming individual curricular programs because we did not consider the magnitude or rigor of the database for individual programs substantial enough to do so. Nevertheless, there are studies that provide compelling data concerning the effectiveness of the program in a particular context. Furthermore, we do report on individual studies and their results to highlight issues of approach and methodology. To remain within our primary charge, which was to evaluate the evaluations, we do not summarize results on the individual programs.
The second part of our charge was to present the criteria and framework for reviewing the evidence. To do so, we have developed a set of definitions of key terms which draw on a framework for evaluating the effectiveness of mathematics curricula. Using these definitions and the framework, we were able to undertake a review of the major categories of evaluation studies. We briefly review the definitions and the framework.
FRAMEWORK AND KEY DEFINITIONS
To guide our review of evaluations of mathematics curricula, the committee developed a “Framework for Evaluating Curricular Effectiveness” (see Figure 3-2). This framework emerged from deliberations of the committee following the testimony of experts in the field at two workshops held during 2003, motivated by the need to find common ways to examine different types of evaluations. It permitted the committee to compare evaluations and consider how to identify and distinguish among the variety of methodologies they employed. The committee recommends that individuals
or teams charged with curriculum evaluations conduct studies that make use of the following framework:
Effectiveness of curriculum materials should be determined through evaluation studies that specify the program under investigation in relation to three major components and their interactions:
The program materials and author’s design principles;
The quality, extent, and means of curricular implementation components; and
The effects on the quality, breadth, type, and distribution of outcomes of student learning over time.
Evaluation studies should further articulate the research design, measurement, and documentation of the above components, and the analysis of results. Secondary components of systemic factors, intervention strategies, and unanticipated influences should also be considered.
The quality of an evaluation depends on how well it connects these components into a chain of reasoning, evidence, and argument to show the effects of curricular use, and to demonstrate their connection to the treatment under investigation. Studies could also include systematic variation to explore which features of curricula are context dependent and which are context independent.
In applying the framework, one needs to distinguish two different aspects of determining curricular effectiveness. First, a single study should demonstrate that it has obtained a level of scientific validity. Then, for a curricular program to be established as effective, a set of scientifically valid studies should be aggregated and synthesized to yield a judgment of effectiveness. We address each of these aspects in turn.
Based on the framework, the committee identified a set of methodological categories of evaluations. For each category, the committee developed a set of methodological expectations for conducting that type of study. This permitted us to define a scientifically valid study as follows:
For a single curricular evaluation to be scientifically valid, it should address the components identified in the “Framework for Evaluating Curricular Effectiveness.” In addition, it should conform to the methodological expectations of the appropriate category of evaluation as discussed in the report (content analysis, comparative study, or case study). Other designs are possible but would have to address both the theoretical and methodological considerations specified in the framework.
SCIENTIFICALLY ESTABLISHING CURRICULAR EFFECTIVENESS
Defining scientific validity for individual studies is an essential element of understanding curricular effectiveness. However, curricular effectiveness cannot be established by a single scientifically valid study; instead, a body of studies is needed.
Curricular effectiveness is an integrated judgment based on interpretation of a number of scientifically valid evaluations that combine social values, empirical evidence, and theoretical rationales. Similar to assessing test validity (Messick, 1989, 1995) determining effectiveness is a continuing and evolving process. As the body of studies about a curriculum grows larger, findings about its effectiveness can be enhanced or contravened by new findings, new approaches, new research, and changing social conditions
Furthermore, a single methodology, even replications and variations of a study, is inadequate to establish curricular effectiveness, because some types of critical information will be lacking. For example, a content analysis is important because, through expert review of the curriculum content, it provides evidence about such things as the quality of the learning goals or topics that might be missing from a particular curriculum. But content analysis cannot determine whether that curriculum, when actually implemented in classrooms, achieves better outcomes for students. In contrast, a comparative study can provide evidence of improvement in student learning in real classrooms across different curricula. Yet without the kind of complementary evidence provided in a content analysis, nothing will be known about the quality or comprehensiveness of the content in the curriculum that produced higher scores. Furthermore, neither content analyses nor comparative studies typically provide enough detailed information about the quality of the implementation of a particular curriculum. A case study provides deep insight into issues of implementation; however, by itself, it cannot establish representativeness or causality.
Therefore, the committee concluded that:
No single methodology by itself is sufficient to establish a curricular program’s effectiveness. The use of multiple methodologies of evaluation strengthens the determination of effectiveness, provided that each is a scientifically valid study.
This conclusion led the committee to propose the following overarching recommendation:
A curricular program’s effectiveness should be ascertained through the use of multiple methods of evaluation, each of which should be a scientifically valid study. Periodic synthesis of the results across evaluation studies should also be conducted.
This is a general principle for the conduct of evaluations in recognition that curricular effectiveness is an integrated judgment, continually evolving, and based on scientifically valid evaluations. The committee further recognized that agencies, curriculum developers, and evaluators need a more explicit standard by which to decide whether federally-funded curricula (or curricula from other sources whose adoption and use may be supported by federal monies) can be considered effective enough to adopt. The committee decided to recommend a rigorous standard to which programs should be held to be scientifically established as effective. The standard consists of two parts: (1) specification of the array of methodologies required, along with key characteristics, and (2) criteria to determine when the standard has been achieved. The standard relies on the primary methodologies identified in our review, but we acknowledge the possibility of other configurations, provided they draw on the framework and the definition of scientifically valid studies, and include careful review and synthesis of existing evaluations. We view this as an optimal goal to which the field should strive in the attempt to make confident decisions about the effectiveness of any particular curriculum.
The committee recommends that the following standard be used by agencies, curriculum developers, and evaluators:
For a curricular program to be designated scientifically established as effective, a collection of scientifically valid evaluation studies addressing its effectiveness should (1) establish that a curricular program and its implementation produce positive and curricularly valid outcomes for students, and (2) convincingly demonstrate that the positive outcomes are due to the curricular intervention. The collection of studies should employ a combination of the following methodologies, and meet the stated criteria:
(required) Content analyses by at least two qualified experts (a Ph.D.-level mathematical scientist and a Ph.D.-level mathematics educator), with identified credentials and statements of preference and bias, with due consideration of the systemic fit of the curricula under examination, explicitly addressing the dimensions identified in the content analysis chapter (Chapter 4). The findings from the content analyses should lead to conclusions of overall approval by the content analysts and include explanations by the curriculum authors concerning exceptions they take to the analysts’ reports.
(required) Comparative studies using experimental or quasiexperimental designs, identifying the comparative curriculum, and addressing the 10 criteria listed in the comparative studies chapter (Chapter 5). Each comparative study should produce findings that the ex-
perimental program produces results that meet or exceed those of a comparative program already designated scientifically established as effective, or document significant positive impact on curricularly valid outcomes and indicators of future student success, or exceed the results of a widely used program at a statistically and educationally significant level. Each comparative study should specify the level and type of generalization that can be drawn from it.
(highly desirable) One or more case studies to investigate the relationships among the implementation of the curricular program and the program components, as described in the case study chapter (Chapter 6). The case studies should provide documentation that the implementation and outcomes of the program are closely aligned and consistent with the curricular program components and add to the trustworthiness of implementation and the comprehensiveness and validity of the outcome measures.
(required) The final report of a program that is scientifically established as effective should link the analyses, specify what they convey about the effectiveness of the curriculum, and stipulate the extent to which the program’s effectiveness can be generalized, based on the sample populations studied and any other relevant contextual factors and conditions that limit the claims. This report should be made available to the public.
To ensure the independence and impartiality of summative evaluations, which are necessary to scientifically establish a program as effective, the committee makes the following overarching recommendation:
Summative evaluations should be conducted by independent evaluation teams with no membership by authors of the curriculum materials or persons under their supervision.
Consistent with the evaluation standards established by the Joint Committee on Standards for Educational Evaluation, we recognize that in addition to standards for scientific accuracy, evaluation designs need to take into account the needs and resources of stakeholders. This means that evaluation designs should also respond to the demands for utility, feasibility, and propriety. Evaluators must balance the need for scientific rigor with the need for attention to local contextual variations and stakeholders’ issues of utility, feasibility, timeliness, and propriety.
RECOMMENDED PRACTICES FOR THE CONDUCT OF EVALUATION STUDIES
In addition to the recommendations above, the committee identified a number of more specific concerns about the evaluation studies it reviewed.
To address these concerns, the committee recommends that individuals or teams charged with conducting curriculum evaluations should strive toward the following recommended practices.
In relation to implementation, we expressed concerns that across all the studies, there was a disproportionately high representation of students and classrooms from the suburban areas, with weaker representation from urban and rural schools. To address this concern, the committee recommends that:
Evaluations of curricular effectiveness should be conducted with students who represent the intended audience.
In addition, we noted that it is important that judgments of effectiveness be based on clear knowledge and documentation that the program under investigation was adequately and faithfully implemented. To this end, the committee recommends that:
Evaluations should present evidence that provides reliable and valid indicators of the extent, quality, and type of the implementation of the materials. At a minimum, there should be documentation of the extent of coverage of curricular material (what some investigators referred to as “opportunity to learn”) and of the extent and type of professional development provided.
The committee recognized the importance of even more specific information and encourages evaluators to seek methods to gather data on additional implementation components. Because of the expense and difficulty of such documentation, we encourage evaluators to at least address these issues through the use of carefully selected case studies. To this end, we recommend that:
Evaluators are advised to provide reports on other implementation factors. These additional factors could include reports on the assignment of students and differential impacts, instructional quality and type, the beliefs and understandings of teachers and students, documentation of formative or embedded assessments, time and resource allocations, and the influence of parents and interest groups.
In reviewing the evaluation studies, the committee concluded that across all the studies, there are some major problems with the measurement of
student outcomes. These problems make the determination of comparative curricular effectiveness very difficult, and bring potential confusion into the conclusions. These problems include:
A large and quite varied set of tests for the measurement of achievement, without a sensible and methodologically sound means to compare them;
Too many tests that rely exclusively on multiple-choice format, limiting the assessment of the cognitive levels of performance and neglecting the long-term development of student knowledge;
When tests are administered independent of the regular assessment activities, few means to gauge the level of student motivation to perform;
Lack of clear delineation of whether the measures of prior performance assess different content and skills, prerequisite skills, or the extent to which the current curricular material is already known, or other nonspecific factors of less obvious relevance to curricular effectiveness;
Reliance on a total test score of mathematics to make judgments, when such tests tend to be less sensitive to curriculum effects than subtest scores focused around very specific content such as fractions;
For longitudinal studies, lack of methodology to determine if variation in performance by subtopics, across school years, can be validly compared in relation to the psychometrics of the whole test-equating process; and
Lack of methodology on how to draw conclusions concerning the distribution of results across student groups, including by prior performance levels, to examine not only gains between subgroups or between comparative curricula, but to examine gains within subgroups using a particular curriculum.
The committee could not solve this myriad of problems concerning the outcome measures used to assess curricular effectiveness. However, we did identify two issues that should be clearly distinguished and addressed in relation to all studies. These were labeled as “curricular validity of measures” and “curricular alignment to systemic factors.”
To determine effectiveness, outcome measures should be demonstrated to be sensitive to curricular changes. In addition, those measures should comprehensively sample the curricular objectives in the course, measure the content within those objectives validly, and ensure that teaching to the test (rather than the curriculum) is not feasible or likely. The committee used the term “curricular validity of measures” to refer to these requirements.
To address this concern, the committee recommends that:
At a minimum, one of the outcome measures used to determine curricular effectiveness should have demonstrated curricular validity.
Ensuring that curriculum validity of measures is taken into account becomes complex in evaluations involving the comparison of multiple curricula. In such situations the committee decided that each curriculum examined should use, at a minimum, a set of items (which may be a subset of a test) that has curriculum validity of measures. This implies that if a state test is not aligned to a curriculum, it cannot be used to determine curricular effectiveness.
In the context of No Child Left Behind, it should be clear that in order for programs to establish their credentials as effective or as “scientifically based,” they will need to show that they have selected outcome measures that demonstrate curricular validity. Accountability without curricular validity is hollow because it is possible to raise scores by teaching to the test, and thus deny students the opportunities to learn the breadth and depth of the entire curriculum. In addition, if measures only sample from the lower levels of the content, particularly at the high school level, the K-12 sector will not have adequate information on students’ preparation for advanced study. In our review of the evaluations of the curricula, our deliberations were hampered by the absence of adequately demonstrated curricular validity in outcome measures.
The committee also recognized the importance of the demonstration of evidence of curricular alignment necessary for school decision makers, and that these may be dependent on local contexts. Reports on outcome measures should identify how they connect to the national, state, and local contexts. We labeled consideration of these issues as those of “curricular alignment with systemic factors.” To this end, the committee recommends that:
Evaluations should, when possible and relevant, report on a curricular program’s alignment to systemic factors, particularly in relation to the local, state, or national mandatory tests or widely used tests having an impact on student opportunities and future activities
Finally, the committee acknowledges the limitations in basing an evaluation of a complex, multifaceted curriculum on a single outcome measure. Thus, the committee recommends that:
Whenever feasible, multiple forms of student outcomes should be used to assess the effectiveness of a curricular program. Measures should consider persistence in course taking, drop-out or failure rates, as well as multiple measures of a variety of the cognitive skills and concepts associated with mathematics learning.
In Chapters 4 through 6, the committee summarizes the results of the review of the subsets of relevant studies. Our focus was on the methodologies of content analyses (Chapter 4), comparative studies (Chapter 5), and case and synthesis studies (Chapter 6). In this chapter, we synthesize our conclusions across these three areas and make recommendations about the most critical issues that need to be addressed to adequately position evaluations to determine curricular effectiveness.
The committee recognizes that there is little agreement about what should be included in the conduct of content analyses. There were areas of agreement on the part of evaluators across the content analyses, which included the importance of ensuring that the materials were carefully sequenced, comprehensive, and correct. Most authors situated their analyses in the context of an identified set of standards, either at the state level or in reference to Principles and Standards for School Mathematics (NCTM, 2000). Content errors were reported, particularly in first editions, but all participants in the debates showed willingness and commitment to see these identified and fixed quickly and accurately.
In other areas, the committee found distinct differences in preferences and perspectives on content analyses, and was able to find a set of dimensions that seemed to capture those differences. For instance, the committee acknowledged that content analyses conducted a priori are useful and necessary. In addition, we recognized the value of content analysis studies conducted in situ to assess the feasibility of novel approaches prior to formal pilot studies or field testing. On paper, a curriculum may look comprehensive, correct, and orderly, but study of its practical consequences is necessary to ensure its feasibility, its incorporation of adequate levels of challenge and engagement, and its fit with typical or local resources. In order to assist the field in stabilizing this methodology, we outlined dimensions of content analysis and identified some of the key sources of debate.
In relation to content analyses, the committee recommends that:
Content analyses should be recognized as a form of connoisseurial assessments, and thus should be conducted by a variety of scholars, including mathematical scientists, mathematics educators, and mathematics teachers and well-qualified individuals, who should identify their qualifications, values concerning mathematical priorities, and potential sources of bias regarding their execution of content analyses.
Furthermore, the committee recommends that:
A content analysis should clearly indicate the extent to which it addresses the following three dimensions:
Clarity, comprehensiveness, accuracy, depth of mathematical inquiry and mathematical reasoning, organization, and balance (disciplinary perspectives).
Engagement, timeliness and support for diversity, and assessment (learner-oriented perspectives).
Pedagogy, resources, and professional development (teacher- and resource-oriented perspective).
The committee examined 95 comparative studies. Nationally, there is difference of opinion as to whether anything can be learned from a corpus of studies that collectively exhibit a variety of methodological flaws. The committee took the position that much can be learned through a careful and rigorous examination of the current database, provided those studies meet criteria for studies identified as “at least minimally methodologically adequate.” These criteria required that studies:
Include quantifiably measurable outcomes such as test scores, responses to specified cognitive tasks of mathematical reasoning, performance evaluations, grades, and subsequent course taking; and
Provide adequate information to judge the comparability of samples.
In addition, a study must have included at least one of the following additional design elements:
A report of implementation fidelity or professional development activity;
Results disaggregated by content strands or by performance by student subgroups; and/or
Multiple outcome measures or precise theoretical analysis of a measured construct, such as number sense, proof, or proportional reasoning.
A set of 63 studies met these criteria and were closely examined for the lessons they could offer on the conduct of future comparative studies of curricular effectiveness. We conducted this review by studying this set in relation to the seven “critical decision points” identified in our framework (Chapter 5). We then examined the pattern of results among these 63 studies by program category (NSF-supported, University of Chicago School Mathematics Project [UCSMP]), and commercially generated) and subjected
these results to a process of filtering to see what standards of rigor seemed to influence them most. Finally, we conducted a more thorough review of the studies in relation to what they revealed about analysis by content strand, by equity, and by the interactions among content strand, equity, and grade band. We concluded that comparative studies need to attend most closely to the following three factors:
More rigorous design;
More precise measures of content-strand outcomes, especially in relation to curricular validity of measures;
Careful sampling of representative groups and examination of outcomes by student subgroups.
The committee recommends that comparative study design should attend specifically to at least the following 10 critical decision points and document how they are addressed in individual studies:
More pure experimental studies should be conducted, thus ensuring a better balance of experimental and quasi-experimental studies.
In quasi-experimental designs, it is necessary to establish comparability by matching samples or making statistical adjustments, using factors such as prior achievement measures, teacher effects, ethnicity, gender, and socioeconomic status (SES). Other factors in need of such consideration are implementation components, as recommended previously.
The selection of the correct unit of analysis is critical to the design of comparative studies to establish independence of observations, in relation to tests of significance. Increasingly sophisticated means of conducting studies should be employed, to take into account the level of the educational system in which experimentation occurs.
Gathering data on implementation fidelity is essential for evaluators to gauge the adequacy of implementation. Studies could also include nested designs to support analysis of variation by implementation components.
Outcome data should include a variety of measures of the highest quality. These measures should vary by question type (open ended, multiple choice), by type of test (international, national and local) and by relation of testing to everyday practice (formative, summative, high stakes), and ensure curricular validity of measures and assess curricular alignment with systemic factors. Tests should also include content strands to aid disaggregation at the level of major content strands.
In planning data analyses, careful consideration should be given to the choice of appropriate statistical tests and their interpretation,
including the possible use of sophisticated methods of examining complex data that are becoming readily available such as hierarchical linear modeling.
Reports should include clear statements of the limitations to generalization of the study. These should include indications of limitations in populations sampled, sample size, unique population inclusions or exclusions, and levels of use or attrition. Data should also be disaggregated by gender, race/ethnicity, SES, and performance levels to permit readers to see comparative gains across subgroups among between and within studies.
Effect sizes should be reported.
Careful attention should also be given to the selection of samples of populations for participation. These samples should be representative of the populations to which one wants to generalize the results.
The control group should use an identified comparative curriculum or curricula to avoid comparisons to unstructured instruction.
For the purpose of examining the effect of different methodological variations on the results of the evaluations, the committee coded all outcomes of the comparative study, by program type, into positive and statistically significantly stronger than the comparative program, negative and statistically significantly weaker than the comparison program, or showing no significant difference between the control and comparative group (see Table 5-8). We then subjected these results to filter analysis using the seven critical decision points.
Overall, the filtering results suggest that increased rigor seems to lead in general to less strong outcomes, but never reports of completely contrary results. These results also suggest that in recommending design considerations to evaluators, careful attention should be paid to having evaluators include measures of treatment fidelity, considering the impact on all students as well as one particular subgroup; using the correct unit of analysis; and using multiple tests that are also disaggregated by content strand.
The committee recognizes the value of diverse curricular options and finds continuing experimentation in curriculum development to be essential, especially in light of changes in the conduct and use of mathematics and technology. However, it should be accompanied by rigorous efforts to improve our conduct of comparative studies, strengthening the results by learning from previous efforts.
The committee reviewed a set of 45 case studies and selected 32 of these, based on a set of criteria, for intensive review. We saw an important
CONCLUSIONS AND RECOMMENDATIONS 201 role for case studies, particularly in articulating the mechanisms behind effects. In particular, the committee recommends that:
Case studies should stipulate clearly what they are cases of, how claims are produced and backed by evidence, and what events are related or left out and why, and should identify explicit underlying mechanisms to explain a rich variety of research evidence.
It is worth noting that case studies often reveal aspects of program components, implementation components, and interactions among these two that behave differently than intended by program designers, and therefore provide essential insights into curricular effectiveness. This is one reason why case studies are a valuable tool in an evaluator’s methodological toolkit. The committee emphasizes that a case study should be conducted as rigorously as any other form of study.
Moreover, the committee believes that if program evaluations systematically included explanatory variables in their study of curriculum effectiveness, the gap between research and evaluation would be largely erased. Thus evaluation studies would become far more valuable to the educational field. Moreover, the inclusion of explanatory variables would give program adopters more precise information about whether the conditions for effectiveness demanded by a particular curriculum coincide with their own local conditions, commitments, and resources. Evaluation studies would thus be a valuable resource for stakeholders and researchers.
RECOMMENDATIONS TO FEDERAL AND STATE AGENCIES AND PUBLISHERS
Evaluation studies should be undertaken by a variety of scholars with expertise in the following fields: mathematics, mathematics education research, curriculum development, evaluation, statistics, and measurement. These scholars should design and implement the many facets of the evaluation review, working together as a team with regular consultation from stakeholders, including designers, publishers, teachers, administrators, students, and community members. It is preferable that none of these scholars be closely affiliated with the mathematics curriculum materials under review.
The committee recommends that:
Major efforts should be made by federal agencies to improve the nation’s capacity in mathematics curriculum evaluation. Individuals or teams charged with curriculum evaluation should show evidence of understanding the interdisciplinary nature of the task, and involve mathematics educators, mathematicians, measurement specialists, evaluators, and practitioners.
The committee was asked to review the 13 NSF-supported curricula and 6 sets of commercially generated curricula. We note that there was considerable variation in the type and extent of evaluation material provided across these 19 curricula. The database of evaluations for the NSF-supported curricula and for UCSMP greatly exceeded the database for the commercially generated materials in quantity and quality. In establishing a stronger knowledge base for evaluation, it is essential that responsibility for curricular evaluation be shared among three primary bodies: federal agencies developing curricula, publishers, and state and local districts and schools implementing curricula. The committee believes that the typically modest role of districts and schools in evaluation should become more rigorous and significant if we are to require that curricular excellence become the norm in our decentralized system of education. Our review of district-level data was limited by lack of access to such information and minimal means of quality assurance. Furthermore, district and school personnel could benefit from improved data to help determine how and where to focus professional development, respond to local resources and needs, and inform parents of both professional choices and the reasons for those choices. In some instances, an effort to provide assistance in building local capacity to use and interpret evaluation results may be advised. For each of these bodies (publishers, federal, state, and local), the committee made recommendations regarding the conduct of future evaluations. At the federal level, the committee recommends that:
Calls for proposals by federal agencies should include more explicit expectations for evaluation of curricular initiatives and increasing sophistication in methodological choices and quality. No federal agency should provide continued funding for major curricular programs that fail to present evaluation data from well-designed, scientifically valid studies.
Furthermore, the committee recommends that:
A federal agency, such as the National Center for Education Statistics, should develop a program for district- and school-level data collection and maintenance on issues of curricular implementation.
The committee solicited testimony from publishers, who expressed clear willingness to receive guidance on the conduct of evaluation. Some led the way in articulating evaluation methods that could guide comparative review as well as formative evaluations guiding new editions. Those publishers with innovative approaches or unique approaches tended to argue most vigorously for innovative methodologies and were more likely to offer insights into practices needing overhaul, such as methods of adoption that succumb to financial interest more than they respond to reasoned inquiry
and sound knowledge bases. Some publishers failed to submit materials for review or any evaluations, and as reported previously, there were many more reviews available for NSF-supported materials. However, some publishers showed weak understanding of the distinctions between market research and scientific research on effectiveness, reporting surveys of teacher preference as methods of curricular evaluation. As a result, the committee recommends that:
Publishers should (1) differentiate market research from scientifically valid evaluation studies and (2) make such evaluation data available to potential clients who use federal funds to purchase curriculum materials.
Districts and schools are the most likely sources of accurate longitudinal data—a critical element in student performance. Local districts and schools should improve their methods of documenting curricular use and linking it to student outcomes. Districts and schools have and should keep more careful records of teachers’ professional development activities related to curricula and content learning, and should systematically ensure that all students have fair opportunities to learn, especially under conditions of mobility. Finally, districts and schools can contribute a great deal to the discussion and debate on the impact of accountability systems and their relationship to curricular validity and implementation. To this end, the committee recommends that:
The federal Department of Education, in concert with state education agencies, should undertake an initiative to provide local and district decision makers with training on how to conduct and interpret valid studies of curricular effectiveness.
In addition, the committee recognized that in order to conduct more secure and reliable evaluations, additional basic research is needed in a number of emerging areas pertinent to curricular effectiveness, as discussed in the framework. For example, during the review of content analyses a number of targets of controversy surfaced, including:
The breadth of topics across years—and extent of integration, multidisciplinary and/or sequential treatment of subfields of numeration, geometry, algebraic reasoning, probability and statistics, and discrete mathematics;
The relative emphasis of numeration, symbol manipulation skills, and computation and related conceptual development;
The value, purpose, and use of contextual problems, modeling approaches, and quantitative literacy activities;
The emphasis on analytic/symbolic, visual, or numeric approaches;
The reliance on technology as a tool, and the place of manipulatives;
The importance and effectiveness of student methods and group work;
The role of practice and item sequencing;
The role of the teacher in relation to exposition and coaching; and
The role of different forms of assessment in student learning and achievement.
Thus the committee recommends that:
The federal government and/or publishers should conduct multidisciplinary basic empirical research studies on, but not limited to:
The interplay among curricular implementation, professional development, and the forms of support and professional interaction among teachers and administrators at the school level;
Methods of observing and documenting the type and quality of instruction;
Teacher learning from curriculum materials and implementation;
The development of outcome measures at the upper level of secondary education and at the elementary level in non-numeration topics that are valid and precise at the topic level;
Methods of parent and community education and involvement; and
Targets of curricular content controversy such as the appropriate uses of technology; the relative use of analytic, visual, and numeric approaches; or the integration or segregation of the treatment of subfields, such as algebra, geometry, statistics, and others.
Although the committee chose not to recommend the proposed second phase of the review of evaluations, it instead proposes that the nation become much more serious and realistic about what is needed to strengthen our knowledge base on curricular effectiveness in mathematics. We have amassed the relevant studies and classified and summarized them. From a subset of these, we have drawn inferences about the conduct of future evaluations. We have proposed a framework for the subsequent conduct of that work, and argued for the need to engage in intensive and inclusive discussions about how to proceed in directions most likely to succeed. We have called for increased attention to program theory and implementation measures in program evaluation, for improvement in the curricular validity of outcomes measures, for improved use of experimental and quasi-experimental research design and coordination of multiple methodologies, and for a coalition of the federal government, districts and schools, and the
commercial sector to build capacity to undertake these studies of mathematics curricular effectiveness.
The committee recognizes the complexity and urgency of the challenge, and argues that we should avoid seemingly attractive, but oversimplified, solutions. Although the corpus of evaluations is not sufficient to directly resolve the debates on curricular effectiveness, we believe that in the controversy surrounding mathematics curriculum evaluation, an opportunity exists to forge solutions through negotiation of perspective, to base our arguments on empirical data informed by theoretical clarity, and to build in a critical degree of coherence that is often missing from curricular choice, that is, feedback from careful, valid, and rigorous study. Our intention in presenting this report is to help the nation to take advantage of this opportunity.