Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
4 Some International Examples The primary focus of the steering committee's efforts was to find examples of the many forms that an assessment program built around improving learning can take. The committee looked at programs in seven states, several international examples, and three programs developed by researchers: the Berkeley Evaluation and Assessment Research assessment model, Facet-Based Assessment, and Model-Based Assessment. Presenters for each of these programs were asked to discuss not only the goals and characteristics of their programs, but also the ways in which the programs exemplify the criteria the committee had identified. They were also asked to talk about problems and obstacles they had encountered, as well as successes they believed they had achieved and methods of securing evidence of their results. In this chapter, the examples from abroad are discussed. The notion of gaps between different elements and goals of the educational system may not have been as much on the minds of education officials in other countries, but the assessment systems in several countries nevertheless seem to have much to offer the discussion in the United States. Two different Australian systems, for example, offer interesting ways of thinking about alignment and coherence. Studies from Great Britain demonstrate a way teachers can use as- sessments to help students make progress in their learning, while the Interna- tional Baccalaureate program shows the role that teachers can play in a widely dispersed system. The Organisation for Economic Co-operation and Develop- ment's (OECD) Programme for International Student Assessment (PISA) dem- onstrates one way in which diverse constituents can focus on the material that is most important to assess. 19
20 ASSESSMENT IN SUPPORT OF INSTRUCTION AND LEARNING AUSTRALIA At the national level, Australia has built a large-scale assessment on the basis of a preexisting framework, or "map of progress," that outlined the knowledge and skills students should develop. Geoff Masters, chief executive officer at the independent, nonprofit Australian Council for Educational Research (ACER), explained to workshop participants that the resulting system developed almost by happenstance, yet has many interconnected and mutually supporting parts. The original framework took the form of a detailed matrix showing levels of competence in different aspects of each subject area. In English, for example, the first subject for which a framework was developed, descriptions of competence in reading, writing, listening, speaking, and viewing were developed. The frame- work describes eight different levels of competency for each skill and is designed to cover the years of compulsory schooling. ACER recognized that teachers needed some guidance in monitoring stu- dent progress along the framework. On its own initiative, ACER developed an assessment resource for teachers that they could use in making their own assess- ments of how children were progressing in terms of the framework. The resource kits, which were sold to schools around the country, included activities and materials and a range of assessment methods to be used individually and with groups of students. When the national government later decided to conduct a national survey of primary children's literacy skills, to obtain data similar to that provided in the United States by the National Assessment of Educational Progress, ACER sub- mitted a proposal to develop an assessment based on the model they had already devised for the teachers' assessment kits. Government officials agreed to adopt the ACER model, thus establishing a national assessment system that relied on teachers to conduct and score the assessments. A number of means of ensuring consistency and fairness were built into the system. First, as with the original resource kits, the assessment supplied guide- lines and scoring rubrics. A group of experienced external assessors trained and monitored teachers in the use of the assessment methods. These assessors also visited schools and monitored a subset of the assessments as they were con- ducted. Second, all the student work generated for assessment purposes was collected for further monitoring at a central office in Melbourne. The work was sampled and, where discrepancies were found, rescored. The initial assessment was successful, yielding results for nearly 9,000 stu- dents in the third and fifth grades. Student performance was shown in terms of their progress along the matrix; the relative performance of socioeconomic sub- groups was also shown. Unfortunately, as Masters explained, the national government was surprised in the end to find no indication of how many students had "passed." Since the assessment was designed only to show how far groups of students had progressed
SOME INTERNATIONAL EXAMPLES 21 through the stages identified in the matrix, no cutpoints had been identified for either grade. However, ACER was able to go back and conduct a standard- setting exercise to determine what minimum level of competency in reading and writing should be expected at each grade. Pass rates could then be determined retroactively, and although the results turned out to be controversial, the exercise demonstrated the adaptability of the assessment system for the accountability purposes that are particularly important to policy makers and politicians. QUEENSLAND Richard Shavelson, professor of education and psychology at Stanford Univer- sity, described for the audience the somewhat different situation in the Australian state of Queensland, whose system he has studied. There the state had for many years relied on a set of "A-level" examinations prepared by the University of Queensland, similar to those used in Great Britain, both to determine how well students were prepared for college study in different subjects and as an element in the college selection process. In 1970-1971, concern began to mount that the exams were too difficult and were the cause of an undesirable narrowing of the curriculum. Queensland decided to replace the A levels with formative assess- ments that would more directly address students' needs, and then to build on those to obtain summative information about student performance that would be of value beyond the classroom. In essence, as Shavelson explained, Queensland officials decided to develop "a system for auditing the local implementation of curriculum and assessment and accountability." Teachers and local schools are responsible for both curricu- lum and assessment and their work is monitored to ensure that it is consistent across the state and meets standards for quality. An infrastructure was set up to accomplish the monitoring, which includes a Board of Senior Secondary Studies, which set the syllabi the essential goals for content, cognitive skills, and domain-specific skills for each subject and the general methods for conducting assessments. The board is also responsible for moderation of scores, a process by which teachers' scores are calibrated with one another to achieve consistency across classes and schools. Below this board, a series of district-level content panels in each of the A-level subjects provides more direct support to schools and teachers. Each school is then free to develop its own two-year, A-level curricu- lum in each subject, as well as a culminating exam. The exams are scored according to a Queensland-wide, five-point, domain-referenced scale, and moderated. Thus, schools and teachers are given a considerable amount of both direction and latitude. They use formative and summative assessments throughout the two A-level years, based on guidelines provided by Queensland, using both kinds (and students are always aware of the purpose of a particular assessment) to help students understand in detail the expectations they are striving to meet. To
22 ASSESSMENT IN SUPPORT OF INSTRUCTION AND LEARNING Shavelson, the key to the system' s apparent success over thirty years is the very close link made between the curriculum and the content of the assessments. To American eyes, one striking aspect of both Australia's national assess- ment system and the Queensland model is the degree to which each, in its way, accords significant value to the judgments of teachers about their students. In these systems, teachers have many different opportunities for training and devel- opment to improve the knowledge and skills they need to play a key role in the assessment program. They can become involved in development and scoring of assessments (as are many of their counterparts in the United States), and receive the trust necessary to develop evaluative assessments of students on their own. GREAT BRITAIN: ENHANCED FORMATIVE ASSESSMENT Dylan Wiliam of King's College, London, described efforts in Great Britain to focus closely on the ways teachers can use assessments to help students make progress in their learning. He began by describing an overview of approximately 250 studies that explored the effectiveness of a formative classroom assessment (also sometimes called assessment for learning) in which clear evidence of a positive effect on learning was found. Specifically, Wiliam explained, when teachers provide students with clear feedback that gives them guidance on the steps they need to take to improve, students progress at a greater rate than they do in response to other kinds of feedback. Wiliam also described a study in which a group of twenty-four mathematics and science teachers were asked to develop their use of formative assessment with one class in several specific ways: by making greater use of higher-order questioning, providing task-involving rather than ego-involving feedback, devel- oping the use of peer- and self-assessment strategies, and exploring the use of summative tests for formative purposes (Black, Harrison, Lee, Marshall, and Wiliam, 2002~. For each class, the local class that could best be used as a control was identified so that any improvements in learning could potentially be mea- sured, and in this study as well evidence of a positive effect was found. While the methods sound simple allowing a longer wait time while stu- dents consider how to answer a question, for example Wiliam stressed the importance not of the methods themselves, but of the insights into how students learn that led to them. The idea, he explained, is to initiate students into a culture of learning in which they not only take responsibility for their learning but are supported in the steps they need to take to progress. At the same time, teachers' capacity to make useful inferences about their students are enhanced, just as their opportunities to use these inferences are increased (Black et al., 2002~.
SOME INTERNATIONAL EXAMPLES THE INTERNATIONAL BACCALAUREATE (IB) DIPLOMA PROGRAMME 23 The International Baccalaureate (IB) Diploma Programme offered workshop participants an additional way to think about the role of teachers in assessment. George Pook, head of assessment for the International Baccalaureate Organisation, explained that the IB was developed to provide a common curriculum for stu- dents around the world, as well as a grading system that would be recognized and understood by colleges and universities around the world. Thus, consistency is very important to the success of the program, but at the same time there is a need to entrust considerable responsibility to widely dispersed schools and teachers. The IB uses a variety of assessment strategies for summative purposes. For example, students must complete an extended essay on a topic of their own choosing at the end of the program, which is scored centrally. Examinations may include tasks ranging from multiple-choice questions to full-length essays, as appropriate for each subject. Oral presentations are also required in language subjects, and these are scored by teachers using criteria supplied by the IB pro- gram. All of the results are reported in terms of a seven-point scale that is linked to defined levels of performance that program administrators try to keep consis- tent from year to year as well as across participating schools around the world, who of course work in different languages. The points on the scales describe content and skills, and the scoring is intended only to indicate how well students have mastered them, not to spread students out for comparative purposes. Internal, teacher-generated assessments play a significant role in the pro- gram for both formative and summative purposes. Teacher-generated assess- ments address a different range of subject matter and skills than the IB-generated assessments do. The two types are intended to complement one another in creating an overall measure of a student's achievement. Teachers' ongoing formative assessments are viewed as opportunities for students to see how they are progressing along the criteria defined in the seven-point scale. Released test questions, rubrics, and student work are all used to provide this feedback. Many IB teachers serve as external assessors for other schools, and also have opportu- nities to review and revise the curricula in their disciplines. All IB teachers receive support in the form of resource materials, workshops, and an online curriculum center. Moderators are available to give teachers feedback on their internal assessment methods, as well as their assignments and their grading. PROGRAMME FOR INTERNATIONAL STUDENT ASSESSMENT (PISA) The OECD, which was formed as part of the Marshall Plan after World War II, is composed of thirty nations, all of which are democratic market economies. As Barry McGaw, director for education at OECD, explained at the workshop, a
24 ASSESSMENT IN SUPPORT OF INSTRUCTION AND LEARNING primary function of the OECD is to collect data in a number of policy areas, and in the late 1980s the organization began a process of upgrading its statistical work in education, with the particular goal of ensuring that the data used to represent national systems become more comparable. While the OECD had been using data regarding educational outcomes supplied by the International Association for the Evaluation of Educational Achievement for a number of years, it began to gather data of its own in the mid-199Os through PISA. The focus is on summative data that can be used be make useful comparisons among the member nations. The primary initial goal for PISA was, as McGaw explained, "to estimate the yield of national education systems," and he acknowledged that this is a grand ambition. Yield is an economic concept not generally used in the study of education, but it led the developers of PISA to focus on what students can do with what they have learned, and thus avoid the difficulty of identifying the material that had been covered in common across many countries. Thus PISA assesses the "literacy" of fifteen-year-olds in reading, mathematics, and science. They use a variety of measurement approaches multiple-choice questions as well as open- ended short questions and written pieces, but the assessments are not intended to be used for formative, classroom purposes. McGaw provided some examples of the kinds of questions that can be con- sidered using PISA data, using tables and graphs, for example, to show how the member countries vary in terms of the balance they achieve between equity and quality. He also showed graphically that countries vary considerably in terms both of how much spread they have between their lowest and highest performing students, and also in terms of how much of that spread occurs within schools and how much occurs across schools. Probing that question even deeper, he pre- sented a table that broke down the variation that occurs across schools according to whether it was intended that is, the result of deliberate tracking of students into academic or vocational programs, for instance or unintended. Data such as these, McGaw explained, are very useful for helping countries see that there are alternatives to the way they are structuring their education systems. South Korea, for example, has been remarkably successful at achieving both high quality and high equity; it has the lowest degree of spread among high- and low-performing students, while overall performance is high. Although PISA does not fit particularly well with the criteria laid out by the committee, McGaw noted that it does offer formative possibilities in a system context. Denmark, he noted, has found that though it spends among the largest amounts per students, its average student performance figures are quite low. As a consequence, the ministry of education is working to make the system operate more efficiently and improve student performance. Doing so, of course, implies that it is confident that the constructs measured by PISA are genuinely important, even though they are not directly linked to the curriculum taught in Denmark or any other country.
SOME INTERNATIONAL EXAMPLES 25 It is in this sense that PISA' s experience might be most useful to educators looking for ways to bridge the gaps. The process of developing PISA was an extensive effort to build a framework that defined a reasonable set of expecta- tions for fifteen-year-olds in each of the domains. International groups drew on assessments from around the world and worked through cultural and language differences to come up with two versions of the test, one in English and one in French, that represented their best effort to assess what is really important for fifteen-year-olds to be able to do. McGaw suggested that any concepts that got past the double translations and other reviews, field tests, differential item func- tioning (DIF) analyses,] and other screens were likely to be truly key concepts. He does not believe that PISA focuses mostly on what is easy to assess, rather than what is important, and does believe that it assesses understanding and rea- soning, not factual recall. 1DIF analyses flag test questions that perform differently for a particular subgroup of test takers than for the group as a whole. Thus, for example, if students in one country, or those who are native speakers of a particular language, have difficulty with a question for cultural reasons rather than because of their skill with its content, it can be identified so that it need not count against them.