3
Adapting Achievement Tests into Multiple Languages for International Assessments

Ronald K. Hambleton*

International comparative studies of school achievement can provide (1) valuable data for educational policy makers about the quality of education in their countries, (2) possible explanations for the findings, and (3) suggestions for how improvements in achievement might be accomplished. But, as with any research study, valid conclusions and recommendations from a comparative study of educational achievement can only follow when the research methodology for the study is sound, and the data collection design has been implemented correctly.

International studies of achievement such as the Third International Mathematics and Science Study (TIMSS) and the Organization for Economic Cooperation and Development’s Program for International Student Assessment (OECD/PISA) are research studies that are particularly difficult to implement well because of special methodological problems. Three of the problem areas are (1) reaching agreement on the variables to measure and the definitions of constructs, (2) choosing nationally representative samples, and (3) standardizing test administration conditions (including matching motivational levels of the students taking the test in participating countries). A fourth methodological problem, which is often given far less attention than it deserves, is the translation and adaptation of test instruments, scoring protocols, and related questionnaires. Unless

*  

Ronald Hambleton is a Distinguished University Professor and Chair of the Laboratory of Psychometric and Evaluative Research at the University of Massachusetts at Amherst.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement 3 Adapting Achievement Tests into Multiple Languages for International Assessments Ronald K. Hambleton* International comparative studies of school achievement can provide (1) valuable data for educational policy makers about the quality of education in their countries, (2) possible explanations for the findings, and (3) suggestions for how improvements in achievement might be accomplished. But, as with any research study, valid conclusions and recommendations from a comparative study of educational achievement can only follow when the research methodology for the study is sound, and the data collection design has been implemented correctly. International studies of achievement such as the Third International Mathematics and Science Study (TIMSS) and the Organization for Economic Cooperation and Development’s Program for International Student Assessment (OECD/PISA) are research studies that are particularly difficult to implement well because of special methodological problems. Three of the problem areas are (1) reaching agreement on the variables to measure and the definitions of constructs, (2) choosing nationally representative samples, and (3) standardizing test administration conditions (including matching motivational levels of the students taking the test in participating countries). A fourth methodological problem, which is often given far less attention than it deserves, is the translation and adaptation of test instruments, scoring protocols, and related questionnaires. Unless *   Ronald Hambleton is a Distinguished University Professor and Chair of the Laboratory of Psychometric and Evaluative Research at the University of Massachusetts at Amherst.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement the translations (or, more correctly, test adaptations) are carried out well— and this usually means carrying out a combination of careful translations and reviews, conducting field tests of the adapted tests, and compiling validity evidence—the results of an international comparative study such as TIMSS or OECD/PISA may be confounded by the consequences of poorly translated assessment materials. To the extent that a test adaptation changes the psychological meaning and/or test difficulty in the target languages or cultures, comparisons of student performance across language and cultural groups may have limited validity. Poorly translated assessment materials can have many consequences. Awkward or improper translations may make the test instruments easier or harder for students in some countries. In one recent international assessment, it was learned through self-report that test translators in one country had simplified the language in the mathematics assessment by one grade level to make it more understandable to students. The reading difficulty from the mathematics items had been removed to place the focus of these items on the assessment of mathematics skills only. The consequence was that the test items were easier in this country than they would have been had the reading difficulty of the test items not been removed. Cross-national comparisons of mathematics achievement for the country involved were no longer meaningful. Also, just plain bad translations may make the test instrument totally invalid. Literal translations are usually problematic. Apparently “out of sight, out of mind” was literally translated as “invisible, insane” in one translation between English and French. This humorous example is cited often among test translators, though the original source is unknown. Poor translations go beyond simply the language aspects of the test and test directions. For example, the multiple-choice format may be less familiar to students in some parts of the world. Africa is one location; China is another. Sometimes the problem of differential familiarity of item formats across countries participating in an international comparative study is handled by using multiple item formats. The idea seems to be that of balancing item format familiarity (or unfamiliarity) across participating countries. Test length also may be a problem. In some countries, tests may be relatively short, so a longer test used in an international comparative study may produce fatigue that can impact test performance. The problem arose in a U.S.–China comparison of mathematics achievement (Hambleton, Yu, & Slater, 1999) and was a potential source of invalidity in the comparison of results. Clearly, then, for international assessments where great importance is given to the results, considerable care must be given to the translation of assessment materials. The goals of this chapter are (1) to describe some of the major myths about test adaptations, (2) to describe nine steps for adapting tests that

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement follow from the International Test Commission Guidelines for Translating and Adapting Tests, and (3) to provide several examples of good and bad test adaptation practices from the recent TIMSS and OECD/PISA projects. Recommendations for adapting assessments will be offered later in this chapter. First, two points need to be made. Test adaptation research is not limited to international studies of educational achievement. Popular intelligence, aptitude, and personality tests have been adapted for years, some of the most popular tests into more than 50 languages. Quality of life measures used in medical research are being widely adapted and used around the world (Hambleton & Patsula, 1998). Projects like TIMSS and OECD/PISA are two of many international studies of achievement. Many more studies are underway, including a major study of aging in Europe that is assessing many cognitive variables and involving thousands of participants and more than 10 languages (Schroots, Fernandez-Ballesteros, & Rudinger, 1999). Even in the United States, there are Spanish versions of the Scholastic Aptitude Test (SAT), several state assessments, the GED, and many school district-level achievement tests. Credentialing exams delivered by Microsoft, Novell, and other companies in the information technology field are being given to more than two million candidates a year in over 20 languages, and the numbers of candidates and languages have been increasing exponentially. Clearly the amount of testing in multiple languages is substantial and growing. Second, the term “test adaptation” is preferred today to the term “test translation” by researchers working in this field (see Hambleton, Merenda, & Spielberger, in press). The former term is more indicative of the process that actually takes place in making a test produced in one language available and validated for use in another. Test translation is only part of the process. Decisions must be made about how to preserve the psychological equivalence of a test in two or more language and cultural groups. Format modifications may be necessary. Directions may need to be revised. For example, in one recent study, it was necessary to ask Chinese students to “check their answers” rather than “fill in the bubbles” that appeared on the American version of the test and to change the placement of the artwork in the Chinese version of the test (Hambleton, Yu, & Slater, 1999). Radical changes may be needed to make the item formats suitable. For example, the incomplete sentence format (with or without answer choices) causes a major problem in countries such as Turkey, where the object of a sentence often appears at the beginning. In this situation, the blank (or answer choices) to be completed (or selected) by candidates appear prior to the portion of the sentence that defines the question for the candidates. One can certainly wonder about the impact of this shift in the order of presentation of test material on the difficulty of

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement the question. Finally, analogy questions almost never work in an adapted version of a test because it is nearly impossible to find words that have exactly the same meaning in two languages and with the same level of familiarity (Allalouf, Hambleton, & Sireci, 1999). Another example concerns verb tenses. When the passive tense appears in passages, it must be changed in translations because it does not exist in all languages. Here is another example of a translation problem: The word “you” in English is both singular and plural. In some languages, such as Turkish, two possible words can be used in place of “you.” The first is singular and informal. The second is formal and polite, and can have either a singular or plural meaning. Michal Beller from Israel (personal communication) talked about the richness of language. In Hebrew, for example, different words are used for the English word “picking” in expressions such as “picking grapes,” “picking olives,” and so on. Also, there is only one word for “camel” in English and Hebrew. In Arabic, there are numerous words for camel to distinguish different types. Giray Berberoglu from Turkey (personal communication) talked about his difficulty in finding equivalent meanings for words like “cold fish” and “bleeding heart” and translating expressions such as “every cloud has a silver lining.” The list of changes that are required to make a test valid in multiple languages and cultures often goes well beyond the already difficult task of translating a test. FIVE MYTHS ABOUT TEST ADAPTATIONS Hambleton and Patsula (1999) described five myths about the test adaptation process, as described in the following paragraphs. Myth 1: The preferable strategy is always to adapt an existing test rather than develop a new test for a second language group. There are good reasons for adapting a test, but there are also reasons for not proceeding with a test adaptation. Especially when cross-cultural comparisons are not of interest, it may be substantially easier and more relevant to construct a new test for a second language group. This avoids any complications with copyright, ensures that an item format can be chosen that will be suitable for the intended population, and ensures that any desired modifications in the definition of the construct of interest can be made at the outset of the test development process. Sometimes, too, it may be desirable not to adapt a test, but rather to require all examinees to take a test in a single language. For example, in the United States, there has been interest in some states in making high school graduation tests available in both English and Spanish. Techni-

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement cally this is possible, but the question of whether to make two language versions of a test available depends on many factors, including the definition of the construct being measured. Is the language in which performance is to be demonstrated a part of the construct definition or not? In the case of reading, reading in the language of English is nearly always part of the construct of interest. Producing a Spanish-equivalent version of a reading test in English makes very little sense because inferences of English reading proficiency cannot be made from a test administered in Spanish. The situation with a mathematics test may be different. The construct of interest may be focused on computation skills, concepts, and problem-solving skills. Here, the purpose of the test is to look for a demonstration of the skills, and the language in which the performance is assessed and demonstrated may be of little or no interest. Of course, if the desired inference is the mastery of mathematics skills when the test questions are presented in English, then a Spanish version of the test would be inappropriate in this situation, too. Myth 2: Anyone who knows two languages can produce an acceptable translation of a test. This is one of the most troublesome myths because it results in unqualified persons adapting tests. There is considerable evidence suggesting that test translators need to be (1) familiar with both source and target languages and the cultures, (2) generally familiar with the construct being assessed, and (3) familiar with the principles of good test development practices. With the 1995 TIMSS, countries reported that finding qualified translators was one of their biggest problems (Hambleton & Berberoglu, 1997). How, for example, can a mathematics or science test be translated from English to Spanish without some technical knowledge? Would a translator with little knowledge of test development principles know to keep answer choices of approximately the same length, so that length of answer choice does not become a clue to the correct answer? All too often in the cross-cultural literature, there is evidence of unqualified persons being involved in the test adaptation process. Professor Emeritus Ype Poortinga from the University of Tilburg in the Netherlands, who is a past editor of the Journal of Cross-Cultural Psychology and an internationally known cross-cultural psychologist, commented (personal communication) that he believed 75 percent of the research in cross-cultural psychology before 1990 was flawed because of the poor quality of test adaptations.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement Myth 3: A well-translated test guarantees that the test scores will be valid in a second language or culture for cross-language comparative purposes. Van de Vijver and Poortinga (1997) make the point that not only should the meaning of a test be consistent across persons within a language group and culture, but that meaning, whatever it is, must be consistent across language groups and cultures. For example, if a test is more speeded in a second-language version because of the nature of that language, then the two language versions of the test are not equally valid. We have encountered just such a problem in some German test translations. The German words are longer than English words and take correspondingly longer to read. The result is a slightly more speeded German version of the test. In this instance, the test may be equally valid in each language group and culture, but still will not be suitable for cross-cultural comparisons because the German version with the same time limit as the English version would be administered under slightly more speeded test conditions. Myth 4: Constructs are universal, and therefore all tests can be translated into other languages and cultures. An excellent example of this myth is associated with intelligence tests. This construct is known to exist in nearly all cultures. The Western notion of intelligence places emphasis on speed of response. On the other hand, in some non-Western cultures, speed of response is of minor importance as an operating principle, and members of these cultural groups often score lower on Western intelligence tests because of a failure to perform quickly (Lonner, 1990). But it is only by one of the Western definitions of the construct of intelligence that these cultural groups appear to be of less intelligence. Using a definition that does not place emphasis on speed of response, the results from a cross-cultural comparative study may be very different. See the work of Gardner (1983) and Sternberg (1989) for wellknown work on broadening and changing the definition of intelligence. Poortinga and van de Vijver (1991) describe numerous additional examples in which cross-cultural comparisons are flawed because of the nongeneralizability of construct definitions across cultures. Another example would be the definition, say, of the content to cover on the OECD/PISA 15-year-old assessment of mathematics achievement. The American idea of the relevant content domain is likely to be different from that of other countries. Ultimately, a decision must be made about the breadth and depth of the content domain that is relevant for the assessment; this may place some countries at an advantage, and others at a

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement disadvantage. Content domains for achievement domains such as mathematics, science, and reading are hardly universal. They may not even be equally suitable across states or provinces within the same country. Myth 5: Translators are capable of finding flaws in a test adaptation. Field testing is not usually necessary. The cross-national testing literature includes thousands of examples of poorly adapted test items. The fact is that translators are not able to anticipate all of the problems encountered by examinees taking a test in a second language. Field testing is as important for adapted tests in the target language as it is for tests produced in any language. Field testing should be an integral part of the test adaptation process. This myth comes from the mistaken belief that a backward translation design is sufficient to justify the use of a test in a second language. In this design, a test is forward translated into the target language, then back translated into the original or source language for the test. The original and back-translated versions of the test are compared, and if found comparable, the assumption is made that the target-language version of the test is acceptable. But many concepts can be translated into another language and back-translated but may not be understood in the target language. For example, passages about snow, ice, and cold weather may not be meaningful in warm-weather countries, and this fact would not be identified in a back-translation design. The material itself may be translated and back translated easily, but the psychological meaning of the material may be very different in the two language versions of the test. Jeanrie and Bertrand (1999) describe another example of a poor translation that was not caught by the translators; this item has appeared in a French translation of a well-known English personality test for many years. In the English version, the expression was “Generally, I prefer to be by myself.” In the French version, the sentence was translated to, “Generally I prefer to be myself.” The meaning is quite different, yet candidate responses were scored in exactly the same way with the two very different versions of the statement. This difference may be called a critical error in translation, and it impacts the validity of scores on the scale of the personality test where the item appears. In summary, all of the myths can seriously compromise the validity of a test in a second language or cultural group, or negatively influence the validity of adapted tests for use in cross-language comparison studies. Fortunately, each myth is straightforward to address in practice. What follows are steps for adapting tests that can eliminate the myths and other shortcomings in test adaptation methodology.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement NINE STEPS TO MAXIMIZE SUCCESS IN TEST ADAPTATIONS The nine steps are based on the International Test Commission guidelines for translating and adapting educational and psychological tests. In 1992 an international committee was formed under the direction of the International Test Commission to prepare guidelines for adapting tests. The committee had 13 representatives from eight countries, with financial support from a number of countries and organizations, including the U.S. Department of Education. The 22 guidelines went through numerous reviews and field tests and have been published (Hambleton, 1994; Hambleton et al., in press; van de Vijver & Hambleton, 1996). They are being adopted by many organizations (see, e.g., Muniz & Hambleton, 1997), and were used in both the TIMSS and OECD/PISA projects. The guidelines appear in the Annex to this chapter. For each of the 22 guidelines, the committee offered a rationale or explanation for including the guideline, described steps that might be taken to meet the guideline, presented several common errors found in practice, and provided numerous references for additional study (see Hambleton et al., in press). Step 1: Review construct equivalence in the language and cultures of interest. In international comparative studies, it is important to establish whether construct equivalence exists among participating countries, and if it does not, either considering “decentering” (i.e., revising the definition of the construct to be equivalent in each language and cultural group) or discontinuing the project. The publication by Harkness (1998a) is especially helpful in the study of construct equivalence because she reviews numerous definitions of construct equivalence and approaches for studying it. In the 1995 TIMSS study, for example, initially the mathematics domain that could be agreed on by participating countries was so narrow that the study was nearly discontinued. Later, a decision was made that decentering would be done to redefine the mathematics content domain. Each country was required to be less rigid so that a construct could be defined that would be worthy of an international comparative study of mathematics achievement. Step 2: Decide whether test adaptation is the best strategy. Some tests will be more amenable to translation into certain languages than others. The more similar the target language and/or culture are to the source language and/or culture, the easier the adaptation will be (thus, English to Spanish adaptations may make more sense than English

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement to Arabic or English to Chinese adaptations). With tests intended for cross-cultural comparisons, test adaptation (possibly with some decentering) may be the only option. But when cross-cultural comparisons are not of interest, it may be easier to actually produce a new test that meets the cultural parameters in the second-language group than to adapt an existing test that may have a number of shortcomings (e.g., a less than satisfactory definition of the construct, inappropriate item formats, an overly long test, or use of some culturally specific content). Step 3: Choose well-qualified translators. Lack of well-qualified translators is often one of the major shortcomings of a test adaptation project. Two points can be made. First, in selecting translators, the search should be for persons who are fluent in both languages, who are very familiar with the cultures under study, and who have some knowledge of test construction and the construct being measured. As knowledge of test construction practices is not common among translators, this may be addressed with some training prior to initiating the test adaptation process. Adding a psychometrician to the mix may be desirable, too. Second, researchers have found that the double-translation procedure (i.e., two independent translations followed by reconciliation of both versions by a third party) offers advantages over a back-translation procedure or a single forward translation procedure. In the double-translation procedure, multiple individuals judge the equivalence of the source- and target-language versions of the test. In the back-translation design, a single translator may judge the target-language version of the test by comparing the source and back-translated source versions of the test. Another advantage of a double-translation procedure is that any discrepancies in the translation are noted on the all-important target-language version of the test. See, for example, recent work on the OECD/PISA project by Grisay (1998, 1999) for extensive evaluative comments on the double translation with reconciliation design. Idiosyncrasies and misunderstandings of individual translators can be reduced with the use of multiple translators. An unfortunate idiosyncrasy of a translator might be to always make correct answers in multiple-choice items a bit longer than the distractors. A misunderstanding might be when the translator mixes up the meanings of terms and concepts. The use of multiple translators increases the chances that these problems and many others will be identified prior to finalizing a test adaptation.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement Step 4: Translate and adapt the test. One approach to increasing the likelihood of a valid test adaptation is to adopt one of the two (or both) standard designs: forward translation and back translation. Forward-translation designs are the most technically sound because the focus of the review is on both the source- and target-language versions of the test. Backward-translation designs can also reveal poor translations, but without a focus on the target-language version of the test, problems in the adaptation can be missed. For example, concepts like “sales tax” and “hamburger” are hard to translate into Chinese, so these English words may be used in the adapted version. They are very easy to back translate, but they may be quite meaningless in the target-language version of the test (for more examples, see Hambleton, Yu, & Slater, 1999). In practice, both designs could be used to strengthen the methodology of the test adaptation process. Step 5: Review the adapted version of the test and make any necessary changes. In a forward-translation design, one set of translators performs the original source-to-target-language translation, while another set of translators examines the adapted version of the test for any errors that may lead to differences in meaning between the two language versions. The focus of the second group of translators would be on the quality of the translation or adaptation of the test. As Geisinger (1994) suggests, this review can be accomplished in a group meeting, individually, or by some combination of individual and group work. Geisinger believes the most effective strategy is first to have the translators review the items and react in writing, then to have the individuals share their comments with one another, reconcile any differences in opinion, and make any changes in the original and/or adapted-language versions as necessary. The National Institute for Testing and Evaluation in Israel adapts college admissions tests into five languages (Arabic, English, Russian, French, and Spanish) from the original Hebrew-language version. One special feature in their process is that their translators work from the translated version first and attempt to determine the validity of the questions: For example, is the item stem clear? Is there a single correct or best answer? Are there grammatical clues that may lead the test-wise candidate to the correct answer? After the test items are judged to be technically sound, then the equivalence of the adapted version and the original Hebrew version are compared. Translators look at several features of the adapted items: accuracy of the translation as well as clarity of the sentences, level of difficulty of the words, and fluency of the translation.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement With a backward-translation design, translators would take the adapted version of the test and back translate to the source language, then judgments would be made about the equivalence of the original and back-translated versions of the test. Where nonequivalence is identified, changes in the adapted version of the test are considered. The idea is that if the adaptation has been effective, the back-adapted version of the test should look very much like the original. Of course, when the adaptation involves format changes, changes in tense, changes in concepts, and other changes, the target-language version of the test may be fine, but a back-translated test may not look at all like the original. In general, a back-translation design seems like an excellent supplement to the forward-translation design, but the design is not likely to be able to stand on its own. The information the design provides about the validity of the adapted test is limited. Step 6: Conduct a small tryout of the adapted version of the test. Many studies seem to go wrong at this point. Too often test developers believe a judgmental review is sufficient evidence to establish the validity of a test in a second language. But validity evidence for using a test in a second language depends on stronger evidence than that the test seems to look acceptable to translators and/or reviewers. Not only is empirical evidence needed to support the validity of inferences from an adapted version of a test, but perhaps multiple empirical studies may be needed. A good example of what researchers might learn from a tryout of test items in a second language and culture is highlighted clearly in the papers by Allalouf and Sireci (1998) and Allalouf et al. (1999). Here, it was learned, for example, that verbal analogy items were nearly impossible to translate well. The situation really is no different from validating the scores from any test. Empirical evidence is needed to support the validity of inferences from scores on a test. That a test may function well and produce valid scores in one country is not suitable evidence that similar results will be obtained in a second country or culture with the adapted version of the test. A pilot test might consist of administering the test as well as interviewing the examinees to obtain their criticisms of the test itself, instructions, time limits, and other factors. These findings form the basis for revising the test. One good suggestion from Ellis and Mead (1998) might be carried out: They suggested that when there are disagreements about the best adaptation of a test item, these variations might be field tested, and the results used to make the final decision about which adaptation is the most suitable.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement Step 7: Conduct a validation investigation. Good translators are often capable of identifying and fixing many shortcomings in adapted tests. But many problems go unidentified until test items are field tested. For example, in a recent study by Hambleton, Yu, and Slater (1999) in which National Assessment of Educational Progress (NAEP) mathematics items were adapted into Chinese, one problem with a NAEP test item went unidentified by the translators. A field test revealed a major problem with the item that could not be identified by the translators because it was a curriculum issue. Chinese students at the eighth grade were unfamiliar with the mathematical concept of estimation. The ability to round off numbers for arriving at approximate or estimated answers was not taught in the Chinese curriculum. As a result, on the estimation items (for example, “find the estimate of the product of 98 and 11”), the eighth-grade Chinese students performed disproportionately more poorly than American students. The validity issue concerned whether or not the estimation items should be retained in the findings from the comparative study. The adapted test should be field tested using, whenever possible, a large sample of individuals representative of the eventual target population, and preliminary statistical analyses should be carried out, such as a reliability analysis and a classical item analysis. Checking for construct equivalence using factor analysis is desirable if sample sizes are large enough to produce stable factorial structures. One important analysis is to check that the items function similarly in both the adapted- and source-language versions of the test. This can be accomplished through the use of an item bias study (often called a “differential item functioning” or DIF study) (Holland & Wainer, 1993). If there are items that function differently for each group when the groups are matched on ability, they can be eliminated from the test or they can be retranslated, readministered, and reanalyzed to determine whether they function the same in all adapted versions. This type of analysis has become routine with TIMSS and OECD/PISA. The Muniz, Hambleton, and Xing (2001) study highlights the fact that even small samples (i.e., 50 persons per group) can be useful in detecting flaws in the translation/ adaptation process because the problems of poor translations often are large and therefore easy to detect, even with small samples. Step 8: Choose and implement a design for placing scores from the source- and target-language versions of the test on a common reporting scale. This step is necessary when cross-national or cross-cultural comparisons are of interest, or the test score norms or performance standards with

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement the source-language version of the test are of interest with the target-language version of the test. At this step, a linking design is needed to place the test scores from the different versions of the test on a common scale. Three popular linking designs are used in practice: (1) bilingual group design, (2) matched monolingual group design, and (3) monolingual group design. All three designs are popular, though the third design may be the easiest to implement in practice (see, e.g., Angoff & Cook, 1988). For an example based on item response modeling of the data, studies by Angoff and Cook or Woodcock and Munoz-Sandoval (1993) are of special interest. Step 9: Document the process and prepare a manual for the users of the adapted test. Documenting the results from steps 1 to 8 and preparing a manual for the users of the adapted test are important activities. The manual might include specifics regarding the administration of the test as well as how to interpret the test scores. This is a very important step, yet it is often overlooked. The OECD/PISA project has done an especially good job of preparing detailed steps for adapting tests and documenting the processes that take place in participating countries (see, e.g., Grisay, 1998, 1999). TEST ADAPTATION PROCEDURES FOR TIMSS AND THE OECD/PISA PROJECTS For the 1995 TIMSS, and at the request of TIMSS project staff, Hambleton and Berberoglu (1997) conducted a survey of 45 participating countries (with 27, or 60 percent, of the surveys being returned). The five main findings of the survey were as follows: The amount of review and revision involved in adapting the tests exceeded the time allocated to do the work. In addition, competent translators were hard to locate in many of the countries. The high cost of translators was also a problem in many countries. The manual laying out the test adaptation process (operations manual) needed to be shorter and more focused. Without specific guidelines, countries were coming up with their own guidelines for translators, but then standardization of translation procedures across participating countries was compromised. For example, one country emphasized the importance of simplicity in translation, while another emphasized detailed rules to be followed by translators. All too often standardization among translators within a country rather than across countries was the focus of attention.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement Better directives to international writing committees were needed to reduce the number of problems detected during the test adaptation process. For example, one problem arose with long sentences. They were difficult to translate well, and when the sentences were shortened for ease of readability in some countries, language difficulty was not consistent across language versions of the test. Multiple-choice items in the incomplete stem format were difficult to translate. Such items are difficult to translate because the organization of subject, verb, and object in sentences is not consistent across languages. In countries such as Turkey, use of the incomplete stem format meant placing the blanks at the beginning of sentences rather than the end, and revising answer choices to match the format change, and these changes could have influenced the difficulty of test items. The passive tense in passages was a problem because this tense does not exist in all languages. Many countries found themselves translating sentences in the passive tense to the active tense. But such changes probably influenced the structure of the language in the adapted tests and may have affected the difficulty of the test items. Hambleton and Berberoglu asked national coordinators to comment about the operations manual used in TIMSS. Suggestions for improvement included (1) spell out the qualifications of translators more clearly, (2) offer a process for resolving differences between translators, and (3) emphasize “decentering” in the item writing process. Decentering is the technique of choosing passage topics, expressions, concepts, etc. that are most likely to be understandable or acceptable across languages and cultures. For example, passages about gun control, drugs, or snow may be unacceptable in an international assessment. Even a passage about high school students “getting a job” was rejected in one recent study because several countries worried that the passage may be sending the wrong message to young persons. The OECD/PISA 2000 study of school achievement in 15-year-olds benefited considerably from the methodology used in TIMSS to adapt the tests. For one, documentation for carrying out the test adaptation process was more directive and concisely written than TIMSS, thus providing more guidance to participating countries. Also, many countries more familiar with French as a second language than English preferred to prepare their adaptations from a French rather than an English version of the test. Therefore, the starting point for the adaptation was the preparation by project staff of equivalent English and French versions of the test. This was a unique feature of the OECD/PISA project and one that was much appreciated by many participating countries. They had the option of starting with either an English or a French version of the test to do their test

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement adaptation work. In fact, countries were encouraged to prepare double translations of the tests if possible—one adaptation from the French and one from the English version of the test. This double-translation, double-source-language design had the advantages of providing participating countries with two standards for evaluating their translations and offering a framework for judging the amount of freedom available to them in doing the translations. A comparison of the English and French versions for many countries, besides providing two sources for doing the translation (especially helpful when problems arose with a translation), offered an indication of the latitude that would be allowed in preparing their translations. In addition, the actual production of the French translation before any of the country translations were carried out identified problems early about difficult-to-translate material, which could then be revised or eliminated in the original English- and French-source-language versions of the test. At the same time, a considerable investment in time and resources was needed to produce formally equivalent English and French versions of the test. But to do otherwise would have made it more difficult for equivalent tests to be produced in participating countries. In the OECD/PISA project, many important features of an effective test adaptation were learned from the TIMSS experience, but an old problem surfaced again: Too little time was allowed to carry out the test adaptation reviews. In part, this problem was created by the ambitious schedule, but it was also because of some inexperience about the time needed to have committees carry out careful reviews. Not only do tests need to be adapted, but so do the demographic surveys. Finding cultural and contextual equivalents for questions in the school, teacher, and student questions was sometimes a problem. Terms like “advanced,” “special enrichment,” and “courses” were not always understood. Entrance exams in one country might be called “oral exams” in another country (e.g., Russia). Questions about school structures and organizations were not always meaningfully adapted because the concepts had no equivalent. Harkness (1998b) has written about the problems of translating surveys and questionnaires. She notes that an extensive amount of research is reported in the methodology of survey construction literature, but little of it relates to the uses of surveys in multiple languages and cultures. And as she notes, rarely is there an isomorphism of words across languages. With a rating scale, for example, a translation, word for word, may create smaller or larger psychological gaps between points on the rating scale. For example, with a rating scale anchored by the extremes “allow” and “not allow,” in one of the language translations the extremes became “allow” and “forbid.” But the word “forbid” turned out to be considerably more negative than the extreme “not allow,” and this choice ap-

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement peared to significantly influence the use of the rating scale in the second-language version of the survey. Clearly translating rating scales is more than a word-by-word translation. Also, in languages such as Hebrew, candidates read from right to left rather than left to right, so the scales need to be reversed. How much might this influence the meaningfulness of the rating scale? Or in Japan, rating scales may be presented vertically rather than horizontally. Will this shift in format influence ratings? Harkness notes that not just the words need to be translated; often the directions may be different (in some languages candidates “tick boxes”; in others they “select boxes” or “choose boxes”) and it is not known whether these changes influence candidate responses. She adds that standardizing the administration also may be critical. It may make a big difference in some countries whether the survey is self-administered, administered face to face, or administered in a group. Clearly the topic of survey/questionnaire adaptation is in its infancy. Future OECD/PISA projects and studies like TIMSS and TIMSS-Repeat (TIMSS-R) will need to focus additional attention on the adaptation of surveys and questionnaires, or risk misinterpreting the data from these instruments. A number of important points appear to have been learned from the OECD/PISA 2000 project for future test adaptations: Improved methods are being used to locate and train test translators. The test adaptation process is being fully documented and includes important features such as forward and backward translations, double-translation designs from single- and double-source-language versions of the test, national verification, and even international verification. All of these features enhance the quality of test adaptations for international comparative studies. Translators are being given excellent advice. For example, as many as 45 rules are being given to translators in the training documents. These include (a) avoid simplifying language and/or changing the level of abstraction of the testing material, and (b) avoid providing unintentional clues to correct answers by not making the correct answers longer, by eliminating grammatical clues, and so on. CONCLUSIONS An increasing number of educational, credentialing, and psychological tests are being adapted for use in multiple languages and cultures. For example, Spielberger’s state-trait anxiety measure is now in more than 50

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement languages; major individually administered intelligence tests such as the Wechsler Intelligence Scale for Children are available in over 50 languages; TIMSS in 1995 was administered to students in 32 different languages; and Microsoft is delivering credentialing exams in more than 15 languages. These are but a few of hundreds of tests now available in multiple languages. At the same time, these adapted tests will have limited value unless they are adapted with a high degree of concern for issues of usability, reliability, and validity in participating countries. There is a rapidly emerging psychometric literature on the topic of test adaptation methodology, and more advances can be expected in the coming years as researchers respond to the expanding need for adapted tests of high technical quality (see, for example, Hambleton, Merenda, & Spielberger, in press). Avoiding the five myths and following the nine steps introduced in this chapter for the test adaptation process should go a long way toward improving current practices. In addition, the nine steps provide a framework for incorporating new methodology into the process as it is developed. Three conclusions follow from the research carried out in completing this paper. First, test adaptation methodology has advanced considerably in the past 20 years. It has moved from the use of a single and possibly unqualified translator and/or limited empirical work with bilinguals to considerably more sophisticated methodologies with focus on establishing construct, method, and item-level equivalence (Hambleton, Merenda, & Spielberger, in press; van de Vijver & Tanzer, 1997). There is an ever-increasing number of papers published on the topic each year; now there is a new journal published by Lawrence Erlbaum Associates under the direction of the International Test Commission called the International Journal of Testing that is expected to publish many of the methodological and substantive advances. There has been an emergence of test adaptation guidelines, and more time is being allocated to the process of test adaptations than ever. For example, a comparison of the methodology used for test adaptation in the 1988 and 1991 NCES-ETS studies (see Lapointe, Mead, & Askew, 1992; Lapointe, Mead, & Phillips, 1989) and the TIMSS and OECD/PISA projects in 1995 and 2000, respectively, shows major advances in sophistication and effort. Second, the future of test adaptation seems very positive. The methodology is very much in place, and advances are still being made, but what is needed now are commitments of resources and time to ensure that test adaptation work is carried out well. Finally, the most important areas for improvement in the coming years with international comparative studies of achievement are the following: choosing multiple translators well and training them, aggressively applying current judgmental and statistical designs and methods,

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement and building on experiences and knowledge gained to continually improve the process. REFERENCES Allalouf, A., Hambleton, R. K., & Sireci, S. G. (1999). Identifying the causes of DIF in translated verbal items. Journal of Educational Measurement, 36(3), 185-198. Allalouf, A., & Sireci, S. G. (1998, April). Detecting sources of DIF in translated verbal items. Paper presented at the meeting of the American Educational Research Association, San Diego. Angoff, W. H., & Cook, L. L. (1988). Equating the scores of the Prueba de Aptitud Academica and the Scholastic Aptitude Test (Rep. No. 88-2). New York: College Entrance Examination Board. Ellis, B., & Mead, A. (1998, August). Measurement equivalence of a 16PF Spanish translation: An IRT differential item and test functioning analysis. Paper presented at the 24th meeting of the International Association of Applied Psychology, San Francisco. Gardner, H. (1983). Frames of mind: The theory of multiple intelligences. New York: Basic Books. Geisinger, K. F. (1994). Cross-cultural normative assessment: Translation and adaptation issues influencing the normative interpretation of assessment instruments. Psychological Assessment, 6, 304-312. Grisay, A. (1998). Instructions for the translation of the PISA material (OECD/PISA Rep.). Melbourne: Australian Council for Educational Research. Grisay, A. (1999). Report on the development of the French source version of the PISA test material (OECD/PISA Rep.). Melbourne: Australian Council for Educational Research. Hambleton, R. K. (1994). Guidelines for adapting educational and psychological tests: A progress report. European Journal of Psychological Assessment, 10, 229-244. Hambleton, R. K., & Berberoglu, G. (1997). TIMSS instrument adaptation process: A formative evaluation (Laboratory of Psychometric and Evaluative Research Rep. No. 290). Amherst: University of Massachusetts, School of Education. Hambleton, R. K., Merenda, P., & Spielberger, C. (Eds.). (in press). Adapting educational and psychological tests for cross-cultural assessment. Hillsdale, NJ: Lawrence Erlbaum Associates. Hambleton, R. K., & Patsula, L. (1998). Adapting tests for use in multiple languages and cultures. Social Indicators Research, 45, 153-171. Hambleton, R. K., & Patsula, L. (1999). Increasing the validity of adapted tests: Myths to be avoided and guidelines for improving test adaptation practices. Journal of Applied Testing Technology, 1, 1-12. Hambleton, R. K., Yu, J., & Slater, S. C. (1999). Field-test of the ITC guidelines for adapting psychological tests. European Journal of Psychological Assessment, 15(3), 270-276. Harkness, J. (Ed.). (1998a). Cross-cultural equivalence. Mannheim, Germany: ZUMA. Harkness, J. (1998b, August). Response scales in cross-national survey research. Paper presented at the meeting of the American Psychological Association, Toronto, Canada. Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates. Jeanrie, C., & Bertrand, R. (1999). Translating tests with the International Test Commission Guidelines: Keeping validity in mind. European Journal of Psychological Assessment, 15(3), 277-283. Lapointe, A. E., Mead, N. A., & Askew, J. M. (1992). Learning mathematics (Rep. No. 22-CAEP-01). Princeton, NJ: Educational Testing Service.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement Lapointe, A. E., Mead, N. A., & Phillips, G. W. (1989). A world of differences: An international assessment of mathematics and science (Rep. No. 19-CAEP-01). Princeton, NJ: Educational Testing Service. Lonner, W. J. (1990). An overview of cross-cultural testing and assessment. In R. W. Brislin (Ed.), Applied cross-cultural psychology (pp. 56-76). Newbury Park, CA: Sage. Muniz, J., & Hambleton, R. K. (1997). Directions for the translation and adaptation of tests. Papeles del Psicologo, August, 63-70. Muniz, J., Hambleton, R. K., & Xing, D. (2001). Small sample studies to detect flaws in item translations. International Journal of Testing 1(2), 115-135. Poortinga, Y. H., & van de Vijver, F. J. R. (1991). Testing across cultures. In R. K. Hambleton & J. Zaal (Eds.), Advances in educational and psychological testing (pp. 277-308). Boston: Kluwer Academic. Schroots, J. J. F., Fernandez-Ballesteros, R., & Rudinger, G. (1999). Aging in Europe. Amsterdam, Netherlands: IOS Press. Sternberg, R. (1989). The triarchic mind: A new theory of human intelligence. New York: Viking. van de Vijver, F. J. R., & Hambleton, R. K. (1996). Translating tests: Some practical guidelines. European Psychologist, 1, 89-99. van de Vijver, F. J. R., & Poortinga, Y. H. (1997). Towards an integrated analysis of bias in cross-cultural assessment. European Journal of Psychological Assessment, 13, 29-37. van de Vijver, F. J. R., & Tanzer, N. (1997). Bias and equivalence in cross-cultural assessment: An overview. European Review of Applied Psychology, 47(4), 263-279. Woodcock, R. W., & Munoz-Sandoval, A. F. (1993). An IRT approach to cross-language test equating and interpretation. European Journal of Psychological Assessment, 9, 233-241.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement ANNEX TO CHAPTER 3: INTERNATIONAL TEST COMMISSION TEST ADAPTATION GUIDELINES Context C.1 Effects of cultural differences which are not relevant or important to the main purposes of the study should be minimized to the extent possible. C.2 The amount of overlap in the constructs in the populations of interest should be assessed. Test Development and Adaptation D.1 Test developers/publishers should insure that the adaptation process takes full account of linguistic and cultural differences among the populations for whom adapted versions of the test are intended. D.2 Test developers/publishers should provide evidence that the language used in the directions, rubrics, and items themselves as well as in the handbook is appropriate for all cultural and language populations for whom the test is intended. D.3 Test developers/publishers should provide evidence that the choice of testing techniques, item formats, test conventions, and procedures is familiar to all intended populations. D.4 Test developers/publishers should provide evidence that item content and stimulus materials are familiar to all intended populations. D.5 Test developers/publishers should implement systematic judgmental evidence, both linguistic and psychological, to improve the accuracy of the adaptation process and compile evidence on the equivalence of all language versions. D.6 Test developers/publishers should ensure that the data collection design permits the use of appropriate statistical techniques to establish item equivalence between the different language versions of the test. D.7 Test developers/publishers should apply appropriate statistical techniques to (1) establish the equivalence of the different versions of the test, and (2) identify problematic components or aspects of the test which may be inadequate to one or more of the intended populations.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement D.8 Test developers/publishers should provide information on the evaluation of validity in all target populations for whom the adapted versions are intended. D.9 Test developers/publishers should provide statistical evidence of the equivalence of questions for all intended populations. D.10 Nonequivalent questions between versions intended for different populations should not be used in preparing a common scale or in comparing these populations. However, they may be useful in enhancing content validity of scores reported for each population separately. Administration A.1 Test developers and administrators should try to anticipate the types of problems that can be expected, and take appropriate actions to remedy these problems through the preparation of appropriate materials and instructions. A.2 Test administrators should be sensitive to a number of factors related to the stimulus materials, administration procedures, and response modes that can moderate the validity of the inferences drawn from the scores. A.3 Those aspects of the environment that influence the administration of a test should be made as similar as possible across populations for whom the test is intended. A.4 Test administration instructions should be in the source and target languages to minimize the influence of unwanted sources of variation across populations. A.5 The test manual should specify all aspects of the test and its administration that require scrutiny in the application of the test in a new cultural context. A.6 The administrator should be unobtrusive and the administrator-examinee interaction should be minimized. Explicit rules that are described in the manual for the test should be followed. Documentation/Score Interpretations I.1 When a test is adapted for use in another population, documentation of the changes should be provided, along with evidence of the equivalence.

OCR for page 58
Methodological Advances in Cross-National Surveys of Educational Achievement I.2 Score differences among samples of populations administered the test should not be taken at face value. The researcher has the responsibility to substantiate the differences with other empirical evidence. I.3 Comparisons across populations can only be made at the level of invariance that has been established for the scale on which scores are reported. I.4 The test developer should provide specific information on the ways in which the sociocultural and ecological contexts of the populations might affect performance on the test, and should suggest procedures to account for these effects in the interpretation of results.