Read "Language and Machines: Computers in Translation and Linguistics" at NAP.edu

« Previous: Appendix 9. Cost Estimates of Various Types of Translation

Page 67 Cite

Suggested Citation:"Appendix 10. An Experiment in Evaluating the Quality of Translations." National Research Council. 1966. Language and Machines: Computers in Translation and Linguistics. Washington, DC: The National Academies Press. doi: 10.17226/9547.

Page 68 Cite

Page 69 Cite

Page 70 Cite

Page 71 Cite

Page 72 Cite

Page 73 Cite

Page 74 Cite

Page 75 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Appendix 10 An Experiment in Evaluating the Quality of Translations This experiment* was designed to lay the foundations for a stan- dard procedure for measuring the quality of scientific transla- tions, whether human or mechanical. There have been other ex- periments on this problem te.g., G. A. Miller and J. G. Beebe- Center, Mechan. Transl. 3, 73 (1958~; S. M. Pfafflin, Mechan. - Transl. 8, 2 (1965~], but their methods for evaluating translations have been too laborious, too subject to arbitrariness in standards, or too lacking in reliability and/or validity to become generally accepted. The measurement procedure developed here gives \ promise of being amenable to refinement to the point where it will meet the requirements of relative simplicity and feasibility, fixed standards of evaluation, and high validity and reliability. A detailed report of this experiment will be submitted for pub- lication elsewhere; the present brief report will serve to indicate the general nature of the measurement procedure and some of the chief results. THE MEASUREMENT PROCEDURE It was reasoned that the two major characteristics of a translation are (a) its intelligibility, and (b) its fidelity to the sense of the original text. Conceptually, these characteristics are independent; that is, a translation could tee highly intelligible and yet lacking in fidelity or accuracy. Conversely, a translation could be highly accurate and yet lacking in intelligibility; this would be likely to occur, however, only in cases where the original had low intel- ligibility. Essentially, the method for evaluating translations employed in this experiment involved obtaining subjective ratings for these two characteristicsintelligibility and fidelityof sentences selected * Conducted by John B. Carroll with funds provided by the Automatic Language Processing Advisory Committee. 67

randomly from a translation and interspersed in random order among other sentences from the same translation and also among sentences selected at random from other translations of varying quality. When a translation sentence was being rated for intelligi- bility, it was rated without reference to the original. "Fidelity" was measured indirectly: the rater was asked to gather whatever meaning he could from the translation sentence and then evaluate the original sentence for its "informativeness" in relation to what he had understood from the translation sentence. Thus, a rating of the original sentence as "highly informative" relative to the translation sentence would imply that the latter was lacking in fidelity. All ratings were made by persons who were specially selected and trained for this purpose. There were two sets of raters. The first set of raters (called here "monolinguals" for convenience) consisted of 18 native speakers of English who had no knowledge of the language of the original (Russian, in this case). They were all Harvard undergraduates with high tested verbal intelligence and with good backgrounds in science. In rating "informativeness" these raters were provided with carefully prepared English trans- lations of the original sentences, so that in effect they were com- paring two sentences in Englishone the sentence from the trans- lation being evaluated, and the other the carefully prepared trans- lation of the original. The second set of raters ("bilinguals") consisted of 18 native speakers of English who had a high degree of competence in the comprehension of scientific Russian. Their ratings of the intel- ligibility of the translation sentences may well have been influenced by their knowledge of the vocabulary and syntax of Russian; at any rate, no attempt was made to prevent them from using such know- ledge. To rate "informativeness," they made a direct comparison between the translation sentence (in English) and the original ver- sion. All ratings were made on nine-point scales that had been estab- lished by the writer prior to the experiment by an adaptation of a psychometric technique known as the method of equal-appearing intervals. Thus, points on these scales could be assumed to be equally spaced in terms of subjectively observed differences. In the case of the intelligibility scale, each of the nine points on the scale had a verbal description (see Table 4~. The same was true of the "informativeness" scale except that verbal descriptions were omitted for a few of the points (see Table 5~. In this way each degree on the scales could be characterized in a meaningful way. For example, point 9 on the intelligibility scale was described 68

TABLE 4. Scale of ~tellig~bility 9Perfectly clear and intelligible. Reads like ordinary text; has no stylistic infelicities. 8Perfectly or almost clear and intelligible, but contains minor grammati- cal or stylistic infelicities, and/or midly unusual word usage that could, nevertheless, be easily "corrected." Generally clear and intelligible, but style and word choice and/or syntactical arrangement are somewhat poorer than in category 8. 6The general idea is almost immediately intelligible, but full comprehen- sion is distinctly interfered with by poor style, poor word choice, alter- native expressions, untranslated words, and incorrect grammatical arrangements. Postedit~ng could leave this in nearly acceptable form. 5The general idea is intelligible only after considerable study, but after this study one is fairly confident that he understands. Poor word choice, grotesque syntactic arrangement, untranslated words, and similar phenomena are present, but constitute mainly "noise" through which the main idea is still pi erceptible. 4Masquerades as an intelligible sentence, but actually it is more unintel- ligible than intelligible. Nevertheless, the idea can still be vaguely apprehended. Word choice, syntactic arrangement, aDd/or alternative expressions are generally bizarre, and there may be critical words un- translated. 3Generally unintelligible; it tends to read like nonsense but, with a con- siderable amount of reflection and study, one can at least hypothesize the idea intended by the sentence. 2Almost hopelessly unintelligible even after reflection and study. Never- theless, it does not seem completely nonsensical. 1Hopelessly unintelligible. It appears that no amount of study and reflec- tion would reveal the thought of the sentence. as follows: "Perfectly clear and intelligible. Reads like ordinary text; has no stylistic infelicities.t' Point 5 (the midpoint of the scale): "The general idea is intelligible only after considerable study, but after this study one is fairly confident that he under- stands. Poor word choice, grotesque syntactic arrangement, un- translated words, and similar phenomena are present, but con- stitute mainly 'noise' through which the main idea is still percep- tible .'t PREPARATION OF TEST MATERIALS AND COLLECTION OF DATA The measurement procedure was tested by applying it to six varied English translationsthree human and three mechanical 69

TABLE 5. Scale of Informativeness (This pertains to how informative the original version is perceived to be after the translation has been seen and studied. ~ the translation already conveys a great deal of information, it may be that the original can be said to be low In informativeness relative to the translation being evaluated. But if the translation conveys only a certain amount of information, it may be that the original conveys a great deal more, In which case the original is Oh in informativeness relative to the translation being evaluated.) 9Extremely informative. Makes "all the difference in the world" In com- prehending the meaning intended. (A rating of 9 should always be as- signed when the original completely changes or reverses the meaning conveyed by the translation.) 8Very informative. Contributes a great deal to the clarification of the meaning intended. By correcting sentence structure, words, and phrases, it makes a great change in the reader's impression of the meaning intended, although not so much as to change or reverse the meaning completely. 7(Between 6 and 8.) 6 Clearly informative. Adds considerable information about the sentence structure and individual words, putting the reader "on the right track" as to the meaning intended. 5(Between 4 and 6. ) 4~ contrast to 3, adds a certain amount of information about the sentence structure and syntactical relationships; it may also correct minor misapprehensions about the general meaning of the sentence or the meaning of individual words. 3By correcting one or two possibly critical meanings, chiefly on the word level, it gives a slightly different "twist" to the meaning conveyed by the translation. It adds no new information about sentence structure, however. 2No really new meaning is added by the original, either at the word level or the grammatical level, but the reader is somewhat more confident that he apprehends the meaning intended. 1Not informative at all; no new meaning is added, nor is the reader's confidence in his understanding increased or enhanced. OThe original contains, if anything, less information than the translation. The translator has added certain meanings, apparently to make the passage more understandable. of a Russian work entitled Mashina i Mysl' (Machine and Thought) by Z. Rovenskii, A. Uemov, and E. Uemova (Moscow, 1960~. These translations were of five passages varying considerably in type of content. (All the passages selected for this experiment, with the original Russian versions, have now been published by the Office of Technical Services, U.S. Department of Commerce, Technical 70

Translation TT 65-60307.) The materials associated with one of these passages were used for pilot studies and rater practice sessions; the experiment proper used the remaining four passages. In preparing materials for the rating task, 36 sentences were selected at random from each of the four passages under study. Since six different translations were being evaluated, six different sets of materials were prepared (in two forms, one for the monolinguals and one for the bilinguals) in such a way that each set contained a different translation of a given sentence. In this way no rater evaluated more than one translation of a given sentence. Each set of materials was given to three monolinguals and to three bilinguals; thus, there were 18 monolinguals and 18 bilinguals. Each rater had 144 sentences to evaluate first for in- telligibility and then for the informativeness of the original (or the standard translation of it) after the translation had been seen. The raters required three 90-min sessions to complete this task, deal- ing with 48 sentences in each session. The raters were not informed as to the source of the translations they were rating, although they were told that some had been made by machine. Before undertaking this task, the raters attended a 1-fur ses- sion in which they were given instruction in the rating procedures and required to work through a 30-sentence practice set. During the rendering of ratings for intelligibility, the raters held stopwatches on themselves to record the number of seconds it took them to read and rate each sentence. RESULTS The results of the experiment can be considered under two head- ings: (a) the average scores of the various translations, and (b) the variation in the scores as a function of differences in sentences, passages, and raters. Table 6 gives the over-all mean ratings and time scores for the six translations, arranged in order of general excellence ac- cording to our data. Consider first the mean ratings for intelligibility by the mono- linguals. Translation 1, a published human translation that had presumably been carefully done, received the highest mean rating, 8.30, on the scale established in Table 4. But 8.30 is still appreci- ably different from the maximum possible mean rating of 9.00, and it is evident that not even this "careful" human translation was as good as one might have expected. Furthermore, the mean rating of Translation 1 is not significantly different from that of Trans- lation 4 (8.21), a "quick" human translation made by rapid dictation 71

procedures. The mean ratings of Translations ~ and 4 do, however, differ significantly from the mean rating (7.36) of Translation 2, another "quick" human translation. It may be concluded that the measurement procedure studied here is sensitive enough to dif- ferentiate among human translations. A similar remark may be made about the sensitivity of this procedure to differences in the intelligibility of machine trans- lations. Translations 7 and 5 were shown to be significantly more intelligible, on the average, than Translation 9. Of most current interest, however, are the results having to do with the comparison of the human and the machine translations. Machine translations 7, 5, and 9 received mean ratings, respec- tively, of 5.72, 5.50, and 4.73. A scale value of 5 refers to a trans- lation in which "the general idea is intelligible only after consider- able study, but after this study one is fairly confident that he understands . . . " All these machine translations are significantly less intelligible, on the average, than any of the three human trans- lations. As machine translations improve, it should be possible to scale them by the present rating procedure to determine how nearly they approach human translations in intelligibility. The monolinguals' mean ratings on "informativeness" (reflec- ting the lack of fidelity of the translations) show an almost perfect inverse relationship to the mean ratings on intelligibility, and they differentiate the various translations in the same way and to the same extent. This result means that in practice, when ratings are averaged over sentences, passages, and raters, "intelligibility" and "fidelity" are very highly correlated. The detailed results of this study show that only in the case of a few particular sentences do the mean ratings of intelligibility and informativeness convey different information. Furthermore, the mean reading times per sentence show almost precisely the same pattern of results as the ratings. In fact, the mean reading times are linearly related to the mean ratings, a result that supports the conclusion that the points on the rating scales are evenly spaced. The results from the ratings by bilinguals contribute nothing more to the differentiation of the translations than is obtainable with the monolinguals' ratings. Bilinguals' intelligibility ratings of the translations are slightly (and significantly) higher, on the average, than those of the monolinguals, and correspondingly, their informa- tiveness ratings are slightly lower. Yet, they took significantly longer to read and rate the sentences. Apparently their knowledge of Russian caused them to work harder on trying to understand the translations. One is inclined to give more credence to the results 72

from the monolinguals because monolinguals are more represen- tative of potential users of translations and are not influenced by knowledge of the source language. It is also to be noted that the data from the monolinguals differentiate the translations to a somewhat greater extent than do the data from the bilinguals. The results concerning the differences in ratings due to differ- ences in sentences, passages, and raters can now be considered. (The detailed tables of these results are omitted here to save space.) The more important results may be summarized as fol- lows: 1. The results do not differ significantly from passage to pas- sage; that is, on the average the various passages from a given translation receive highly similar ratings. For intelligibility ratings, however, there is a small but significant interaction between translation and passage, indicating that translations are to some extent differentially effective for different types of content. (This interaction effect is present both for human and for machine translations.) 2. There is a marked variation among the sentences. In fact, as may be seen from Figure I, there is some overlap between sentences from human translations and from mechanical transla- tions; or, in other words, there are some sentences translated by machine that have higher ratings than some other sentences trans- lated by human translators, even though, on the average, the human- translated sentences are better than the machine-translated ones. These results imply that in order to obtain reliable mean ratings for translations, a fairly large sample of sentences must be rated. 3. Variation among raters is relatively small, but it is large enough to suggest that ratings should always be obtained from several raterssay at least three or four. CONCLUSION This experiment has established the fact that highly reliable assessments can be made of the quality of human and machine translations. In the case of the six particular translations investi- gated in the study, all the human translations were clearly superior to the machine translations; further, some human translations were significantly superior to other human translations, and some machine translations were significantly superior to other machine translations. On the whole, the machine translations were found to fall about at the midpoint of a scale ranging from the best possible to the poorest possible translation. 73

What is still needed, however, is a system whereby any trans- lation can be easily and reliably assessed. The present experiment has determined the necessary parameters of such a system. 40n 30- 20- 10- TRANSLATION 1 5 6 7 ~ 9 0- 1 2 3 4 4O 30- 20- 10- TRANSLATION 4 O- 1 2 30- 20- z LL LL CY lo 20, 10- nl 1 20- 3 4 TRANSLATION 2 10- 0 1 2 3 4 8 ~ ~ ~ 7 8 ~ TRANS LAT ION 7 0 1 2 , ~ ~ ~ , ~ 9 20- 10- TRANSLATION 5 1 2 4 RANSLATION 9 ~ 4 ~ 6 7 MEAN INTELLIGIBILITY RATING 8 9 FIGURE 1. Frequency distribution of monolingu- als' mean intelligibility ratings of the 144 sen- tences in each of six translations. Translations 1, 4, and 2 are human translations; Translations 7, 5, and 9 are machine translations. 74

Next: Appendix 11. Types of Errors Common in Machine Translation »

Language and Machines: Computers in Translation and Linguistics (1966)

Chapter: Appendix 10. An Experiment in Evaluating the Quality of Translations

Welcome to OpenBook!

Get Email Updates