Cover Image

HARDBACK
$89.95



View/Hide Left Panel

Page 482

New Trends in Natural Language Processing: Statistical Natural Language Processing

Mitchell Marcus

SUMMARY

The field of natural language processing (NLP) has seen a dramatic shift in both research direction and methodology in the past several years. In the past, most work in computational linguistics tended to focus on purely symbolic methods. Recently, more and more work is shifting toward hybrid methods that combine new empirical corpus-based methods, including the use of probabilistic and information-theoretic techniques, with traditional symbolic methods. This work is made possible by the recent availability of linguistic databases that add rich linguistic annotation to corpora of natural language text. Already, these methods have led to a dramatic improvement in the performance of a variety of NLP systems with similar improvement likely in the coming years. This paper focuses on these trends, surveying in particular three areas of recent progress: part-of-speech tagging, stochastic parsing, and lexical semantics.

SOME LIMITATIONS OF RULE-BASED NLP

Until about 3 or 4 years ago, all natural language processing (NLP) systems were entirely hand constructed, with grammars and semantic components made up of many carefully handcrafted rules. Often the target coverage of such systems was based on a small set of ex-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 482
Page 482 New Trends in Natural Language Processing: Statistical Natural Language Processing Mitchell Marcus SUMMARY The field of natural language processing (NLP) has seen a dramatic shift in both research direction and methodology in the past several years. In the past, most work in computational linguistics tended to focus on purely symbolic methods. Recently, more and more work is shifting toward hybrid methods that combine new empirical corpus-based methods, including the use of probabilistic and information-theoretic techniques, with traditional symbolic methods. This work is made possible by the recent availability of linguistic databases that add rich linguistic annotation to corpora of natural language text. Already, these methods have led to a dramatic improvement in the performance of a variety of NLP systems with similar improvement likely in the coming years. This paper focuses on these trends, surveying in particular three areas of recent progress: part-of-speech tagging, stochastic parsing, and lexical semantics. SOME LIMITATIONS OF RULE-BASED NLP Until about 3 or 4 years ago, all natural language processing (NLP) systems were entirely hand constructed, with grammars and semantic components made up of many carefully handcrafted rules. Often the target coverage of such systems was based on a small set of ex-

OCR for page 482
Page 483 emplar sentences; many such systems were originally developed on fewer than several hundred examples. While these systems were able to provide adequate performance in interactive tasks with typed input, their success was heavily dependent on the almost magical ability of users to quickly adapt to the limitations of the interface. The situation is quite different, however, when these rule sets are applied open loop to naturally occurring language sources such as newspaper texts, maintenance manuals, or even transcribed naturally occurring speech. It now appears unlikely that hand-coded linguistic grammars capable of accurately parsing unconstrained texts can be developed in the near term. In an informal study conducted during 1990 (Black et al. 1992b), short sentences of 13 words or less taken from the Associated Press (AP) newswire were submitted to a range of the very best parsers in the United States, parsers expressly developed to handle text from natural sources. None of these parsers did very well; the majority failed on more than 60 percent of the test sentences, where the task was to find the one correct parse for each sentence in the test set. Another well-known system was tested by its developer using the same materials in 1992, with a failure rate of 70 percent. This failure rate actually conflates two different, and almost contradictory, problems of this generation of parsers. The first is that the very large handcrafted grammars used by parsers that aim at broad coverage often generate very, very large numbers of possible parses for a given input sentence. These parsers usually fail to incorporate some source of knowledge that will accurately rank the syntactic and semantic plausibility of parses that are syntactically possible, particularly if the parser is intended to be domain independent. The second problem, somewhat paradoxically, is that these parsers often fail to actually provide the correct analysis of a given sentence; the grammar of a natural language like English appears to be quite vast and quite complex. Why can't traditional approaches to building large software systems, using techniques like divide and conquer, solve this last problem? The problem is not that the grammar developers are not competent or that there is a lack of effort; a number of superb computational linguists have spent years trying to write grammars with broad enough coverage to parse unconstrained text. One hypothesis is that the development of a large grammar for a natural language leads into a complexity barrier similar to that faced in the development of very large software systems. While the human grammatical system appears to be largely modular, the interaction of the subcomponents is still sufficient to cause the entire system to be unmanageably com-

OCR for page 482
Page 484 plex. The net result is that the grammatical system does not appear to decompose easily into units that a team can develop and then join together. In support of this view is the fact that almost all of the large grammars extant are the result of a single grammar developer working over a long period of time. If this conclusion is correct, an approach to developing NLP systems must be found other than careful handcrafting. STATISTICAL TECHNIQUES: FIRST APPEARANCE One of the first demonstrations that stochastic modeling techniques, well known in the speech-processing community, might provide a way to cut through this impasse in NLP was the effective application of a simple letter trigram model to the problem of determining the national origin of proper names for use in text-to-speech systems (Church, 1985). Determining the etymology of names is crucial in this application because the pronunciation of identical letter strings differs greatly from language to language; the string GH, for example, is pronounced as a hard G in Italian, as in Aldrighetti, while most often pronounced as F or simply silent in English, as in laugh or sigh. This system estimates the probability that a name W comes from language L as the product of the probabilities, estimated across a set of known names from L, of all the contiguous three-letter sequences in W. It then assigns W to the language L, which maximizes this probability. The success of this program came as a surprise to most of the NLP community, at the time completely wedded to the symbolic techniques of traditional artificial intelligence (AI). Many people in NLP thought this application was a fluke, that the task solved by the program was somehow special. In fact, this technique led the way toward application of statistical techniques to problems that one would have thought required an ''AI-complete" solution, a full solution to the problem of modeling human understanding. In another sense this work was an application of an approach to linguistic analysis called distributional analysis (Harris, 1951), which reached its zenith in the 1950s. This work suggested that the structure of language could be discovered by looking at distributional patterns of linguistic entities. While the work of Chomsky in the late 1950s showed that distributional analysis could not be the whole story, most linguists assumed that Chomsky's work implied that distributional techniques should be abandoned entirely. This application showed that simple distributional techniques were useful for solving hard engineering problems that looked resistant to the application of a priori knowledge.

OCR for page 482
Page 485 THE ARCHITECTURE OF AN NLU SYSTEM Figure 1a gives an overview of a few of the crucial steps in the process of decoding a sentence in a conventional NLU system, given that the words that make up the sentence have been determined either by a speech recognition system or by tokenization of an ASCII source. When a new sentence comes in, it is analyzed by a parser that both determines what part of speech to assign to each of the words FIGURE 1 Two decompositions of the architecture of an NLU system. (a)—Standard decomposition. (b)—An alternate decomposition.

OCR for page 482
Page 486 and combines these part-of-speech tagged words into larger and larger grammatical fragments, using some kind of grammar that tells what combinations are possible and/or likely. The output of this grammatical analysis, either a single-rooted tree or a string of tree fragments, then goes through semantic analysis, which determines the literal meaning of a sentence in isolation. This phase of analysis decides both what the individual words mean and how to combine the individual word meanings into larger semantical structures. Often, this last step is done using some form of compositional semantics, where the meaning of each larger unit is constrained to be a relatively simple function of the meaning of its parts. This meaning representation is then further analyzed by pragmatic and discourse components to determine what the sentence means given its particular context of use and to place this representation into a multisentence representation useful for such tasks as determining the referent of pronouns and other noun phrases. For the purposes of the rest of this chapter, the problem will be subdivided into somewhat smaller functional units than given by the conventional model. This subdivision, given in Figure lb, reflects the development of statistical NLP techniques over the past several years, with rate of progress roughly proportional to height in the figure. The first success of statistical modeling techniques for NLU was in the area of part-of-speech determination—deciding, for example, whether the word saw functioned in a given linguistic context as a singular noun or a past tense verb. A variety of techniques now tag previously unseen material with 95 to 97 percent accuracy. Recently, purely context-free probabilistic parsing methods have been supplanted by parsing algorithms that utilize probabilities of context-free rules conditionalized on aspects of surrounding linguistic structure. Such parsers provide the correct parse as the first parse output between 60 to 80 percent of the time, by sentence, on naturally occurring texts from rich, but not entirely unconstrained, domains such as the Wall Street Journal. They have performed with up to 91 percent accuracy on spoken language tasks from limited domains like the Advanced Research Projects Agency's (ARPA) Air Travel Information Service (ATIS) domain. In the area of lexical semantics, a range of promising techniques for performing word-sense disambiguation have emerged in just the last year, as well as some preliminary work in automatically determining the selectional restrictions of verbs, that is, what kind of objects can serve as the subject or object of a given verb. Finally, all of these methods depend crucially on the availability of training materials annotated with the appropriate linguistic structure. These advances were made possible by the development of cor-

OCR for page 482
Page 487 pora appropriately annotated with part-of-speech and syntactic structure. This paper will also touch on the development of such corpora. PART-OF-SPEECH TAGGING The task of a part-of-speech tagger is to map an input stream of word tokens into the correct part of speech for each word token in context. To do this, it must disambiguate words that have the potential to be many different parts of speech. A part-of-speech tagger, for example, should map the sentence Can we can the can? into the string of parts of speech shown in Figure 2. This problem of lexical disambiguation is a central problem in building any NLP system; given a realistically large lexicon of English, many common words are used in multiple parts of speech. Determining what function each word plays in context is a crucial part of either assigning correct grammatical structure for purposes of later semantic analysis or of providing a partial heuristic chunking of the input into phrases for purposes of assigning intonation in a text-to-speech synthesizer. The problem of lexical disambiguation was thought to be completely intractable 10 years ago by most of the NLP community, and yet now  a wide range of very different techniques can solve this problem with 95 to 97.5 percent word accuracy, for part-of-speech tag sets of between 40 and 80 tags, depending on the task and how accuracy is measured (see, e.g., Black et al., 1992a; Brill, 1992; Church, 1988; Cutting et al., 1992; Hindle, 1989; Merialdo, 1991; Weischedel et al., 1993; and others). It is worth noting that many of these errors are not very harmful; a significant fraction of the errors consist of cases where one kind of verb is confused with another kind of verb or one kind of noun with another. Many of the parsers for which these taggers serve as preprocessors are forgiving enough that the errors do not actually throw the parser off track. Most part-of-speech taggers are implemented as hidden Markov models (HMMs). For an input sentence S = w1, w2, ... wn, these taggers predict a tag ti for each word wi given two sets of probabilities: First, P(w ^ t) (the probability of w given t), the probability for each possible word w and each part-of-speech tag t that if a given word is tagged with t, it is in fact the word w. Second, P(ti+1) (ti), the transition probability that the next tag is ti+1, given that the current Words in: Can we can the can? Part-of-speech stream out: modal pronoun verb det noun FIGURE 2 Part-of-speech taggers assign tags in context.

OCR for page 482
Page 488 tag is ti. These taggers use a linear time search utilizing the dynamic programming algorithm, often called the Viterbi algorithm, to find the string of tags T = t1, t2 . . . tn, that maximize Fli P(wi, ^ ti) P(tI ^ ti+1). The question here is how to estimate the value of the parameters of the HMM. The standard approach for HMMs is to use the forward/backward algorithm to automatically estimate the parameters, as described by Jelinek elsewhere in this volume. However, systems that use the forward/backward algorithm do not perform quite as well as those that estimate parameters, at least initially, by simple counting, using a corpus of text that has been pretagged with part-of-speech information (Merialdo, 1991). In practice, such systems must use some technique to smooth1 very small counts. One could then use the forward/backward algorithm  to smooth these direct estimates, but there is little evidence that this helps. Currently, then, the best way to estimate the parameters of an HMM  for part-of-speech tagging is to hand tag a corpus and simply count. The theme that emerges here is true of most statistical NLP applications and will be a leitmotif in what follows below. What works best both for part-of-speech tagging using HMMs and for the entire range of statistical NLP applications considered in this paper, is some appropriate combination of stochastic techniques and linguistic knowledge. While earlier work provides evidence that handcrafted symbolic representations of linguistic knowledge are insufficient to provide industrial-strength NLP, it also appears that the use of statistical methods without some incorporation of linguistic knowledge is insufficient as well. This linguistic knowledge may either be represented in implicit form, as in the use of a pretagged corpus here, or encoded explicitly in the form of a grammar.2 In the next few years, I believe we are going to see stochastic techniques and linguistic knowledge more and more deeply interleaved. The Problem of Unknown Words In conjunction with this observation it is important to realize that if one simply implemented an HMM  for part-of-speech tagging as 1Since sentence probabilities are estimated by multiplying together many estimates of local probabilities, any probability estimate of zero leads to a zero probability for the entire string. Since any direct estimate is based on only finite data, it is important to assume that any event not observed at all has some very small, but nonzero probability. How to best perform this smoothing of probability estimates is a central technical issue in applying any of the methods discussed in this chapter. 2For readers familiar with logic, this is the distinction between knowledge represented extensionally and knowledge represented intensionally.

OCR for page 482
Page 489 discussed above, the performance of the resulting system on new material could well be no better than 70 or 80 percent correct. Without exception, input is preprocessed before parts of speech are assigned by an HMM; this preprocessing is often only partially discussed in technical descriptions of part-of-speech taggers. The preprocessing copes with "unseen words," words that were never seen in the training data and for which the system therefore has no prior knowledge of possible parts of speech. It turns out that about half of the word types in the Brown corpus (Francis, 1964; Francis and Kucera, 1982), a carefully balanced representative corpus of American English, appear exactly once (about 32,000 out of 67,000 word types). This is consistent with Zipf's law, the empirical law that the frequency of a word type is inversely proportional to its rank. Nor can this problem be circumvented by some appropriately huge lexicon; a very large number of proper names appear on any newswire for the first time each day. How can this unseen word problem be handled? One simple but quite effective technique is to tag each unknown word with the most likely tag given its last three letters—an empirical approximation to simple morphological analysis (Brill, 1992). A useful heuristic for proper nouns in most English text is to use capitalization, often combined with some other heuristics to correct for unknown words used at the beginning of sentences (Weischedel et al., 1993). The key point here is that these techniques for unseen words go beyond using purely stochastic techniques to using implicit and explicit linguistic knowledge, although in a trivial way, to get the job done. STOCHASTIC PARSING All work on stochastic parsing begins with the development of the inside/outside algorithm (Baker, 1979), which generalizes the Baum-Welch algorithm for estimating HMMs to the estimation of parameters of stochastic context-free grammars.3 Just as each iteration of the Baum-Welch algorithm over some training corpus improves estimates of the parameters of the underlying HMM, as judged by the criterion of maximal likelihood, so the inside/outside algorithm improves parameter estimates of an underlying probabilistic context-free grammar, judged by this same criterion. However, straightforward application of the inside/outside algorithm does not appear to produce effective parsers; the best results to 3For a tutorial introduction to probabilistic context-free grammars and the inside/ outside algorithm, see Jelinek et al. (1991) Lari and Young (1990).

OCR for page 482
Page 490 date have resulted in parsers with about 35 percent correct parses on fully reserved test material in simple parsing tasks (Fujisaki et al. 1989; Sharman et al., 1990). Two problems appear to lie behind this failure. First, for realistic probabilistic context-free grammars (PCFGs) the number of parameters that must be estimated is very large; unless some a priori constraint is provided, n3 parameters must be estimated for a grammar with n nonterminal categories, categories that label not words but structures, like noun phrase, verb phrase, or sentence. But a worse problem is that the objective function that the inside/outside procedure maximizes, namely the probability of the training corpus given the grammar, is in fact not the objective function that one wants to maximize to train effective parsers. For parsing the goal is to maximize assignment of the correct grammatical structure, to recursively subdivide the sentence correctly into its constituent grammatical parts, determined, say, by examining similar sentences in a treebank of hand-parsed sentences. Unfortunately, there is no reason to expect that a PCFG whose parameters are estimated by the inside/ outside algorithm will assign structures that have the desired constituent structure. In recent years a range of new grammatical formalisms have been proposed that some suggest have the potential to solve a major part of this problem. These formalisms, called lexicalized grammar formalisms, express grammars in which the entire grammar consists of complex structures associated with individual words, plus some very simple general rules for combining these structures. Such grammar formalisms include combinatory categorial grammars (CCGs), lexicalized tree-adjoining grammars (LTAGs), and link grammars (Joshi and Schabes, 1992; Steedman, 1993). In these lexicalized formalisms each word can be thought of as a tree fragment; the full grammatical analysis of a sentence is formed by specifying how and in what order the fragments associated with each word in a sentence combine. Words may themselves be ambiguous between different "parts of speech," here differing tree fragments. In these grammar formalisms the bulk of parsing a sentence is just deciding on which part of speech to assign to each word. Given this property of these grammar formalisms, perhaps some way can be found to extend the inside/outside algorithm appropriately so that its objective function maximizes the probabilities of strings of part-of-speech tagged words. If so, it is just a matter of extending the search space to handle the large number of complex part-of-speech structures of lexicalized grammars.4 4 I thank Aravind Joshi for the above observation.

OCR for page 482
Page 491 Constraining the Inside/Outside Algorithm Recently, a number of experiments have been performed that combine the inside/outside algorithm with some form of linguistic knowledge. In a recent experiment by Pereira and Schabes (1992), a modified version of the inside/outside algorithm was applied to a corpus that was manually annotated with a skeletal syntactic bracketing by the Penn Treebank Project (Brill et al., 1990; Marcus et al., 1993). In this experiment the I/O algorithm was modified to consider only PCFG rules that did not violate the skeletal bracketing of the corpus, zeroing out many of the n3 parameters a priori. The algorithm was then trained on a corpus of only 770 sentences collected in the Air Travel Information System  (ATIS) domain (Hemphill et al., 1990). The evaluation was based on the "crossing brackets" parser evaluation metric of Black et al. (1991). This crossing-brackets measure counts the number of brackets inserted during parsing that are consistent with the correct bracketing of the sentence.5 Without constraint, the algorithm achieved 35 percent bracketing accuracy on reserved test materials but achieved 90 percent bracketing accuracy when constrained by the annotated corpus. Conditioning PCFG Rules on Linguistic Context One new class of models uses linguistic knowledge to condition the probabilities of standard probabilistic context-free grammars. These new models, which in essence augment PCFG grammar rules with probabilistic applicability constraints, are based on the hypothesis that the inability of PCFGs to parse with high accuracy is due to the failure of PCFGs to model crucial aspects of linguistic structure relevant to the appropriate selection of the next grammar rule at each point within a context-free derivation. Probabilities in the standard stochastic context-free model are conditioned only on the type of nonterminal that the grammar is about to expand; the key idea of these new models is that this provides insufficient linguistic context to adequately model the probabilities of rule expansion. One such parser, that of Magerman and Marcus (1991a, 1991b), assumes that expansion of any nonterminal is conditioned on the type of nonterminal, the most likely part-of-speech assignments for the next several words 5Notice that this is a much easier measure than the percentage of sentences parsed correctly; if one of, say, 33 brackets is inconsistent in a given sentence, the sentence is 97 percent correct by the bracket measure and simply wrong by the sentences-correct measure.

OCR for page 482
Page 492 I'm currently at MIT Forty-five Pearl Street What kind of food does LaGroceria serve Is there a Baybank in Central Square Where is the closest library to MIT What's the address of the Baybank near Hong Kong What's the closest ice cream parlor to Harvard University How far is Bel Canto's from Cambridge Street in Cambridge Is there a subway stop by the Mount Auburn Hospital Can I have the phone number of the Cambridge City Hall Can you show me the intersection of Cambridge Street and Hampshire Street How do I get to the closest post office from Harvard University Which subway stop is closest to the library at forty-five Pearl Street FIGURE 3 Sample sentences from the Massachusetts Institute of Technology's Voyager corpus. in the parser's input stream, and the rule that has generated the particular nonterminal that the parser is trying to expand. For example, the rule "NP—pronoun" might have a different probability when it expands the NP in the rule "S — NP VP" than when it expands the NP in the rule "VP—NP NP"). Tested on a corpus of sentences from the Massachusetts Institute of Technology's Voyager domain (Zue et al., 1990), this parser correctly parsed 89 percent of a reserved test set. A sample list of sentences from this corpus, with length distribution typical of the corpus as a whole, is given in Figure 3. Although the performance of this algorithm is quite impressive in isolation, the sentences in this corpus are somewhat simpler in structure than those in other spoken language domains and are certainly much simpler than sentences from newswire services that were the target of the parser evaluation discussed in the introduction to this chapter. On the other hand, a simple PCFG for this corpus parses a reserved test set with only about 35 percent accuracy, comparable to PCFG performance in other domains. If the probability of each rule is conditioned on both the current nonterminal and on the rule immediately above that gave rise to the current nonterminal, then performance improves to about 50 percent accuracy. Conditioning each rule on the expected part of speech of the next several words in addition increases performance to 87.5 percent accuracy. The key point here again is that combining a very simple stochastic framework with a little bit of linguistic knowledge greatly increases performance over each alone. Many parsing techniques are now emerging that combine stochastic techniques with linguistic knowledge in a number of different ways. Again, as discussed briefly above, linguistic knowledge can be

OCR for page 482
Page 494 rich set of linguistically relevant questions in combination with decision tree techniques. These questions examine not only syntactic properties, but lexical and class-based information as well, thus combining a much richer set of linguistic knowledge sources than any other model to date. The decision tree uses this set of questions to search for the grammar implicit in a very large hand-annotated corpus. Published reports of early stages of this work indicate that this technique is 70 percent correct on computer manual sentences of length 7 to 17, where, to count as correct, each parse must exactly match the prior hand analysis of that sentence in the test corpus, a more stringent test criterion than any other result mentioned here. While this last experiment uses one uniform statistical technique, decision trees, to make all parsing decisions, some recent work suggests that effective parsing might be done by a suite of interacting parsing experts, each handling a particular grammatical phenomenon. Perhaps the clearest example of this is a recent technique to resolve the ambiguous attachment of prepositional phrases. Consider the sentence I saw the man with the telescope; here the prepositional phrase with the telescope might modify the man, meaning I saw the man who had a telescope, or it might modify the main verb saw, meaning I used the telescope to see the man. If the sentence were instead I saw the planet with the telescope, the prepositional phrase would certainly modify the main verb, but if it were I saw the man with the hat the prepositional phrase would clearly modify the man. Here, as in many other cases, it becomes clear that a decision about grammatical structure depends crucially on the properties of the lexical items themselves. A technique that uses likelihood ratios to compare the strength of association between the preposition and the main verb with the strength of association between the preposition and the preceding noun correctly assigns about 80 percent of prepositional phrases in sentences from the AP newswire with structure identical to the examples here (Hindle and Rooth, 1993). It is interesting to note that human judges, given the same information, do this task at about 85 to 87 percent accuracy. This experiment also points out the key role of lexical properties in deciding grammatical structure. Its success suggests that the crucial role of grammar is just to mediate the properties of lexical items themselves. This would suggest, as does the recent work on lexicalized grammars discussed above, that the words themselves are primary. Annotated Corpora Before leaving the question of syntax, I would like to say a word about the production of annotated corpora themselves. There are, at

OCR for page 482
Page 495 the moment, two large grammatically annotated corpora for English—the IBM/Lancaster Treebank (Garside et al., 1987) and the Penn Treebank (Marcus et al., 1993). As of this time, only materials from the second are generally available; they are distributed through the Linguistic Data Consortium; because of this, and my familiarity with this corpus, that is the focus here. The Penn Treebank now consists of 4.5 million words of text tagged for part of speech, with about two-thirds of this material also annotated with a skeletal syntactic bracketing. All of this material has been hand corrected after processing by automatic tools. The two largest components of the corpus consist of over 1.6 million words of material from the Dow-Jones News Service, hand parsed, with an additional 1 million words tagged for part of speech and a skeletally parsed version of the Brown corpus (Francis, 1964; Francis and Kucera, 1982), the classic 1-million-word balanced corpus of American English. This material has already been used for purposes ranging from serving as a gold standard for parser testing to serving as a basis for the induction of stochastic grammars to serving as a basis for quick lexicon induction. There is now much interest in very large corpora that have quite detailed annotation, assuming that such corpora can be efficiently produced. The group at Penn is now working toward providing a 3-million-word bank of predicate-argument structures. This will be done by first producing a corpus annotated with an appropriately rich syntactic structure and then automatically extracting predicate-argument structure, at the level of distinguishing logical subjects and objects, and distinguishing a small range of particular adjunct classes. This corpus will be annotated by automatically transforming the current treebank into a level of structure close to the intended target and then completing the conversion by hand. LEXICAL SEMANTICS AND BEYOND We now turn to an area of very recent progress-lexical semantics. At initial inspection, it would appear most unlikely that statistical techniques would be of much use for either the discovery or representation of the meanings of words. Surprisingly, some preliminary work over the past several years indicates that many aspects of lexical semantics can be derived from existing resources using statistical techniques. Several years ago it was discovered that methods from statistics and information theory could be used to ''tease out" distinctions between words, as an aid to lexicographers developing new dictionar-

OCR for page 482
Page 496 Computed over Parsed AP Corpus (N = 24.7 million SVO triples) FIGURE 4  What do you typically do with food and water? (Adapted from Church et al., 1991.) ies (Church et al., 1991). As an example, consider the following: How could one distinguish the meaning of food and water? Figure 4 shows the mutual information score,6 an information theoretic measure, between various verbs and between food and water in an automatically parsed corpus, where either food or water is the object of that verb or, more precisely, where one or the other is the head of the noun phrase which is the object of the verb. The corpus used in this experiment consists of 25 million subject-verb-object triples automatically extracted from  the AP newswire by the use of a parser for unrestricted text. The mutual information score is high if the verb and noun tend to occur together and will tend toward 0 if the verb and noun occur together no more often than expected by chance. Because this measure is the log of a ratio, scores such as those shown in the table are quite high. What kinds of things can you do with food? Well, according to the AP newswire, you can hoard it, go without it, eat it, consume it, etc. With water, you can conserve it, boil it, ration it, pollute it, etc.7 This indeed begins to reveal something about the meaning of 6The mutual information statistic is a measure of the interdependence of two signals. It is defined as Ml(x,y) I log [P(x,y)/P(x)P(y)]. 7Because English contains a large number of so-called phrasal verbs, whenever the parser encounters a verb followed immediately by a prepositional phrase, as in go without food, the parser creates a potential phrasal verb, for example, go_lwithout and a triple where the object of the preposition (here food) is taken as the object of the putative phrasal verb (here go_without).

OCR for page 482
Page 497 Computed over Parsed AP Corpus (N = 24.7 million SVO triples) FIGURE 5  What do you do more with food than with water? (Adapted from Church et al., 1991.) these verbs, based on distributional properties of these words, that is, what other words they cooccur with. To differentiate these two words somewhat more sharply, one might ask what might be done more or less to one than the other. Here, an appropriate metric is the t-score, which contrasts the conditional probability of seeing food as object given a particular verb with the conditional probability of seeing water as object given that verb.8 Figure 5 shows that one eats or buys food far more than water and that one pumps or drinks water far more than food. Perhaps surprisingly, these descriptions get quite close to the heart of the difference between food and water. These experiments show that statistical techniques can be used to tease out aspects of lexical semantics in such a way that a human lexicographer could easily take advantage of this information. However, for computers to utilize such information, some kind of representation must be found to encode semantical information in the machine. What kinds of representations might be appropriate for automatic discovery procedures? Much research on automatic machine translation is now being done using the parallel French-English transcripts of the proceedings of the Canadian parliament. This corpus has been used to test a statistical technique that finds the most reliable local clue in context to tease apart different senses of the same word in the 8The formula computed is t | [P(food|verb) - P(water|verb)]/{s2[P(food|verb)] + s2[P(water|verb)]}1/2

OCR for page 482
Page 498 Word: prendre Informant: Right noun Information: .381 bits (a) For prendre the noun to the right is maximally informative. Sense I Sense 2 part decision mesure parole note connaissance exemple engagement temps fin initiative retraite (b) Some French words that are informant values for each sense. Pr(English — Sense I) Pr(English— Sense 2) to_take .433 to_make .186 to_make .061 to_speak .105 to_do .051 to_rise .066 to_be .045 to_take .066   to_be .058 decision .036 to_get .025 to_have .021 (c) Sense one translates as take, sense two as make. FIGURE 6 The two senses of Prendre translate as take or make. (Adapted from Brown et al., 1991.) source language, representing that meaning as the translation in the target language (Brown et al., 1991). An example of this technique is shown in Figure 6. Here, the most useful single clue for the translation of an instance of the French word prendre is the particular word that occurs as the first noun to its right. As shown in Figure 6(a), the identity of this word provides, on average, 0.381 bits of information as to how to distinguish between two different senses of prendre. Figure 6(b) shows some of the French words that distinguish  between one sense of prendre (part, mesure, note, exemple, etc.) and another (decision, parole, connaissance, etc.). As shown in Figure 6(c), the most likely translation into English for sense 1 is the verb take; for sense 2, the most likely translation is the English verb make. This technique has shown itself to be quite useful in current work in machine translation. As a technique for word sense disambiguation, it

OCR for page 482
Page 499 has the clear drawback that often a word with multiple senses in language 1 has many of the same senses in language 2, so it can only be used when the target language does split the senses of interest. Other researchers have found a richer and more robust representation for words in the internal representation used within WordNet, a large hand-built computer-based lexicon (Beckwith et al., 1991). Because WordNet is fundamentally computer based, it can be organized as a large graph of different kinds of relations between words. These relations include not only relatively standard ones such as X is a synonym of Y, or X is an antonym of Y but also many other relations such as X is a kind of Y and X is a part of Y. Concepts within WordNet are represented by "synonym sets," sets of words that all share a core meaning. One simple representation of a meaning of a word, then, is just the synonym set, or synset, of the words that share that meaning. This hierarchy has been used to investigate the automatic classification of verbs by the kinds of objects that they take, a first step toward determining the selectional restrictions of verbs automatically (Resnik, 1992). In this work, synonym sets are used to represent classes of objects in both the input and output of a program that computes a variant of the mutual information statistic discussed above. Using synonym sets for the output provides the general classification one seeks, of course. Using synonym sets for input as well has an important added advantage: it provides a solution to the sparse data problem that plagues work in lexical statistics. Many of the counts of verb-object pairs that make up the input to this program are very small and therefore unreliable, in particular given a corpus as small as the million-word Brown corpus (Francis, 1964; Francis and Kucera, 1982) used in this experiment. By pooling data for particular nouns into the synonym sets they fall into, much of this sparse data problem can be solved. Figure 7 gives one example of the performance of Resnik's statistic. These are the highest-ranking synonym sets for objects of the verb open. The names of the synonym sets were hand assigned within WordNet. Figure 8 gives the single highest ranking synonym set for a list of common verbs. These two experiments show that a statistical approach can do surprisingly well in extracting major aspects of the meaning of verbs, given the hand encoding of noun meanings within WordNet. These experiments suggest that it might be possible to combine the explicit linguistic knowledge in large hand-built computational lexicons, the implicit knowledge in a skeletally parsed corpus, and some novel statistical and information theoretic methods to automatically determine a wide variety of aspects of lexical semantics. The work described above is typical of much recent work in the

OCR for page 482
Page 500 SynSet Name Typical Members entrance door mouth mouth repository store, closet, locker, trunk container bag, trunk, locker, can, box, hamper time_period tour, round, season, spring, session, week, evening, morning, saturday   oral_communication discourse, engagement, relation, reply, mouth, program,   conference, session writing scene, book, program, statement, bible, pargraph, chapter FIGURE 7  Classes of things that are opened. (Adapted from Resnik, 1992.) Verb Most Highly Associated Object SynSet ask question call someone climb stair cook repast draw cord drink beverage eat nutrient lose sensory_faculty play part pour liquid pull cover push button read written_material sing music FIGURE 8  "Prototypical" classes of objects for common verbs. (Adapted from Resnik, 1992.) area of lexical discovery. Other recent work has focused on, for example, the use of distributional techniques for discovery of collocations in text (Smadja, 1993) and of subcategorization frames for verbs (Brent, 1993) and to uncover lexical semantics properties (Pustojovsky et al., 1993). Much other work has been done in this area; the references given here are typical rather than exhaustive. A QUESTION FOR TOMORROW While the focus above has been on the effectiveness of recent stochastic and statistical techniques, there is some evidence that this

OCR for page 482
Page 501 effectiveness is due in large measure to the empirical corpus-based nature of these techniques rather than to the power of stochastic modeling. Surprisingly, symbolic learning techniques have performed as well as stochastic methods on two tasks considered above, despite the fact that they learn only simple symbolic rules, with only simple counting used during training, and then only to choose one potential rule over another. This raises the question of whether the effectiveness of the stochastic techniques above is essentially due to the fact that they extract linguistic structure from a large collection of natural data or is the result of their statistical nature. This issue, I believe, will be resolved in the next several years. In one experiment a very simple symbolic learner integrated with a parser for free text produced a set of symbolic lexical disambiguation rules for that parser. The parser, running with the new augmented grammar, if viewed only as a part-of-speech tagger, operates at about 95 percent word accuracy (Hindle, 1989). What makes this result all the more surprising is that this parser works strictly left to right in a fully deterministic fashion. Recently, a very simple symbolic learning technique called error-based transformation learning was applied to the tagging problem (Brill, 1992). The resultant tagger operates with a set of only 160 simple rules, plus a table of the most likely tag in isolation for each word. The tagger begins by tagging each word with the most likely tag in isolation and then applies each of the 160 rules in order to the entire corpus. These rules are of the form In a context X, change tag A to tag B. This tagger operates at about 96 percent word accuracy. This learning algorithm has also been applied to the problem of bracketing English text, with very surprising results (Brill, 1993). The learner begins by assuming that English is strictly right branching and then learns a set of rules using exactly the same learning technique as used in the tagger discussed above, except that here the potential environments for rule application are very simple configurations in the bracketed string, for example, if some category X is preceded by a right paren then . . . or if a left paren falls between category X and category Y then. . . . There are only two possible rule operations that simply transform the binary branched trees with bracketings ((A B) C) and (A (B C)) into each other. The parser was trained on a small 500-sentence bracketed subset of material from the Wall Street Journal, of sentences less than 15 words in length, and acquired about 160 rules. Tested on a reserved test set of sentences of the same length, the parser bracketed 54 percent of the sentences completely consistently with the original bracketing; 72 percent of the sentences were bracketed with one bracketing error or less. Trained on only 250 sen-

OCR for page 482
Page 502 tences of length n, n < 25, the parser again acquired about 160 rules and parsed a similar reserved test set at 30 percent of sentences bracketed correctly. In recent unpublished experiments this same technique was applied to the problem of labeling these bracketed but unlabeled structures, achieving about 95 percent correct labeling, by node. Perhaps this learning technique will lead to an even more powerful stochastic method of some kind. What is unique about this learner is that each rule applies to the output of the previous rule. But perhaps it will turn out that the power of these methods comes from use of a corpus itself. Time will tell. ACKNOWLEDGMENTS This work was partially supported by Defense Advanced Research Projects Agency (DARPA) Grant No. N0014-85-K0018, by DARPA and (AFOSR) jointly under Grant No. AFOSR-90-0066, and by Grant No. DAAL 03-89-C0031 PRI. Thanks to Eric Brill, Ken Church, Aravind Joshi, Mark Liberman, David Magerman, Yves Schabes, and David Yarowsky for helpful discussions. Thanks also to two anonymous reviewers for many excellent comments and suggestions. REFERENCES Baker, J. K. 1979. Trainable grammars for speech recognition. Proceedings of the Spring Conference of the Acoustical Society of America. Beckwith, R., C. Fellbaum, G. Gross, and G. Miller. 1991. WordNet: A lexical database organized on psycholinguistic principles. Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, U. Zernik (ed.), Lawrence Erlbaum, Hillsdale, N.J., pp. 211-232. Black, E., S. Abney, F. Flickenger, R. Grishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini, and T. Strzalkowski. 1991. A procedure for quantitatively comparing the syntactic coverage of English grammars. Proceedings of the Fourth DARPA Speech and Natural Language Workshop, February. Black, E., F. Jelinek, J. Lafferty, R. Mercer, and S. Roukos. 1992a. Decision tree models applied to the labeling of text with parts-of-speech. Proceedings of the DARPA Speech and Natural Language Workshop, February, pp. 117-121. Black, E., F. Jelinek, J. Lafferty, D. M. Magerman, R. Mercer, and S. Roukos. 1992b. Towards history-based grammars: Using Richer models for probabilistic parsing. Proceedings of the 31th Annual Meeting of the Association for Computational Linguistics. Black, E., J. Lafferty, and S. Roukos. 1993. Development and evaluation of a broadcoverage probabilistic grammar of English-language computer manuals. Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics. Brent, M. 1993. From grammar to lexicon: Unsupervised learning of lexical syntax. Computational Linguistics, 19:243-262.

OCR for page 482
Page 503 Brill, E. 1992. A simple rule-based part of speech tagger. Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy. Brill, E. 1993. Automatic grammar induction and parsing free text: A transformation-based approach. Proceedings of the 31th Annual Meeting of the Association for Computational Linguistics. Brill, E., D. Magerman, M. Marcus, and B. Santorini. 1990. Deducing linguistic structure from the statistics of large corpora. Proceedings of the DARPA Speech and Natural Language Workshop, June, pp. 275-282. Brown, P., S. Della Pietra, V. Della Pietra, and R. Mercer. 1991. A statistical approach to sense disambiguation in machine translation. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, Calif. Church, K. 1985. Stress assignment in letter to sound rules for speech synthesis. Proceedings of the 23rd Annual Meeting of the Association for Computational Linguistics, pp. 246-253. Church, K. 1988. A stochastic parts program and noun phrase parser for unrestricted text. Proceedings of the Second Conference on Applied Natural Language Processing. 26th Annual Meeting of the Association for Computational Linguistics, pp. 136-143. Church, K. W. Gale, P. Hanks, and D. Hindle. 1991. Using Statistics in Lexical Analysis. AT&T Bell Laboratories Technical Memorandum. Cutting, D., J. Kupiec, J. Pederson, and P. Sibun. 1992. A practical part-of-speech tagger. Proceedings of the Third Conference on Applied Natural Language Processing, ACL. Francis, W. N. 1964. A Standard Sample of Present-Day English for Use with Digital Computers. Report to the U.S Office of Education on Cooperative Research Project No. E-007. Brown University, Providence, R.I. Francis, W. N., and H. Kucera. 1982. Frequency Analysis of English Usage. Lexicon and Grammar. Houghton Mifflin, Boston. Fujisaki, T., F. Jelinek, J. Cocke, E. Black, and T. Nishino. 1989. A probabilistic method for sentence disambiguation. Proceedings of the 1st International Workshop on Parsing Technologies, Carnegie-Mellon University, Pittsburgh, Pa. Harris, Z. 1951. Methods in Structural Linguistics. University of Chicago Press, Chicago. Garside, R., G. Leech, and G. Sampson. 1987. The Computational Analysis of English. A Corpus-Based Approach. Longman, London. Hemphill, C., J. Godfrey, and G. Doddington. 1990. The ATIS spoken language systems pilot corpus. Proceedings of the Third DARPA Speech and Natural Language Workshop, February. Hindle, D. 1989. Acquiring disambiguation rules from text. Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics. Hindle, D., and M. Rooth. 1993. Structural ambiguity and lexical relations. Computational Linguistics, 19:103-120. Jelinek, F., J. D. Lafferty, and R. L. Mercer. 1991. Basic methods of probabilistic context-free grammars. Continuous Speech Recognition Group, IBM T. J. Watson Research Center. Joshi, A., and Y. Schabes. 1992. Tree adjoining grammars and lexicalized grammars. Tree Automata and Languages, M. Nivat and A. Podelski (eds.). Elsevier, New York. Lari, K., and S. J. Young. 1990. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language 4:35-56. Magerman, D., and M. Marcus. 1991a. Parsing the Voyager domain using pearl. Pro-

OCR for page 482
Page 504 ceedings of the Fourth DARPA Speech and Natural Language Workshop, February. Magerman, D., and M. Marcus. 1991b. PEARL—A probabilistic chart parser. Proceedings, Fifth Conference of the European Chapter of the Association for Computational Linguistics (EACL), Berlin, April. Marcus, M., B. Santorini, M. A. Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19:313-330. Merialdo, B. 1991. Tagging text with a probabilistic model. Proceedings of ICASSP-91, pp. 809-812. Pereira, F., and Y. Schabes. 1992. Inside-outside reestimation from partially bracketed corpora. Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics. Pustojovsky, J., S. Bergler, and P. Anick. 1993. Lexical semantic techniques for corpus analysis. Computational Linguistics, 19:331-358. Resnik, P. 1992. WordNet and distributional analysis: A class-based approach to lexical discovery. Workshop Notes AAAI-92 Workshop in Statistically-Based NLP Techniques, July. Schabes, Y., M. Roth, and R. Osborne. 1993. Parsing the Wall Street Journal with the inside-outside algorithm. Proceedings, Sixth Conference of the European Chapter of the Association for Computational Linguistics (EACL), Utrecht, April. Sharman, R.A., F. Jelinek, and R. Mercer. 1990. Generating a grammar for statistical training. Proceedings of the Third DARPA Speech and Natural Language Workshop, February. Smadja, F. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19:143-178. Steedman, M. 1993. Categorial grammar, Lingua, 90:221-258. Weischedel, R., M. Meteer, R. Schwartz, L. Ramshaw, and J. Palmucci. 1993. Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics, 19:359-382. Zue, V., J. Glass, D. Goodine, H. Leung, M. McCandless, M. Philips, J. Polifroni, and S. Seneff. 1990. Recent progress on the VOYAGER system. Proceedings of the Third DARPA Speech and Natural Language Workshop, June.