Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 49
Automatic Text Understanding of
Content and Text Quality
Ani nenkovA
University of Pennsylvania
Reading involves two rather different kinds of semantic processing. One is
related to understanding what information is conveyed in the text and the other to
appreciating the style of the text—how well or poorly it is written. For people, text
content and stylistic quality are inextricably linked. For machines, robust under-
standing of written material has become feasible in many contexts but text quality
has been out of reach so far. The mismatch matters a great deal because people
rely on machines to locate and navigate information sources and increasingly read
machine-generated text, for example as machine translations or text summaries.
In this presentation I discuss some of the simple and elegant intuitions that
have enabled semantic processing in machines, as well as some of the emerging
directions in text quality assessment.
TEXT SEMANTICS (MEANING)
Reading and Understanding the Web
A single insight about language semantics has led to successes in a variety
of automatic text understanding tasks. Words tend to appear in specific contexts
and these contexts convey rich information about the type of word, its meaning,
and connotation (Harris, 1968). Computers can learn much semantic informa -
tion without human supervision simply by collecting statistics of (hundreds of)
thousands of texts.
The context of a target word, consisting of other phrases or words that occur
nearby in texts more often than expected by chance, is accumulated over large
text collections. For example, the word tea may be characterized by the context
49
OCR for page 50
50 FRONTIERS OF ENGINEERING
[drink:60, green:55, milk:40, sip:30, enjoy:10, . . .].
Each entry shows a word that appeared five words before or after tea, and the
number of times the pair was seen in a large text collection. Taking just the number
of occurrences of context words makes the representation even more convenient,
because various standard (geometric) approaches exist for comparing the dis -
tance between numeric vectors. In this manner, a machine can compute the simi -
larity between any two words.
Here is an example from Pantel and Lin (2002) of the 15 words most similar
to wine computed by this approach:
Wine: beer, white wine, red wine, Chardonnay,
champagne, fruit, food, coffee, juice, Cabernet,
cognac, vinegar, Pinot noir, milk, vodka, . . .
The list may not look immediately useful but is certainly impressive if one
considers how little similarity there is in the sequence of letters wine, beer,
Chardonnay.
Building upon these representations, it has become possible to automati-
cally discover words with multiple senses by clustering words similar to them
(plant: (plant, factory, facility, refinery) (shrub,
ground cover, perennial, bulb)), finding synonyms and antonyms.
To aid analysis of customer reviews, researchers at Google developed a large
lexicon of almost 200,000 positive and negative words and phrases, identified
through their similarity to a handful of predefined positive or negative words such
as excellent, amazing, bad, horrible. Among the positive phrases
in the automatically constructed lexicon were cute, fabulous, top of
the line, melt in your mouth; negative examples included subpar,
crappy, out of touch, sick to my stomach (Velikovich et al., 2010).
Another line of research in semantic processing exploits the stable meaning
of some contexts. For example, patterns like “X such as Y,” if occurring often in
texts, is very likely an indicator that Y is a kind of X (i.e., “Red wines such as
Cabernet and Pinot noir . . .”). Similarly a phrase like “The mayor of X” is a good
indicator that X is a city. NELL (Never Ending Language Learning, http://rtw.
ml.cmu.edu/rtw/) is a system that constantly learns unary and binary predicates,
corresponding to categories and relations such as isCity(Philadelphia)
and playsInstrument(George_Harrison, guitar). The learn-
ing of each type of fact starts with minimal supervision in the form of several
examples of category instances or entities between which a relation holds, given
by the researchers. Then the system starts an infinite loop in which it finds web
pages that contain the examples, finds phrase patterns that typically occur with the
examples, selects the best patterns that indicate the predicate with high probability,
and then applies the patterns to new texts to discover more instances for which
the predicate is true. Different flavors of this approach to machine understanding
OCR for page 51
51
AUTOMATIC TEXT UNDERSTANDING OF CONTENT AND TEXT QUALITY
have been developed to help search and question answering (Etzioni et al., 2008,
Pasca et al., 2006).
Reading and Understanding a Text
In the semantic processing I have discussed so far, the computer reads numer-
ous textual documents with the objective to learn representations of words, come
up with a lexicon of phrases with positive or negative connotation, or learn cat -
egory instances and relations. A more difficult task for a computer is to understand
a specific text.
Much traditional research related to computer processing of a single text has
relied on supervised techniques. Researchers invested effort to prepare collections
in which human annotators marked positive and negative examples of a semantic
distinction of interest. For example, they could mark the different senses of a
word, the part of speech of words, or would mark that Roger Federer is a person,
Bulgaria is a country. Then features describing the context of the categories of
interest would be extracted from the text, and a statistical classifier would use
the positive and negative examples to combine the features and predict the same
type of information on unseen text. More recently it has become clear that the
unsupervised approach in which computers accumulate knowledge and statistics
from large amounts of text and the supervised approach can be combined effec-
tively and result in better systems for semantic processing.
When reading a specific text, computers also need to resolve what entity in
the document is referred to by pronouns such as “he/his,” “she/her,” and “it/its.”
Systems are far from perfect but are getting better at this task. Usually pronouns
appear in the text near noun phrases, i.e., “the professor prepared his lecture,”
but in other situations gender and number information is necessary to correctly
resolve the pronoun, as in, “John told Mary he had booked the trip.” Machines
can rather accurately learn the likely gender of names and nouns, again by reading
large volumes of text and collecting statistics of co-occurrence. Statistics about
the co-occurrence of a pronoun of a given gender and the immediately preceding
noun or honorifics and names (Mr. John Black, Mrs. Mary White), collected over
thousands of documents, give surprisingly good guesses about the likely gender
of nouns (Bergsma, 2005).
TEXT QUALITY (STYLE)
Automatic assessment of text quality, or style, is a far more difficult task
compared to the acquisition of semantics, or at least considerably less researched.
Much of the effort in my lab has been focused on developing models of text
quality. I will discuss two successful endeavors: prediction of general and specific
sentences and automatic assessment of sentence fluency in machine translation
and summary coherence in text summarization.
OCR for page 52
52 FRONTIERS OF ENGINEERING
A well-written text contains a balanced mix of general overview statements
and specific detailed sentences. If a text contains too many general sentences it
will be perceived as insufficiently informative, and too much specificity can be
confusing for the reader.
To train a classifier, we exploit a resource of 1 million words of Wall Street
Journal text with discourse annotations (Louis and Nenkova, 2011a). The dis-
course annotations, among other things, specify the way two adjacent sentences in
the text are related. There could be an implicit comparison between two statements
(John is always punctual. Mary often arrives late.), or a contingency (causal) rela -
tion (I hurt my foot. I cannot go dancing tonight.), or temporal relations.
One of the discourse relations annotated in the corpus is instantiation. It holds
between two adjacent sentences where the second gives a specific example of infor-
mation mentioned in the first, as in, “He is very smart. He solved the problem in
five minutes.” We considered that the first sentence is general while the second is
specific in all instances of instantiation relation. We computed a number of features
that according to our intuition would distinguish between the two categories. We
expected that the presence of opinion or evaluative statements would characterize
the general sentences as well as unusual use of language that would later be inter-
preted or clarified in a specific sentence. Among the features were
• the length of the sentence.
• the number of opinion or subjective words, derived from existing
dictionaries.
• the specificity of words in the sentences, derived from corpus statistics
as the fraction of documents in one year of New York Times articles that
contain the word. The fewer documents contain the word, the more
specific it is.
• mentions of numbers and people, companies, and geographical locations;
such mentions are detected automatically.
• syntactic features related to adjectives, adverbs, verbs, and prepositions.
• probabilities of sequences of one, two, or three consecutive words com-
puted over one year of New York Times articles.
A logistic regression classifier, trained on around 2,800 examples of general
and specific sentences from instantiation relations, learned to predict the distinc -
tion incredibly well. On a completely independent set of news articles, five differ-
ent people were asked to mark each sentence as general or specific. For sentences
in which all five annotators agreed about the class, the classifier can predict the
correct class with 95 percent accuracy. For examples on which only four out of
the five annotators agreed, the accuracy is 85 percent. For all examples, which
included sentences for which people found it hard to classify in terms of general
and specific, the accuracy of prediction was 75 percent. Moreover, the confidence
OCR for page 53
53
AUTOMATIC TEXT UNDERSTANDING OF CONTENT AND TEXT QUALITY
of the classifier turned out to be highly correlated with annotator agreement, so
it was possible to identify which sentences would not fit squarely into one of the
classes. The degree of specificity of a sentence given by the classifier gives an
accurate indication of how a sentence will be perceived by people.
Applying the general-or-specific classifier to samples of automatic and human
summaries of clusters of news articles has demonstrated that machine summaries
are overly specific and has indicated ways for improving system performance
(Louis and Nenkova, 2011b).
Word co-occurrence statistics and subjective language have also been suc-
cessful in automatically distinguishing implicit comparison, contingency, and
temporal discourse relations (Pitler et al., 2009). Identification of such relations
is not only necessary for semantic processing of text, it is also required for robust
assessment of text quality (Pitler and Nenkova, 2008). Finally, statistics on types,
length, and distance between verb, noun, and prepositional phrases, as well as
probabilities of occurrence and co-occurrence of words, are highly predictive of
the perceived quality of summaries (Nenkova et al., 2010).
REFERENCES
Bergsma, S. 2005. Automatic acquisition of gender information for anaphora resolution. Proceed -
ings of the 18th Conference of the Canadian Society for Computational Studies of Intelligence,
Victoria, Canada, May 9–11, 2005, pp. 342–353.
Etzioni, O., M. Banko, S. Soderland, and D. S. Weld. 2008. Open information extraction from the
web. Communications of the ACM 51(12):68–74.
Harris, Z. S. 1968. Mathematical Structures of Language. New York: Wiley.
Louis, A., and A. Nenkova. 2011a. Automatic identification of general and specific sentences by
leveraging discourse annotations. Proceedings of the International Joint Conference in Natural
Language Processing, Chiang Mai, Thailand, November 8–13, 2011.
Louis, A., and A. Nenkova. 2011b. Text specificity and impact on quality of news summaries. Pro -
ceedings of the Workshop on Monolingual Text-To-Text Generation, Portland, Oregon, June 24,
2011, pp. 34–42.
Nenkova, A., J. Chae, A. Louis, and E. Pitler. 2010. Structural features for predicting the linguistic
quality of text: Applications to machine translation, automatic summarization and human-
authored text. Empirical Methods in Natural Language Generation, edited by E. Krahmer and M.
Theune. Lecture Notes in Artificial Intelligence, Vol. 5790. Berlin: Springer-Verlag. pp. 222–241.
Pantel, P., and D. Lin. 2002. Discovering word senses from text. Proceedings of ACM SIGKDD
Conference on Knowledge Discovery and Data Mining 2002, Edmonton, Canada, July 23–26,
2002, pp. 613–619.
Pasca, M., D. Lin, J. Bigham, A. Lifchits, and A. Jain. 2006. Names and similarities on the web: Fact
extraction in the fast lane. Proceedings of the 21st International Conference on Computational
Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics,
Sydney, Australia, July 2006.
Pitler, E., and A. Nenkova. 2008. Revisiting readability: A unified framework for predicting text
quality. Proceedings of the Conference on Empirical Methods in Natural Language Processing,
Waikiki, Hawaii, October 25–27, 2008.
OCR for page 54
54 FRONTIERS OF ENGINEERING
Pitler, E., A. Louis, and A. Nenkova. 2009. Automatic sense prediction for implicit discourse rela -
tions in text. Proceedings of the 47th Annual Meeting of the Association for Computational
Linguistics and the 4th International Joint Conference on Natural Language Processing of the
Asian Federation of Natural Language Processing, Singapore, August 2–7, 2009, pp. 683–691.
Velikovich, L., S. Blair-Goldensohn, K. Hannan, and R. McDonald. 2010. The viability of web-
derived polarity lexicons. Human Language Technologies: The 2010 Annual Conference of the
North American Chapter of the Association for Computational Linguistics, Los Angeles, Calif.,
June 1–6, 2010, pp. 777–785.