2
Text Categorization and Analysis

David Lewis and Hinrich Schütze

2.1 TEXT CATEGORIZATION

Automatic text categorization is the primary language retrieval technology in content filtering for children. Text categorization is the sorting of text into groups, such as pornography, hate speech, violence, and unobjectionable content. A text categorizer looks at a Web page and decides into which of these groups a piece of text should fall. Applications of text categorization include filtering of e-mail, chat, or Web access; text indexing; and data mining.

Why is content filtering a categorization task? One way to frame the problem is to say that the categories are actions, such as “allow,” “allow but warn,” or “block.” We either want to allow access to a Web page, allow access but also give a warning, or block access. Another way to frame the problem is to say that the categories are different types of content, such as news, sex education, pornography, or home pages. Depending on which category we put the page in, we will take different actions. For example, we want to block pornography and give access to news.

The automation of text categorization requires some input from people. The idea is to mimic what people do. Two parts of the task need to be automated. One is the categorization decision itself. The categorization decision says, for example, what we should do with a Web page. The second part to be automated is rule creation. We want to determine automatically the rules to apply.

Automation of the categorization decision requires a piece of software that applies rules to text. This is the best architecture because then



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 5
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop 2 Text Categorization and Analysis David Lewis and Hinrich Schütze 2.1 TEXT CATEGORIZATION Automatic text categorization is the primary language retrieval technology in content filtering for children. Text categorization is the sorting of text into groups, such as pornography, hate speech, violence, and unobjectionable content. A text categorizer looks at a Web page and decides into which of these groups a piece of text should fall. Applications of text categorization include filtering of e-mail, chat, or Web access; text indexing; and data mining. Why is content filtering a categorization task? One way to frame the problem is to say that the categories are actions, such as “allow,” “allow but warn,” or “block.” We either want to allow access to a Web page, allow access but also give a warning, or block access. Another way to frame the problem is to say that the categories are different types of content, such as news, sex education, pornography, or home pages. Depending on which category we put the page in, we will take different actions. For example, we want to block pornography and give access to news. The automation of text categorization requires some input from people. The idea is to mimic what people do. Two parts of the task need to be automated. One is the categorization decision itself. The categorization decision says, for example, what we should do with a Web page. The second part to be automated is rule creation. We want to determine automatically the rules to apply. Automation of the categorization decision requires a piece of software that applies rules to text. This is the best architecture because then

OCR for page 5
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop we can change the behavior by changing the rules rather than rewriting the software every time. This automatic categorizer applies two types of rules. One type is extensional rules that explicitly list all sites that cannot be accessed (i.e., “blacklisted” sites) or, alternatively, all sites that can be accessed (e.g., kid-safe zones or “whitelisted” sites). The second type, which is technically more complicated, is intentional rules or keyword blocking. We look at the content of the page, and, if certain words occur, then we take certain actions, such as blocking access to that page. It can be more complicated than just a single word. For example, it can be logic based, where we use AND and OR operators, or it can be a weighted combination of different types of words. Automated rule writing is called supervised learning. One or more persons are needed to provide samples of the types of decisions we wish to make. For example, we could ask a librarian to identify which of 500 texts or Web pages are pornography and which ones are not. This provides a training set of 500 sample decisions to be mimicked. The rule-writing software attempts to produce rules that mimic those categorization decisions. The goal is to mimic the categorization decisions made by people. The selection of the persons who provide the samples is fundamental, because whatever they do becomes the gold standard, which the machine tries to mimic. Everything depends on the particular persons and their judgments. Research shows that supervised learning is at least as good as expert human rule writing. (Supervised learning is also very flexible. For example, foreign content is not a problem, as long as the content involves text rather than images.) The effectiveness of these methods is far from perfect—there is always some error rate—but sometimes it is near agreement with human performance levels. Still, the results differ from category to category, and it is not clear how directly it applies to, for example, pornography. As discussed in the next presentation, there is an inevitable trade-off between false positives and false negatives, and categories vary widely in difficulty. Substantially improved methods are not expected in the next 10 to 20 years. It is not clear which text categorization techniques are most effective. Some recently developed techniques are not yet used commercially, so there may be incremental improvements. Nor is it clear how effective semiautomated categorization is, or whether the categories that are difficult for automated methods are the same as those that perplex people. With regard to spam e-mail, it is possible to circumvent it, but there is no foolproof way to filter it. The question is whether the error rate is acceptable. This all comes back to community standards. We can train the classi-

OCR for page 5
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop fier to predict the probability that a person would find an item inappropriate, and training can give equal weight to any number of community volunteers. In other words, we can build a machine that mimics a community standard. We take some people out of the community, get their judgments about what they find objectionable or not, and then build a machine that creates rules that mimic that behavior. But this does not solve the political question of how to define the community, who to select as representatives of that community, and where in that community to apply the filter. The technological capability does not solve the application issues in practice. 2.2 ADVANCED TEXT TECHNOLOGY True text understanding will not happen for at least 20 or 30 years, and maybe never. Therein lies the problem, because to filter content with absolute accuracy we would need text understanding. As a result, there will always be an error rate; the question is how high it is. The text categorization methods discussed above use the “bag-of-words” model. This is a simplistic machine representation of text. It takes all the words on a page and treats them as an unstructured list. If the text is “Dick Armey chooses Bob Shaffer to lead committee,” then a representative list would be: Armey, Bob, chooses, committee, Dick, lead, Shaffer. The structure and context of the text is completely lost. This impoverished representation is the basis of text classification methods in existing content filters. There are problems with this type of representation. It fails, in many cases, because of ambiguous words. The context is important. Ambiguous words such as “beaver” have both a hunter’s meaning and a graphic meaning. Using the bag-of-words model alone, you cannot tell which meaning is relevant. The bag-of-words model is inherently problematic for these types of ambiguous words. Other words, such as “breast” and “blow,” are not ambiguous but can be used pornographically. Again, if we use a bag-of-words model, then we lose context and cannot deal with these words properly. When context counts, the bag-of-words model fails. The problem cannot be resolved fully by looking for adjacent words, as search engines do when they give higher weight to information objects that match the query and have certain words in the same sentence. There is a distinction between search engines and classification. Search engines compute a ranking of pages. The end users look at the top 10 or maybe the top 100 ranked pages. Because they are looking only at pages in which the signal is strongest and because they are making a relative judgment, this type of methodology works very well; the highest-rated pages are

OCR for page 5
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop probably very relevant to the query.1 But in classification, we have to make a decision about one page by itself. This is a much more difficult problem. By looking at the words that lie nearby, we cannot always make a decent statistical guess as to whether a situation is innocuous or not. When context is important, when the bag-of-words model fails, pornography filters and content filters make errors. However—surprisingly—the bag-of-words model is effective in many applications, so it is not a hopeless basis for pornography filters despite its error rate. It always comes down to what error rate is acceptable.2 To go beyond the bag-of-words model, a number of technologies are currently available: morphological analysis, part-of-speech tagging, translation, disambiguation, genre analysis, information extraction, syntactic analysis, and parsing. Even using these technologies, thorough text understanding will remain in the distant future; a 100-percent-accurate categorization decision cannot be made today. But these advanced text technologies can increase the accuracy of content filters, and this increased accuracy may be significant in some areas. The first area relates to over-broad filters that block material that should not be blocked, raising free speech issues. It is relatively easy to build an over-broad filter, which blocks pornography very well but also blocks a lot of good content, like Dick Armey’s home page. These over-broad filters may suffice in many circumstances. For example, there may be parents who would say, “As long as not a single pornographic page comes through, or it almost never happens, it is OK if my child cannot see a lot of good content.” But these over-broad filters are problematic in many other settings, such as in libraries, where there is an issue of free speech. If a lot of good content is blocked, then that is problematic. Advanced technology can really make a difference, because by increasing the accuracy of the filter, less good content would be blocked. 1   Milo Medin said that various search engine companies have come with a number of techniques to filter adult content, so that you have to turn on the capability to see certain types of references. Most of it is ranking based, but there are some other obvious things as well. Part of the challenge is that many adult sites are trying to get people to visit, so they fill their headers with all kinds of information that make it obvious what is going on. The question is, how practical is that? 2   Milo Medin said that the people who run search engines have an economic interest in making their results as accurate as possible, to satisfy their subscribers. Normal large search engines want the adult-content filter to be as accurate as possible. If the filter is turned on, we basically want to eliminate adult content. The Google folks, as an example, have devoted a lot of energy to these issues, but it is not aimed directly at pornography. They focus on a broader set of issues to which pornography is a business input.

OCR for page 5
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop The second area is pornography versus other objectionable content, such as violence and hate speech. The bag-of-words model is most successful under two conditions: (1) when there are unambiguous words indicating relevant content and (2) when there are a few of these indicators. Pornography has these properties; probably about 40 or 50 words, most of them unambiguous, indicate pornography. Thus, the bag-of-words model is actually not so bad for this application, especially if you like over-broad filters. However, in many other areas, such as violence and hate speech, the bag-of-words model is less effective. Often you must read four or five sentences of a text before identifying it as hate speech. Accuracy becomes important in such applications, and advanced technology can be helpful here. The third area is automated blacklisting. Remember the distinction between extensional and intentional rules; extensional rules are lists of sites that you want to block. This is an effective content-filtering technique, mostly driven by human editors now. This is a promising area for automation. Accuracy is important because blocking one site can block thousands of pages; you want to be sure of doing the right thing. Advanced text technology also can play a role here. A potential problem with these text technologies is their lack of robustness. They can be circumvented through changes in meaning. If a pornographer wants to get through a filter that he knows and can test, then he or she will be able to get through it—it is simply a question of effort. But pornographers are not economically motivated to expend a lot of effort to get through these filters. I may be wrong, but my sense is that, because children do not pay for pornography, this is probably not a problem. In summary, true machine-aided text understanding will not be available in the near term, and that means there always will be a significant error rate with any automated method. The advanced text technologies improve accuracy, which may be important in contexts such as free speech in libraries, identification of violence and hate speech, and automated blacklisting. The extent of the improvement from these technologies depends on many parameters, and tests must be run.3 The latest numbers I know of are from Consumer Reports,4 but they are aggregated and not broken down 3   Milo Medin said that it is difficult to do good experiments and that sloppy experimentation is rewarded in a strange way. First, you run a very large collection of text through your filter and determine how much of the material identified as pornographic was, in fact, not. Second, you find out how much of the material identified as not pornographic was, in fact, a problem. If you do that analysis badly or carelessly, your filter looks better. 4   Consumer Reports, March 2001.

OCR for page 5
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop by area. There is probably a big difference in accuracy between pornography and the other objectionable areas. There is also a trade-off between false positives and false negatives. The extent to which advanced techniques make a difference depends on where in the trade-off you start out. If I had to give a number, I would expect a 20 to 30 percent improvement in accuracy over the bag-of-words model—if you want to let all good content through (if you do not want over-blocking).