Skip to main content

Currently Skimming:

2 Text Categorization and Analysis
Pages 5-10

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 5...
... A text categorizes looks at a Web page and decides into which of these groups a piece of text should fall. Applications of text categorization include filtering of e-mail, chat, or Web access; text indexing; and data mining.
From page 6...
... This provides a training set of 500 sample decisions to be mimicked. The rulewriting software attempts to produce rules that mimic those categorization decisions.
From page 7...
... 2.2 ADVANCED TEXT TECHNOLOGY True text understanding will not happen for at least 20 or 30 years, and maybe never. Therein lies the problem, because to filter content with absolute accuracy we would need text understanding.
From page 8...
... It always comes down to what error rate is acceptable.2 To go beyond the bag-of-words model, a number of technologies are currently available: morphological analysis, part-of-speech tagging, translation, disambiguation, genre analysis, information extraction, syntactic analysis, and parsing. Even using these technologies, thorough text understanding will remain in the distant future; a 100-percent-accurate categorization decision cannot be made today.
From page 9...
... The advanced text technologies improve accuracy, which may be important in contexts such as free speech in libraries, identification of violence and hate speech, and automated blacklisting. The extent of the improvement from these technologies depends on many parameters, and tests must be run.3 The latest numbers I know of are from Consumer Reports,4 but they are aggregated and not broken down 3Milo Medin said that it is difficult to do good experiments and that sloppy experimentation is rewarded in a strange way.
From page 10...
... There is also a trade-off between false positives and false negatives. The extent to which advanced techniques make a difference depends on where in the trade-off you start out.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.