6
Advanced Techniques for Automatic Web Filtering

Michel Bilello

6.1 BACKGROUND

As of 1999, the Web had about 16 million servers, 800 million pages, and 15 terabytes of text (comparable to the text held by the Library of Congress). By 2001, the Web was expected to have 3 billion to 5 billion pages.1

To prevent kids from looking at inappropriate material, one solution is to have dedicated, pornography-free Web sites—such as Yahoo!Kids and disney.com—and assign reviewers to look at those particular Web sites. This is useful in protecting children too young to know how to use a Web browser.

Filtering is mostly text based (e.g., Net Nanny, Cyber Patrol, CYBERSitter). There are different methods and problems; for example, Cyber Patrol looks at Web sites but has to update its lists all the time. You can also block keywords, scanning the pages and matching the words with keywords. But keyword blocking is usually not enough, because text embedded in images is not recognized as text.2 You could block all images, but then surfing an imageless Web would become boring, especially for children. A group at the Nippon Electronic Corporation (NEC)

1  

Steve Lawrence and C. Lee Giles, “Accessibility and Distribution of Information on the Web,” Nature 400(6740): 107-109, July 8, 1999.

2  

Michel Bilello said that his group has used a technique that pulls text off images, such as chest X-rays used for research purposes. They process the x-ray image, detect the text, and then remove, for example, the name of the patient, which the researcher does not need to know.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 33
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop 6 Advanced Techniques for Automatic Web Filtering Michel Bilello 6.1 BACKGROUND As of 1999, the Web had about 16 million servers, 800 million pages, and 15 terabytes of text (comparable to the text held by the Library of Congress). By 2001, the Web was expected to have 3 billion to 5 billion pages.1 To prevent kids from looking at inappropriate material, one solution is to have dedicated, pornography-free Web sites—such as Yahoo!Kids and disney.com—and assign reviewers to look at those particular Web sites. This is useful in protecting children too young to know how to use a Web browser. Filtering is mostly text based (e.g., Net Nanny, Cyber Patrol, CYBERSitter). There are different methods and problems; for example, Cyber Patrol looks at Web sites but has to update its lists all the time. You can also block keywords, scanning the pages and matching the words with keywords. But keyword blocking is usually not enough, because text embedded in images is not recognized as text.2 You could block all images, but then surfing an imageless Web would become boring, especially for children. A group at the Nippon Electronic Corporation (NEC) 1   Steve Lawrence and C. Lee Giles, “Accessibility and Distribution of Information on the Web,” Nature 400(6740): 107-109, July 8, 1999. 2   Michel Bilello said that his group has used a technique that pulls text off images, such as chest X-rays used for research purposes. They process the x-ray image, detect the text, and then remove, for example, the name of the patient, which the researcher does not need to know.

OCR for page 33
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop tried to recognize the clustering communities within the Web. You could, for example, keep the user away from particular communities or exclude some communities from the allowed Web sites. 6.2 THE WIPE SYSTEM In the Stanford WIPE system,3 we use software to analyze image content and make classification decisions as to whether an image is appropriate or not. Speed and accuracy are issues; for example, we try to avoid both false positives and false negatives. The common image-processing challenges to be overcome include nonuniform image background; textual noise in foreground; and a wide range of image quality, camera positions, and composition. This work was inspired by the Fleck-Forsyth-Bregler System at the University of California at Berkeley, which classifies images as pornographic or not.4 The published results were 52 percent sensitivity (i.e., 48 percent false negatives) and 96 percent specificity (i.e., 4 percent false positives). The Berkeley system had a rather long processing time of 6 minutes per image. In comparison, the WIPE system has higher sensitivity, 96 percent, and somewhat less specificity (but still high) at 91 percent, and the processing time is less than 1 second per image. This technology is most applicable to automated identification of commercial porn sites; it also could be purchased by filtering companies and added to their products to increase accuracy. In the WIPE system, the image is acquired, feature extraction is performed using wavelet technology, and, if the image is classified as a photograph (versus drawing), extra processing is done to compare a feature vector with prestored vectors. Then the image is classified as either pornographic or not, and the user can reject it or let it pass on that basis. There is an assumption that only photographs—and not manually generated images, such as an artist’s rendering—would be potentially objectionable. Manually generated images can be distinguished on the basis of tones: smooth tones for manually generated images versus continuous tones for photographs. Again, only photographs would require the next processing stage. 3   For a technical discussion, see James Z. Wang, Integrated Region-based Image Retrieval, Dordrecht, Holland: Kluwer Academic Publishers, 2001, pp. 107-122. The acronym WIPE stands for Wavelet Image Pornography Elimination. 4   Margaret Fleck, David Forsyth, and Chris Bregler, “Finding Naked People,” Proceedings of the European Conference on Computer Vision, B. Buxton and R. Cipolla, eds., Berlin, Germany: Springer-Verlag, Vol. 2, 1996, pp. 593-602.

OCR for page 33
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop This work was based on an information-retrieval system that finds in a database all the images “close” to one selected image. From the selected image the software looks at thousands of images stored in the database and retrieves all the ones that are deemed “close” to the selected image. The images were tested against a set of 10,000 photographic images and a knowledge base. The knowledge base was built with a training system. For every image there is some trusted element, a feature vector can be defined that encompasses all the information, texture, color, and so on. Then images are classified according to the information in this vector. The database contains thousands of objectionable images of various types and thousands of benign images5 of various types. In the training process, you process random images to see if the detection and classification are correct. You can adjust sensitivity parameters to allow tighter or looser filtering. You could combine text and images or do multiple processing of multiple images on one site to decrease the overall error in classifying a site as objectionable or not. A statistical analysis was done showing that, if you download 20-35 images for each site, and 20-25 percent of downloaded images are objectionable, then you can classify the Web site as objectionable with 97 percent accuracy.6 Image content analysis can be combined with text and IP address filtering. To avoid false positives, especially for art images, you can skip images that are associated with the IP addresses of museums, dog shows, beach towns, sports events, and so on. In summary, you cannot expect perfect filtering. There is always a trade-off between performance and processing effort. But the performance of the WIPE system shows that good results can be obtained with current technology. The performance can improve by combining imagebased and text-based processing. James Wang is working on training the system automatically as it extracts the features and then classifying the images manually as either objectionable and not.7 5   To develop a set of benign images, David Forsyth suggested obtaining the Corel collection or some similar set of images known to be not-problematic or visiting Web news groups, where it is virtually guaranteed that images will not be objectionable. He said this is a rare case in which you can take a technical position without much trouble. 6   David Forsyth took issue with the statistical analysis, because there is a conditional probability assumption that the error is independent of the numbers. In the example given earlier with images of puddings (in Forsyth’s talk in Chapter 3), a large improvement in performance cannot be expected because there are certain categories in which the system will just get it wrong again. If it is wrong about one picture of pudding and then wrong again about a second picture of pudding, then it will classify the Web site wrong, also. 7   For more information, see <http://WWW-DB.Stanford.EDU/IMAGE> (papers) and <http://wang.ist.psu.edu>.