National Academies Press: OpenBook
« Previous: 5 Cyber Patrol: A Major Filtering Project
Suggested Citation:"6 Advanced Techniques for Automatic Web Filtering." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

6
Advanced Techniques for Automatic Web Filtering

Michel Bilello

6.1 BACKGROUND

As of 1999, the Web had about 16 million servers, 800 million pages, and 15 terabytes of text (comparable to the text held by the Library of Congress). By 2001, the Web was expected to have 3 billion to 5 billion pages.1

To prevent kids from looking at inappropriate material, one solution is to have dedicated, pornography-free Web sites—such as Yahoo!Kids and disney.com—and assign reviewers to look at those particular Web sites. This is useful in protecting children too young to know how to use a Web browser.

Filtering is mostly text based (e.g., Net Nanny, Cyber Patrol, CYBERSitter). There are different methods and problems; for example, Cyber Patrol looks at Web sites but has to update its lists all the time. You can also block keywords, scanning the pages and matching the words with keywords. But keyword blocking is usually not enough, because text embedded in images is not recognized as text.2 You could block all images, but then surfing an imageless Web would become boring, especially for children. A group at the Nippon Electronic Corporation (NEC)

1  

Steve Lawrence and C. Lee Giles, “Accessibility and Distribution of Information on the Web,” Nature 400(6740): 107-109, July 8, 1999.

2  

Michel Bilello said that his group has used a technique that pulls text off images, such as chest X-rays used for research purposes. They process the x-ray image, detect the text, and then remove, for example, the name of the patient, which the researcher does not need to know.

Suggested Citation:"6 Advanced Techniques for Automatic Web Filtering." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

tried to recognize the clustering communities within the Web. You could, for example, keep the user away from particular communities or exclude some communities from the allowed Web sites.

6.2 THE WIPE SYSTEM

In the Stanford WIPE system,3 we use software to analyze image content and make classification decisions as to whether an image is appropriate or not. Speed and accuracy are issues; for example, we try to avoid both false positives and false negatives. The common image-processing challenges to be overcome include nonuniform image background; textual noise in foreground; and a wide range of image quality, camera positions, and composition.

This work was inspired by the Fleck-Forsyth-Bregler System at the University of California at Berkeley, which classifies images as pornographic or not.4 The published results were 52 percent sensitivity (i.e., 48 percent false negatives) and 96 percent specificity (i.e., 4 percent false positives). The Berkeley system had a rather long processing time of 6 minutes per image.

In comparison, the WIPE system has higher sensitivity, 96 percent, and somewhat less specificity (but still high) at 91 percent, and the processing time is less than 1 second per image. This technology is most applicable to automated identification of commercial porn sites; it also could be purchased by filtering companies and added to their products to increase accuracy.

In the WIPE system, the image is acquired, feature extraction is performed using wavelet technology, and, if the image is classified as a photograph (versus drawing), extra processing is done to compare a feature vector with prestored vectors. Then the image is classified as either pornographic or not, and the user can reject it or let it pass on that basis. There is an assumption that only photographs—and not manually generated images, such as an artist’s rendering—would be potentially objectionable. Manually generated images can be distinguished on the basis of tones: smooth tones for manually generated images versus continuous tones for photographs. Again, only photographs would require the next processing stage.

3  

For a technical discussion, see James Z. Wang, Integrated Region-based Image Retrieval, Dordrecht, Holland: Kluwer Academic Publishers, 2001, pp. 107-122. The acronym WIPE stands for Wavelet Image Pornography Elimination.

4  

Margaret Fleck, David Forsyth, and Chris Bregler, “Finding Naked People,” Proceedings of the European Conference on Computer Vision, B. Buxton and R. Cipolla, eds., Berlin, Germany: Springer-Verlag, Vol. 2, 1996, pp. 593-602.

Suggested Citation:"6 Advanced Techniques for Automatic Web Filtering." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

This work was based on an information-retrieval system that finds in a database all the images “close” to one selected image. From the selected image the software looks at thousands of images stored in the database and retrieves all the ones that are deemed “close” to the selected image. The images were tested against a set of 10,000 photographic images and a knowledge base. The knowledge base was built with a training system. For every image there is some trusted element, a feature vector can be defined that encompasses all the information, texture, color, and so on. Then images are classified according to the information in this vector.

The database contains thousands of objectionable images of various types and thousands of benign images5 of various types. In the training process, you process random images to see if the detection and classification are correct. You can adjust sensitivity parameters to allow tighter or looser filtering. You could combine text and images or do multiple processing of multiple images on one site to decrease the overall error in classifying a site as objectionable or not.

A statistical analysis was done showing that, if you download 20-35 images for each site, and 20-25 percent of downloaded images are objectionable, then you can classify the Web site as objectionable with 97 percent accuracy.6 Image content analysis can be combined with text and IP address filtering. To avoid false positives, especially for art images, you can skip images that are associated with the IP addresses of museums, dog shows, beach towns, sports events, and so on.

In summary, you cannot expect perfect filtering. There is always a trade-off between performance and processing effort. But the performance of the WIPE system shows that good results can be obtained with current technology. The performance can improve by combining imagebased and text-based processing. James Wang is working on training the system automatically as it extracts the features and then classifying the images manually as either objectionable and not.7

5  

To develop a set of benign images, David Forsyth suggested obtaining the Corel collection or some similar set of images known to be not-problematic or visiting Web news groups, where it is virtually guaranteed that images will not be objectionable. He said this is a rare case in which you can take a technical position without much trouble.

6  

David Forsyth took issue with the statistical analysis, because there is a conditional probability assumption that the error is independent of the numbers. In the example given earlier with images of puddings (in Forsyth’s talk in Chapter 3), a large improvement in performance cannot be expected because there are certain categories in which the system will just get it wrong again. If it is wrong about one picture of pudding and then wrong again about a second picture of pudding, then it will classify the Web site wrong, also.

7  

For more information, see <http://WWW-DB.Stanford.EDU/IMAGE> (papers) and <http://wang.ist.psu.edu>.

Suggested Citation:"6 Advanced Techniques for Automatic Web Filtering." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 33
Suggested Citation:"6 Advanced Techniques for Automatic Web Filtering." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 34
Suggested Citation:"6 Advanced Techniques for Automatic Web Filtering." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 35
Next: 7 A Critique of Filtering »
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop Get This Book
×
Buy Paperback | $48.00 Buy Ebook | $38.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

In response to a mandate from Congress in conjunction with the Protection of Children from Sexual Predators Act of 1998, the Computer Science and Telecommunications Board (CSTB) and the Board on Children, Youth, and Families of the National Research Council (NRC) and the Institute of Medicine established the Committee to Study Tools and Strategies for Protecting Kids from Pornography and Their Applicability to Other Inappropriate Internet Content.

To collect input and to disseminate useful information to the nation on this question, the committee held two public workshops. On December 13, 2000, in Washington, D.C., the committee convened a workshop to focus on nontechnical strategies that could be effective in a broad range of settings (e.g., home, school, libraries) in which young people might be online. This workshop brought together researchers, educators, policy makers, and other key stakeholders to consider and discuss these approaches and to identify some of the benefits and limitations of various nontechnical strategies. The December workshop is summarized in Nontechnical Strategies to Reduce Children's Exposure to Inappropriate Material on the Internet: Summary of a Workshop. The second workshop was held on March 7, 2001, in Redwood City, California. This second workshop focused on some of the technical, business, and legal factors that affect how one might choose to protect kids from pornography on the Internet. The present report provides, in the form of edited transcripts, the presentations at that workshop.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!