Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
CSelected Technology Issues C.1 INFORMATION RETRIEVAL TECHNOLOGIES Information retrieval, a function that supports people who are ac- tively seeking or searching for information, typically assumes a static or relatively static database of digital objects against which people search. These digital objects may be documents, images, sound recordings, and so on. The general information retrieval problem is one of making a decision about which objects in the database to show to the user. Systems for information retrieval seek to maximize the material that a person sees that is likely to be relevant to his or her information problem, and to minimize the material that is not relevant to the problem. An information retrieval system works by representing the objects in a database in some well-specified manner, and then representing the user's information problem ("information need") in a similar fashion. The retrieval techniques then compare needs with objects. A key aspect of object representation is the language used to describe the objects in question. Typically, this language (more precisely, a "meta- language") consists of a vocabulary (and sometimes precise grammar) that can characterize the objects in a form suitable for automated compari- son to the user's needs. The representation of information objects also requires interpreta- tions by a human indexer, machine algorithm, or other entity. When people are involved, the interpretation of a particular text by one person 418
APPENDIX C 419 is likely to be different from that of someone else, and may even be differ- ent for one person at different times. As a person's state of knowledge changes, his or her understanding of a text changes. Everyone has expe- rienced the situation of finding a document not relevant at some point, but quite relevant later on, perhaps for a different problem or perhaps because we, ourselves, are different. Such subjectivity means that deci- sions about representation will be inconsistent. An extensive literature on interindexer consistency indicates that when people are asked to repre- sent an information object, even if they are highly trained in using the same meta-language (indexing language), they might achieve 60 to 70 percent consistency at most in tasks like assigning descriptors.] Machine-executable algorithms for representing information objects or information problems do give consistent representations. But any one such algorithm gives only one interpretation of the object, out of a great variety of possible representations, depending on the interpreter. The most typical way to accomplish such representations is through statistically based automatic indexing. Techniques of this sort index (represent) documents by the words that appear in them, deleting the most common (articles, prepositions, and other function words), con- flating different forms of a word into one (e.g., making plural and singu- lar forms the same), and weighting the resulting terms according to some function of how often they appear in the document (term fre- quency the more occurrences, the greater the weight) and how many documents they appear in (document frequency the more documents, the lower the weight). A consistent result in information retrieval experimentation is that automatic indexing of the sort just described gives results that are at least as good as human indexing, and usually better. Performance is evaluated by a pair of measures, recall and precision.2 This result is often stated as meaning that automatic indexing is better than manual indexing, because manual indexing is so much more expensive than automatic. But it is important to keep in mind another consistent information retrieval result, that the sets (or lists) of documents that are retrieved using one technique 1L.E. Lawrence, 1977, Inter-indexer Consistency Studies 1954-1975: A Review of the Literature and summary of study results, University of Illinois, Graduate School of Library Science, Urbana-Champaign; K. Markey, 1984, "Interindexer Consistency Tests: A Literature Re- view and Report of a Test of Consistency in Indexing Visual Materials," Library and Informa- tion Research, 6~2~: 155-177; L. Mai-Chan, 1989, "Inter-indexer consistency in subject catalog- ing," Information Technology and Libraries 8~4~: 349-357. 2"Recall" measures how complete the search results are, and is the proportion of relevant documents in the whole collection that have actually been retrieved. "Precision" measures how precise the search results are, and is the proportion of retrieved documents that are relevant.
420 YOUTH, PORNOGRAPHY, AND THE INTERNET are different from those retrieved by another technique (both relevant and non-relevant documents). Having constructed representations of both the information objects and the person's information problem, the information retrieval system is in a position to compare these representations to one another, in order to identify those objects that match the information problem description. Technologies for accomplishing this are called search or retrieval techniques. Because people in general cannot specify precisely what it is that they do not know (that for which they are searching), because people cannot rep- resent accurately and completely what an information object is about, and because relevance judgments are inherently subjective, it is clear that such techniques must be probabilistic, rather than deterministic. That is, any decision as to whether to retrieve (or block) an information object is al- ways a guess, more or less well informed. Thus, in information retrieval, we cannot ever say with confidence that an information object is (or is not) relevant to an information need, but can only make judgments of probability (or belief) of (non~relevance. It can be easily seen that many of the problems that information re- trieval faces are also problems that information filtering must face. A1- though filtering aims to not retrieve certain information objects, it still must do so on the basis of some representation of those objects, and some representation of the person for whom the filtering is done. Filtering works by prespecifying some information object descriptions that are to be excluded from viewing by some person or class of persons. Thus, it depends crucially on having an accurate and complete representation of what is to be excluded (this is analogous to the query or information problem representation), and also accurate and complete representations of the information objects that are to be tested for exclusion. lust as in information retrieval, making the exclusion decision in information filter- ing is an inherently uncertain process. C.1.1 Text Categorization and Representation Automatic text categorization is the primary language retrieval tech- nology in content screening. Text categorization is the sorting of text into groups, such as pornography, hate speech, and violence. A text cate- gorizer looks at a text-based information object and decides the proper categorization for this object. Applications of text categorization are fil- tering of electronic e-mail, chats, or Web access; and indexing and data . . mmmg. The principal problem with text categorization is that text is ambigu- ous in many ways: polysemy, synonymy, and so on. For example, a bank
APPENDIX C 421 can be either a financial institution or something on the side of a river (polysemy). The context matters a lot in the interpretation. Efficient automatic text categorization requires an automated catego- rization decision that identifies, on the basis of some categorization rules, the category into which an object falls. (Note that if the rules are sepa- rated from the decision maker, the behavior of the decision maker can be changed merely by changing the rules, rather than requiring the rewriting every time of the software underlying the decision maker.) The decision- making software examines the text in the information object, and on the basis of these rules, categorizes the object. The simplest decision rules might base the decision on the mere presence of certain words (e.g., if the word "breast" appears, the object is pornographic). More sophisticated rules might search for combinations of words (e.g., if the word "breast" appears without the word "cancer," the object is pornographic), or even weighted combinations of different words. Decision rules are developed by modeling the kinds of decisions that responsible human beings make. Thus, the first step in automated rule writing (also known as "supervised learning") is to ask a responsible individual to identify, for example, which of 500 digital text objects con- stitute pornography and which do not. The resulting categorization thus provides a training set of 500 sample decisions to be mimicked by the decision rules. Of course, the selection of the persons who provide the samples is crucial, because whatever they do becomes the gold standard, which the decision rules then mimic. Everything depends on the particular persons and their judgments, but the technology does not provide guidance on how to define the community or whom to select as representatives of that community. Research indicates that supervised learning is at least as good as ex- pert human rule writing.3 The effectiveness of these methods is far from perfect there is always a high error rate but sometimes it is near agree- ment with human performance levels. Still, the results differ from cat- egory to category, and it is not clear how directly it applies to, for ex- ample, pornography. As discussed below, there is an inevitable trade-off between false positives and false negatives (i.e., attributing an item to a category when it should not be; not attributing an item to a category when it should be), and categories vary widely in difficulty. Substantially im- proved methods are not expected in the next 10 to 20 years. 3Fabrizio Sebastiani. 2002. ''Machine Learning in Automated Text categorization, ACM Computing Surveys 34~1~: 147.
422 YOUTH, PORNOGRAPHY, AND THE INTERNET It is not clear which text categorization techniques are most effective. The best techniques are not yet used commercially, so there may be incre- mental improvements. Nor is it clear how effective semiautomated cat- egorization is, or whether the categories that are difficult for automated methods are the same as those that most perplex people. The simplest machine representation of text is what is known as the "bag-of-words" model, in which all of the words in an object are treated as an unstructured list. If the object's text is, "Dick Armey chooses Bob Shaffer to lead committee," then a representative list would be: Armey, Bob, chooses, committee, Dick, lead, Shaffer. (A slightly more sophisti- cated version associates a count of the number of times each word occurs with the word itself.) Note that in such a representation, the structure and context of the text are completely lost. Thus, one weakness in the representation is due to the ambiguity of language, which is resolved in human discourse through context. For example, the word "beaver" has a hunter's meaning and a pornographic meaning. Other words, such as "breast" and "blow," may be less am- biguous but can be used pornographically or otherwise. When context is important in determining meaning, the bag-of-words representation is inadequate. The bag-of-words representation is most useful when two conditions are met: when there are unambiguous words indicating relevant content, and when there are relatively few of these indicators. Pornographic text has these properties; probably about 40 or 50 words, most of them unam- biguous, indicate pornography. Thus, the bag-of-words representation is reasonably well suited for this application, especially if a high rate of false positives is acceptable. However, in many other areas, such as violence and hate speech, the bag-of-words representation is less useful. (For example, while a pornographic text such as the text on an adult Web page often can be identified by viewing the first few words, one must often read four or five sentences of a text before identifying it as hate speech.) When fidelity of representation becomes important, a number of tech- niques can go beyond the bag-of-words model: morphological analysis, part-of-speech tagging, translation, disambiguation, genre analysis, infor- mation extraction, syntactic analysis, and parsing. For example, a tech- nique more robust than the bag-of-words approach is to consider adjacent words, as search engines do when they give higher weight to information objects that match the query and have certain words in the same sentence. However, even with these technologies, true machine-aided text under- standing will not be available in the near term, and there always will be a significant error rate with any automated method. Approaches that go beyond the bag-of-words representation improve accuracy, which may be important in contexts apart from pornography.
APPENDIX C 423 C.1.2 Image Representation and Retrieval Images can be ambiguous in at least as many ways as text can be. Furthermore, there is no universal meta-language for describing images. People who are interested in images for advertising purposes have differ- ent ways to talk and think about them than do art historians, even though they may be searching for the same images. The lack of a common meta- language for images means that special meta-languages must be devel- oped for images for use in different problem domains. The process of determining whether a given image is pornographic involves object recognition, which is very difficult for a number of rea- sons. First, it is difficult to know what an object is; things look different from different angles and in different lights. When color and texture change, things look different. People can change their appearance by moving their heads around. We do not look different to one another when we do this, but we certainly look different in pictures. Today, it is difficult for computer programs to find people, though finding faces can be done with reasonably high confidence. It is often possible to tell whether a picture has nearly naked people in it, but there is no program that reliably determines whether there are people wearing clothing in a picture. To find naked people, image recognition programs exploit the fact that virtually everyone's skin looks about the same in a picture, as long as one is careful about intensity issues. Skin is very easy to detect reliably in pictures, so an image recognition program searching for naked people first searches for skin. So, one might simply assume that any big blob of skin must be a naked person. However, images of the California desert, apple pies, and all sorts of other things are rendered in approximately the same way as skin. A more refined algorithm would then examine how the skin/desert/pie coloring is arranged in an image. For example, if the coloring is arranged in patterns that are long and thin, that pattern might represent an arm, a leg, or a torso. Then, because the general arrangement of body parts is known, the location of an arm, for example, provides some guidance about where to search for a leg. Assembling enough of such pieces together can pro- vide enough information for recognizing a person. A number of factors help to identify certain pornographic pictures, such as those that one might find on an adult Web site. For example, in such pictures, the people tend to be big, and there is not much other than people in these pictures. Exploiting other information, such as the source of the image or the words and links on the Web page from which the image is drawn, can increase the probability of reliable identification of such an image.
424 YOUTH, PORNOGRAPHY, AND THE INTERNET But it is essentially impossible for current computer programs to dis- tinguish between hard-core and soft-core pornography, because what con- stitutes "hard-core" versus "soft-core" pornography is in the eyes of the viewer rather than the image itself. Consider also whether the photo- graphs of lock Sturges, many of which depict naked children, constitute pornography. Furthermore, in the absence of additional information, it is quite impossible for computer programs to distinguish between images that contain naked people in what might be pornographic poses, which are considered "high art" (e.g., paintings by Rubens), and what someone might consider a truly pornographic image. What computer programs can do with reasonable reliability today (and higher reliability in the future) is to determine whether there might be naked people in a picture. But any of the contextual issues raised above will remain beyond the purview of automated recognition for the foreseeable future. C.2 SEARCH ENGINES AND OTHER OPERATIONAL INFORMATION RETRIEVAL SYSTEMS Information retrieval systems consist of a database of information objects, techniques for representing those objects and queries put to the database, and techniques for comparing query representations to infor- mation object representations. The typical technique for representing in- formation objects is indexing them, according to words that appear in the documents, or words that are assigned to the documents by humans or by automatic techniques. An information retrieval system then takes the words that represent the user's query (or filter), and compares them to the "inverted index" of the system, in which all of the words used to index the objects in the collection are listed and linked to the documents that are indexed by them. Surrogates (e.g., titles) for those objects that most closely (or exactly) match (i.e., are indexed by) the words in the query are then retrieved and displayed to the user of the system. It is then up to the user to decide whether one or more of the retrieved objects is relevant, or worth looking at. From the above description, it is easy to see that "search engines" are a type of information retrieval system, in which the database is some collection of pages from the World Wide Web, which have been indexed by the system, and in which the retrieved results are links to the pages that have been indexed by the words in the user's query. The basic algorithm in search engines is based on the "bag-of-words" model for handling data described above. However, they also use some kind of recognition of document structure to improve the effectiveness of a search. For example, search engines often treat titles differently from
APPENDIX C 425 the body of a Web page; titles can often indicate the topic of a page. If the system can extract structure from documents, then this often can be used as an indicator for refining the retrieval process. Many search engines also normalize the data, a process that involves stripping out capitalization and most of the other orthographic differ- ences that distinguish words. Some systems do not throw this informa- tion away automatically, but rather attempt to identify things such as sequences of capitalized words possibly indicating a place name or per- son's name. The search engine often removes stop words, a list of words that it chooses not to index typically quite common words like "and" and "the."4 In addition, the search engine may apply natural language processing to identify known phrases or chunks of text that properly belong together and indicate certain types of content. What remains after such processing is a collection of words that need to be matched against documents represented in the database. The sim- plest strategy is the Boolean operator model. Simple Boolean logic says either "this word AND that word occur," or "this word OR that word occurs," and, therefore, the documents that have those words should be retrieved. Boolean matching is simple and easy to implement. Because of the volume of data on the Internet, almost all search engines today in- clude an automatic default setting that, in effect, uses the AND operator with all terms provided to the search engine. All Boolean combinations of words in a query can be characterized in a simple logic model that says, either this word occurs in the document, or it does not. If it does occur, then you have certain matches; if not, then you have other matches. Any combination of three words, for example, can be specified, such that the document has this word and not the other two, or all three together, or one and not the other of two. However, if the user does not specify the word exactly as it is stored in the index, then it will not be processed appropriately, and in particular the word cannot be a synonym (unless you supply that synonym), an alternate phrasing, or a euphemism. Another strategy is the vector space model. A document is repre- sented as an N-dimensional vector, in which N is the number of different words in the text, and the component of the vector in any given direction is simply the number of times the word appears in the text. The measure of similarity between two documents (or, more importantly, between a document and the search query similarly represented) is then given by the cosine of the angle between the two vectors. The value of this param- 4Note that the stop list is a likely place to put a filter. For example, if ''bitch,, was included in the stop list, no Web site, including that of the American Kennel Club Web site, would be found in searches that included the word ''bitch.,
426 YOUTH, PORNOGRAPHY, AND THE INTERNET eter ranges from zero to 1.0, and the closer the value is to 1.0, the more similar the document is to the query or to the other document. In this model, a perfect match (i.e., one with all the words present) is not nec- essary for a document to be retrieved. Instead, what is retrieved is a "best match" for the query (and of course, less good matches can also be displayed). Most Web search engines use versions of the vector space model and also offer some sort of Boolean search. Some use natural language pro- cessing to improve search outcomes. Other engines (e.g., Google) use methods that weight pages depending on things like the number of links to a page. If there is only one link to a given page, then that page receives a lower ranking than a page with the same words but many links to it. The preceding discussion assumes that the documents in question are text documents. Searching for images is much more difficult, because a similarity metric of images is very difficult to compute or even to concep- tualize. For example, consider the contrast between the meaning of a picture of the Pope kissing a baby versus a picture of a politician kissing a baby. These pictures are the same in some ways, and very different in other ways. More typically, image searches are performed by looking for text that is associated with images. A search engine will search for an image link tag within the HTML and the sentences that surround the image on either side an approach with clear limitations. For example, the words, "Oh, look at the cute bunnies," mean one thing on a children's Web site and something entirely different on Playboy's site. Thus, the words alone may not indicate what those images are about. C.3 LOCATION VERIFICATION Today, the Internet is designed and structured in such a way that the physical location of a user has no significance for the functionality he or she expects from the Internet or any resources to which he or she is con- nected. This fact raises the question of the extent to which an Internet user's location can in fact be established through technical means alone. Every individual using the Internet at a given moment in time is associated with what is known as an IP address, and that IP address is usually associated with some fixed geographical location. However, be- cause IP addresses are allocated hierarchically by a number of different administrative entities, knowing the geographical location of one of these entities does not automatically provide information about the locations associated with IP addresses that it allocates. For example, the National Academy of Sciences is based in Washington, D.C., and it allocates IP addresses to computers tied to its network. However, the Academy has
APPENDIX C 427 employees in California, and also computers that are tied to the Academy network. The Academy knows which IP addresses are in California and in Washington, D.C., but someone who knew only that an IP address was one associated with the Academy would not know where that IP address was located.5 Under some circumstances, it can be virtually impossible to deter- mine the precise physical location of an Internet user. Consider, for ex- ample, the case of an individual connecting to the Internet through a dial- up modem. It is not an unreasonable assumption that the user is most likely in the region in which calls to the dial-up number are local, simply because it would be unnecessary for most people to incur long-distance calling costs for such connections. Furthermore, the exchange serving dial-up modem access numbers can, in principle, employ caller-ID tech- nology. However, the exchange associated with the telephone from which the dial-up call originates may not be technologically capable of provid- ing caller-ID information; this would be the case in some areas in the United States and in much of the world. Or the user might simply sup- press caller-ID information before making the dial-up modem call. In these instances, the number through which the individual connects to the Internet does not necessarily say anything about his location at that time. Internet access routed through satellites can be difficult to localize as well. The reason is that a satellite's transmission footprint can be quite large (hundreds of square miles?), and more importantly is moving quite rapidly. Localization (but only within the footprint) can be accomplished only by working with a detailed knowledge of the orbital movements of an entire constellation of satellites. However, those connecting to the Internet through a broadband con- nection can be localized much more effectively, though with some effort. For example, while a cable Internet ISP may assign IP addresses to users dynamically, any given address must be mappable to a specific cable modem that can be identified with its media access control address. While such mapping is usually done for billing and customer care reasons, it provides a ready guide to geographical addresses at the end user's level. Those who gain access through DSL connections can be located because the virtual circuit from the digital subscriber line access multiplexer is 5While location information is not provided automatically from the IP addresses an ad- ministrative entity allocates, under some circumstances, some location information can be inferred. For example, if the administrative entity is an ISP, and the ISP is, for example, a French ISP, it is likely though not certain that most of the subscribers to a French ISP are located in France. of course, a large French company using this ISP might well have branch offices in London, so the geographical correspondence between French ISP and French Internet user will not be valid for this case, though as a rule of thumb, it may not be a bad working assumption.
428 YOUTH, PORNOGRAPHY, AND THE INTERNET mapped to a specific twisted pair of copper wires going into an indi- vidual's residence. Also, wireless connections made through cell phones (and their data-oriented equivalents) are now subject to a regulation that requires the network client to provide location information for E-911 (en- hanced emergency 911) reasons. This information is passed through the signaling network and would be available to a wireless ISP as well. In principle, the information needed to ascertain the location of any IP address is known collectively by a number of administrative entities, and could be aggregated automatically. But there is no protocol in place to pass this information to relevant parties, and thus such aggregation is not done today. The result is that in practice, recovering location information is a complex and time-consuming process. To bypass these difficulties, technical proposals have been made for location-based authentication.6 However, the implementation of such proposals generally requires the installation of additional hardware at the location of each access point, and thus cannot be regarded as a general- purpose solution that can localize all (or even a large fraction of) Internet users. The bottom line is that determining the physical location of most Internet users is a challenging task today, though this task will become easier as broadband connections become more common. C.4 USER INTERFACES The history of information technology suggests that increasingly real- istic and human-like forms of human-computer interaction will develop. The immediately obvious trends in the near-term future call for greater fidelity and "realism" in presentation. For example, faster graphics pro- cessors will enable more realistic portrayals of moving images, which soon will approach the quality of broadcast television. Larger screens in which the displayed image subtends a larger angle in the eye will increase the sense of being immersed in or surrounded by the image portrayed. Goggles with built-in displays do the same, but also offer the opportunity for three-dimensional images to be seen by the user. Virtual reality displays carry this a step further, in that the view seen by the user is adjusted for changes in perspective (e.g., as one turns one's head, the view changes). 6See, for example, Dorothy E. Denning and Peter F. MacDoran, 1996," Location-Based Authentication: Grounding Cyberspace for Better Security," in Computer Fraud and Security, Elsevier Science Ltd., February. A commercial enterprise now sells authentication systems that draw heavily on the technology described in this paper. See <http: / /www. cyberlocator.com/works.html>.
APPENDIX C 429 Speech and audio input/output are growing more common. Today, computers can provide output in the form of sound or speech that is either a reproduction of human speech or speech that is computer-synthe- sized. The latter kind of speech is not particularly realistic today but is expected to become more realistic with more research and over time. Speech recognition is still in its infancy as a useful tool for practical appli- cations, even after many years of research, but it, too, is expected to improve in quality (e.g., the ability to recognize larger vocabularies, a broader range of voices, a lower error rate) over time. Another dimension of user interface is touch and feel. The "joystick" often used in some computer-based video games provides the user with a kinesthetic channel for input. Some joysticks also feature a force feedback that, for example, increases the resistance felt by the user when the stick is moved farther to one side or another. Such "haptic" interfaces can also- in principle be built into gloves and suits that could apply pressure in varying amounts to different parts of the body in contact with them. Finally, gesture recognition is an active field of research. Humans often specify things by pointing with their hands. Computer-based efforts to recognize gestures can rely on visual processing in which a human's gestures are viewed optically through cameras connected to the com- puter, and the motions analyzed. A second approach is based on what is known as a dataglove, which can sense finger and wrist motion and trans- mit information on these motions to a computer.7 Product vendors of these technologies promise a user experience of extraordinarily high fidelity. For example, it is easy to see how these technologies could be used to enhance perceived awareness of others- one might be alone at home, but through one's goggles and headphones, hear and see others sharing the same "virtual" space. (In one of the simplest cases, one might just see others with goggles and headphones as well, but the digital retouching technologies that allow images to be modi- fied might allow a more natural (though perhaps less realistic) depiction.) 7see, for example, <http://www.ireality.com/Wireless_announce.html> for a 1997 prod- uct announcement by the General Reality company.