3
Technological Drivers

Privacy is an information concept, and fundamental properties of information define what privacy can—and cannot—be. For example, information has the property that it is inherently reproducible: If I share some information with you, we both have all of that information. This stands in sharp contrast to apples: If I share an apple with you, we each get half an apple, not a whole apple. If information were not reproducible in this manner, many privacy concerns would simply disappear.

3.1
THE IMPACT OF TECHNOLOGY ON PRIVACY

Advances in technology have often led to concerns about the impact of those advances on privacy. As noted in Chapter 1, the classic characterization of privacy as the right to be left alone was penned by Louis Brandeis in his article discussing the effects on privacy of the then-new technology of photography. The development of new information technologies, whether they have to do with photography, telephony, or computers, has almost always raised questions about how privacy can be maintained in the face of the new technology. Today’s advances in computing technology can be seen as no more than a recurrence of this trend, or can be seen as different in that new technology, being fundamentally concerned with the gathering and manipulation of information, increases the potential for threats to privacy.

Several trends in the technology have led to concerns about privacy. One such trend has to do with hardware that increases the amount of



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age 3 Technological Drivers Privacy is an information concept, and fundamental properties of information define what privacy can—and cannot—be. For example, information has the property that it is inherently reproducible: If I share some information with you, we both have all of that information. This stands in sharp contrast to apples: If I share an apple with you, we each get half an apple, not a whole apple. If information were not reproducible in this manner, many privacy concerns would simply disappear. 3.1 THE IMPACT OF TECHNOLOGY ON PRIVACY Advances in technology have often led to concerns about the impact of those advances on privacy. As noted in Chapter 1, the classic characterization of privacy as the right to be left alone was penned by Louis Brandeis in his article discussing the effects on privacy of the then-new technology of photography. The development of new information technologies, whether they have to do with photography, telephony, or computers, has almost always raised questions about how privacy can be maintained in the face of the new technology. Today’s advances in computing technology can be seen as no more than a recurrence of this trend, or can be seen as different in that new technology, being fundamentally concerned with the gathering and manipulation of information, increases the potential for threats to privacy. Several trends in the technology have led to concerns about privacy. One such trend has to do with hardware that increases the amount of

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age information that can be gathered and stored and the speed with which that information can be analyzed, thus changing the economics of what it is possible to do with information technology. A second trend concerns the increasing connectedness of this hardware over networks, which magnifies the increases in the capabilities of the individual pieces of hardware that the network connects. A third trend has to do with advances in software that allow sophisticated mechanisms for the extraction of information from the data that are stored, either locally or on the network. A fourth trend, enabled by the other three, is the establishment of organizations and companies that offer as a resource information that they have gathered themselves or that has been aggregated from other sources but organized and analyzed by the company. Improvements in the technologies have been dramatic, but the systems that have been built by combining those technologies have often yielded overall improvements that sometimes appear to be greater than the sum of the constituent parts. These improvements have in some cases changed what it is possible to do with the technologies or what it is economically feasible to do; in other cases they have made what was once difficult into something that is so easy that anyone can perform the action at any time. The end result is that there are now capabilities for gathering, aggregating, analyzing, and sharing information about and related to individuals (and groups of individuals) that were undreamed of 10 years ago. For example, global positioning system (GPS) locators attached to trucks can provide near-real-time information on their whereabouts and even their speed, giving truck shipping companies the opportunity to monitor the behavior of their drivers. Cell phones equipped to provide E-911 service can be used to map to a high degree of accuracy the location of the individuals carrying them, and a number of wireless service providers are marketing cell phones so equipped to parents who wish to keep track of where their children are. These trends are manifest in the increasing number of ways people use information technology, both for the conduct of everyday life and in special situations. The personal computer, for example, has evolved from a replacement for a typewriter to an entry point to a network of global scope. As a network device, the personal computer has become a major agent for personal interaction (via e-mail, instant messaging, and the like), for financial transactions (bill paying, stock trading, and so on), for gathering information (e.g., Internet searches), and for entertainment (e.g., music and games). Along with these intended uses, however, the personal computer can also become a data-gathering device sensing all of these activities. The use of the PC on the network can potentially generate data that can be analyzed to find out more about users of PCs than they

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age anticipated or intended, including their buying habits, their reading and listening preferences, who they communicate with, and their interests and hobbies. Concerns about privacy will grow as the use of computers and networks expands into new areas. If we can’t keep data private with the current use of technology, how will we maintain our current understanding of privacy when the common computing and networking infrastructure includes our voting, medical, financial, travel, and entertainment records, our daily activities, and the bulk of our communications? As more aspects of our lives are recorded in systems for health care, finance, or electronic commerce, how are we to ensure that the information gathered is not used inappropriately to detect or deduce what we consider to be private information? How do we ensure the privacy of our thoughts and the freedom of our speech as the electronic world becomes a part of our government, central to our economy, and the mechanism by which we cast our ballots? As we become subject to surveillance in public and commercial spaces, how do we ensure that others do not track our every move? As citizens of a democracy and participants in our communities, how can we guarantee that the privacy of putatively secret ballots is assured when electronic voting systems are used? The remainder of this chapter explores some relevant technology trends, describing current and projected technological capacity and relating it to privacy concerns. It also discusses computer, network, and system architectures and their potential impacts on privacy. 3.2 HARDWARE ADVANCES Perhaps the most commonly known technology trend is the exponential growth in computing power—loosely speaking the central processor unit in a computer will double in speed (or halve in price) every 18 months. What this trend has meant is that over the last 10 years, we have gone through about seven generations, which in turn means that the power of the central processing unit has increased by a factor of more than 100. The impact of this change on what is possible or reasonable to compute is hard to overestimate. Tasks that took an hour 10 years ago now take less than a minute. Tasks that now take an hour would have taken days to complete a decade ago. The end result of this increase in computing speed is that many tasks that were once too complex to be automated can now be easily tackled by commonly available machines. While the increase in computing power that is implied by this exponential growth is well known and often cited, less appreciated are the economic implications of that trend, which entail a decrease in the cost of computation by a factor of more than 100 over the past 10 years. One

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age outcome of this is that the desktop computer used in the home today is far more powerful than the most expensive supercomputer of 10 years ago. At the same time, the cell phones commonly used today are at least as powerful as the personal computers of a decade ago. This change in the economics of computing means that there are many more computers in simple numbers than there were a decade ago, which in turn means that the amount of total computation available at a reasonable price is no longer a limiting factor in any but the most complex of computing problems. Nor is it merely central processing units (CPUs) that have shown dramatic improvements in performance and dramatic reductions in cost over the past 10 years. Dynamic random access memory (DRAM), which provides the working space for computers, has also followed a course similar to that for CPU chips.1 Over the past decade memory size has in some cases increased by a factor of 100 or more, which allows not only for faster computation but also for the ability to work on vastly larger data sets than was possible before. Less well known in the popular mind, but in some ways more dramatic than the trend in faster processors and larger memory chips, has been the expansion of capabilities for storing electronic information. The price of long-term storage has been decreasing rapidly over the last decade, and the ability to access large amounts of such storage has been increasing. Storage capacity has been increasing at a rate that has outpaced the rate of increase in computing power, with some studies showing that it has doubled on average every 12 months.2 The result of this trend is that data can be stored for long periods of time in an economical fashion. In fact, the economics of data storage has become inverted. Traditionally, data was discarded as soon as possible to minimize the cost of storing that data, or at least moved from primary storage (disks) to secondary storage (tape) where it was more difficult to access. With the advances in the capacities of primary storage devices, it is now often more expensive to decide how to cull data or transfer it to secondary storage (and to spend the resources to do the culling or transferring) than it is to simply store it all on primary storage, adding new capacity when it is needed. The change in the economics of data storage has altered more than just the need to occasionally cull data. It has also changed the kind of 1 On the other hand, the speed with which the contents of RAM chips can be accessed has not increased commensurately with speed increases in CPU chips, and so RAM access has become relatively “slower.” This fact has not yet had many privacy implications, but may in the future. 2 E. Grochowski and R.D. Halern, “Technological Impact of Magnetic Hard Disk Drives on Storage Systems,” IBM Systems Journal 42(2):338-346, July 2003.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age data that organizations are willing to store. When persistent storage was a scarce resource, considerable effort was expended in ensuring that the data that were gathered were compressed, filtered, or otherwise reduced before being committed to persistent storage. Often the purpose for which the data had been gathered was used to enhance this compression and filtering, resulting in the storing not of the raw data that had been gathered but instead of the computed results based on that data. Since the computed results were task-specific, it was difficult or impossible to reuse the stored information for other purposes, part of the compression and filtering caused a loss of the general information such that it could not be recovered. With the increase in the capacity of long-term storage, reduction of data as they are gathered is no longer needed. And although compression is still used in many kinds of data storage, that compression is often reversible, allowing the re-creation of the original data set. The ability to re-create the original data set is of great value, as it allows more sophisticated analysis of the data in the future. But it also allows the data to be analyzed for purposes other than those for which it was originally gathered, and allows the data to be aggregated with data gathered in other ways for additional analysis. Additionally, forms of data that were previously considered too large to be stored for long periods of time can now easily be placed on next-generation storage devices. For example, high-quality video streams, which can take up megabytes of storage for each second of video, were once far too large to be stored for long periods; the most that was done was to store samples of the video streams on tape. Now it is possible to store large segments of real-time video footage on various forms of long-term storage, keeping recent video footage online on hard disks, and then archiving older footage on DVD storage. Discarding or erasing stored information does not eliminate the possibility of compromising the privacy of the individuals whose information had been stored. A recent study has shown that a large number of disk drives available for sale on the secondary market contain easily obtainable information that was placed on the drive by the former owner. Included in the information found by the study was banking account information, information about prescription drug use, and college application information.3 Even when the previous owners of the disk drive had gone to some effort to erase the contents of the drive, it was in most cases fairly easy to repair the drive in such a way that the data that the drive had held 3 Simson L. Garfinkel and Abhi Shelat, “Remembrance of Data Past: A Study of Disk Sanitization Practices,” IEEE Security and Privacy 1(1):83-88, 2003.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age were easily available. In fact, one of the conclusions of the study is that it is quite hard to really remove information from a modern disk drive; even when considerable effort has been put into removing the information, sophisticated “digital forensic” techniques can be used to re-create the data. From the privacy point of view, this means that once data have been gathered and committed to persistent storage, it is very difficult to ever be sure that the data have been removed or forgotten—a point very relevant to the archiving of materials in a digital age. With more data, including more kinds of data, being kept in its raw form, the concern arises that every electronic transaction a person ever enters into can be kept in readily available storage, and that audio and video footage of all of the public activities for that person could also be available. This information, originally gathered for purposes of commerce, public safety, health care, or for some other reason, could then be available for uses other than those originally intended. The fear is that the temptation to use all of this information, either by a governmental agency or by private corporations or even individuals, is so great that it will be nearly impossible to guarantee the privacy of anyone from some sort of prying eye, if not now then in the future. The final hardware trend relevant to issues of personal privacy involves data-gathering devices. The evolution of these devices has moved them from the generating of analog data to the generation of data in digital form; from devices that were on specialized networks to those that are connected to larger networks; and from expensive, specialized devices that were deployed only in rare circumstances to cheap, ubiquitous devices either too small or too common to be generally noticed. Biometric devices, which sense physiological characteristics of individuals, also count as data-gathering devices. These sensors, from simple temperature and humidity sensors in buildings to the positioning systems in automobiles to video cameras used in public places to aid in security, continue to proliferate, showing the way to a world in which all of our physical environment is being watched and sensed by sets of eyes and other sensors. Box 3.1 provides a sampling of these sensing devices. The ubiquitous connection of these sensors to the network is really a result of the transitive nature of connectivity. It is not in most cases the sensors themselves that are connected to the larger world. The standard sensor deployment has a group of sensors connected by a local (often specialized) network to a single computer. However, that computer is in turn connected to the larger network, either an intranet or the Internet itself. Because of this latter connection, the data generated by the sensors can be moved around the network like any other data once the computer to which the sensors are directly connected has received it. The final trend of note in sensing devices is their nearly ubiquitous

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age BOX 3.1 A Sampling of Advanced Data-gathering Technologies Pervasive sensors and new types of sensors (e.g., “smart dust”) Infrared/thermal detectors GPS/location information Cell-phone-generated information Radio-frequency identification tags Chips embedded in people Medical monitoring (e.g., implanted heart sensors) Spycams and other remote cameras Surveillance cameras in most public places Automated homes with temperature, humidity, and power sensors Traffic flow sensors Camera/cell-phone combinations Toys for children that incorporate surveillance technology (such as a stuffed animal that contains a nanny-cam) Biometrics-based recognition systems (e.g., based on face recognition, fingerprints, voice prints, gate analysis, iris recognition, vein patterns, hand geometry) Devices for remote reading of monitors and keyboards Brain wave sensors Smell sensors However, it should also be noted that data-gathering technologies need not be advanced or electronic to be significant or important. Mail or telephone surveys, marketing studies, and health care information forms, sometimes coupled with optical scanning to convert manually created data into machine-readable form, also generate enormous amounts of personal and often sensitive information. proliferation. Video cameras are now a common feature of many public places; traffic sensors have become common; and temperature and humidity sensors (which can be used as sensors to detect humans) are in many modern office buildings. Cell phone networks gather position information for 911 calling, which could be used to track the locations of their users. Many automobiles contain GPS sensors, as part of either a navigation system or a driver aid system. As these devices become smaller and more pervasive, they become less noticeable, leading to the gathering of data in contexts where such gathering is neither expected nor noticed. The proliferation of explicit sensors in our public environments has been a cause for alarm. There is also the growing realization that every computer used by a person is also a data-gathering device. Whenever a computer is used to access information or perform a transaction, informa-

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age tion about the use or transaction can be (and often is) gathered and stored. This means that data can be gathered about far more people in far more circumstances than was possible 10 years ago. It also means that such information can be gathered about activities that intuitively appear to occur within the confines of the home, a place that has traditionally been a center of privacy-protected activities. As more and more interactions are mediated by computers, more and more data can be gathered about more and more activities. The trend toward ubiquitous sensing devices has only begun, and it shows every sign of accelerating at an exponential rate similar to that seen in other parts of computing. New kinds of sensors, such as radio-frequency identification (RFID) tags or medical sensors allowing constant monitoring of human health, are being mandated by entities such as Walmart and the Department of Defense. Single-sensor surveillance may be replaced in the future with multiple-sensor surveillance. The economic and health benefits of some ubiquitous sensor deployments are significant. But the impact that those and other deployments will have in practice on individual privacy is hard to determine. 3.3 SOFTWARE ADVANCES In addition to the dramatic and well-known advances in the hardware of computing have come significant advances in the software that runs on that hardware, especially in the area of data mining and information fusion/data integration techniques and algorithms. Owing partly to the new capabilities enabled by advances in the computing platform and partly to better understanding of the algorithms and techniques needed for analysis, the ability of software to analyze the information gathered and stored on computing machinery has made great strides in the past decade. In addition new techniques in parallel and distributed computing have made it possible to couple large numbers of computers together to jointly solve problems that are beyond the scope of any single machine. Although data mining is generally construed to encompass data searching, analysis, aggregation, and, for lack of a better term, archaeology, “data mining” in the strict sense of the term is the extraction of information implicit in data, usually in the form of previously unknown relationships among data elements. When the data sets involved are voluminous, automated processing is essential, and today computer-assisted data mining often uses machine learning, statistics, and visualization techniques to discover and present knowledge in a form that is easily comprehensible. Information fusion is the process of merging/combining multiple sources of information in such a way that the resulting information is

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age more accurate or reliable or robust as a basis for decision making than any single source of information would be. Information fusion often involves the use of statistical methods, such as Bayesian techniques and random effects modeling. Some information fusion approaches are implemented as artificial neural networks. Both data mining and information fusion have important everyday applications. For example, by using data mining to analyze the patterns of an individual’s previous credit card transactions, a bank can determine whether a credit card transaction today is likely to be fraudulent. By combining results from different medical tests using information fusion techniques, physicians can infer the presence or absence of underlying disease with higher confidence than if the result of only one test were available. These techniques are also relevant to the work of government agencies. For example, the protection of public health is greatly facilitated by early warning of outbreaks of disease. Such warning may be available through data mining of the highly distributed records of first-line health care providers and pharmacies selling over-the-counter drugs. Unusually high buying patterns of such drugs (e.g., cold remedies) in a given locale might signal the previously undetected presence and even the approximate geographic location of an emerging epidemic threat (e.g., a flu outbreak). Responding to a public health crisis might be better facilitated with automated access to and screening analyses of patient information at clinics, hospitals, and pharmacies. Research on these systems is today in its infancy, and it remains to be seen whether such systems can provide reliable warning on the time scales needed by public health officials to respond effectively. Data-mining and information fusion technologies are also relevant to counterterrorism, crisis management, and law enforcement. Counterterrorism involves, among other things, the identification of terrorist operations before execution through analysis of signatures and database traces made during an operation’s planning stages. Intelligence agencies also need to pull together large amounts of information to identify the perpetrators of a terrorist attack. Responding to a natural disaster or terrorist attack requires the quick aggregation of large amounts of information in order to mobilize and organize first-responders and assess damage. Law enforcement must often identify perpetrators of crimes on the basis of highly fragmentary information—e.g., a suspect’s first name, a partial license number, and vehicle color. In general, the ability to analyze large data sets can be used to discern statistical trends or to allow broad-based research in the social, economic, and biological sciences, which is a great boon to all of these fields. But the ability can also be used to facilitate target marketing, enable broad-based

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age e-mail advertising campaigns, or (perhaps most troubling from a privacy perspective) discern the habits of targeted individuals. The threats to privacy are more than just the enhanced ability to track an individual through a set of interactions and activities, although that by itself can be a cause for alarm. It is now possible to group people into smaller and smaller groups based on their preferences, habits, and activities. There is nothing that categorically rules out the possibility that in some cases, the size of the group can be made as small as one, thus identifying an individual based on some set of characteristics having to do with the activities of that individual. Furthermore, data used for this purpose may have been gathered for other, completely different reasons. For example, cell phone companies must track the locations of cell phones on their network in order to determine the tower responsible for servicing any individual cell phone. But these data can be used to trace the location of cell-phone owners over time.4 Temperature and humidity sensors used to monitor the environment of a building can generate data that indicate the presence of people in particular rooms. The information accumulated in a single database for one reason can easily be used for other purposes, and the information accumulated in a variety of database can be aggregated to allow the discovery of information about an individual that would be impossible to find out given only the information in any single one of those databases. The end result of the improvements in both the speed of computational hardware and the efficiency of the software that is run on that hardware is that tasks that were unthinkable only a short time ago are now possible on low-cost, commodity hardware running commercially available software. Some of these new tasks involve the extraction of information about the individual from data gathered from a variety of sources. A concern from the privacy point of view is that—given the extent of the ability to aggregate, correlate, and extract new information from seemingly innocuous information—it is now difficult to know what activities will in fact compromise the privacy of an individual. 3.4 INCREASED CONNECTIVITY AND UBIQUITY The trends toward increasingly capable hardware and software and increased capacities of individual computers to store and analyze information are additive; the ability to store more information pairs with the increased ability to analyze that information. When combined with these 4 Matt Richtel, “Tracking of Mobile Phones Prompts Court Fights on Privacy,” New York Times, December 10, 2005, p. A1.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age two, a third technology trend, the trend toward increased connectivity in the digital world, has a multiplicative effect. The growth of network connectivity—obvious over the past decade in the World Wide Web’s expansion from a mechanism by which physicists could share information to a global phenomenon, used by millions to do everything from researching term papers to ordering books—can be traced back to the early days of local area networks and the origin of the Internet: Growth in the number of nodes on the Internet has been exponential over a period that began roughly in 1980 and continues to this day.5 Once stand-alone devices connected with each other through the use of floppy disks or dedicated telephone lines, computers are now networked devices that are (nearly) constantly connected to each other. A computer that is connected to a network is not limited by its own processor, software, and storage capacity, and instead can potentially make use of the computational power of the other machines connected to that network and the data stored on those other computers. The additional power is characterized by Metcalfe’s law, which states that the power of a network of computers increases in proportion to the number of pair-wise connections that the network enables.6 A result of connectivity is the ability to access information stored or gathered at a particular place without having physical access to that place. It is no longer necessary to be able to actually touch a machine to use that machine to gather information or to gain access to any information stored on the machine. Controlling access to a physical resource is a familiar concept for which we have well-developed intuitions, institutions, and mechanisms that allow us to judge the propriety of access and to control that access. These intuitions, institutions, and mechanisms are much less well developed in the case of networked access. The increased connectivity of computing devices has also resulted in a radical decrease in the transaction costs for accessing information. This has had a significant impact on the question of what should be considered a public record, and how those public records should be made available. Much of the information gathered by governments at various levels is considered public record. Traditionally, the costs (both in monetary terms and in terms of costs of time and human aggravation) to access such 5 Raymond Kurzweil, The Singularity Is Near, Viking Press, 2005, pp. 78-81. 6 See B. Metcalfe, “Metcalfe’s Law: A Network Becomes More Valuable as It Reaches More Users,” Infoworld, Oct. 2, 1995. See also the May 6, 1996, column at http://www.infoworld.com/cgi-bin/displayNew.pl?/metcalfe/bm050696.html. The validity of Metcalfe’s law is based on the assumption that every connection in a network is equally valuable. However, in practice it is known that in many networks, certain nodes are much more valuable than others, a point suggesting that the value may increase less rapidly in proportion to the number of possible pair-wise connections.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age 3.8.2.2 Statistical Disclosure Limitation Techniques27 Other techniques can be used to reduce the likelihood that a specific individual can be identified in a data-mining application that seeks to uncover certain statistical patterns. Such techniques are useful to statistical agencies such as the Census Bureau, the Bureau of Labor Statistics, and the Centers for Disease Control and Prevention (to name only a few), which collect vast amounts of personally identifiable data and use it to produce useful data sets, summaries, and other products for the public or for research uses—most often in the form of statistical tables (i.e., tabular data). Some agencies (e.g., Census) also make available so-called microdata files—that is, files that can show (while omitting specific identifying information) the full range of responses made on individual questionnaires. Such files can show, for example, how one household or one household member answered questions on occupation, place of work, and so on. Given the sensitive nature of much of this information and the types of analysis and comparison facilitated by modern technology, statistical agencies also can and do employ a wide range of techniques to prevent the disclosure of personally identifiable information related to specific individuals and to ensure that the data that are made available cannot be used to identify specific individuals or, in some cases, specific groups or organizations. Following are descriptions of many of those techniques. Limiting details. Both with tabular data and microdata, formal identifiers and many geographic details are often simply omitted for all respondents. Adding noise. Data can be perturbed by adding random noise (adding a random but small amount or multiplying by a random factor close to 1, most often before tabulation) to help disguise potentially identifying values. For example, imagine perturbing each individual’s or household’s income values by a small percentage. Targeted suppression. This method suppresses or omits extreme values or values that might be unique enough to constitute a disclosure. Top-coding and bottom-coding. These techniques are often used to limit disclosure of specific data at the high end or low end of a given range by grouping together values falling above or below a certain level. For instance, an income data table could be configured to list every income below $20,000 as simply “below $20,000.” Recoding. Similar to top-coding and bottom-coding, recoding 27 Additional discussion of some of these techniques can be found in National Research Council, Private Lives and Public Policies, National Academy Press, Washington, D.C., 1993.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age involves assigning individual values to groups or ranges rather than showing exact figures. For example, an income of $54,500 could simply be represented as being within the range of “$50,000- $60,000.” Such recoding could be adequate for a number of uses where detailed data are not required. Rounding. This technique involves rounding values (e.g., incomes) up or down based on a set of earlier decisions. For example, one might decide to round all incomes down to the nearest $5,000 increment. Another model involves making random decisions on whether to round a given value up or down (as opposed to conforming data according to a predetermined rounding convention). Swapping and/or shuffling. Swapping entails choosing a certain set of fields among a set of records in which values match, and then swapping all other values among the records. Records can also be compared and ranked according to a given value to allow swapping based on values that, while not identical, are close to each other (so-called rank-swapping). Data shuffling is a hybrid approach, blending perturbation and swapping techniques. Sampling. This method involves including data from only a sample of a given population. Blank and impute. In this process, values for particular fields in a selection of records are deleted and the fields are then filled either with values that have been statistically modeled or with values that are the same as those for other respondents. Blurring. This method involves replacing a given value with an average. These average values can be determined in a number of different ways—for example, one might select the records to be averaged based on the values given in another field, or one might select them at random, or vary the number of values to be averaged. 3.8.2.3 Cryptographic Techniques The Portia project28 is a cross-institutional research effort attempting to apply the results of cryptographic protocols to some of the problems of privacy. Such protocols theoretically allow the ability to do queries over multiple databases without revealing any information other than the answer to the particular query, thus ensuring that multi-database queries can be accomplished without the threat of privacy-threatening aggregation of the data in those databases. Although there are theoretical protocols that can be proved to give these results, implementing those protocols in a fashion that is efficient enough for common use is an open research 28 See more information about the Portia project at http://crypto.stanford.edu/portia/.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age problem. These investigations are in their early stages, so it is too soon to determine if the resulting techniques will be appropriate for wide use. A similar project is attempting to develop a so-called Hippocratic database, which the researchers define as one whose owners “have responsibility for the data that they manage to prevent disclosure of private information.”29 The thrust behind this work is to develop database technology to minimize the likelihood that data stored in the database are used for purposes other than those for which the data were gathered. While this project has produced some results in the published literature, it has not resulted in any widely deployed commercial products. 3.8.2.4 User Notification Another set of technologies focus on notification. For example, the Platform for Privacy Preferences (P3P) facilitates the development of machine-readable privacy policies.30 Visitors to a P3P-enabled Web site can set their browsers to retrieve the Web site’s privacy policy and compare it to a number of visitor-specified privacy preferences. If the Web site’s policy is weaker than the visitor prefers, the visitor is notified of that fact. P3P thus seeks to automate what would otherwise be an onerous manual process for the visitor to read and comprehend the site’s written privacy policy. An example of a P3P browser add-on is Privacy Bird.31 Results of the comparison between a site’s policy and the user’s preferences are displayed graphically, showing a bird of different color (green and singing for a site whose policy does not violate the requirements set by the user, angry and red when the policy conflicts with the desires of the user) in the toolbar of the browser. Systems such as Privacy Bird cannot guarantee the privacy of the individual who uses them—such guarantees can only be provided by enforcement of the stated policy. They do attempt to address the privacy issue directly, allowing the user to determine what information he or she is willing to allow to be revealed, along with what policies the recipient of the information intends to follow with regard to the use of that information or the transfer of that information to third parties. Also, the process of developing a P3P-compatible privacy policy is structured and systematic. Thus, a Web site operator may discover gaps in its existing privacy policy as it translates that policy into machine-readable form. 29 Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu, “Hippocratic Databases,” 28th International Conference on Very Large Databases (VLDB), Hong Kong, 2002. 30 See http://www.w3.org/P3P/. 31 See http://www.privacybird.com/.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age 3.8.2.5 Information Flow Analysis Privacy can also be protected by tools for automated privacy audits. Some companies, especially large ones, may find it difficult to know the extent to which their practices actually comply with their stated policies. The purpose of a privacy audit is to help a company determine the extent to which it is in compliance with its own policy. However, since the information flows within a large company are multiple and varied, automated tools are very helpful in identifying and monitoring such flows. When potential policy violations are identified, these tools bring the information flows in question to the attention of company officials for further attention. Such tools often focus on information flows to and from externally visible Web sites, monitoring form submissions and cookie usage, and looking for Web pages that accidentally reveal personal information. Tools can also tag data as privacy sensitive, and when such tagged data are subsequently accessed, other software could check to ensure that the access is consistent with the company’s privacy policy. Because of the many information flows in and out of a company, a comprehensive audit of a company’s privacy policy is generally quite difficult. But although it is virtually impossible to deploy automated tools everywhere within a company’s information infrastructure, automated auditing tools can help a great deal in improving a company’s compliance with its own stated policy. 3.8.2.6 Privacy-Sensitive System Design Perhaps the best approach for protecting privacy is to design systems that do not require the collection or the retention of personal information in the first place.32 For example, systems designed to detect weapons hidden underneath clothing have been challenged on privacy grounds because they display the image recorded by the relevant sensors, and what appears on the operator’s display screen is an image of an unclothed body. However, the system can be designed to display instead an indicator signaling the possible presence of a weapon and its approximate location on the body. This approach protects the privacy of the subject to a much greater degree than the display of an image, although it requires a much more technically sophisticated approach (since the image detected must be analyzed to determine exactly what it indicates). 32 From the standpoint of privacy advocacy, it is difficult to verify the non-retention of data since this would entail a full audit of a system as implemented. Data, once collected, often persist by default, and this may be an important reason that a privacy advocate might oppose even a system allegedly designed to discard data.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age When a Web site operator needs to know only if a visitor’s age is above a certain threshold (e.g., 13), rather than the visitor’s age per se, collecting only an indicator of a threshold protects the visitor’s privacy. More generally, systems can be designed to enable an individual to prove that he or she possesses certain attributes (e.g., is authorized to enter a building, holds a diploma, is old enough to gamble or drink) without revealing anything more about the individual. Even online purchases could, in principle, be made anonymously using electronic cash. However, the primary impediments to the adoption of such measures appear to be based in economics and policy rather than in technology. That is, even though measures such as those described above appear to be technically feasible, they are not in widespread use. The reason seems to be that most businesses benefit from the collection of detailed personal information about their customers and thus have little motivation to deploy privacy-protecting systems. Law enforcement agencies also have concerns about electronic cash systems that might facilitate anonymous money laundering. 3.8.2.7 Information Security Tools Finally, the various tools supporting information security—encryption, access controls, and so on—have important privacy-protecting functions. Organizations charged with protecting sensitive personal information (e.g., individual medical records, financial records) can use encryption and access controls to reduce the likelihood that such information will be inappropriately compromised by third parties. A CD-ROM with personal information that is lost in transit is a potential treasure trove for identity thieves, but if the information is encrypted on the CD, the CD is useless to anyone without the decryption key. Medical records stored electronically and protected with good access controls that allow access only to authorized parties are arguably more private than paper records to which anyone has access. Electronic medical records might also be protected by audit trails that record all accesses and prevent forwarding to unauthorized parties or even their printing in hard copy. With appropriate authentication technologies deployed, records of queries made by specific individuals can also be kept for future analysis.33 Retention of such records can deter individuals from making privacy-invasive queries in the course of their work—in the event that personal information is compromised, a record might exist of queries that might 33 The committee is not insensitive to the irony that keeping query logs is arguably privacy-invasive with respect to the individual making the queries.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age have produced that personal information and the parties that may have made those queries. 3.9 UNSOLVED PROBLEMS AS PRIVACY ENHANCERS Although much of the discussion above involves trends in technology that can lead to privacy concerns, many technical challenges must be addressed successfully to enable truly ubiquitous surveillance. If so, one can argue that many worries about technology and privacy are therefore misplaced. For example, the problem of data aggregation is far more than simply the problem of finding the data to be combined and using the network to bring those data to a shared location. One fundamental issue is that of interpreting data collected by different means so that their meaning is consistent. Digital data, by definition, comprises fields that are either on (represent 1) or off (represent 0). But how these 1s and 0s are grouped and interpreted to represent more complex forms of data (such as images, transaction records, sound, or temperature readings) varies from computer to computer and from program to program. Even so simple a convention as the number of bits (values of 1 or 0) used to represent a value such as an alphanumeric character, an integer, or a floating point number varies from program to program, and the order in which the bits are to be interpreted can vary from machine to machine. The fact that data are stored on two machines that can talk to each other over the network does not mean that there is a program that can understand the data stored on the two machines, as the interpretation of that data is generally not something that is stored with the data itself. This problem is compounded when an attempt is made to combine the contents of different databases. A database is organized around groupings of information into records and indexes of those records. The combinations and indexes, known as schema, define the information in the database. Different databases with different schema definitions cannot be combined in a straightforward way; the queries issued to one of those databases might not be understood in the other database (or, worse still, might be understood in a different way). Since the schema used in the database defines, in an important way, the meaning of the information stored in the database, two databases with different schema store information that is difficult to combine in any meaningful way. Note that this issue is not resolved simply by searching in multiple databases of similar formats. For example, although search engines facilitate the searching of large volumes of text that can be spread among multiple databases, this is not to say that these data can be treated as belonging to a single database, for if that were the case both the format and the

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age semantics of the words would be identical. The Semantic Web and similar research efforts seek to reduce the magnitude of the semantic problem, disambiguating syntactically identical words. But these efforts have little to do with aggregations of data in dissimilar formats, such as video clips and text or information in financial and medical databases. This problem of interpretation is not new; it has plagued businesses trying to integrate their own data for nearly as long as there have been computers. Huge amounts of money are spent each year on attempts to merge separate databases within the same corporation, or in attempts by one company to integrate the information used by another company that it has acquired. Even when the data formats are known by the programmers attempting to do the integration, these problems are somewhere between difficult and impossible. The notion that data gathered by sensors about an individual by different sources can be easily aggregated by computers that are connected by a network presupposes, contrary to fact, that this problem of data integration and interpretation has been solved. Similarly, the claim that increases in the capacity of storage devices will allow data to be stored forever and used to violate the privacy of the individual ignores another trend in computing, which is that the formats used to interpret the raw data contained in storage devices are program specific and tend to change rapidly. Data are now commonly lost not because they have been removed from some storage device, but because there is no program that can be run that understands the format of the data or no hardware that can even read the data.34 In principle, maintaining documentation adequate to allow later interpretation of data stored in old formats is a straightforward task—but in practice, this rarely happens, and so data are often lost in this manner. And as new media standards emerge, it becomes more difficult to find and/or purchase systems that can read the media on which the old data are recorded. A related issue in data degradation relates to the hardware. Many popular and readily available storage devices (CDs, DVDs, tapes, hard drives) have limited dependable lifetimes. The standards to which these devices were originally built also evolve to enable yet more data to be packed onto them, and so in several generations, any given storage device may well be an orphan, with spare parts and repair expertise difficult to find. Data can thus be lost if—even though the data have not been destroyed—they become unreadable and thus unusable. Finally, even with the advances in the computational power available 34 National Research Council, Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy, Robert Sproull and Jon Eisenberg, eds., The National Academies Press, Washington, D.C., 2005.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age on networks of modern computers, there are some tasks that will remain computationally infeasible without far greater breakthroughs in computing hardware than have been seen even in the past 10 years. Some tasks, such as those that require the comparison of all possible combinations of sets of events, have a computational cost that rises combinatorially (i.e., faster than exponentially) with the number of entities being compared. Such computations attempted over large numbers of people are far too computationally expensive to be done by any current or anticipated computing technology. Thus, such tasks will remain computationally infeasible not just now but for a long time to come.35 Similar arguments also apply to certain sensing technologies. For example, privacy advocates worry about the wide deployment of facial recognition technology. Today, this technology is reasonably accurate under controlled conditions where the subject is isolated, the face is exposed in a known position, and there are no other faces being scanned. Attempts to apply this technology “in the wild,” however, have largely failed. The problem of recognizing an individual from a video scan in uncontrolled lighting, where the face is turned or tilted, and where the face is part of a crowd, or when the subject is using countermeasures to defeat the recognition technology, is far beyond the capabilities of current technology. Facial recognition research is quite active today, but it remains an open question how far and how fast the field will be able to progress. 3.10 OBSERVATIONS Current trends in information technology have greatly expanded the ability of its users to gather, store, share, and analyze data. Indeed, metrics for the increasing capabilities provided by information technology hardware—storage, bandwidth, and processing speed, among others—could be regarded as surrogates for the impact of technological change on privacy. The same is true, though in a less quantitative sense, for software—better algorithms, better database management systems 35 To deal with such problems, statisticians and computer scientists have developed pruning methods that systematically exclude large parts of the problem space that must be examined. Some methods are heuristic, based on domain-specific knowledge and characteristics of the data, such as knowing that men do not get cervical cancer or become pregnant. Others are built on theory and notions of model simplification. Still others are based on sampling approaches that are feasible when the subjects of interest are in some sense average rather than extreme. If a problem is such that it is necessary to identify with high probability only some subjects, rather than requiring an exhaustive search that identifies all subjects with certainty, these methods have considerable utility. But some problems—and in particular searches for terrorists who are seeking to conceal their profile within a given population—are less amenable to such treatment.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age and more powerful query languages, and so on. These data can be gathered both from those who use the technology itself and from the physical world. Given the trends in the technology, it appears that there are many ways in which the privacy of the individual could be compromised, both by governments, private corporations, and individual users of the technology. Many of these concerns echo those that arose in the 1970s, when the first databases began to be widely used. At that time, concerns over the misuse of the information stored in those databases and the accuracy of the information itself led to the creation of the Fair Information Practice guidelines in 1973 (Section 1.5.4 and Box 1.3). Current privacy worries are not as well defined as those that originally led to the Fair Information Practice guidelines. Whereas those guidelines were a reaction to fears that the contents of databases might be inaccurate, the current worries are more concerned with the misuse of data gathered for otherwise valid reasons, or the ability to extract additional information from the aggregation of databases by using the power of networked computation. Furthermore, in some instances, technologies developed without a conscious desire for affecting privacy may—upon closer examination—have deep ramifications for privacy. As one example, digital rights management technologies have the potential to collect highly detailed information on user behavior regarding the texts they read and the music they listen to. In some instances, they have a further potential to create security vulnerabilities in the systems on which they run, exploitation of which might lead to security breaches and the consequent compromise of personal information stored on those systems. The information-collection aspect of digital rights management technologies is discussed further in Section 6.7. At the same time, some technologies can promote and defend privacy. Cryptographic mechanisms that can ensure the confidentiality of protected data, anonymization techniques to ensure that interactions can take place without participants in the interaction revealing their identity, and database techniques that allow extraction of some information without showing so much data that the privacy of those whose data has been collected will be compromised, are all active areas of research and development. However, each of these technologies imposes costs, both social and economic, for those who use them, a fact that tends to inhibit their use. If a technology has no purpose other than to protect privacy, it is likely to be deployed only when there is pressure to protect privacy—unlike other privacy-invasive technologies, which generally invade privacy as a sideeffect of some other business or operational purpose. An important issue is the impact of data quality on any system that involves surveillance and matching. As noted in Chapter 1, data quality

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age has a significant impact on the occurrence of false positives and false negatives. By definition, false positives subject individuals to scrutiny that is inappropriate and unnecessary given their particular circumstances—and data quality issues that result in larger numbers of false positives lead to greater invasions of privacy. By definition, false negatives do not identify individuals who should receive further scrutiny—and data quality issues that result in larger numbers of false negatives compromise mission accomplishment. Technology also raises interesting philosophical questions regarding privacy. For example, Chapter 2 raised the distinction between the acquisition of personal information and the use of that information. The distinction is important because privacy is contextually defined—use X of certain personal information might be regarded as benign, while use Y of that same information might be regarded as a violation of privacy. But even if one assumes that the privacy violations might occur at the moment of acquisition, technology changes the meaning of “the moment.” Is “the moment” the point at which the sensors register the raw information? The point after which the computers have processed the bit streams from the sensors into a meaningful image or pattern? The point at which the computer identifies an image or pattern as being worthy of further human attention? The point at which an actual human being sees the image or pattern? The point at which the human being indicates that some further action must be taken? There are no universal answers to such questions—contextual factors and value judgments shape the answers. A real danger is that fears about what technology might be able to do, either currently or in the near future, will spur policy decisions that will limit the technology in artificial ways. Decisions made by those who do not understand the limitations of current technology may prevent the advancement of the technology in the direction feared but also limit uses of the technology that would be desirable and that do not, in fact, create a problem for those who treasure personal privacy. Consider, for example, that data-mining technologies are seen by many to be tools of those who would invade the privacy of ordinary citizens.36 Poorly formulated limitations on the use of data mining may reduce its impact on privacy, but may also inadvertently limit its use in other applications that pose no privacy issue whatever. Finally, it is worth noting the normative question of whether technology or policy ought to have priority as a foundation for protecting privacy. One perspective on privacy protection is that policy should come first—policy, and associated law and regulation, are the basis for the per- 36 A forthcoming study by the National Research Council will address this point in more detail.

OCR for page 88
Engaging Privacy and Information Technology in a Digital Age formance requirements of the technology—and that technology should be developed and deployed that conforms to the demands of policy. On the other hand, policy that is highly protective of privacy on one day can be changed to one that is less protective the next day. Thus, a second view of privacy would argue that technology should constitute the basis for privacy protection, because such a foundation is harder to change or circumvent than one based on procedural foundations.37 Further, violations of technologically enforced privacy protections are generally much more difficult to accomplish than violations of policy-enforced protections. Whether such difficulties are seen as desirable stability (i.e., an advantage) or unnecessary rigidity (i.e., a disadvantage) depends on one’s position and perspective. In practice, of course, privacy protections are founded on a mix of technology and policy, as well as self-protective actions and cultural factors such as ethics, manners, professional codes, and a sense of propriety. In large bureaucracies, significant policy changes cannot be implemented rapidly and successfully, even putting aside questions related to the technological infrastructure. Indeed, many have observed that implementing appropriate human and organizational procedures that are aligned with high-level policy goals is often harder than implementing and deploying technology. 37 Lessig argues this point in Code, though his argument is much broader than one relating simply to privacy. See Lawrence Lessig, Code and Other Laws of Cyberspace, Basic Books, New York, 2000.