L
The Science and Technology of Privacy Protection

To the extent that there is a tension between counterterrorism efforts and protection of citizens’ privacy, it is useful to understand how it may be possible to design counterterrorism information systems to minimize their impact on privacy. This appendix considers privacy protection from two complementary perspectives—privacy protection that is built into the analytical techniques themselves and privacy protection that can be engineered into an operational system. The appendix concludes with a brief illustration of how government statistical agencies have approached confidential data collection and analysis over the years. A number of techniques described here have been proposed for use in protecting privacy; none would be a panacea, and several have important weaknesses that are not well understood and that are discussed and illustrated.

L.1
THE CYBERSECURITY DIMENSION OF PRIVACY

Respecting privacy interests necessarily means that parties that should not have access to personal information do not have such access. Security breaches are incompatible with protecting the privacy of personal information, and good cybersecurity for electronically stored personal information is a necessary (but not sufficient) condition for protecting privacy.

From a privacy standpoint, the most relevant cybersecurity technologies are encryption and access controls. Encryption obscures digitally stored information so that it cannot be read without having the key neces-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 263
L The Science and Technology of Privacy Protection To the extent that there is a tension between counterterrorism efforts and protection of citizens’ privacy, it is useful to understand how it may be possible to design counterterrorism information systems to minimize their impact on privacy. This appendix considers privacy protection from two complementary perspectives—privacy protection that is built into the analytical techniques themselves and privacy protection that can be engineered into an operational system. The appendix concludes with a brief illustration of how government statistical agencies have approached confidential data collection and analysis over the years. A number of tech- niques described here have been proposed for use in protecting privacy; none would be a panacea, and several have important weaknesses that are not well understood and that are discussed and illustrated. L.1 THE CYBERSECURITY DIMENSION OF PRIVACY Respecting privacy interests necessarily means that parties that should not have access to personal information do not have such access. Secu- rity breaches are incompatible with protecting the privacy of personal information, and good cybersecurity for electronically stored personal information is a necessary (but not sufficient) condition for protecting privacy. From a privacy standpoint, the most relevant cybersecurity tech- nologies are encryption and access controls. Encryption obscures digitally stored information so that it cannot be read without having the key neces- 

OCR for page 263
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS sary to decrypt it. Access controls provide privileges of different sorts to specified users (for example, the system may grant John Doe the right to know that a file exists but not the right to view its contents, and it may give Jane Doe both rights). Access controls may also be associated with audit logs that record what files were accessed by a given user. Because of the convergence of and similarities between communica- tion and information technologies, the technologies face increasingly simi- lar threats and vulnerabilities. Furthermore, addressing these threats and vulnerabilities entails similar countermeasures or protection solutions. A fundamental principle of security is that no digital resource that is in use can be absolutely secure; as long as information is accessible, it is vulner- able. Security can be increased, but the value of increased security must be weighed against the increase in cost and the decrease in accessibility. Human error, accident, and acts of God are the dominant sources of loss and damage in information and communication systems, but the actions of hackers and criminals are also of substantial concern. Terror- ists account for a small percentage of losses, financial and otherwise, but could easily exploit vulnerabilities in government and business to cause much more serious damage to the nation. Security analysts and special- ists report a large growth in the number and diversity of cyberthreats 1 and vulnerabilities.2 Despite a concurrent growth in countermeasures (that is, security technologies3) penetrations and losses are increasing. A data-breach chronology reports losses of 104 million records (for example, in lost laptop computers) containing personally identifiable information from January 2005 to February 2007.4 The Department of Homeland Secu- rity National Cyber Security Division reports that over 25 new vulner- abilities were discovered each day in 2006.5 The state of government information security is unnecessarily weak. 1A.T. Williams, A. Hallawell, R. Mogull, J. Pescatore, N. MacDonald, J. Girard, A. Litan, L. Orans, V. Wheatman, A. Allan, P. Firstbrook, G. Young, J. Heiser, and J. Feiman, Hype Cycle for Cyberthreats, Gartner, Inc., Stamford, Conn., September 13, 2006. 2 National Vulnerability Database, National Institute of Standards and Technology Com- puter Security Division, sponsored by the U.S. Department of Homeland Security National Cyber Security Division/U.S. Computer Emergency Readiness Team (US-CERT), available at http://nvd.nist.gov/. 3A.T. Williams, A. Hallawell, R. Mogull, J. Pescatore, N. MacDonald, J. Girard, A. Litan, L. Orans, V. Wheatman, A. Allan, P. Firstbrook, G. Young, J. Heiser, and J. Feiman, Hype Cycle for Cyberthreats, Gartner, Inc., Stamford, Conn., September 13, 2006. 4A Chronology of Data Breaches, Privacy Rights Clearing House. 5 National Vulnerability Database, National Institute of Standards and Technology Com- puter Security Division, sponsored by the U.S. Department of Homeland Security National Cyber Security Division/U.S. Computer Emergency Readiness Team (US-CERT), available at http://nvd.nist.gov/.

OCR for page 263
 APPENDIX L For example, the U.S. Government Accountability Office (GAO) noted in March 2008 that [m]ajor federal agencies continue to experience significant information security control deficiencies that limit the effectiveness of their efforts to protect the confidentiality, integrity, and availability of their information and information systems. Most agencies did not implement controls to sufficiently prevent, limit, or detect access to computer networks, sys- tems, or information. In addition, agencies did not always effectively manage the configuration of network devices to prevent unauthorized access and ensure system integrity, patch key servers and workstations in a timely manner, assign duties to different individuals or groups so that one individual did not control all aspects of a process or transaction, and maintain complete continuity of operations plans for key informa- tion systems. An underlying cause for these weaknesses is that agen- cies have not fully or effectively implemented agencywide information security programs. As a result, federal systems and information are at increased risk of unauthorized access to and disclosure, modification, or destruction of sensitive information, as well as inadvertent or deliberate disruption of system operations and services. Such risks are illustrated, in part, by an increasing number of security incidents experienced by federal agencies.6 Such performance is reflected in the public’s lack of trust in gov- ernment agencies’ ability to protect personal information.7 Security of government information systems is poor despite many relevant regula- tions and guidelines.8 Most communication and information systems are unnecessarily vulnerable to attack because of poor security practices, and 6 Statement of Gregory C. Wilshusen, GAO Director for Information Security Issues, “Infor- mation Security: Progress Reported, but Weaknesses at Federal Agencies Persist,” Testimony Before the Subcommittee on Federal Financial Management, Government Information, Federal Services, and International Security, Committee on Homeland Security and Gov- ernmental Affairs, U.S. Senate, GAO-08-571T, March 12, 2008. Available at http://www.gao. gov/new.items/d08571t.pdf. 7 L. Ponemon, Priacy Trust Study of United States Goernment, The Ponemon Institute, Traverse City, Mich., February 15, 2007. 8Appendix III, OMB Circular A-130, “Security of Federal Automated Information Re- sources,” (Office of Management and Budget, Washington, D.C.) revises procedures for- merly contained in Appendix III, OMB Circular No. A-130 (50 FR 52730; December 24, 1985), and incorporates requirements of the Computer Security Act of 1987 (P.L. 100-235) and responsibilities assigned in applicable national security directives. See also Federal Information Security Management Act of 2002 (FISMA), 44 U.S.C. § 3541, et seq., Title III of the E-Government Act of 2002, Public Law 107-347, 116 Stat. 2899, available at http://csrc. nist.gov/drivers/documents/FISMA-final.pdf.

OCR for page 263
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS the framework outlined in Chapter 2 identifies data stewardship as a criti- cal evaluation criterion.9 Although cybersecurity and privacy are conceptually different, they are often conflated—with good reason—in the public’s mind. Cyberse- curity breaches—which occur, for example, when a hacker breaks into a government information system that contains personally identifiable information (addresses, Social Security numbers, and so on)—are natu- rally worrisome to the citizens who may be affected. They do not particu- larly care about the subtle differences between a cybersecurity breach and a loss of privacy through other means; they know only that their privacy has been (potentially) invaded and that their loss of privacy may have deleterious consequences for them. That reaction has policy significance: the government agency responsible (perhaps even the entire government) is viewed as being incapable of protecting privacy, and public confidence is undermined when it asserts that it will be a responsible steward of the personal information it collects in its counterterrorism mission. L.2 PRIVACY-PRESERVING DATA ANALYSIS L.2.1 Basic Concepts It is intuitive that the goal of privacy-preserving data analysis is to allow the learning of particular facts or kinds of facts about individuals (units) in a data set while keeping other facts secret. The term data set is used loosely; it may refer to a single database or to a collection of data sources. Under various names, privacy-preserving data analysis has been addressed in various disciplines. A statistic is a quantity computed on the basis of a sample. A major goal of official statistics is to learn broad trends about a population by studying a relatively small sample of members of the population. In many cases, such as in the case of U.S. census data and data collected by the Internal Revenue Service (IRS), privacy is legally mandated. Thus, the goal is to identify and report trends while protecting the privacy of indi- viduals. That sort of challenge is central to medical studies: the analyst wishes to learn and report facts of life, such as “smoking causes cancer,” while preserving the privacy of individual cancer patients. The analyst must be certain that the privacy of individuals is not even inadvertently compromised. 9 Data stewardship is accountability for program resources being used and protected ap- propriately according to the defined and authorized purpose.

OCR for page 263
 APPENDIX L Providing such protection is a difficult task, and a number of seem- ingly obvious approaches do not work even in the best of circumstances, for example, when a trusted party holds all the confidential data in one place and can prepare a “sanitized” version of the data for release to the analyst or can monitor questions and refuse to answer when privacy might be at risk. (This point is discussed further in Section L.2.2 below.) In the context of counterterrorism, privacy-preserving data analysis is excellent for teaching the data analyst about “normal” behavior while preserving the privacy of individuals. The task of the counterterrorism analyst is to identify “atypical” behavior, which can be defined only in contrast with what is typical. It is immediately obvious that the data on any single specific individual should have little effect on the determina- tion of what is normal, and in fact this point precisely captures the source of the intuition that broad statistical trends do not violate individual privacy. Assuming a good knowledge of what is “normal,” technology is necessary for counterterrorism that will scrutinize data in an automated or semiautomated fashion and flag any person whose data are abnormal, i.e., that satisfy a putatively “problematic” profile. In other words, the out- come of data analysis in this context must necessarily vary widely (“yes, it satisfies the profile” or “no, it does not satisfy the profile”), depending on the specific person whose data is being scrutinized. Whether the profile is genuinely “problematic” is a separate matter. In summary, privacy-preserving data analysis may permit the ana- lyst to learn the definition of normal in a privacy-preserving way, but it does not directly address the counterterrorism goal: privacy-preserving data analysis “masks” all individuals, whereas counterterrorism requires the exposure of selected individuals. There is no such thing as privacy- preserving examination of an individual’s records or privacy-preserving examination of a database to pinpoint problematic individuals. The question, therefore, is whether the counterterrorism goal can be satisfied while protecting the privacy of “typical” people. More precisely, suppose the existence of a perfect profile of a terrorist: the false-positive and false-negative rates are very low. (The existence of such a perfect profile is magical thinking and contrary to fact, but suppose it anyway.) Would it be possible to analyze data, probably from diverse sources and in diverse formats, in such a way that the analyst learns only information about people who satisfy the profile? As far as we know, the answer to that question is no. However, it might be possible to limit the amount of information revealed about those who do not satisfy the profile, perhaps by controlling the information and sources used or by editing them after they are acquired. That would require major efforts and attention to the quality and utility of information in integrated databases.

OCR for page 263
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS L.2.2 Some Simple Ideas That Do Not Work in Practice There are many ideas for protecting privacy, and what may seem like sensible ideas often fail. Understanding how to approach privacy protec- tion requires rigor in two senses: spelling out what “privacy protection” means and explaining the extent to which a particular technique succeeds in providing protection. For example, assume that all the data are held by a trustworthy curator, who answers queries about them while attempting to ensure privacy. Clearly, queries about the data on any specific person cannot be answered, for example, What is the sickle-cell status of Averill Harriman? It is therefore instructive to consider the common suggestion of insisting that queries be made only on large subsets of the complete database. A well-known differencing argument (the “set differencing” attack) demon- strates the inadequacy of the suggestion: If the database permits the user to learn exact answers, say, to the two questions, How many people in the database have the sickle-cell trait? and, How many people—not named X—in the database have the sickle-cell trait? then the user learns X’s sickle-cell trait status. The example also shows that encrypting the data (another frequent suggestion) would be of no help. Encryption protects against an intruder, but in this instance the privacy compromise emerges even when the database is operated correctly, that is, in conformance with all stated security policies. Another suggestion is to monitor query sequences to rule out attacks of the nature just described. Such a suggestion is problematic for two reasons: it may be computationally infeasible to determine whether a query sequence compromises privacy,10 and, more surprising, the refusal to answer a query may itself reveal information.11 A different approach to preventing the set differencing attack is to add random noise to the true answer to a query; for example, the response to a query about the average income of a set of individuals is the sum of the true answer and some random noise. That approach has merit, but it must be used with care. Otherwise, the same query may be issued over and over and each time produce a different perturbation of the truth. With enough queries, the noise may cancel out and reveal the true answer. Insisting that a given query always results the same answer is problematic in that it may be impossible to decide whether two syntactically different queries 10 J. Kleinberg, C. Papadimitriou, and P. Raghavan, “Auditing boolean attributes,” pp. 86- 91 in Proceedings of th ACM Symposium on Principles of Database Systems, Association for Computing Machinery, New York, N.Y., 2000. 11 K. Kenthapadi, N. Mishra, and K. Nissim, “Simulatable auditing,” pp. 118-127 in Proceed- ings of the th ACM Symposium on Principles of Database Systems, Association for Computing Machinery, New York, N.Y., 2005.

OCR for page 263
 APPENDIX L are semantically equivalent. Related lower bounds on noise (the degree of distortion) can be given as a function of the number of queries. 12 L.2.3 Private Computation The cryptographic literature on private computation addresses a dis- tinctly different goal known as secure function evaluation.13 In this work, the term priate has a specific technical meaning that is not intuitive and is described below. To motivate the description, recall the original description of privacy-preserving data analysis as permitting the learn- ing of some facts in a data set while keeping other facts secret. If privacy is to be completely protected, some things simply cannot be learned. For example, suppose that the database has scholastic records of students in Middletown High School and that the Middletown school district releases the fact that no student at the school has a perfect 5.0 average. That state- ment compromises the privacy of every student known to be enrolled at the school—it is now known, for example, that neither Sergey nor Laticia has a 5.0 average. Arguably, that is no one else’s business. (Some might try to argue that no harm comes from the release of such information, but this is defeating the example without refuting the principle that it illustrates.) Similarly, publishing the average net worth of a small set of people may reveal that at least one person has a very high net worth; a little extra information may allow that person’s identity to be disclosed despite her modest lifestyle. Private computation does not address those difficulties, and the ques- tion of which information is safe to release is not the subject of study at all.14 Rather, it is assumed that some facts are, by fiat, going to be released, for example, a histogram of students’ grade point averages or average income by block. The “privacy” requirement is that no information that cannot be inferred from those quantities will be leaked. The typical setting is that each person (say, each student in Middletown High School) partici- 12 I. Dinur and K. Nissim, “Revealing information while preserving privacy,” pp. 202-210 in Proceedings of the nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Da- tabase Systems, Association for Computing Machinery, New York, N.Y., 2003; C. Dwork, F. McSherry, and K. Talwar, “The price of privacy and the limits of LP decoding,” pp. 85-94 in Proceedings of the th Annual ACM SIGACT Symposium on Theory of Computing, Association for Computing Machinery, New York, N.Y., 2007. See also the related work on compressed sensing cited in the latter. 13 O. Goldreich, S. Micali, and A. Wigderson, “How to solve any protocol problem,” pp. 218-229 in Proceedings of the th ACM SIGACT Symposium on Computing, Association for Computing Machinery, New York, N.Y., 1987. 14 O. Goldreich, S. Micali, and A. Wigderson, “How to solve any protocol problem,” pp. 218-229 in Proceedings of the th ACM SIGACT Symposium on Computing, Association for Computing Machinery, New York, N.Y., 1987.

OCR for page 263
0 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS pates in a cryptographic protocol whose goal is the cooperative comput - ing of the quantity of interest (the histogram of grade point averages) and that the cryptographic protocol will not cause any information to be leaked that a student cannot infer from the histogram and his or her own data (that is, from the grade point histogram and his or her own grade point average). L.2.4 The Need for Rigor Privacy-preservation techniques typically involve altering raw data or the answers to queries. Those general actions are referred to as input perturbation and output perturbation,15 depending on whether the altera- tions are made before the queries or in response to them. Various methods are used for input and output perturbation. Some involve redaction of information (for example, removing “real” identifi- ers, the use of indirect identifiers, selective reporting, or forms of aggrega- tion) or alteration of data elements by adding noise, swapping, recoding (for example, collapsing categories), and data simulation.16 But no matter 15A relevant survey article is N. Adam and J. Wortmann, “Security-control methods for statistical databases: A comparative study,” ACM Computing Sureys 21(4):515-556, 1989. Some approaches post-dating the survey are given in L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression,” International Journal on Uncer- tainty, Fuzziness and Knowledge-based Systems 10(5):557-570, 2002; A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” pp. 211-222 in Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Association for Computing Machinery, New York, N.Y., 2003; and C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity of functions in private data analysis,” pp. 265-284 in Proceedings of the Thirty-Ninth Annual ACM Sympo- sium on Theory of Computing, Association for Computing Machinery, New York, N.Y., 2006, and references therein. 16 Many of these methods are described in the following papers: S.E. Fienberg, “Conflicts between the needs for access to statistical information and demands for confidentiality,” Journal of Official Statistics 10(2):115-132, 1994; Federal Committee on Statistical Methodol- ogy, Office of Management and Budget (OMB), “Statistical Policy Working Paper 2. Report on Statistical Disclosure and Disclosure-Avoidance Techniques,” OMB, Washington, D.C., 1978, available at http://www.fcsm.gov/working-papers/sw2.html; Federal Committee on Statistical Methodology, OMB, “Statistical Policy Working Paper 22 (Second version, 2005), Report on Statistical Disclosure Limitation Methodology,” originally prepared by Subcom- mittee on Disclosure Limitation Methodology, OMB, Washington, D.C., 1994, and revised by the Confidentiality and Data Access Committee, 2005, available at http://www.fcsm.gov/ working-papers/spwp22.html. Many of these techniques are characterized as belonging to the family of matrix masking methods in G.T. Duncan and R.W. Pearson, “Enhancing access to microdata while protecting confidentiality: prospects for the future (with discussion),” Statistical Science 6:219-239, 1991. The use of these techniques in a public-policy context is set by the following publications: National Research Council (NRC), Priate Lies and Public Policies: Confidentiality and Accessibility of Goernment Statistics, G.T. Duncan, T.B. Jabine,

OCR for page 263
 APPENDIX L what the technique or approach, there are two basic questions: What does it mean to protect the data? How much alteration is required to achieve that goal? The need for a rigorous treatment of both questions cannot be over- stated, inasmuch as “partially protecting privacy” is an oxymoron. An extremely important and often overlooked factor in ensuring privacy is the need to protect against the availability of arbitrary context informa- tion, including other databases, books, newspapers, blogs, and so on. Consider the anonymization of a social-network graph. In a social network, nodes correspond to people or other social entities, such as organizations or Web sites, and edges correspond to social links between them, such as e-mail contact or instant-messaging. In an effort to preserve privacy, the practice of anonymization replaces names with meaningless unique identifiers. The motivation is roughly as follows: the social net- work labeled with actual names is sensitive and cannot be released, but there may be considerable value in enabling the study of its structure. Anonymization is intended to preserve the pure unannotated structure of the graph while suppressing the information about precisely who has contact with whom. The difficulty is that anonymous social-network data almost never exist in the absence of outside context, and an adversary can potentially combine this knowledge with the observed structure to begin compromising privacy, deanonymizing nodes and even learning the edge relations between explicitly named (deanonymized) individuals in the system.17 A more traditional example of the difficulties posed by context begins with the publication of redacted confidential data. The Census Bureau receives confidential information from enterprises as part of the economic census and publishes a redacted version in which identifying information on companies is suppressed. At the same time, a company may release information in its annual reports about the number of shares held by particular holders of very large numbers of shares. Although the redac- tion may be privacy-protective, by using very simple linkage tools on the redacted data and the public information, an adversary will be able to add back some of the identifying tags to the redacted confidential data. Roughly speaking, those tools allow the merging of data sets that contain, for example, different types of information about the same set of entities. and V.A. de Wolf, eds., National Academy Press, Washington, D.C., 1993; NRC, Expanding Access to Research Data: Reconciling Risks and Opportunities , The National Academies Press, Washington, D.C., 2005. 17 L. Backstrom, C. Dwork, and J. Kleinberg, “Wherefore art thou R3579X? Anonymized social networks, hidden patterns, and structural steganography,” pp. 181-190 in Proceedings of the th International Conference on World Wide Web, 2007, available at http://www2007. org/proceedings.html.

OCR for page 263
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS The key point is that entities need not be directly identifiable by name to be identified. Companies can be identified by industrial code, size, region of the country, and so on. Any public company can be identified by using a small number of such variables, which may well be deduced from the company’s public information and thus provide a means of matching against the confidential data. Similarly, individuals need not be identified only by their names, addresses, or Social Security numbers. The linkage software may use any collection of data fields, or variables, to determine that records in two distinct data sets correspond to the same person. And if the “privacy-pro- tected” or deidentified records include values for additional variables that are not yet public, simple record-linkage tools might let an intruder iden- tify a person (that is, match files) with high probability and thus leak this additional information in the deidentified files. For example, an adversary may use publicly available data, including newspaper accounts from New Orleans on the effects of hurricane Katrina and who was rescued in what efforts, to identify people with unusual names in a confidential epidemio- logic data set on rare genetic diseases gathered by the Centers for Disease Control and Prevention and thus learn all the medical and genetic infor- mation about the individuals that redaction was supposed to protect. For a final, small-scale, example, consider records of hospital emer- gency-room admissions, which contain such fields as name, year of birth, ZIP code, ethinicity, and medical complaint. The combinations of fields are known to identify many people uniquely. Such a collection of attri- butes is called a quasi-identifier. In microaggregation, or what is known as k-anonymization, released data are “coarsened”; for example, ZIP codes with the same first four digits are lumped together, so for every possible value of quasi-identifier, the data set contains at least k records. However, if someone sees an ambulance at his or her neighbor’s house during the night and consults the published hospital emergency-room records the following day, he or she can learn a small set of complaints that contains the medical complaint of the neighbor. Additional information known to that person may allow the neighbor’s precise complaint to be pinpointed. Context also comes into play in how different privacy-preserving techniques interact when they are applied to different databases. For example, the work of Dwork et al. rigorously controlled the amount of information leaked about a single record.18 If several databases, all con- taining the same record, use the same technique, and if the analyst has 18 C.Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity of func- tions in private data analysis,” pp. 265-284 in Proceedings of the rd Theory of Cryptography Conference, Association for Computing Machinery, New York, N.Y., 2006.

OCR for page 263
 APPENDIX L access to all these databases, the cumulative erosion of privacy of the given record may be as great as the sum of the leakages suffered in the separate databases that contain it. And that is a good case! The many methods in fields spanning com- puter science, operations research, economics, and statistics deal with data of different types recorded in many forms. For a targeted set of methods and specific kinds of data, although there may be results that can “guarantee” privacy in a released data file or a system responding to a series of queries, many well-known approaches fail to offer such guar- antees or even weaker assurances. For example, some literature on data imputation for privacy protection never defines priacy at all;19 thus, it is difficult to assess the extent to which the methods, although heuristically reasonable, actually guarantee privacy. L.2.5 The Effect of Data Errors on Privacy In the real world, data records are imperfect. For example, • Honest people make errors when providing information. • Clerical errors yield flawed recording of correct data. • Many data values may be measurements of quantities that regu- larly fluctuate or that for various other reasons are subject to measure- ment error. Because of imperfections in the data, a person may be mischaracter- ized as problematic. That is, the profile may be perfect, but the system may be operating with bad data. That appears to be an accuracy problem, but for several reasons it also constitutes a privacy problem. Although we have not discussed a definition of priacy, the recent lit- erature studies the appropriate technical definition at length. The approach favored in the cryptography community, modified for the present context, says that for anyone whose true data do not fit the profile, there is (in a quantifiable sense) almost no difference between the behavior of a sys- 19 D.B. Rubin, “Discussion: Statistical disclosure limitation,” Journal of Official Statistics 9(2):461-468, 1993; T.E. Raghunathan, J.P. Reiter, and D.B. Rubin, “Multiple imputation for statistical disclosure limitation,” Journal of Official Statistics 19(2003):1-19, 2003. However, there is also a substantial literature that does provide an operational assessment of privacy and privacy protection. For example, see G.T. Duncan and D. Lambert, “The risk of disclo- sure for microdata,” Journal of Business and Economic Statistics 7:207-217, 1989; E. Fienberg, U.E. Makov, and A.P. Sanil, “A Bayesian approach to data disclosure: Optimal intruder behavior for continuous data,” Journal of Official Statistics 13:75-89, 1997; and J.P. Reiter, “Es- timating risks of identification disclosure for microdata,” Journal of the American Statistical Association 100(2005):1103-1113, 2005.

OCR for page 263
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS tem that contains the person’s data and the behavior of a system that does not. That is, the behavior of the system in the two cases should be indistinguishable; it follows that the increase in the risk of adverse effects of participating in a data set is small. That approach allows us to avoid subjective decisions about which type of information leakage constitutes a privacy violation. Clearly, indistinguishability can fail to hold in the case of a nonterrorist whose data are incorrectly recorded. The harm to a person of appearing to satisfy the perfect profile may be severe: the person may be denied credit and the freedom to travel, be prevented from being hired for some jobs, or even be prosecuted. Finally, at the very least, such a misidentification will result in further scrutiny and consequent loss of privacy. (See Gavison on protection from being brought to the attention of others.20) The problem of errors is magnified by linkage practices because errors tend to propagate. Consider a database, such as the one assembled by ChoicePoint by linking multiple databases. Consider, say, three separate databases created by organizations A, B, and C. If A and B are extremely scrupulous about preventing data errors but C is not, the integrated data- base will contain inaccuracies. The accuracy of the integrated database is only as good as the accuracy of the worst input database. Furthermore, if each database contains errors, they may well compound to create a far greater percentage of files with errors in the integrated database. Finally, there are the errors of matching themselves, which are inherent in record linkage; if these are as substantial as the literature on record linkage sug- gests,21 the level of error in the merged database is magnified, and this poses greater risks of misidentification. All the above difficulties are manifested even when a perfect profile is developed for problematic people. But imperfect profiles combined with erroneous data will lead to higher levels of false positives than either alone. Moreover, if we believe that data are of higher quality and that profiles are more accurate than they actually are, the rate of false nega- tives—people who are potential terrorists but go undetected—will also grow, and this endangers all of us. Record linkage also lies at the heart of data-fusion methods and has major implications for privacy protection and harm to people. The 20 R. Gavison, “Privacy and the limits of the law,” pp. 332-351 in Computers, Ethics, and Social Values, D.G. Johnson and H. Nissenbaum, eds., Prentice Hall, Upper Saddle River, N.J., 1995. 21 W.E. Winkler, Oeriew of Record Linkage and Current Research Directions , Statistical Re- search Report Series, No. RRS2006/02, U.S. Bureau of the Census, Statistical Research Division, Washington, D.C., 2006, and W.E. Winkler, “The quality of very large databases,” Proceedings of Quality in Official Statistics, 2001, CD-ROM (also available at http://www. census.gov/srd/www/byyear.html as RR01/04).

OCR for page 263
 APPENDIX L literature on record linkage22 makes it clear that to achieve low rates of error (high accuracy) one needs both “good” variables for linkage (such as names) and ways to organize the data by “blocks,” such as city blocks in a census context or well-defined subsets of individuals characterized by variables that contain little or no measurement error. As measurement error grows, the quality of matches deteriorates rapidly in techniques based on the Fellegi-Sunter method. Similarly, as the size of blocks used for sorting data for matching purposes grows, so too do both the compu- tational demands for comparing records in pairs and the probabilities of correct matches. Low-quality record-linkage results will almost certainly increase the rates of both false positives and false negatives when merged databases are used to attempt to identify terrorists or potential terrorists. False nega- tives correspond to the failure of systems to detect terrorists when they are present and represent a systemic failure. False positives impinge on individual privacy. Government uses of such methods, either directly or indirectly, through the acquisition of commercial databases constructed with fusion technologies need to be based on adequate information on data quality especially as related to record-linkage methods. L.3 ENHANCING PRIVACY THROUGH INFORMATION-SYSTEM DESIGN Some aspects of information-system design are related to the ability to protect privacy while maintaining effectiveness, and there are many designs (and tradeoffs among those designs) for potential public policies regarding data privacy for information systems. Moreover, times and technology have changed, and a new set of policies regarding privacy and information use may be needed. To be rational in debating and choosing the policies and regulations that will provide the most appropriate com- bination of utility (such as security) and privacy, it is helpful to consider the generic factors that influence both. This section lists the primary components of information-system design that are related to privacy and indicates the issues that are raised in considering various options. L.3.1 Data and Privacy A number of factors substantially influence the effects of a deployed information system on privacy. Debates and regulations can benefit from differentiating systems and applications on the basis of the following: 22 See, for example, T.N. Herzog, F.J. Scheuren, and W.E. Winkler, Data Quality and Record Linkage Techniques, Springer Science and Business Media, New York, N.Y., 2007.

OCR for page 263
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS • Which data features are collected. In wiretapping, recording the fact that person A telephoned person B might be less invasive than recording the conversation itself. • Coertness of collection. Data may be collected coertly or with the awareness of those being monitored. For example, images of airport pas- sengers might be collected covertly throughout the airport or with pas- senger awareness at the security check-in. • Dissemination. Data might be collected and used only for a local application (for example, at a security checkpoint) or might be dissemi- nated widely in a nationwide data storage facility accessible to many agencies. • Retention periods. Data might be destroyed within a specified period or kept forever. • Use. Data might be restricted to a particular use by policy (for example, anatomically revealing images of airport passengers might be available for the sole purpose of checking for hidden objects) or unre- stricted for arbitrary future use. One policy choice of particular impor- tance is whether the data are subject to court subpoena for arbitrary purposes or the ability to subpoena is restricted to specified purposes. • Audit trail. An audit trail (showing who accessed the data and when) should be kept. • Control of permissions. If data are retained, policy might specify who can grant permission for dissemination and use (for example, the collector of the data, a court, or the subject of the data). • Trust. The perception of privacy violations depends heavily on the trust of the subject that the government and everyone who has access to the data will abide by the stated policy on data collection and use. • Analytical methods inoled. Analysis of data collected or the pre- sentation of analytical results might be restricted by policy. For example, in searching for a weapon at a checkpoint, a scanner might generate anatomically correct images of a person’s body in graphic detail. What is of interest is not those images but rather the image of a weapon, so ana- lytical techniques that detected the presence or absence of a weapon in a particular scan could be used, and that fact (presence or absence) could be reported rather than the image itself. L.3.2 Information Systems and Privacy Chapter 2 describes a framework for assessing information-based programs. But the specifics of program’s implementation make a huge difference in the extent to which it protects (or can protect) privacy. The following are some of the implementation issues that arise.

OCR for page 263
 APPENDIX L • Does the application require access to data that explicitly identify indi- iduals? Applications such as searching a database for all information about a particular person clearly require access to data that are associated with individual names. Other applications, such as discovering the pat- tern of patient symptoms that are predictive of a particular disease, need not necessarily require that individual names be present. • Does the application require that indiidually identified data be reported to its human user, and, if so, under what conditions? Some computer applica- tions may require personally identified data but may not need to report personal identifications to their users. For example, a program to learn which over-the-counter drug purchases predict emergency-room visits for influenza might need personally identified data of drug purchases so that it can merge them with personally identified emergency-room records, but the patterns that it learns and reports to the user need not necessarily identify individuals or associate specific data with identifiable individuals. Other systems might examine many individually identified data records but report only records that match a criterion specified by a search warrant. • Is the search of the data drien by a particular starting point or person, or is it an indiscriminate search of the entire data set for a more general pattern? Searches starting with a particular lead (for example, Find all people who have communicated with person A in the preceding week) differ from searches that consider all data equally (for example, Find all groups of people who have had e-mail exchanges regarding bombs). The justifica- tion for the former hinges on the justification for suspecting person A; the latter involves a different type of justification. • Can the data be analyzed with priacy-enhancing methods? Technolo- gies in existence and under development may in some cases enable dis- covery of general patterns and statistics from data while providing assur- ances that features of individual records are not divulged. • Does the data analysis inole integrating multiple data sources from which additional features can be inferred, and, if so, are these features inferred and reported to the user? In some cases, it is possible to infer data features that are not explicit in the data set, especially when multiple data sets are merged. For example, it is possible in most cases to infer the names of people associated with individual medical records that contain only birthdates and ZIP codes if that data set is merged with a census database that contains names, ZIP codes, and birthdates. L.4 STATISTICAL AGENCY DATA AND APPROACHES Government statistical agencies have been concerned with confiden- tiality protection since early in the 20th century and work very hard to

OCR for page 263
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS “deidentify” information gathered from establishments and individuals. They have developed methods for protecting privacy. Their goals are to remove information that could be harmful to a respondent from released data and to protect the respondents from identification. As a consequence, released statistical data, even if they may be related to individuals, are highly unlikely to be linkable with any reasonable degree of precision to other databases that are of use in prevention of terrorism. That is, the nature of redaction of individually identifiable information seems to yield redacted data that are of little value for this purpose. L.4.1 Confidentiality Protection and Public Data Release Statistical agencies often promise confidentiality to their respondents regarding all data provided in connection with surveys and censuses, and, as noted above, these promises are often linked to legal statutes and provisions. But the same agencies have a mandate to report the results of their data-collection efforts to others either in summary form or in tables, reports, and public-use microdata sample (PUMS) files. PUMS files are computer-accessible files that contain records of a sample of housing units with information on the characteristics of each unit and the people in it. The data come in the form of a sample of a much larger population; as long as direct identifiers are removed and some subset of other vari- ables “altered,” there is broad agreement that sampling itself provides substantial protection. Roughly speaking, the probability of identifying an individual’s record in the sample file is proportional to the probability of selection into the sample (given that it is not known whether a given individual is in the sample).23 (In particular, if a person is not selected for the sample, the person’s data are not collected and his or her privacy is protected.) It is also possible to provide privacy guarantees even in the worst case (that is, worst case over sampling).24 Nonetheless, many of the methods used by the agencies are ad hoc and may or may not “guarantee” privacy on their own, let alone when used with combining data from multiple databases. Nor would they sat- isfy the technical definitions of privacy described above. Rather, they rep- resent an effort to balance data access with confidentiality protection—an 23 See E.A.H. Elamir and C. Skinner, “Record level measures of disclosure risk for survey microdata,” Journal of Official Statistics 22(3):525-539, 2006, and references therein. 24A. Evfimievski, J. Gehrke and R. Srikant, “Limiting Privacy Breaches in Privacy Preserv- ing Data Mining,” pp. 211-222 in Proceedings of the Twenty-Second ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems, ACM, New York, N.Y., 2003; C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating Noise to Sensitivity of Functions in Private Data Analysis,” pp. 265-284 in rd Theory of Cryptography Conference, ACM, New York, N.Y., 2006.

OCR for page 263
 APPENDIX L approach that fits with technical statistical frameworks.25 Such trade-offs may be considered informally, but there are various formal sets of tools for their quantification.26 Duncan and Stokes apply such an approach to the choice of “topcod- ing” for income, that is, truncating the income scale at some maximum value.27 They illustrate trade-off choices for different values of topcoding in terms of risk (of reidentification through a specific form of record link- age) and utility (in terms of the inverse mean square error of estimation for the mean or a regression coefficient). For some other approaches to agency confidentiality and data release in the European context, see Willenborg and de Waal.28 L.4.2 Record Linkage and Public Use Files One activity that is highly developed in the context of statistical- agency data is record linkage. The original method that is still used in most approaches goes back to pioneering work by Fellegi and Sunter, who used formal probabilistic and statistical tools to decide on matches and nonmatches.29 Inherent in the method is the need to assess accuracy of matching and error rates associated with decision rules.30 The same ideas are used, with refinements, by the Census Bureau 25 For a discussion of the approaches to trade-offs, see the various chapters in Confiden- tiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz, eds., North-Holland Publishing Company, Amsterdam, 2001. 26A framework is suggested in G.T. Duncan and D. Lambert, “Disclosure-limited data dissemination (with discussion),” Journal of the American Statistical Association 81:10-28, 1986. See additional discussion of the risk-utility trade-off by G.T. Duncan, S.E. Fienberg, R. Krish- nan, R. Padman, and S.F. Roehrig, “Disclosure limitation methods and information loss for tabular data,” pp. 135-166 in Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz, eds., North- Holland Publishing Company, Amsterdam, 2001. A full decision-theoretic framework is developed in M. Trottini and S.E. Fienberg, “Modelling user uncertainty for disclosure risk and data utility,” International Journal of Uncertainty, Fuzziness, and Knowledge-Based Systems 10(5):511-528, 2002; and M. Trottini, “A decision-theoretic approach to data disclosure prob- lems,” Research in Official Statistics 4(1):7-22, 2001. 27 G.T. Duncan and S.L. Stokes, “Disclosure risk vs. data utility: The R-U confidentiality map as applied to topcoding,” Chance 3(3):16-20, 2004. 28 L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control, Springer-Verlag Inc., New York, N.Y., 2001. 29 I. Fellegi and A. Sunter, “A theory for record linkage,” Journal of the American Statistical Association 64:1183-1210, 1969. 30 See, for example, W. Winkler, The State of Record Linkage and Current Research Problems , Statistical Research Report Series, No. RR99/04, U.S. Census Bureau, Washington, D.C., 1999; W.E. Winkler, “Re-identification methods for masked microdata,” pp. 216-230 in Pria- cy in Statistical Databases, J. Domingo-Ferrer, ed., Springer, New York, N.Y., 2004; M. Bilenko,

OCR for page 263
0 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS to match persons in the Current Population Survey (sample size, about 60,000 households) with IRS returns. The Census Bureau and the IRS pro- vide the data to a group that links the records to produce a set of files that contain information from both sources. The merged files are redacted, and noise is added until neither the Census Bureau nor the IRS can rematch the linked files with their original files.31 The data are released as a form of PUMS file. Those who prepared the PUMS file have done sufficient testing to offer specific guarantees regarding the protection of individuals whose data went into the preparation of the file. This example illustrates not only the complexity of data protection associated with record linkage but the likely lack of utility of statistical-agency data for terrorism prevention, because linked files cannot be matched to individuals. R. Mooney, W.W. Cohen, P. Ravikumar, and S.E. Fienberg, “Adaptive name-matching in information integration,” IEEE Intelligent Systems 18(5):16-23, 2003. 31 For more details, see J.J. Kim and W.E. Winkler, “Masking microdata files,” pp. 114-119 in Proceedings of the Surey Research Methods Section, American Statistical Association, Alexan- dria, Va., 1995; J.J. Kim and W.E. Winkler, Masking Microdata Files, Statistical Research Report Series, No. RR97-3, U.S. Bureau of the Census, Washington, D.C., 1997.