Tadayoshi Kohno, the Short-Dooley Professor of Computer Science and Engineering at the University of Washington, began the session by noting that the next panel would also focus on emerging technologies, with an emphasis on analytics and the cloud. He encouraged the participants to prepare questions to pose to panelists during the open discussion session. Kohno, as moderator, then introduced the following panelists and gave each of them 5 minutes for opening comments:
- Carl Gunter, professor of computer science, University of Illinois;
- Roxana Geambasu, assistant professor of computer science, Columbia University;
- Steven M. Bellovin, Percy K. and Vidal L. W. Hudson Professor of Computer Science, Columbia University; and,
- James L. Wayman, research administrator, San Jose State University.
Carl Gunter discussed privacy implications of the growing use and collection of digital health data. He distinguished between “health care” technologies (tools for diagnosis and treatment of disease) and “health” technologies (the quickly growing market of tools for disease prevention and encouragement of healthy habits, such as the Fitbit), and suggested that these two areas may be moving toward a disruptive convergence.
Gunter described emerging capabilities in analysis of both structured and semi-structured data, including doctor’s notes or even information from a Fitbit or an Apple watch, and noted that data mining of electronic health records (EHRs) has led to the identification of prescription drug risks. Such capabilities could have enormous societal benefits, but they require access to large quantities of data about individuals, who may not want their records to be accessible even for such purposes.
He suggested that the rapidly changing field of health IT has a number of characteristics that could make it a useful laboratory for monitoring privacy trends and developments, including the following:
- There are many stakeholders with competing interests;
- Regulations and rules are evolving;
- Privacy provisions in existing laws such as the Health Insurance Portability and Accountability Act (HIPAA) and the Health Information Technology for Economic and Clinical Health (HITECH) Act were developed following much public debate and negotiation;
- The field is seeing increased use of distributed networks where institutions hold data to support research, but share answers to research queries on their data;
- Analysis of health data can yield great public benefit (in the form of medical breakthroughs and advances in public health); and
- Collection and analysis of data can pose privacy risks.
In particular, the field could be a valuable, evolving case study on balancing the use of information for public good with individual rights to privacy.
Roxana Geambasu discussed her research on increasing privacy online. Her work emphasizes enabling development and design with privacy in mind, and increasing user awareness of the privacy implications of their online actions.
She noted that privacy is scarce on the Internet; indeed, many users are eager to share their data online and many services aggressively collect and use that information. Today’s Web services collect immense amounts of information, including every click and every site we visit, and mine our documents and emails. This data can be used to target ads or fine-tune prices, sometimes to the benefit of the user and sometimes not. Users are generally unaware of how such data are used or abused by the collectors.
Geambasu described XRay, a tool developed by her research group that can reveal how Web services use personal data for targeting or personalization. It works by monitoring user inputs and outputs from these services, and identifies correlations using test accounts populated with subsets of a user’s information. She noted that the tool has proved remarkably accurate (around 80-90 percent precision) with Gmail, YouTube, and Amazon. By increasing transparency about how Web services use data, tools like XRay increase user awareness and, potentially, pressure on services to behave responsibly. She noted that the tool could also be of use to privacy watchdogs, such as the Federal Trade Commission (FTC), and investigative journalists.
She pointed out that the algorithms used to analyze online data can unintentionally lead to harmful, unintended, and/or unanticipated consequences, such as price discrimination. Her group has also created an infrastructure called FairTest to help programmers identify privacy bugs in their applications, enabling them to avoid discriminatory or other unintended effects. Geambasu suggested that the strategies embodied by these tools for enhancing online privacy could be applicable to privacy protection in other domains.
Steven M. Bellovin noted that there are many definitions of privacy. He identified one of the most common definitions as the ability to control what happens to one’s personal information. Based upon this definition, he identified the following two key types of privacy offenses: (1) using data for a purpose other than that for which they were originally collected and (2) linking data from two or more different sources. He noted that the second type of offense is a specific instance of the first.
He illustrated the first offense with the example of driver’s licenses, which are intended to indicate that an individual is legally qualified to drive, but are used secondarily for boarding airplanes or entering bars. He pointed out that many bars scan driver’s licenses to verify their validity, and some actually record a patron’s name, address, and demographic data, which may itself constitute a privacy violation.
Bellovin went on to discuss privacy issues related to biometrics data, including fingerprints or facial patterns. He pointed out that it is difficult to control the secondary use of biometric information. For example, an individual’s image could be obtained or captured without his or her knowledge in a public place, then matched to other sources. If linked to an individual’s Facebook profile, personal information about that person can be obtained, whether directly or through data analytics; for example, a student project from the Massachusetts Institute of Technology (MIT)1 showed that one can accurately infer an individual’s sexual orientation by analyzing that person’s Facebook network. Bellovin noted that compromised data from the recent Office of Personnel Management (OPM) breach includes information that could potentially be linked to Facebook photos, leaked records from Ashley Madison, or other data sources to reveal sensitive information even if the other data sources contain no personally identifiable information (PII). This underscores the fact that privacy issues can arise even in the absence of PII. Even without a user’s name, a Web service such as Netflix or Amazon can build a dossier for that user. Health records, even in the absence of PII, are still extremely personal and can be re-identified.
1 C. Jernigan and B.F. Mistree, 2009, Gaydar: Facebook friendships expose sexual orientation, First Monday, 14(10).
He suggested that biometric data, when linked to other sources, present tremendous potential for privacy violation. He proposed that using salted hashes of biometric data might be more privacy-preserving than direct use of biometric data such as a facial image or a human fingerprint.
James Wayman discussed two major themes: (1) meaning derived from the absence of information and (2) the privacy of members of the IC.
He began by pointing out the idea from Zen philosophy that the empty space between objects is just as important as the objects themselves. He carried this into the intelligence field, noting that the utility of information is often inverted from what one might expect. Sometimes the absence of information reveals a lot, as when “listening in the gaps” between pieces of information during the intelligence practice of traffic analysis. Wayman provided a specific example: Fugitive terrorists may likely be off the grid, meaning that they may be the ones for whom no communications data exists. He suggested that the IC would like to do more data reduction, but persists with data retention because it is hard to know what to throw away when the ability to recognize the absence of signal may also be important.
He also suggested a need to consider not only the emerging IC technologies that could threaten privacy, but also how emerging technologies threaten the privacy of members of the IC.
Bellovin suggested that unintended consequences often arise in the context of secondary use. He recalled a statistic suggesting that commercial data brokers may have more than a thousand data points on the average American. Secondary use of data can lead analysts to draw spurious conclusions from observed correlations. Incorporating such conclusions into hiring decisions or insurance qualifications could have unfair and detrimental consequences.
Geambasu suggested that unintended consequences may become increasingly significant as we move into a data exchange-based world. Primary- and third-party-collected data is obtained by fourth-parties—data brokers—with whom users may never interact. Data brokers hold vast quantities of data about users, the flow of which cannot be effectively tracked or managed.
Wayman noted that the U.S. VISIT2 program, through which the U.S. government collects biometric information about foreign visitors to the country, led other nations to collect biometric information from non-national travelers. This, combined with the leak of fingerprint records from the recent OPM breach, could have significant consequences for those within the intelligence community.
Wayman also addressed the unintended consequences of the absence of data in a data-rich world. He noted that people who turn off their phones when entering an IC facility to avoid tracking might inadvertently raise a red flag. Bellovin highlighted an example from World War II, when the research of U.S. nuclear physicists ceased to be published, tipping off the Soviets that their work had been taken out of the public eye.
Panelists identified the technologies whose privacy implications they found most worrisome.
Bellovin noted that he was most concerned about the potential privacy implications of remote (or involuntary) capture of biometric information, and those of machine learning, which can already arrive at sensitive correlations—and these technologies continue to advance.
Gunter reiterated his concerns around the convergence between health care and health technologies. For example, in the health care sphere, the security of a wirelessly controllable defibrillator is scrutinized by
2 This program has been superseded by the Office of Biometric Identity Management program, enacted in 2013.
regulators, but that of a fitness monitor like Fitbit is not. He worries that the incentive to add capabilities to lower-end products could lead to a host of insecure mid-level products, such as an insulin pump that transmits information to—or is even controlled by—a mobile phone, bringing with them significant security and privacy risks.
Wayman pointed out that iris recognition at a distance, kick-started through the Defense Advanced Research Projects Agency’s Human Identification at a Distance program, is already commercially available and can work at a distance up to approximately 10 feet. The Intelligence Advanced Research Projects Agency’s Janis project is currently focusing on improving facial recognition under a variety of different conditions.
Geambasu pointed out that the online data landscape is complex, and information is tracked by many agents on a variety of websites. Whether or not these trackers know a user’s name, they may have information about other sites a user has visited, or even other devices used, and may exchange cookies with others. Neither users nor researchers fully understand this landscape, and all that can currently be done is to try to break the black box around such exchange. She noted that one (possibly controversial) solution would be to make such exchange of data explicitly legal and then devise an infrastructure that would ensure rigorous compliance with a set of appropriate controls.
Gunter added that there are similar issues around the architecture of advertising on mobile phones. Specifically, on Android phones, an app’s advertisers can have the same privileges as the app itself. It is thus possible for an advertiser to access a phone’s microphone or camera, the effects of which some have been trying to measure. There is a large potential for harm in this space; advertisement-supported apps are quite popular because they tend to be free, but advertisers may have access to sensitive data, such as medical information, collected by the apps themselves.
Bellovin reiterated the extent of online tracking, referring to a statistic that as much as 40 percent of people’s Internet bandwidth goes to trackers and ads. He agreed that it is difficult even for a knowledgeable person to understand where his or her information is going.
He noted that the Fair Information Practice Principles (FIPPs) instantiated in the Privacy Act of 1974 do not apply to the IC, and that U.S. law is unclear on the circumstances under which the government can purchase data from third parties—an action that could enable circumvention of other provisions in the law. For example, under the Stored Communications Act, communications companies cannot sell or give certain information to the government. However, there seems to be no prohibition against the government obtaining this information from a data broker.
Geambasu noted that there could be value in engaging auditors to monitor and provide oversight of data practices that users cannot see. Gunter asked who the auditors might be, and noted that allowing a company or the government to perform this role would raise trust issues and potential conflicts. Geambasu pointed out that the financial sector has an established infrastructure, though it may not work perfectly, and suggested that a similar infrastructure could be established for auditing the Web.
Gunter pointed out that direct consumer genomic testing results can reveal hereditary information and thus enable inference about a subject’s family members. He pointed out that medical professionals have well-defined protocols for revealing the presence of genetic markers that could have dire implications for a user’s family members, but there is no regulation or guidance on this in the direct-to-consumer space. It is not uncommon for an individual to post his or her entire genetic sequence online, which could have unwanted effects on family members.
Gunter also discussed a study3 that addressed the notion of ownership of identity, describing one of its conclusions—namely, that someone who holds data about an individual actually owns that identity, whether or not it is complete or accurate. He also pointed out that an individual’s own self-identity could be less accurate than the identity held by another party, because the individual might engage in self-delusion. Bellovin suggested that the findings of this study could be useful to workshop participants.
Bellovin also noted that Facebook’s tagging function could enable an individual to reveal information about someone else. Coincidence of location data can also be used to infer information about an individual whose behavior is linked to that of others about whom much is known. For example, machine learning correlations within a given data set have enabled identification of marital/relationship status and ethnicity of individuals about whom only location data over time had been collected.
Geambasu also raised the example of Google Glass, which can collect information on people other than its users, including through video. This and other augmented reality technologies present significant challenges to managing privacy. She pointed out that setting data access controls with mainstream technologies such as Facebook is already difficult, and the management problem will likely increase significantly.
Geambasu suggested that companies such as Google have infrastructures for auditing access to data and maintaining data that are not in use, involving encryption, and minimization and compartmentalization of access. Such strategies, along with anonymization of data moving between services, could be a good model.
Bellovin pointed out the value of minimization: If data do not exist, they cannot be abused. Geambasu agreed, but also pointed out that some data holders keep seemingly unnecessary data in case they might be useful in the future; these parties can take other strategies to separate and sequester data that are not currently valuable to reduce the risk of misuse.
Gunter proposed the idea of developing abstract frameworks that could allow analogies between different sectors. For example, this could enable an understanding of how people feel about privacy with respect to smart electric meters to inform strategies for managing privacy in connected vehicles. He noted an idea, from a recent workshop related to intelligence, of creating a framework that distinguishes data collection from data use—an idea that has not been emphasized in other sectors.
Bellovin pointed out that a recent President’s Council of Advisors on Science and Technology report4 emphasized controls on data use rather than on collection. He noted that some privacy advocates are uncomfortable with this, because even the best use control policies can be changed in ways that open up pathways for unintended or harmful use of stored data.
Wayman referred to the idea that “privacy is a concept in disarray”; people struggle to articulate its meaning.5 He shared an anecdote from past work on an International Organization for Standardization committee for terminology development where a representative pointed out that there is no single word for privacy in Russian. Wayman suggested that it could prove fruitful to focus instead on more carefully
3 National Research Council, 2003, Who Goes There: Authentication Through the Lens of Privacy, Washington, D.C.: The National Academies Press.
4 President’s Council of Advisors on Science and Technology, 2014, Big Data and Privacy: A Technological Perspective. Washington, D.C., https://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdf.
5 D.J. Solove, 2006, A taxonomy of privacy, University of Pennsylvania Law Review, pp. 477-564.
articulated rights. He pointed out a recent academic paper discussing a “theory of creepy.”6 He proposed that whether or not a practice is perceived as “creepy” could be a very useful benchmark.
Gunter suggested that privacy, like friendship or security, will never have a precise definition, and that this should not dissuade people from respecting it or thinking about it. Another participant suggested that there are many words and terms in any culture that embody facets of the values we associate with the term privacy.
One participant, picking up on the earlier discussion of controls on collection and use, suggested that reasonable limits on data collection could be impractical and difficult to define, partly because of the vast quantity and range of data that might be collected and partly because the appropriateness of collection depends on the ultimate use—which is largely unknowable. The participant wondered about the possibility of instead developing a framework for data control and use, noting that the appropriateness of control and use is situation-dependent.
The group discussed the idea of a mathematical framework that might enable objective and automated generation of limits on data use. Several participants noted recent work attempting to develop formal models of the data use rules contained in HIPAA, aiming to enable computers rather than attorneys to make data-sharing decisions. One participant noted that researchers had found holes in this approach. One of the panelists suggested that there are no general concepts of use that are immediately and universally applicable; every concept of use would require its own ontology to achieve a context-specific meaning.
A participant identified some of the limitations of restrictions on use:
- They are difficult to enforce, and enforcement depends upon generally under-resourced enforcement agencies.
- Because restrictions are generally imposed only after something bad has happened, they are thus more punitive than preventative.
- Use restrictions may be subject to attack under the First Amendment; if data were lawfully collected, what is the legal justification for limiting their use?
A panelist raised an example of advertising targeted at an individual whose online activities displayed characteristics associated with depression, pointing out that targeted advertisements could be helpful (for example, advertisements for a support group) or detrimental (for example, providing advertisements for alcohol). Someone else suggested that rather than focusing on the ethics of the outcome in this scenario, we should actually be more concerned with the ethics of conducting this level of profiling without the user’s consent in the first place, whether or not there is currently a legal mechanism to restrict such profiling.
Bellovin suggested that one benchmark to consider for use restriction is whether possession or analysis of the data in question is likely to result in a data holder taking an action that he or she would not have otherwise taken. Weighing the benefits of a given use against its “creepiness” factor could be helpful.
Geambasu proposed some important privacy practices that could be deployed at the service level:
6 O. Tene and Jules Polonetsky, 2015, A theory of creepy: Technology, privacy, and shifting social norms, Yale Journal of Law and Technology 16.1:2.
- Conduct extensive privacy testing while developing large and complex applications. Privacy implications are often unintended consequences, and must be actively sought in order to be prevented. Such testing could be required by companies and conducted by programmers.
- Manage data effectively. If data are not being used, keep them separate and secure—even require permission all the way up the management chain before they can be accessed.
Bellovin proposed a few more practices:
- Avoid globally unique identifiers, which make it easy to link data across time and among different applications.
- Avoid looking at data that are not necessary. For example, the Outlook mail service does not read email content when selecting ads to display.
- Do not collect data that are not needed. If they do not exist, then they cannot be abused.
A participant noted that the discussion had centered on technologies in the context of academia and the private sector. It was pointed out that the workshop was in fact meant to help expose the IC to outside research and practice around privacy, to provide new perspectives, and to help enrich thought about privacy within the IC. This was followed by some discussion of privacy in the context of the IC.
Wayman noted that the IC is clever about using data, suggesting that any general rules about the IC’s use of data could have minimal impact. He suggested that the IC does a good job of protecting individual privacies of members of the public, but that the privacy risks for those within the IC may be substantially higher.
Another participant noted that translating the FIPPs or other policies into concrete and substantive operational requirements is challenging across any industry, and suggested that technologies to help with this translation could be useful to those who design applications.
Bellovin pointed out that the FIPPs apply to the U.S. government, and have analogues in other developed nations’ data protection commissions; they are not, however, broadly applicable to the commercial sector, with the exception of HIPAA. He noted that FIPPs may be rather obsolete, because they focus on identity; he reiterated that profiling and inference of sensitive information can occur whether or not a data set contains identity information. He said that part of his research centers on creating a new formal definition of privacy and the harms that result from various activities. He suggested that for the IC, privacy violations are more likely to arise when focusing on a specific person—but much of the IC’s work is concerned with larger trends, rather than individuals.
Gunter added that it makes sense to think of FIPPs as a starting point, and ask how they should be extended, suggesting that looking beyond the FIPPs could be an important strategy. He also suggested that more progress could be made on privacy by drilling down to sector-specific contexts and ontologies than by focusing on high-level ideas.