Advances in computational social science provide new ways to understand human behaviors, generating insights that may facilitate intelligence analysis and provide decision support. Many researchers in the social and behavioral sciences (SBS) and public officials are excited about the possibilities that big data and computational social science hold for enhanced analysis of open-source intelligence in particular, including intelligence gleaned from social media, sensor data, and other digital information produced by routine human actions and behaviors (Harman, 2015). But many of the qualities that make big data exciting and accessible to researchers may risk causing harm. Big datasets are not just large in volume. They also often contain sensitive information at a grain size that could allow individual identities to be uncovered. These data are durable and can be shared across institutional and national borders at scale and at high speeds.
As more people live more of their lives online and as sensor technologies proliferate, the volume and range of sensitive digital data will grow. These factors increase the potential for harms from research, including loss of privacy and autonomous control over personal information for the individuals who, knowingly or not, are the sources of the data. The reach of big data and its shareability also increase the potential for larger numbers of people to be harmed. A Department of Homeland Security report emphasizes that “the relative ease in engaging multitudes of distributed human subjects (or data about them) through intermediating systems speeds the potential for harms to arise and extends the range of stakeholders who may be impacted” (Boyd and Crawford, 2012; Dittrich and Kenneally, 2012, p. 3).
Although big data traces may appear as disembodied points of information, it is important to bear in mind that data are generated by people. The Council for Big Data, Ethics, and Society notes that “the scope, scale, and complexity of many forms of big data creates a rich ecosystem in which human participants and their communities are deeply embedded and susceptible to harm” (Zook et al., 2017, p. 1). In this appendix, we note some of the challenges that have emerged, and consider their particular implications in the national security context.
NEW ETHICAL CHALLENGES
One primary issue for researchers who use large datasets is that these datasets raise new ethical questions that have not yet been systematically addressed. While SBS researchers have long been accustomed to addressing research ethics via the Common Rule, computational social science research transcends traditional human subjects protections and raises a number of new ethical questions. To name but a few: Are data subjects the same as human subjects, or are they different? What reasonable expectations of privacy do people have for their digital traces, and how do those expectations change in different digital venues? Is informed consent possible, realistic, or required in big data research?
Researchers and ethicists stress that there are no straightforward answers to these questions (Buchanan, 2017; Zook et al., 2017). Because digital research affords more distance between researchers and their human subjects, researchers and ethicists also note that such research is susceptible to ethical distancing (Dittrick and Kenneally, 2012; Lyon, 2001). Furthermore, because digital research methods are advancing rapidly, and because machine-learning tools create decision rules that may not be transparent or intuitive, the applications of these methods and tools may lead to unanticipated ethical questions. The multidisciplinary nature of digital research further complicates the landscape of digital research ethics. While SBS researchers have experience dealing with research ethics challenges, their collaborators in computer science traditionally have had less training in ethical reasoning (Salganik, 2017).
A second key complication is the involvement of new stakeholders in the research process. Stakeholders such as individuals whose data are collected online may have different expectations than researchers have regarding what privacy protections, practices for informed consent, and other approaches to research ethics are reasonable; this further complicates the ethical landscape of digital research (Association of Internet Researchers Ethics Working Committee, 2012). For example, in a recent study demonstrating the power of social network analysis to identify and analyze online extremist networks, Benigni and colleagues (2017) note that social media
users may not be comfortable with knowing that their online behavior was used to support diplomacy, military operations, or intelligence analysis. Yet digital researchers are not required to obtain consent for such research as long as they respect platforms’ terms of service and privacy policies and afford appropriate privacy protections, including deidentification.
In part because of the risks to individuals’ privacy, governing agencies in the United States and elsewhere have a voice as well. For example, the European Union recently enacted a General Data Protection Regulation to address privacy and other issues. Some legal scholars have argued, however, that despite the intentions for this regulation, it facilitates rather than hinders the conduct of computational social science research (Forgó et al., 2017). At present, the most robust consensus is that digital research ethics is not “one size fits all.” Researchers and ethicists take a variety of positions on what expectations should govern digital research, in part because there is no clear consensus on what the threshold for risk of harm should be in this research area. While some analysts argue that ethical practices must account for the potential for indirect harms, others disagree (Bellaby, 2010). Many researchers and ethicists stress the importance of examining questions of privacy, autonomy, harm, and justice in the context of each dataset, research methodology, and intended use of research outcomes (Benigni et al., 2017). Below, we discuss in greater detail some of the most common ethical concerns and issues that arise in digital research and some commonly proposed ways to address them.
Traditional norms of privacy that are relevant to research ethics are oriented toward protecting individuals by ensuring that neither personally identifying information nor sensitive personal information will be exposed. Digital researchers typically deidentify individuals attached to digital datasets, but even this approach does not guarantee that the subjects of digital research will remain fully anonymous, as a National Research Council (2008) report on the subject stresses. A number of studies have shown that the scale, granularity, and durability of data make reidentification possible: it is possible to stitch databases together to identify specific people despite deidentification (Acquisti and Gross, 2009; Lewis et al., 2008). This breach of privacy may be particularly problematic when the identified individuals are members of vulnerable or stigmatized groups (see, e.g., Lee, 2014).
Some researchers argue that Internet users should have no or little expectation of privacy for digital information shared on such public sites as Twitter, Facebook, and YouTube (Walsh and Miller, 2016). But others argue that it is important to think of privacy not in binary terms—where information is either public or private—but instead as a situational and
contextual category (Association of Internet Researchers Ethics Working Committee, 2012; Keller et al., 2016; Zook et al., 2017). According to this view, the public accessibility of data does not necessarily make use of the data ethical. Rather, users’ expectations of privacy vary across different cultural, national, and digital contexts. Ethical challenges may arise when researchers inadvertently violate user expectations and norms in the course of their research. Facebook’s study of emotional contagion, in which the emotional content of the newsfeeds of more than 600,000 users was altered without their knowledge to investigate the relationships between social media exposure and emotion, sparked controversy for this reason (Boyd, 2006; Hancock, 2017).
Information science expert Helen Nissenbaum suggests that ethical digital research requires a nuanced understanding of the right to privacy. She argues that it “is neither the right to secrecy nor a right to control but a right to appropriate flow of personal information.” Users’ understandings of appropriate information flow are governed by “context-relative informational norms” that vary according to the digital space involved and the type of data produced (Nissenbaum, 2010, p. 127). Understanding these context-relative information norms will require further SBS research into the assumptions, expectations, and values that users bring to different digital spaces (Hancock, 2017).
USER AUTONOMY AND CONSENT
The Common Rule requires that researchers obtain informed consent from research subjects to protect subjects’ autonomy. But many Internet researchers believe that informed consent is unrealistic in the online domain for at least two reasons. First, it is not possible to obtain consent from the millions of web users whose digital traces are being studied, and it is a matter of some dispute whether the individuals whose data is studied deserve the same protections as human subjects (Barocas and Nissenbaum, 2014; Buchanan, 2017; Dittrich and Kenneally, 2012). Second, informed consent requires that researchers anticipate the potential risks posed by any given study. To meet the criteria of informed consent, users would have to know what is being collected, with whom it will be shared, under what constraints, and for what purposes. This is challenging not only because so much of the data used in digital research has already been collected, but also because research methods evolve. The many uses to which digital data may be put are often not clear at the time of collection. Researchers also may use material that people post about other individuals, further complicating the meaning of consent (Barocas and Nissenbaum, 2014).
Another provision of the Common Rule is that consent must not be coerced. But if the cost of opting out of having one’s online activities monitored, recorded, preserved, and analyzed is opting out of digital services that have become central parts of everyday financial, medical, work, and social transactions, users of digital systems may not be able to exercise noncoerced choice. Some digital ethics experts also question the freedom of choice implied in terms-of-service agreements for this reason; they argue that including consent to monitoring and research in terms of service fails to meet the requirements of informed consent (Brunton and Nissenbaum, 2013).
A number of researchers have taken advantage of the opportunity that the web offers for experimenting on people without their knowledge (Salganik, 2017). But these practices have caused controversies. For example, defenders of the Facebook study mentioned above in which the emotional content of users’ feeds was altered insisted that this study was no different from the product testing commonplace in industry, and that such research is essential to developing sound algorithms. Detractors insisted that such research violated user expectations of informational and emotional autonomy on Facebook (Boyd, 2006; Hancock, 2017).
SOCIAL MEDIA AND ONLINE GAMING DATA
Of particular interest to SBS researchers is social media online gaming data, which raises not only privacy issues but others as well. While ostensibly in the public domain, data associated with gaming (most of which operates under the creative commons license) is nonetheless policed by the companies that produce the technologies used in its production (e.g., Twitter and Facebook). These companies self-police. This means that
- not all members of the research community have equal access to data collected through games;
- data collected through a game may be retroactively changed, and its availability may change as well;
- the rules of engagement leading to the generation of the data may change;
- how these technology companies actually generate data, how they store it, how they prioritize its delivery, and other policies are generally not made public;
- the data themselves may contain malware that can infect the scientists’ machines or destroy other data; and
- the conditions under which members of the research community can share the data are overly restrictive and/or rapidly changing.
These realities mean that the system for generating data from gaming sites can be exploited and manipulated, making the creation of false data easy. Much of the research performed with such data is done quickly, with no direct replication; and there is a need for extensive, continuous effort to revamp data collection tools to keep up with the changing technology. Adversaries can identify exploits and take advantage of them more easily than scientists can gain access to the data to learn counterstrategies.
Most social media and online data is created by server logs that were not created for the express purpose of SBS research. Instead, this digital trace data (sometimes referred to dismissively as digital exhaust data)1 is collected to help programmers debug errors in the system. However, technology companies are discovering the value of this data for business, to support, for instance, game design or customer retention. The result has been a more concerted effort by technology companies to actively collect specific data. Therefore, in its quest for more open access, the research community has the opportunity, and indeed the obligation, to partner with companies in the development of platforms that can advance SBS research.
Secondary Uses of Datasets
Research with large datasets may have the effect of identifying individuals and their affiliations with undesirable groups, or perhaps falsely identifying individuals as members of marginalized or dangerous communities. Researchers stress the importance of using multiple checks to guard against misidentification (Benigni et al., 2017). They also stress the importance of avoiding the imputation of guilt by association. Social network research often categorizes and makes judgments about individuals and groups based on their relationships. These associations, whether false or accurate, can have material effects on the lives and well-being of those individuals categorized, particularly when the categories carry social stigma or imply that categorized individuals pose security threats because of their social networks (Lyon, 2007).
Researchers may also need to consider the multiple contexts in which their research tools can be used. Social network analysis designed to identify ISIS supporters on Twitter or Facebook, for example, can also be applied to identify members of peaceful political dissident groups, from civil rights advocates in the United States to advocates of democratization in China (Buchanan, 2017). Researchers may need to weigh the harms of carrying out such work against the potential harms of leaving research undone and
foregoing tools for enhancing human safety and security (Wesolowski et al., 2014).
Algorithms can “reproduce existing patterns of discrimination, inherit the prejudice of prior decision makers, or simply reflect the widespread biases that persist in society” (Barocas and Selbst, 2016, p. 674). If training data contains racial bias, for instance, that bias will be replicated in its operation. For example, if a computer system for categorizing medical school applicants had been designed on the basis of previous admissions decisions that discriminated against racial minorities and women, the algorithm would replicate that bias (Barocas and Selbst, 2016). Algorithms that respond to users can also replicate users’ prejudices (Sweeney, 2013). The negative ethical consequences of discriminatory algorithms are magnified in contexts where the algorithms are used to predict outcomes. For example, researchers have shown that an algorithmic system used to predict the likelihood of criminal recidivism was twice as likely to mislabel black offenders as recidivism risks, compared with white offenders (Angwin et al., 2016).
Such systems may unintentionally violate users’, researchers’, and government officials’ commitments to fairness and justice. They may also reduce the accuracy, value, and efficiency of results. To avoid these negative unintended consequences, some researchers have suggested that data scientists and ethicists work together to build systems, training data, and other tools to mitigate bias (Barocas and Boyd, 2017). Other researchers have outlined techniques for assessing the impact of algorithms. While many of their recommendations concern public disclosure of automated systems not suited to intelligence contexts, other recommendations—such as for self-assessment of existing and proposed automated decision systems—are valuable (Reisman et al., 2018).
GREATER SALIENCE OF THESE CHALLENGES IN THE NATIONAL SECURITY CONTEXT
In the contemporary digital landscape, the means used to gather data often reflect significant power imbalances (Brunton and Nissenbaum, 2013). Internet users typically do not know what traces of their everyday lives are monitored; what happens to their information; or what, if anything, occurs because of it. The potential for security community surveillance afforded by digital data compounds the power imbalances already present in digital spaces. Failure to address ethical concerns can have a chilling effect on research. Research projects that raise serious ethical questions may lose their funding or be terminated by sponsoring agencies, as discussed in Chapter 10.
The public may lose its trust in the SBS research community, as well as in government agencies that sponsor such research, including the security community. And widespread concern that the Internet is a space where actions are monitored, stored, and analyzed rather than a site of free information exchange may cause people to censor their online behavior, constraining Internet use itself (Brunton and Nissenbaum, 2013; Mayer-Schönberger, 2009; Penney, 2016). Some digital scholars recommend that Internet users learn practices of “data obfuscation” to protect themselves from digital surveillance and analysis (Brunton and Nissenbaum, 2013).
It is especially important for researchers working on security issues to recognize that they are working in the context of public concern about the power imbalance between Internet companies and individual users and between security agencies and citizens. Concerns about overreach by the Intelligence Community (IC) have marred the history of American national security, as discussed in Chapter 9. Recent news stories about such U.S. government actions as the Total Information Awareness Program (later called the Terrorism Information Awareness program), in which data mining was used to monitor potential security threats, have reignited old fears that the American national security community is unjustly monitoring domestic communications. Previously documented abuses, such as domestic surveillance of civil rights groups revealed by the Church Committee in 1975, are the backdrop for such concerns (Electronic Freedom Foundation, 2004; National Academies of Sciences, Engineering, and Medicine, 2016; Walsh and Miller, 2016).
Researchers and the IC will continue to grapple with the need to balance privacy and autonomy on the one hand and security on the other (Walsh and Miller, 2016). These rights both overlap and conflict. Nissenbaum (2005, p. 64) explains that security has a number of definitions: “security as safety, freedom from the unwanted effects of another’s actions, the condition of being protected from danger, injury, attack (physical and non-physical), and other harms, and protection against threats of all kinds.” The language of security poses a conflict between security as privacy and protection from more powerful coercive powers on the one hand, and security as a moral national/international good, preventing harmful acts of violence, which gives research moral force (Nissenbaum, 2005). Having examined this challenge in the context of counterterrorism research and policy, a National Academies committee argued strongly in 2008 that “even under the pressure of threats as serious as terrorism, the privacy rights and civil liberties that are the cherished core values of our nation must not be destroyed.” That committee’s report recommends that government agencies establish and apply “technical, operational, legal, policy, and oversight processes to minimize privacy intrusion and the damage it causes” (National Research Council, 2008, p. 4).
Acquisti, A., and Gross, R. (2009). Predicting social security numbers from public data. Proceedings of the National Academy of Sciences of the United States of America, 106(27), 10975–10980.
Angwin, J., Larson, J., Mattu, S., and Kirchner, L. (2016). Machine bias. ProPublica, May 23. Available: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing [December 2018].
Association of Internet Researchers Ethics Working Committee. (2012). Ethical Decision-Making and Internet Research (Version 2.0). Available: https://aoir.org/reports/ethics2.pdf [December 2018].
Barocas, S., and Boyd, D. (2017). Engaging the ethics of data science in practice. Communications of the ACM, 60(11), 23–25.
Barocas, S., and Nissenbaum, H. (2014). Big data’s end run around anonymity and consent. In J. Lane, V. Stodden, S. Bender, and H. Nissenbaum (Eds.), Privacy, Big Data, and the Public Good (pp. 45–75). New York: Cambridge University Press.
Barocas, S., and Selbst, A.D. (2016). Big data’s disparate impact. 104 California Law Review, 671. doi:10.2139/ssrn.2477899.
Bellaby, R. (2010). What’s the harm? The ethics of intelligence collection. Intelligence and National Security, 27(1), 93–117.
Benigni, M.C., Joseph, K., and Carley, K.M. (2017). Online extremism and the communities that sustain it: Detecting the ISIS supporting community on Twitter. PLoS One, 12(12), e0181405. doi:10.1371/journal.pone.0181405.
Boyd, D. (2006). Untangling research and practice: What Facebook’s “emotional contagion” study teaches us. Research Ethics, 12(1), 4–13. doi:10.1177/1747016115583379.
Boyd, D., and Crawford, K. (2012). Critical questions for big data. Information, Communication, and Society, 15(5), 662–679. doi:10.1080/1369118X.2012.678878.
Brunton, F., and Nissenbaum, H. (2013). Political and ethical perspectives on data obfuscation. In M. Hildebrandt and K. de Vries Privacy (Eds.), Privacy, Due Process, and the Computational Turn (pp. 164–188). New York: Routledge.
Buchanan, E. (2017). Considering the ethics of big data research: A case of Twitter and ISIS/ISIL. PLoS One, 12(12), e0187155. doi:10.1371/journal.pone.0187155.
Dittrich, D., and Kenneally, E. (2012). The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research. Washington, DC: U.S. Department of Homeland Security. Available: http://www.caida.org/publications/papers/2012/menlo_report_ethical_principles [November 2018].
Electronic Freedom Foundation. (2004). Total/Terrorism Information Awareness (TIA): Is It Truly Dead? Available: http://libertyparkusafd.org/Hale/Special%20Reports/ADVISE/Total-Terrorism%20Information%20Awareness%20--%20%20Is%20It%20Truly%20Dead.htm [December 2018].
Forgó, N., Hänold, S., and Schütze, B. (2017). The principle of purpose limitation and big data. In M. Corrales, M. Fenwick, and N. Forgó (Eds.), New Technology, Big Data and the Law. Perspectives in Law, Business and Innovation (pp. 17–42). Singapore: Springer.
Hancock, J.T. (2017). Introduction to ethics of digital research. In Handbook on Networked Communication (p. 5). Oxford, UK: Oxford University Press.
Harman, J. (2015). Disrupting the Intelligence Community. Foreign Affairs, March/April. Available: https://www.foreignaffairs.com/articles/united-states/2015-03-01/disrupting-intelligence-community [December 2018].
Keller, S.A., Shipp, S., and Schroeder, A. (2016). Does big data change the privacy landscape? A review of the issues. Annual Review of Statistics and Its Application, 3, 161–180. doi:10.1146/annurev-statistics-041715-033453.
Lee, N. (2014). Trouble on the radar. Lancet, 384(9958), 1917.
Lewis, K., Kaufman, J., Gonzalez, M., Wimmer, A., and Christakis, N. (2008). Tastes, ties, and time: A new social network dataset using Facebook.com. Social Networks, 30(4), 330–342. doi:10.1016/j.socnet.2008.07.002.
Lyon, D. (2001). Facing the future: Seeking ethics for everyday surveillance. Ethics and Information Technology, 3(3), 171–181.
Lyon, D. (2007). Surveillance Studies: An Overview. Cambridge, UK: Polity Press.
Mayer-Schönberger, V. (2009). Delete: The Virtue of Forgetting in the Digital Age. Princeton, NJ: Princeton University Press.
National Academies of Sciences, Engineering, and Medicine. (2016). Privacy Research and Best Practices: Summary of a Workshop for the Intelligence Community. Washington, DC: The National Academies Press. doi:10.17226/21879.
National Research Council. (2008). Protecting Individual Privacy in the Struggle Against Terrorists: A Framework for Program Assessment. Committee on Technical and Privacy Dimensions of Information for Terrorism Prevention and Other National Goals, Committee on Law and Justice and Committee on National Statistics, Computer Science and Telecommunications Board, Division of Behavioral and Social Sciences and Education, Division on Engineering and Physical Sciences. Washington, DC: The National Academies Press. doi:10.17226/12452.
Nissenbaum, H. (2005). Where computer security meets national security. Ethics and Information Technology, 7(2), 61–73. doi:10.1007/s10676-005-4582-3.
Nissenbaum, H. (2010). Privacy in Context: Technology, Policy, and the Integrity of Social Life. Palo Alto, CA: Stanford University Press. Available: https://crypto.stanford.edu/portia/papers/privacy_in_context.pdf [December 2018].
Penney, J.W. (2016). Chilling effects: Online surveillance and Wikipedia use. Berkeley Technology Law Journal, 31(1), 117. doi:10.15779/Z38SS13.
Reisman, D., Schultz, J., Crawford, K., and Whittaker, M. (2018). Algorithmic Impact Assessments: A Practical Framework for Public Agency Accountability. Available: https://ainowinstitute.org/aiareport2018.pdf [December 2018].
Salganik, M.J. (2017). Bit by Bit: Social Research in the Digital Age. Princeton, NJ: Princeton University Press. Available: http://www.bitbybitbook.com/en/preface [December 2018].
Sweeney, L. (2013). Discrimination in online ad delivery. Queue, 11(3), 10. doi:10.1145/2460276.2460278.
Walsh, P., and Miller, S. (2016). Rethinking “five eyes” security: Intelligence collection policies and practice post Snowden. Intelligence and National Security, 31(3), 345–368.
Wesolowski, A., Buckee, C.O., Bengtsson, L., Wetter, E., Lu, X. and Tatem, A.J. (2014). Commentary: Containing the Ebola Outbreak—the Potential and Challenge of Mobile Network Data. PLoS Currents, September 29. Available: http://currents.plos.org/outbreaks/index.html%3Fp=42561.html [December 2018].
Zook, M., Barocas, S., Boyd, D., Crawford, K., Keller, E., Gangadharan, S.P., Goodman, A., Hollander, R., Koenig, B.A., Metcalf, J., Narayanan, A., Nelson, A., and Pasquale, F. (2017). Ten simple rules for responsible big data research. PLoS Computational Biology, 13(3), e1005399. doi:10.1371/journal.pcbi.1005399.