In an increasingly interconnected world, perhaps it should come as no surprise that international collaboration in science and technology research is growing at a remarkable rate. Indeed, the number of multiple-author scientific papers with collaborators from more than one country more than doubled between 1990 and 2015, from 10 to 25 percent (Wagner et al., 2017). As science and technology capabilities grow around the world, U.S.-based organizations are finding that international collaborations and partnerships provide unique opportunities to enhance research and training.
International research agreements can serve many purposes, but data are always involved in these collaborations. The kinds of data in play within international research agreements varies widely and may range from financial and consumer data, to Earth and space data, to population behavior and health data, to specific project-generated data—this is just a narrow set of examples of research data but illustrates the breadth of possibilities. The uses of these data are various and require accounting for the effects of data access, use, and sharing on many different parties. Cultural, legal, policy, and technical concerns are also important determinants of what can be done in the realms of maintaining privacy, confidentiality, and security, and ethics is a lens through which the issues of data, data sharing, and research agreements can be viewed as well.
To examine international research collaborations in a systematic way, the Government-University-Industry Research Roundtable (GUIRR) launched a Working Group on International Research Collaborations (I-Group) in 2008. Convened across the National Academy of Sciences, the National Academy of Engineering, and the National Academy of Medicine, GUIRR serves as a forum for dialogue among the top leaders of governmental and nongovernmental research organizations. GUIRR seeks to advance relations between the sectors and facilitate research collaborations. The goal of the I-Group is to work with stakeholders
1 The planning committee’s role was limited to planning the workshop, and the workshop proceedings has been prepared by the workshop rapporteurs as a factual summary of what occurred at the workshop. Statements, recommendations, and opinions expressed are those of individual presenters and participants, and are not necessarily endorsed or verified by the National Academies of Sciences, Engineering, and Medicine, and they should not be construed as reflecting any group consensus.
to develop a more structured approach to cross-boundary collaborations and help companies and universities deal with various cultural, administrative, and legal complexities in undertaking them. According to its Statement of Purpose, the I-Group “engages in dialogue and discussion to facilitate international collaborations among academic, government, and industrial partners by: (1) identifying policies and operations that enhance our ability to collaborate; (2) identifying barriers to collaboration—policies and operations that could be improved; (3) developing a Web-based resource or other compendium of successful strategies and methodologies; and (4) suggesting how barriers might be addressed.”
Workshops that bring together subject matter experts from universities, government, industry, and professional organizations in the United States and other nations are an important means for the I-Group to carry out its work. In planning these workshops, the I-Group strives to be as inclusive as possible, bringing together the appropriate parties to the table around meaningful discussion and solutions. The first workshop organized by the I-Group, Examining Core Elements of International Research Collaboration, was held on July 26-27, 2010, in Washington, DC, with the goal of enhancing international understanding and diminishing barriers to research collaboration (NRC, 2011). To expand on some of the themes explored in this first workshop, the I-Group held two additional workshops. Culture Matters: An Approach to International Research Agreements, was held on July 29-31, 2013, in Washington, DC, and addressed how culture and cultural perception influence and impact the process by which research agreements are made and negotiated across international boundaries (NRC, 2014). This was followed by the subject of this proceedings, Data Matters: Ethics, Data, and International Research Collaboration in a Changing World, which was held on March 14-16, 2018, in Washington, DC, and explored the changing opportunities and risks of data management and use across disciplinary domains. In this third workshop in the series, representatives from around the world and from GUIRR’s three constituent sectors—government, university, and industry—gathered to examine advisory principles for consideration when developing international research agreements, in the pursuit of highlighting promising practices for sustaining and enabling international research collaborations at the highest ethical level possible.
The intent of the workshop was to explore, through an ethical lens, the changing opportunities and risks associated with data management and use across disciplinary domains—all within the context of international research agreements. The broad questions considered during the workshop included the following:
- What are data and information?
- Who holds the data and where does it live?
- Who assesses and maintains their quality, stability, and metadata?
- How is the data accessed?
- Why and for what purposes can access be granted?
- Who sets criteria and oversees the process?
- What can go wrong?
- How do business ethics differ across global boundaries?
- How are confidentiality and privacy handled in different global settings?
Following the workshop, the rapporteurs prepared this proceedings, which reports the main themes that emerged from the workshop presentations and discussions. The organization of the proceedings follows that of the workshop by focusing on the “core elements” of international research collaborations identified in the Planning Committee charge. The goal for the workshop and the proceedings is to serve as an informational resource for participants and others interested in international research collaborations. It will also aid the I-Group in setting its future goals and priorities. Financial support for the activity was provided by the Office of Naval Research, National Institutes of Health, the Office of the Director of National Intelligence, the U.S. Department of Agriculture, and the industry and university partners of GUIRR. Partner companies Elsevier and Siemens Corporation provided additional sponsorship. The workshop agenda is included in Appendix A and the statement of task is in Appendix B.
In her opening remarks to the workshop, Barbara Mittleman, chief strategy officer of Waymark Systems, commented that this was the most difficult meeting to organize she had ever been involved in because the intersection between the topics of ethics, data, and international research collaborations is “very underpopulated,” in large part because this workshop was ahead of the curve. As a result, she and her planning committee colleagues worked to find a way to provide enough substance on each of the three distinct subjects as well as enough overlap among them to generate discourse on the intersectional subject of the workshop. “I think we are going to find that the results of this workshop are going to be very interesting to many people who are not as far along in thinking about it as we now are, collectively,” predicted Mittleman.
I-Group’s original intent was for this third workshop to be about ethics, but it was soon apparent that this was too big of a topic to cover in the space of two and a half days. After thinking about how to delimit the subject and realizing that data science is progressing rapidly from both a technical point of view and from the perspective about how data are being generated and used, the I-Group’s members decided that data would be a lens through which they would look at the topic of ethics. Mittleman noted that “data is the currency of science, and the huge technical advances that have been evolving about generating data, analyzing data, sharing data, new kinds of analytics and connections in data, are allowing us to do all kinds of things we could not do before.” At the same time, she said, there are risks that come along with these advances, which makes it important for this
2 In this section and other sections summarizing presentations, views and opinions are attributed to the presenter unless stated otherwise.
workshop to examine the ethical aspects of the benefits that come from using data and also risks associated with data collection and use. While there are existing frameworks around the world for managing research ethics, few of these structures have been specifically built to accommodate ethical concerns for data management and use.
Of the many people who are affected by research and data collection, some are well aware through an informed consent process that data are being collected and will be used for a variety of purposes. Others, however, do not know that data are being collected as they go about their normal daily activities, such as driving a car, shopping in the grocery store, or exploring on the Internet. “They do not know who gets the data, who has access to it, what they are going to do with it, and what the implications of that are,” said Mittleman. “It is not always going to be relevant in every research collaboration, but I think as long as we are undertaking the task to look at this in terms of the big picture, this is also part of what we need to look at, and another piece that you will hear something about during the meeting is the open science conversation.” Over the course of the workshop, she explained, the discussions would focus on the social side of data to illuminate some of the concerns of people who are collecting, using, and/or being affected by data.
To set the stage for the workshop’s discussions, Jacob Metcalf, researcher at the Data & Society Research Institute and founding partner of Ethical Resolve, spoke about how the processes and infrastructures that enable rigorous thinking about data ethics differ from the norms and infrastructures developed over the past 70 years to deal with research ethics. The latter, he said, were built to handle very different types of harms and address various kinds of human subjects. “The harms of data science are often rendered invisible with the existing tools, policies, and conceptual practices that we are familiar with,” said Metcalf. “When we see these epistemologies that increase scale and speed, always assume an ability to repurpose the data, use analytics in a new kind of way, presume indefinite storage of data, and are continuously updating data sets, we should expect that we are going to struggle significantly with how to conceptualize what it means to do ethical research in those conditions.” In fact, said Metcalf, some of the most interesting work on data ethics is taking place in what he called unexpected and informal venues.
Ethics, he said, can refer to many different things, both formally, within the discipline of philosophy, and colloquially. Ethics can mean the rigorous study of which actions are morally justifiable or unjustifiable according to some sort of philosophical perspective, the act of passing a judgment of a particular action or choice, or the social activities and structures that are necessary to render that judgment as a collection of people. He noted that when he talks to researchers, particularly in technology companies, he finds that they are abstractly considering the
first definition and worried about the second one—that people will judge their actions and generate negative public attention.
What they are not working on, and where they need to be spending most of their energy, is building the capacity to think together. “What we are looking to do is find a loop between one and three, with number two as a byproduct of if we do it right,” said Metcalf. “We want to be figuring out what it means to think collectively and rigorously and come up with actionable decisions about how to handle this new kind of knowledge we are producing.” The goal is to develop norms, which he explained requires common understandings to think together, and sharing common understandings to think together helps to develop better norms. What should be expected is that norms develop in response to technological and social change, which drives the need for infrastructures that drive adoption, adaptation, interrogation, and enforcement of norms. “What folks like us and in meetings such as this should be working on is to facilitate that loop,” said Metcalf. He noted that to “do ethics” is to sit down and think together, something that the science and technology community is good at doing, though it is not practiced in doing so when it comes to ethics.
The reason why norms and infrastructures are the focus of so much thought when it comes to ethics, said Metcalf, is that the tools and infrastructures that exist have been built around a failure of norms, such as the experimentation on subjects in Nazi death camps, the Milgram obedience experiments, the Stanford prison experiment, and the Tuskegee syphilis experiment. In each of those cases, researchers abrogated their responsibility to provide autonomy to the research subjects and, as a result, harmful things were done without the requisite consent and explanation. A less harmful but no less egregious example was the experiment run by researchers at MIT and funded by Quaker Oats in which cognitively impaired students were fed radioactive calcium, with a cumulative radiation dose of approximately 20 chest x-rays, to study how bones take up this key mineral. The conditions of this research were so blatantly wrong that it led to the 1976 Belmont Report and the subsequent development of the Common Rule and resulting establishment of institutional review boards (IRBs) to approve all federally funded human subjects research.
In the context of data science, the Common Rule defines research as the creation of new data in pursuit of generalized knowledge and it defines human subjects as individuals in whose lives or bodies a researcher intervenes in the collection of the data, explained Metcalf. He noted that the Common Rule is not the end-all and be-all of research ethics, and it does have shortcomings, but it has become the touchstone for ethics training within the nation’s universities. Regardless of research field, the Belmont Report and Common Rule have become the lingua franca and conceptual framework of practical applied scientific ethics in the United States and, by extension, to other parts of the world. Metcalf noted that the first major revision of the Common Rule was issued in 2017 to address big data in biomedicine.
The Common Rule, he said, tries to deal with the conflict of interest of being a researcher and physician at the same time. Researchers have the duty of contributing to general human knowledge, while physicians have the duty of providing care to individual patients, which creates a conflicting social role. The hands-on, practical elements of the research ethics established by the Common Rule are about resolving the core risk to human subjects that their autonomy will be compromised, and they will not receive adequate care if the researcher role dominates. The application of the Common Rule to other research fields means that the previously established social roles, obligations, and epistemologies respected in other research fields must fit under a set of rules that were designed for a very particular historical task of addressing the conflict of being a researcher physician, explained Metcalf.
Historically, the academic departments that constitute what is now called data science—mathematics, statistics, and computer science—have not had contact with IRBs, which means they often do not have contact with any formal ethics training. Another complicating issue is that the Common Rule defines research as the creation of new data in pursuit of generalizable knowledge, but data science deals mostly with preexisting, public, or repurposed data sets to discover new phenomena or generate new mathematical or computer science insights. In addition, the Common Rule deals with individuals, and IRBs are excluded by statute from being concerned with potential downstream impacts of research on society or communities.
Another problem with the Common Rule as it concerns data science is that data science makes anyone a researcher and everyone a research subject. Data science is also about making humanity predictable and “leverageable,” and when research switches from aiming to be generalizable to being predictable, it shifts all the harms and benefits downstream and it makes consent nearly impossible. “How do you consent 10 million people?” asked Metcalf. “We do not know how to make sense of that sort of practical handshake that is supposed to happen between researcher and research subject.”
As an example of how this plays out in practical terms, there was a study published in 2018 that used a deep neural network trained on data from a dating website where people self-identify as heterosexual or homosexual to predict sexual orientation based on facial structure more accurately than humans could (Wang and Kosinski, 2018). The researchers claimed they conducted this study for a specific ethical reason, which was that tools for intensive discrimination can be built with off-the-shelf machine learning components and widely available social data and that the public is largely unaware of this. While the researchers reported on Twitter, in response to the public outcry that followed the release of this study, that the study had been approved by their institution’s IRB, Metcalf noted that this study would have been outside of the purview of an IRB because an IRB is supposed to ensure adequate protection of the rights of human research subjects with informed consent at each institution in terms of the research method. “All of the risk here comes downstream for society,” he said. “It has very, very, very little risk for the people whose faces were scraped off of the website, because those
faces are never going to be made public.” Those downstream risks, he explained, would include a country where homosexuality is illegal using such a tool to identify homosexuals and put them in prison or worse.
A second example involved outing the previously and intentionally anonymous graffiti artist Banksy using an algorithm developed to prevent a terrorist attack and public records. If traditional research is about generalizability, this work took a research practice and created something that is specific and predictable about individuals. The question this study raises, said Metcalf, is how society is going to respond to this inversion of the conventional research method to take generalizable knowledge and public data sets to examine individuals.
In 2014, Facebook wanted to determine if it was contributing to emotional contagion, a well-validated psychological phenomenon. The company conducted a study that included a researcher at Cornell, an expert on empirical social psychology, who looked at the social media posts from almost 700,000 Facebook users and examined how many happy and sad words they used on their pages. Facebook then changed the newsfeed algorithm for those 700,000 people so that half of them got slightly happier words on their newsfeed and half of them got slightly sadder words on their newsfeed and showed that people who saw happier stories used more happy words, while those who saw more sad stories used more sad words (Kramer et al., 2014). A simple but interesting result, said Metcalf, but the findings were explosive because the public suddenly realized that Facebook had an algorithm that is not just showing what friends are posting but choosing what people saw on their newsfeed. Furthermore, it made Facebook’s use of page data for research purposes apparent to users. The Cornell researcher’s IRB ruled that this was permissible research because he was using data that had already been collected by the time he was involved with the study.
Looking at these examples reveals some common threads, said Metcalf. First, research occurs in many diverse venues and that multiplies the area where the norms of research ethics can and should apply. Second, it is important to recognize that data sharing is always also model sharing, so a data sharing agreement is about moving models and interpretations between collaborators as much as it is about moving spreadsheets back and forth. “If you are building a deep neural network off of a data set, you are also sharing what a machine will learn from that data set, so the assumptions about who the research subjects are and what we can know about them travels with that human subject data in complex and subtle ways,” he said.
A third thread is that research is now happening at an unprecedented scale, and the tools do not exist for tracking harms when research happens at that scale. “Is a tiny harm to 700,000 people worse than a significant harm to 700 people? I do not know,” said Metcalf. “We have not been able to have that conversation yet and figure out what the actual metric is that we should use to measure harm in those situations.” A fourth thread is the need to grapple with the fact that data science enables researchers to target individuals in a manner that was previously inconceivable to academic researchers.
The takeaway message is that collaboration between academics and industry is both necessary and tricky as work traverses different venues and different sets of norms and expectations. “We should recognize up front that our conversations are not always going to be smooth and embrace the bumpiness,” said Metcalf. The good news, though, is that all of this tumult, as he put it, means there are unexpected venues for conversations about research ethics, including corporate efforts at creating IRB-like structures (Jackman and Kanerva, 2016; Tene and Polonetsky, 2016). The example Metcalf cited also raised some unanswered questions, such as whether to look at formal or informal models of ethics review and where ethics should be located in industry. He noted that Facebook built its ethics review process in parallel with its privacy review, but the people who run the ethics review reside in the policy office.
The European Union’s new General Data Protection Regulations (GDPR) have prompted many new conversations because they require companies to have an accountable paper trail regarding certain ethics decisions. “You do not have to make them public, but they have to be there, and you have to show you did your work to measure harms and benefits,” Metcalf explained. Companies are also developing transparency metrics, and Twitter recently called for researchers to develop public health metrics for the platform and is funding researchers who want to explore what it means to have a healthy conversation on Twitter and how the company can track and measure that.
In closing, Metcalf pointed out that there are significant differences across countries in the way they treat data ethics. The United States, for example, focuses its leverage at the point of data collection; at the point data are collected they are either deemed private or public. There are gray areas, however. A doctor’s records are private, but a credit card receipt from the drug store, which could reveal medical information, is not considered protected medical data. In Europe, GDPR focuses on accountability, explanation, and justification, and Japan has a model that he said is somewhere in between the two that focuses on de-identification in centralized databases. As a final comment, he reiterated his earlier message: “We need to be looking at the loop between norms and infrastructures,” he said.
To illustrate how some of the challenges Metcalf described play out in actual international collaborations, Ghassem Asrar, director of the Joint Global Change Research Institute of the Pacific Northwest National Laboratory, presented three examples from the world of environmental monitoring, modeling, and prediction. He began by noting that the need for science-based information for defining policies and practices continues to grow, especially as it pertains to complex systems, such as Earth’s environment, climate, and weather that are difficult to unravel by looking at each component independently. By necessity, he said, those kinds of problems require multiple disciplines to come together to define the problem, devise experiments or methodologies to answer those questions, and then use the common goals and objectives to convince the nations that have
a stake in these efforts to provide the resources and the support to nominally accomplish those stated objectives. Over the course of five decades, the international environmental sciences community has retooled and reorganized itself to parse the enormous complexity of the planetary ecosystem into its logical, disciplinary, and often local components and then work collaboratively, across international boundaries, to integrate the pieces of knowledge and produce holistic answers to the questions it is pursuing.
While research can glean important insights through retrospective analysis of existing data, some of the most important questions regard future trajectories and possible consequences of current actions, said Asrar. To address those questions, researchers depend on models to provide some notion of what might happen in the future. The results of those projections are themselves treated as data, which raises issues with how to treat derived information from models that are imperfect, abstract representations of reality. Modeling, though, requires data, and data come from observations. In the environmental sciences, observational technology has changed dramatically over the past five decades, introducing an additional element of complexity that must be considered when developing experiments and the policies and practices for using the data generated by these technologies, said Asrar. Adding to the technological complexity is the fact that different nations use different technologies and different approaches to collect data, creating the challenge of establishing best practices for collecting, curating, processing, and ultimately managing and sharing data so that all nations can use these data in a consistent manner.
Asrar listed several scientific and technical challenges that were considered when defining the data sharing policies and practices that the nations participating in global environmental studies had to agree on so that the knowledge derived from those data would be useful for its intended purpose. These challenges included the multiple scales of time and space over which the data are generated and multiple sources of those data; the complex nature of the system and the feed-backs among its components; the uncertainty in the measurements and analyses; the need for data validation, quality assurance, curation, stewardship, dissemination, and sharing; national differences in computation, visualization, and analytical capabilities; and a lack of data, particularly regarding socioeconomics.
One way the environmental sciences community was able to manage these technical complexities and create an integrated system was through the information management system it created. In addition, the community worked to implement best practices and policies for every component of the system, both upstream for collecting the data and downstream for the use of the data, and to resolve issues involving intellectual property and data ownership, said Asrar. In the end, after 20-plus years of discussion, debate, and negotiation, an international body of close to 115 organizations emerged—the Group on Earth Observations (GEO)—that agreed to adopt one set of practices and policies that would enable every nation, including those with advanced and less sophisticated technological capabilities, to use the end results.
The result of that process is the Global Earth Observing System of Systems (GEOSS) platform and a set of data sharing principles that includes full and open data exchange as the default condition, the generation of data and products with minimum time delay and at minimum costs, and reproduction of data at no cost. In its 12 years of operation, GEOSS has generated over 400 million observations from over 5,000 data providers covering all seven continents, produced over 170 brokered catalogs, and informed 60 projects. In 2016, users made nearly 4.5 million inquiries of data from the GEOSS platform. In 2015, GEO issued an assessment of the benefits that resulted when the U.S. government decided to enable free open access to data from its Landsat satellites (International Council for Science Committee on Data for Science and Technology, 2015). “This is the type of analysis this group has been using for convincing the governments and entities to share their data and information for the good of all involved and the rest of the world,” said Asrar.
The World Meteorological Organization (WMO), the United Nations’ specialized agency on weather, climate, and water, has existed in some form since 1873, and it coordinates the work of 200,000 national meteorological and hydrological experts and a global observing network of more than 10,000 stations and operational weather satellites. Data come from surface observations, balloon soundings, ocean weather stations, and satellites and include air quality and greenhouse gas measurements. WMO both coordinates among the nations who contribute data and knowledge and shares the weather forecasts that are the product of the network’s observations and models.
As a result of this consortium’s work, weather forecasting over 3, 5, 7, and more recently 10 days has improved significantly since the 1970s. Asrar said there is clear, demonstrated evidence that two factors—observations contributed by all 191 member nations and territories and improved forecasting models—were responsible for most of this improvement. Not being satisfied with the progress made to date, this coalition of willing nations has set its sights on improving the predictability of weather on the timescale of months, seasons, and year to year, and climate on the timescale of years to decades. “This is the type of effort that no one nation can take on,” said Asrar. “It lends itself to international collaboration and cooperation and because it requires the expertise and the resources of more than one nation and the common goal that they adopt and pursue together.”
What enables all nations to benefit from WMO’s efforts is its data acquisition, access, and sharing policy for the free and unrestricted exchange of hydrological, meteorological, and related data and products (Box 1-1). One element missing from this is the commercial aspect, said Asrar, and this has been a thorny issue with which the countries and participating organizations could not come to agreement. As a result, none of the data can be sold. The United States, however, has a different policy that allows the private-sector companies to add value to public-sector data and develop tailor products for which they can charge a fee. They are not, however, allowed to sell the source data.
The third example Asrar discussed was the Earth System Grid Federation (ESGF), an international collaboration of centers working together to manage and provide access to climate system data, models, and observations. Started a decade ago, ESGF is now the world’s premier data-focused technology infrastructure, built on a consortium of modeling groups in the United States, Europe, Asia, and Australia, to support Earth system science. From its inception, ESGF’s archive has been available to all users except for commercial applications. To its credit, said Asrar, the consortium has been bringing in experts from the beginning to help work on issues of data security and privacy and on how to manage the data stream while maintaining the quality and integrity of the data that may ultimately be used to make critical infrastructure and national security decisions.
One project ESGF has enabled has been to model the geography of food, water, and energy globally to identify potential hotspots that will not be able to produce enough food to meet local demand. This modeling exercise calculated that humans in North America use approximately 30 percent of the terrestrial ecosystem’s supply of net primary production, while South America uses 8 percent,
western Europe uses 86 percent, south central Asia uses 96 percent, and Southeast Asia uses 300 percent (Imhoff and Bounoua, 2006; Imhoff et al., 2004). This modeling activity identified areas that are going to be under tremendous pressure with changes in climate that will affect rainfall. “Having that knowledge gives us some indications of how we need to retool agricultural practices to manage the risk associated with these challenges that we and the rest of the world face,” said Asrar.
He noted that none of the examples he discussed could have been successful without the leadership and scientific contributions of the research community at large. “Clearly, science can and should play a major role in this process and not only in creating a system but making sure that the integrity of what results from these systems are maintained and are used in the way that were intended to be used.” As examples, he said that the role of research in data development can include providing advice on the best data sets to use for various purposes, as well as the merits and limitations of those data sets. Research can also identify high-priority research needs, promote sound data stewardship, help make data sets accessible and usable, and promote data quality and uncertainty characterization.
In conclusion, Asrar said today’s complex socioeconomic and environmental opportunities and challenges transcend individual disciplines and nations, and efforts to address them requires science-based information generated by multidisciplinary collaboration and expertise from around the world. The benefits of collaboration, sharing, and using knowledge outweigh any risk imagined, as the examples he provided show. In each example, he said, collaborators developed and promoted the use of data standards, formats, documentation, quality assurance, and calibration and evaluated based on the use of national and international standards in data curation, stewardship, dissemination, and sharing. These partnerships and collaborations were able to share expertise, resources, and experience to develop capacity and infrastructure where needed and to sustain, improve, and expand the capacity and infrastructure over multiple decades. He noted that one lesson learned from these three examples was the importance of bringing into these consortia those nations with less developed technological and scientific resources and expertise and helping them build capacity to make sure they benefited, too, from the data and knowledge produced through these programs.
One challenge going forward, Asrar said in closing, will be to make sure efforts such as these can sustain what they have built and evolve in the face of emerging technology and the complexities involved in sharing data and determining who gets to use it for what specific purposes. He noted that the slowest progress has been made in nations and regions that had the greatest needs for data and information and that progress has been slower than expected in adopting and implementing data stewardship and sharing principles uniformly across the globe.
During the discussion that followed Metcalf’s and Asrar’s presentations, Metcalf was asked how to deal with infrastructure and technology that is changing so rapidly, something that Asrar mentioned. Metcalf said that while the public face of technology changes rapidly, the underlying dynamics are steadier than the outward face suggests. In his opinion, opportunities exist to gain traction even as technology advances and to build a virtuous cycle of technology development and
collaborative ethical evolution. Asrar added that networks established as part of international consortia can provide a good venue for discussing the ethical issues around data use from evolving technologies.
Asrar was asked if he had examples of two countries responding to a common challenge and acting together, and he replied that development organizations are funding and enticing countries to come together to tackle challenges related to food and water. His organization is supporting an effort in South America that has developed an agreement involving Chile, Uruguay, Argentina, Columbia, and others in the region to tackle problems of energy, food, and water using data and modeling contributed by countries with the technological capacity to assist these efforts. Similar programs are emerging in Asia, too, he said. One thing that is different about these efforts is that they are true collaborations that involve these nations in the analysis rather than simply handing them results.
In terms of counterexamples, Asrar noted that there are contentious issues involving dam building in Iran, Afghanistan, and Pakistan and the effect on downstream farms, but perhaps if these countries had access to the type of analyses they are now being provided, they would be better prepared to deal with this type of situation. He also explained how topology data, which are mostly used for routing water, could also be used for issues related to a nation’s security. The decision was made after major debate among the networks that such data should only be shared with the consent and permission of the affected country.
Metcalf responded to a question about ethical considerations related to publishing in areas outside of the purview of IRBs by noting that many conferences have started including an ethics review before accepting a paper for presentation. The problem with this crowdsourcing approach to ethics is that the research has already been done, so, arguably, the harms have already occurred and there is no discussion around harm mitigation. Another issue is that conference ethics committees do not publish a set of rules that say not to break the terms of service. Metcalf noted that half of the research community believes it is okay to break the terms of service because those rules were written to protect the company and not necessarily the users. One recent study of Twitter users, for example, found that the vast majority did not realize that research is being done using their tweets, even though that information is clearly stated in the terms of service agreement (Fiesler and Proferes, 2018).
A workshop participant observed, based on the two presentations, that data do matter and that the ethics of data matters even more today because of the seriousness of the public availability of data. This participant predicted that the lack of public engagement and accountability and the lack of the individual having any power to influence and shape the data collected about them does not bode well for future collaborations.
In the final framing presentation, Simson Garfinkel,3 senior computer scientist for confidentiality and data access at the U.S. Census Bureau, used the decennial U.S. census to illustrate the challenges of making data accessible and protecting individual privacy. The U.S. Constitution, he explained, requires the Census Bureau to perform an enumeration of the U.S. population every 10 years as directed by law and share the results with the president, the states, and the American people. In creating this record of the U.S. population, the Census Bureau has two overriding—and conflicting—requirements. The first requirement is that it collect and publish an accurate accounting of the population for use in reapportioning the House of Representatives, for drawing legislative boundaries in the states, and for the U.S. Department of Justice to use to enforce the Voting Rights Act.
The second requirement, said Garfinkel, is that the Census Bureau is prohibited from making any publication or release of data that allows the contribution of any individual or establishment to be explicitly identified. After the 2010 census, the Bureau published the results at the block level, which means that for every block, it reported the number of people living there, the distribution of their ages, and the distribution of their races. That might seem like a reasonable approach, except in the case of a block containing a single person. The solution the Census Bureau deployed at the time, he explained, was to use a technique tested by the Census Bureau’s statistical research division called swapping, which swaps data from households from one location with households with identical characteristics on a certain set of variables from a different geographical location. Which households were swapped is not public information, nor is the list of characteristics used to identify which households to swap, and the selection process was targeted to affect the records that were most at risk of disclosure. The problem, though, is that it is impossible to prove mathematically that swapping protects privacy. “To the contrary, I can construct counterexamples in which swapping probably cannot protect privacy for some members of the population,” said Garfinkel.
As a result, the Census Bureau will be using a different approach for the 2020 census. Instead of swapping, it will be adding statistical noise to any table that reports on the distribution of age, race, or other characteristics. Garfinkel explained that this noise will be small for areas with a large number of people, but it will be significant for areas with just a few people. “It is important to realize that the Census Bureau is not the only official statistics agency that makes changes in the statistics that it publishes for the purpose of protecting privacy,” he said. “The practice is nearly universal because statistical agencies understand that individuals, businesses, and even society as a whole would be less supportive of statistics programs if the data from these programs can be used to single out specific individuals or establishments.” The Census Bureau calls this process “disclosure avoidance” to describe the practice of making the changes in the statistical product
3 Simson Garfinkel appeared in an official capacity, but the views and opinions he offered were his own and did not reflect the policy of the U.S. Census Bureau.
prior to release. Other agencies call this practice “statistical disclosure control” or “statistical disclosure limitation.”
From an ethical point of view, the obligation to protect the privacy of people in a data set must be balanced against the positive societal benefits that can result from the use of that data set, said Garfinkel. For example, census data are used to apportion the House of Representatives, and, by law, the apportionment must be perfect. Therefore, the Census Bureau cannot add statistical noise to the count of any state or any specific geographic area for which it publishes statistics, though it is not prohibited from using it elsewhere. In particular, he said, noise can be added to protect privacy without materially affecting the determination under section 203 of the Voting Rights Act.
For the 2020 census, the Census Bureau is using a mathematical framework called differential privacy that helps it determine how much noise to add (Dwork et al., 2006). The amount of noise added is proportional to the inverse of the number of people included in a particular statistic. “We add the noise when we compute those statistics, and then we reconstruct a data set that could generate those statistics,” he explained. “When we reconstruct that data set, we now have a private data set that is different than the real data set, with the difference being the privacy edits that we apply.”
Garfinkel noted that 2020 census data that are not privatized will be kept in a secure location and not released to the public. The penalty for releasing those data is $600,000 and 6 years in prison, so everyone at the Census Bureau takes the responsibility of securing those data seriously. He added that the Census Bureau is going to release the source code for its software, which has never been done before, to allow researchers to see how differential privacy affects accuracy versus privacy using publicly released data from the 1930 and 1940 censuses.
Differential privacy allows the Census Bureau to prove mathematically that the noise it adds has the desired effect. Another important protection that comes from using differential privacy, said Garfinkel, is that the privacy it affords remains, regardless of additional external data. This is a critically important feature for privacy protection because it will keep census data from joining the long list of cases in which a public agency, company, or individual research released a set of data that they thought was properly de-identified, but somebody eventually figures out how to get the identifying information out of the original data.
One of the first known instances of this happening occurred in 2006, when America Online (AOL) released of a set of search terms that its users had entered into the AOL search engine. It turns out that for some people, it was possible to take the constellation of search terms and identify the person, which The New York Times and eventually others did, resulting in a class action lawsuit against AOL and a $5 million settlement pool. Two years later, Netflix experienced a similar issue—and a similar legal outcome—when it released 100 million ratings given to 17,770 movies by roughly 500,000 users that it thought it had thoroughly de-identified, only to have a graduate student and his advisor at the University of Texas at Austin show that they could, in fact, identify specific people in the Netflix data set. “Both cases involve data that clearly contained personal information,
and the question in both of these cases was whether or not the information was personally identifying,” said Garfinkel. “AOL and Netflix thought that it was not, and in both cases, researchers showed otherwise.”
In these cases and others, the data that were thought to be anonymous were re-identified by finding some way to link them with data in another data set. “From the point of view of the data provider, this is frightening because whether or not a data set is vulnerable depends on the existence of another data set somewhere else on the planet, and you do not know about that other data set until it is too late,” said Garfinkel. He likened this to the proven quantum physics phenomenon known as quantum entanglement, in which two particles that are created at the same time can be entangled and remain linked even though they are separated by a great distance. In the case of data, two data sets are entangled because they are based on the same people or same events. Differential privacy addresses this problem by attenuating the connection between two data sets through the addition of noise that created carefully controlled uncertainty, explained Garfinkel. Today, Google uses differential privacy to protect the data it gathers from its Chrome browser and Apple uses it to protect information it collects from iPhone users. There are simple applications of differential privacy, he noted, and the Census Bureau is using a significantly more complex application of the technique with the 2020 census.
In conclusion, said Garfinkel, data collected from individuals frequently contains personally identifiable information even when it is not readily visible and only becomes apparent when linked with another data set. The only way to provably break the linkage between two data sets is to add noise to the underlying data. “If we use noise, we can control the trade-off between privacy loss and the usefulness of the remaining data, and it turns out that the converse is also true,” he said. “If you do not use noise, you lose the ability to exert fine control over that relationship.” His final point was that noise can be used on specific parts of a data set to protect specific types of data. For the 2020 census, for example, the Census Bureau intends to use noise to protect respondents’ age, race, and other identifiable characteristics, but it will not be using noise to add to any of the geographically based population counts that will be used for apportionment and redistricting. That way, he said, the Census Bureau will be protecting individual privacy while also providing an accurate enumeration as required by law.
Garfinkel noted in response to a question that differential privacy is not applicable to clinical trials data for several reasons. The first is that clinical trials data contain text elements, which do not work well with privacy mechanisms based on mathematics. The second problem relates to the Health Insurance Portability and Accountability Act (HIPAA), which states that when data are de-identified, they are no longer protected legally and can be shared. However, as several cases have shown, the “de-identified” data are still inherently identified, creating a conflict with HIPAA because, as he explained, “what the law says to do does not result in what the law says will happen,” which is that de-identifying data will protect an individual’s identity.