This chapter summarizes presentations on a number of challenges associated with the sharing of data, including obstacles to releasing data, privacy and confidentiality problems, and informed-consent issues. The discussions concerning these issues can be found in the last section of the chapter.
There are a number of different obstacles associated with the release of data. Some are related to concerns of the scientists who have generated the data, some are related to the concerns of businesses or other organizations that have paid for the collection of the data, and some are practical issues having to do with the administration of the data.
Concerns About Adversarial Science
It is, unfortunately, the case that the science surrounding environmental health issues can sometimes become very contentious and adversarial. As was the case with tobacco companies and the research showing a link between smoking and lung cancer, companies that profit from the production or use of certain chemicals or products can resist research that indicates that those chemicals have health risks. And this can put researchers in a bind, said Daniel Greenbaum, president of the Health Effects Institute (HEI), during his presentation in Session 4 of the workshop. “There is an inherent tension between collaborative scientific data sharing and what often can be adversarial science,” he said. “The challenge here is that investigators invested time. They probably should
be able to make their data available, but they are being engaged by advocates in controversial cases who are not there really to advance knowledge. Their primary purpose is in undermining the study.” If the investigators believed that these advocates were interested in advancing scientific knowledge, the investigators would be willing to share, but believing that the advocates are only interested in looking for weaknesses in the data that can be used to discredit the researchers makes them hesitant to make the data public.
In a worst-case scenario, the original researchers might even find themselves accused of scientific fraud by such advocates. This happened in California not too long ago, Greenbaum said. The case was meritless and “got thrown out,” he said, but it was still quite unpleasant for the investigators who were accused. “You can imagine—that is a pretty chilling activity.”
As a result of such adversarial tenseness, Greenbaum said, “the new investigators, many of whom are talented biostatisticians and epidemiologists, get frustrated because they cannot even call up the original investigators once they have the data” and ask questions about the data, such as details about how they were generated. “The original investigators do not want to talk to them.”
With the sharing of data encumbered in this way and the usual open scientific dialogue closed down somewhat, Greenbaum explained that it becomes much harder to advance the science in these areas—not just replicating the initial work but also extending it and carrying out new analyses on the data sets. All of these valuable outcomes of data sharing become the victims of distrust and suspicions.
Tensions Between Researchers and Opponents
To reduce some of the problems surrounding the sharing of data, Greenbaum suggested a different approach: finding ways to ease tensions between the researchers who collect the data and their opponents who are interested in discrediting the data or the conclusions that have been drawn from them. “Is there a way to facilitate a true dialogue,” he asked, “between those who fund the adversarial reanalysis—and I am not limiting that only to industry, ... although it tends to happen more with industry—and the scientific community that can set a foundation for more thoughtful and independent testing of key studies? Is there a way to do that in a way that would make sense and that would start to lower the
decibel volume around this incredible distrust ... between the two parties here?”
There may well be no bigger issue in the broad discussion about data sharing than how to ease tensions between these two opposing camps, Greenbaum said. “There are lots of issues around the technical and the other stuff, but it is this issue of how to lower the temperature of these discussions that allows science to shine through on the other end.”
If researchers are to be comfortable sharing sensitive data with others, they must have ways to make sure that the confidentiality of individuals in a data set can be preserved when the data set is shared. This can be particularly challenging for environmental health research, Greenbaum noted. “To do a good air quality study and estimate exposure, you do need to know where somebody lives,” he said. “And once you know where somebody lives and that they have died on a certain date, it is pretty easy to figure out who they are.”
EPA has asked the HEI for advice on how to share confidential data without compromising the confidentiality, Greenbaum said, and, according to the institute, there are basically three options for doing that.
The first option, which is very common, is to have full collaborative data sharing with original investigators. “If you go to the American Cancer Society website, you can apply to them and ask to collaborate with them,” Greenbaum said. “There are many other mechanisms for doing that.”
The second option is the sharing of data through data use agreements. The National Institutes of Health (NIH) has a number of policies for doing this, Greenbaum said. “This is essentially what we did with our own HEI reanalysis. This provides the new investigators much greater access to the underlying data, but with a very significant requirement that they will protect any personal information from disclosure. I do not think I knew it was a $250,000 fine [for failing to protect any personal information] when I signed the data use agreement for the Harvard Six Cities Study and the American Cancer Study, but I knew it was significant penalties, and we pay attention to it. That is a mechanism that is out there, and it is used on a regular basis.”
The third option, which has become increasingly popular in recent years, is to share deidentified data files. “It is possible to do that, for example, for the air quality data sets for the cohort populations,” Greenbaum said, “but the reality is you couldn’t even properly replicate those studies today ... because a deidentified one would not allow you to have location and a variety of other things that are absolutely essential.”
Similarly, it would be impossible to do any sort of reanalysis or additional analysis of the data without having the location information and other types of information that could allow reidentification of the subjects. “You really have to have that kind of information,” he said.
A variation on the second option that Greenbaum mentioned—the sharing of data through data use agreements—is the use of secure data enclaves for the sharing of data. In this setup, sensitive data are kept in a particular physical location, and researchers use the data there so that the data themselves never leave the secure enclave.
Business Considerations Related to Data Sharing
Businesses must take into account various issues when deciding whether to share their data. Two of the main ones are concerns about opening themselves up to liability and other costs and worries about losing the value of confidential business information.
During Session 5 of the workshop, Joseph Rodricks, principal of ENVIRON, discussed his experience with businesses whose products might have health effects. “The concern in this country about product liability and toxic tort litigation is very high,” he said. “When something gets elevated to a known human ‘something,’ litigation breaks out all over. That is the experience of most companies. You might have at least some understanding why they might be concerned about such a finding.”
In particular, Rodricks spoke about how businesses react when studies appear that seem to have a strong potential of adversely affecting their business interests. “The ones of most concern ... are those emerging from human studies where authors are looking at causal relationships between some exposure and a disease,” he said. “Industry gets [really] nervous when studies like that appear if it is their product.” They are concerned about increased regulation as well as about liability.
“I find companies have different attitudes about what they want to do in those circumstances,” Rodricks said. “I discuss a lot of possibilities with them, including doing additional studies. I have heard from some companies that they want to find a way to undermine a study. I just say, ‘Go away. I will not engage in that kind of activity.’”
He agreed with Greenbaum’s concerns about adversarial science. “I have witnessed these very adversarial confrontations, and they are very difficult to even watch,” he said. “I certainly do not want to get involved in them.”
Rodricks noted that the process could be improved by beginning to assemble some guidelines on best practices for sharing data from environmental health research. These guidelines could help people organize their data management process so that some data are prepared for sharing when needed. He pointed out that a document from the Oak Ridge National Laboratory on best practices for preparing environmental data sets to be shared and archived has excellent detail on what investigators can be doing to prepare to publicly release the data as they develop them.1
The Business Value of Data
During Session 3 of the workshop, Greg Bond from the Dow Masters Fellowship Program at the University of Michigan spoke about some of the other business considerations related to data sharing. He noted that any raw data that industry generates to support a product registration are accessible to the relevant government authorities under a variety of statutes in the United States and also abroad. “The vast majority of these data will have been from animal toxicology studies that were conducted under good laboratory practices,” he said. “I can personally attest, as someone who ran a large, in-house toxicology laboratory, that the U.S. EPA comes in unannounced and will audit the data, all the raw data. And as a consequence, we have to keep the tissue blocks and the slides for 35-plus years so that they can audit again in the future, if necessary.”
Regulatory agencies are generally obligated to protect the commercial value of the data collected by companies. The key fact, Bond said, is that the data generated by industry researchers have commercial value. “This is recognized under FIFRA [the Federal Insecticide, Fungicide, and Rodenticide Act] and similar pesticide regulations globally,” he said. “Each of those regulations provides for exclusive use of the data by those who bore the cost of generating the data and for fair and just compensation [for] them by others who want to use the same data to achieve their own registrations and access to the market. The goal here is to prevent free riders and preserve the incentive for companies to invest in innovation and research.”
However, under pesticide regulations there are always time limits on how long a company can be compensated for those data. They are usually in the range of 5 to 15 years, Bond said. “That data compensation is becoming even more important as the rise of low-cost competitors, particularly from Asia and Eastern Europe, has increased,” he said. “These competitors already have some built-in advantages, and, frankly, we do not need to subsidize them by giving our data away to them.”
In particular, he said, he saw firsthand the sorts of advantages that these competitors have over U.S. companies when he spent 3 years working in China and observed the difference between how American companies and Chinese companies dealt with environmental health and safety (EH&S) concerns. “Our folks would wonder why it cost us $30 million to build a plant in China when the local producers could build it for $10 million,” he said. “A lot of that [was] EH&S protections that were built in.”
Thus, it makes sense, Bond said, that the companies that pay to collect environmental data consider those data to be business assets and are generally not eager to share them without compensation. In particular, data may offer the company that collected them some competitive advantage over their competitors—some particular knowledge that their competitors do not have or some insights that may lead to solutions that the competitors might not come up with. As a result, companies often treat environmental data as confidential business information and try to protect them from their competitors. The preservation of such confidential business information can come into conflict with the imperative to share research data, and finding the proper balance between the need to keep business information confidential and the desire to share scientific data can be difficult.
One way that businesses deal with the conflict, Bond said, is to provide higher-level information that is based upon the data. “Certainly, we believe that information should be publicly accessible that is sufficient for people to take action,” he said. For example, companies provide product labels and material safety data sheets with summary information that is sufficient to allow people to determine the steps that they should take, if necessary, to protect themselves. It may be necessary in some cases to improve the quality and quantity of that information, he said, but still, the company would be supplying information as opposed to the underlying data that the information was based on.
Bond explained what happens when someone wishes to question the conclusions of a study that, for instance, was submitted in support of a
position or a rule that EPA is considering. In that case, it would not be enough to look at higher-level information based on the data—the original data themselves would be required. First, Bond said, EPA itself always has access to the data, so it can check whether the conclusions of the study did indeed follow from the data. And, indeed, EPA generally does audit the data. Beyond that, there may be a time limit beyond which the company can keep the data, particularly toxicology data, confidential, and beyond that time limit the data are available. However, he added, he could not think of a single instance when someone asked for access to the data underlying a particular toxicology study. “It has not been an issue that we have had to deal with very often.”
Finally, Bond said, there is the issue of who pays for access to data that a company has paid to collect. The companies that collected data deserve and expect compensation for supplying those data to others, but where should that compensation come from? If another company wishes to use the data, the answer is clear, but if, say, academic researchers are looking to use the data, the answer is not so clear.
In addition to scientists’ concerns and companies’ concerns, data sharing is also complicated by various administrative issues, Bond said. For example, one issue that arises on occasion is the question of who owns the data. This is particularly an issue when data have been collected by groups of companies or institutions. A reasonable number of toxicology studies have been done by industry consortia, for instance. These were groups of competitors who came together years ago to share the cost of conducting the fundamental toxicology research, and over the years some of those companies have been absorbed into other companies via mergers and acquisitions. In a few cases, such absorptions have happened multiple times. And in some cases the companies involved in one of the consortia have gone out of business. The result, Bond said, is that it can be very difficult to sort all of this out to find someone who can make a decision about data sharing.
“As you can imagine,” Bond said, “not all competitors play nicely together. Some even resort to gamesmanship to try to exclude competitors from the market. Things can get nasty and messy in a hurry in these discussions.”
According to Bond, among the other administrative challenges is the fact that organizational policies and procedures and logistical issues may
differ from organization to organization. For example, institutional review boards at different universities may have different rules governing the procedures for sharing data.
How Will Data Be Made Available and Who Will Pay for Data Sharing?
Jerry Blancato, director of the Office of Science and Information Management at EPA’s Office of Research and Development, discussed governance and budget issues during his remarks in Session 5 of the workshop. He noted that multiple people talked about the need for definitions during the workshop, but from the point of view of someone who is in charge of the government asset of data, the question really is “where are the data?” Then decision makers may ask, how does one make the data publicly available, are the data readable by today’s machines, how should archived data be handled, and how should one plan for future data requests 20 or 30 years from now.
One major obstacle to extensive data sharing is its cost. “This is not a zero sum game,” said Blancato during his presentation. “This is going to cost us. It is going to cost us in dollars. There is no question about it.”
“The problem is that the budgets are not expanding to cover this,” he explained. “A balance has to be made.” If data are going to be made publicly available, then the money to do that is going to have to come from somewhere, perhaps from direct research support.
“One could make the argument ... that from a societal point of view, sharing the data will have far greater benefits,” he said. “But we have to measure that. And, as someone else said, we have to do some work to communicate that and convince the community—both the research community and the public community—that that in fact is true.... We are being screamed at from Congress, from the White House, from internal forces, from external forces. ‘Make the data available. Why can’t my data get out there so people can see it?’” But the money has to come from somewhere, and right now there is no agreement on where the money will come from.
Blancato noted that governance is key to figuring out how the data will be made available. Technology may be able to help formulate a solution to the sharing of data publicly, but pieces will still need to be protected and not shared. It is important to take advantage of partnerships with industry experts in Silicon Valley and elsewhere who know how to reasonably store tremendous amounts of data relatively cheaply. He
reinforced the importance of predicting and utilizing upcoming technology in planning for the future so that the data can be used and read 5, 10, or 30 years from now.
One of the obstacles to the release of environmental data is how to ensure, as much as possible, the privacy of the individuals whose data have been collected, as some of these data, such as medical history data or employment data, can be quite sensitive. These individuals were often promised confidentiality before they agreed to provide those data. This section summarizes presentations that focused on the risk that participant confidentiality could be breached in some way or another.
What Is the Empirical Knowledge About Reidentification in Data Sets?
Concerns about reidentification are causing some organizations to pause in their plans to share data, said Julia Brody, executive director of the Silent Spring Institute, during her presentation in Session 2 of the workshop. For example, she indicated, the institute would like to share its data set through EPA’s ExpoCast, an online data resource, but there are questions around whether the data may be reidentifiable, especially when linking air pollution data and personal information, such as household characteristics and consumer product purchases. These questions led Brody and her colleagues to establish a partnership, with funding from the National Institute of Environmental Health Sciences (NIEHS), to empirically investigate the sources of privacy risk in environmental health studies.
“We have looked at 11 major environmental health studies that have collected personal-level exposure measurements to see what kind of data they include that might be vulnerable,” Brody said. “We are focusing particularly on data that might be linked to real estate databases, property transfers, tax records, [and] zoning records and data that might linkable to professional licensing registries, for example, pesticide applicators in the Agricultural Health Study or teachers in the California Teachers Study.” The project is also looking at a category of data that includes such things as what a neighbor might know about a person: what kind of fuel is used in a person’s house, how many pets a person has, when a house was remodeled, and so on.
One goal of the study is to empirically evaluate the reidentifiability of data that the Silent Spring Institute had collected by doing a reidentification experiment to see how many participants could be correctly reidentified by linking study data and public data sets. The institutional review boards that reviewed the original studies have approved this, but the Massachusetts Cancer Registry has not. “We are still talking to them about it,” she said. “In order to generate empirical data about this question, we will need to continue that discussion with them and hope we reach a resolution.”
Another goal of the study is to find out how study participants think about privacy issues. “We interviewed participants in the Personal Genome Project, which is an open-consent study in which participants go through a training process and post their data on the Web,” Brody said. “These participants are not representative of people in studies, but they are really on the frontier of giving up their privacy for science and for public health benefit.”
The study hopes to answer three questions, Brody said. “One, what is our empirical knowledge about what could lead to reidentification in our data sets? Two, what are the potential technical solutions to that in terms of masking or synthesizing data or creating server systems that would respond to queries? And three, what can and should we promise to our study participants in the informed consent?”
A key question is how likely is reidentification of subjects in existing data sets that have had the obvious personal identifying information removed. Two presenters offered very different answers to that question that are summarized in the next section.
Risks of Participants Being Identified
One of the standard ways to share sensitive data in which the privacy of the subjects must be respected is to “deidentify” the data before they are released.2 However, Daniel Barth-Jones, assistant professor of clinical epidemiology at Columbia University’s Mailman School of Public Health, pointed out during his presentation during Session 3 of the workshop that removing identifying data from a data set decreases their usefulness.
He focused specifically on the value of sharing deidentified data, or data stripped of names and other information, such as quasi-identifiers
2 There is not a precise definition of deidentified data, and much debate and uncertainty about what constitutes deidentified data remain.
which in combination could identify the people who supplied the information in a data set. “I believe that there is a great societal value to deidentified data,” he said, “that it provides invaluable public good, and that it is an essential tool for our society in supporting scientific innovation and research. It helps drive forward innumerable scientific and health research advances. It greatly benefits our society as a whole and yet still provides strong privacy protections for individuals.”
“As we move towards expanded health information technology and electronic medical records,” he concluded, “it will yield even more deidentified clinical data, which I believe will support important advances in health science.”
“The inconvenient truth is that we are stuck with a trade-off,” he said. “When we deidentify data, we necessarily degrade that data in different ways, either through overt information loss, like restricting dates that are more specific than a year in HIPAA [the Health Insurance Portability and Accountability Act], or through the trade-off of grouping people together, for example, so that they are not observable as specific individuals” (see Figure 4-1). The less information that is available about the individuals in a data set, the less that can be determined about how different variables are related to environmental exposures. For instance, he said, if information about race is removed from a data set to make it more difficult to identify the subjects, then it becomes impossible to determine if a particular environmental hazard affects members of one race more than another. Furthermore, removing many of the identifying data does not completely guarantee that the subjects cannot be “reidentified”—that is, have their identities deduced from the remaining information, such as sex, age, race, and general location (e.g., zip code or Census tract).
Point: The Risks of Reidentification Are Relatively Low
Barth-Jones stressed in his presentation that much of the current discussion about privacy policies has been driven mainly by anecdotes and, in particular, by a few dramatic accounts of reidentification. However, Barth-Jones said, instead of basing policy on a few dramatic examples like this, it is important to look at the entire body of evidence for a group of people and determine what the average risks are. “It is important to know what the vulnerabilities are,” he said, “but if we do not
get down to the specifics of what the overall risks are, we will not be able to make good policy decisions.”
In particular, Barth-Jones said, he was focusing on the issue of reidentifying patients from data sets that have been deidentified according to HIPAA protocols. These data are collected by doctors and other health care professionals during the provision of health care, and they can be made available for scientific research if they are deidentified as specified by HIPAA.
“It has been claimed that deidentification does not work,” Barth-Jones said, “and I think in the pre-HIPAA context, that is probably the
case.” In particular, he pointed to recent work by a fellow workshop participant, Latanya Sweeney, professor of government and technology in residence at Harvard University, who examined deidentified data and various data sets and was able to use such information as birth date, gender, and zip code to reidentify a significant percentage of the individuals in the data sets—from 27 percent of individuals in the Personal Genome Project to as many as 87 percent in an earlier study. However, he said, things are different now under HIPAA.
“The reality is that HIPAA-compliant deidentification provides important privacy protections,” he said. “The estimate for the ‘safe harbor’ deidentification3 was that about 4 in 10,000 individuals might be reidentifiable.” Barth-Jones mentioned that in a systematic review by El Emam and colleagues (2011), of the 14 reidentification attacks that they surveyed, only 2 were done on data sets that had been deidentified according to current standards, and only one of those data sets contained health data. In that case, the reidentification attempt managed to identify only 2 out of the 15,000 individuals whose data were in the data set. “I think at that point we have to question why someone would go through the effort to do that,” he said.
On the other hand, though, there is no such thing as perfect and permanent deidentification. And because the future will always bring new ways to reidentify data, it is impossible to guarantee individuals that their data will never be identified. It is possible to make reidentification difficult, but not impossible, he said.
There are various approaches to deidentifying data, each with its own advantages and disadvantages, Barth-Jones noted. For example, the “expert determination” method is a little bit more flexible than the “safe harbor” approach.4 “It helps us balance the competing goals of privacy protection and preserving the utility and statistical accuracy of deidentified data,” he said.
In thinking about the deidentification and possible reidentification of data, there are two things that are important to understand, Barth-Jones said. “One is this myth that we can build a perfect population register.” It is difficult to develop a population register that is comprehensive, that is,
3 The safe harbor approach requires removal of all 18 listed identifiers under the HIPAA Privacy Rule.
4 The expert determination method can be used when it is preferable to not remove all 18 listed identifiers. The method requires confirmation from a qualified statistician that the risk of identification is very small on the basis of the retained identifiers.
that includes everyone. For instance, many population registers depend on voter registration lists, but only about 70 percent of Americans are registered to vote.
The other thing to keep in mind is that various errors and inconsistencies will inevitably arise in linking data between the sample and the population, creating what Barth-Jones referred to as “data divergence.” The basic approach to reidentification is to match variables from a population data set—a list of registered voters, for example—with variables from the sample data set under investigation. But there are various ways that the data in the population data set can diverge from the data in the sample data set. First, the values of the data variables can change over time. People move from zip code to zip code, their income bracket changes, or their marital status changes. All of these things make it more difficult to draw a link between the sample in question and the overall population. There are also missing and incomplete data as well as keystroke and other coding errors in any data set, which, again, will make it harder to link people in the sample data set with individuals in the population data set.
A fundamental problem with efforts to protect individuals from identification, Barth-Jones said, is that as more is done to deidentify or anonymize the data, the less useful the information is for statistical analyses. To illustrate, Barth-Jones showed an original analysis that clearly showed a race-based effect: as depicted in Figure 4-2, African American, Hispanic, and Asian groups all scored differently from white Americans (the leftmost group). But when the various racial groupings other than white were combined to anonymize the data and make it harder to identify individuals, the effect disappeared, and all that remained was a white group and an “other” group (the rightmost group) between which there were no statistically significant differences.
One important thing to keep in mind when trying to deidentify data, Barth-Jones said, is the difference between whether someone is unique in the sample under consideration or unique in the larger population. “If they are unique in the sample, we call them ‘sample unique.’ If they are unique in the larger population, we call them ‘population unique.’” In general, in any given sample there may be individuals with unique identifying information; for example, there may be only one person in the sample with a particular age, sex, location, and degree of education, but in the larger population there may be several people with that particular set of identifiers. “It is really only those records that are unique
in the sample and in the population that are at a clear risk of being reidentified using exact record linkage,” he explained.
Unfortunately, he said, efforts to deidentify samples sometimes do not take into account the differences between sample uniqueness and population uniqueness. Some have suggested that deidentification methods collapse variables to the point that only a tiny percentage of the sample is unique. This is generally not necessary because the real question is often not how many records are sample unique but rather how many of those sample-unique records are also population unique. When people ignore this distinction and try to minimize the sample-unique records, it can cause the data set to lose most of its value for understanding environmental effects on health. “If we are going to do this to our statistical analyses,” he said, “we might as well all give up and go home.”
Barth-Jones noted that it is important to obtain a better understanding of the risks that deidentified patient data can be reidentified to improve patient confidentiality. “How do we move beyond anecdotes to a
rigorous scientific evidence-based risk management approach for dealing with reidentification risks?” he asked. “I would suggest we want to do that using quantitative policy analyses. These have been used for decades by agencies like the EPA and the Energy Department to address difficult risk management questions. There is a lot of uncertainty in reidentification risk assessment, and I think we can use these methods here.”
Counterpoint: We Need a New Way of Thinking About the Risks of Reidentification
At various points in the workshop, both Barth-Jones and Sweeney referred to perhaps the best known case of reidentification, which occurred in 1997 when Sweeney, then a graduate student, was able to pick out from a state insurance data set Massachusetts Governor William Weld, whose data had been deidentified. The fact that she was able to identify Weld, even though the data set had been stripped of anything that directly identified the people in it, seemed to be a striking example of how easy it is to circumvent deidentification. Sweeney offered her own thoughts on reidentification during her presentation in Session 5 of the workshop. Work on reidentification over the past 15 years has resulted in a new way of thinking about the risks of reidentification. She discussed how reidentification experiments are conducted—for example, by computer science scholars—to evaluate the privacy risks of peer-reviewed and published data sets so the reidentification methods and results can inform privacy protection policies and strategies.
The possibility that data will be reidentified scares many people, Sweeney said. Scientists worry that if the subjects whose data are in their data sets are reidentified, it will discredit them and perhaps make it more difficult to attract subjects in the future. But as a computer scientist, she said, she thinks of reidentification in a very different way.
“[R]eidentification is a sufficient and necessary condition for improving privacy protection,” she said. Without real-world demonstrations of data reidentification that indicate actual vulnerabilities and risks, there will be bad policy and little scientific progress in privacy-enhancing technologies. To zero in on the appropriate types of privacy protections—whether they are stronger than today’s protections, weaker, or, more likely, totally different from what is now known—it is necessary to go through cycles of data being reidentified, protections being changed in response, and so on.
“Computer science had this exact same problem many years ago in encryption,” she said. “It used to be the case that some people would rely
on encryption for national security. Somebody would break it. They would publish it. The national security people would freak out. Businesses would freak out. Somebody else would come up with a better system, and somebody else would break that. They created this cycle until eventually we got the strong encryption that we use today. Today, I can tell you exactly how your credit card is encrypted over the Internet, and even knowing how it is encrypted does not make it possible for you to break it. We could have never gotten there if we did not have those cycles.”
Sweeney described current thinking on reidentification as being like an argument between two lawyers who have different political positions. “One camp says everything can be reidentified, and the other camp says nothing can be reidentified. And neither one of them is right. But now how do we clarify it?”
The current situation, Sweeney argued, is the result of 15 years of talking about the risks of reidentification with very few people actually looking at the details of how reidentifications occur and how they can be prevented. For example, her work in reidentifying William Weld from the Group Insurance Commission in Massachusetts is well-known. “It is cited in the HIPAA privacy regulation preamble,” she said. “It is cited in preambles in other countries’ privacy regulations.” But although she has made more than 20 attempts to publish a paper describing the details, it has never been published. “Not because I did not try,” she said. “Not because I did not write the papers, but because, in general, reidentifications are disruptive. Despite a significant history after all those attempts, we do not have a history to turn to.”
The next large-scale reidentification she did was a case in southern Illinois for the Department of Public Health. She reidentified children in a cancer registry. Her results were scored, and she was told that she was absolutely accurate in 20 out of 22 of the cases, she said. “The judge saw the results and sealed the case and said you may never tell anyone how you did that.”
She did reidentifications in New York City and for the U.S. Census Bureau. In both of those cases, she found that the data were vulnerable to reidentifications even as the data “were already going out the door.” In each case, she said, she was asked to give the organizations time to figure out how to deal with the vulnerabilities. “I did,” she said, “and for years there was nothing done. They just continued to ship the data in the same way as if the vulnerability never existed, and the papers did not either.”
The bottom line, she said, is that it is necessary to study situations in careful detail—including what information is available and who has access to it—if one is to understand the true risks of reidentification.
She is now working at assembling that evidence, she said. As part of that effort she has put together a map of how health information travels between organizations and individuals (see Figure 4-3). “This is what the data map looks like today,” she said. “You can visit it on thedatamap.org. Every node, if you click on it, will list names of entities and document how we know they engage in that sharing practice.” The entities in bold font are those that are not covered by the privacy rules in HIPAA—and they are surprisingly numerous.
One of the major nodes in the map is hospital discharge data, she said. “There are about 33 states that share the [deidentified] data. Only three states use HIPAA standards. The other 30 states use standards that are weaker than HIPAA.” When she used Freedom of Information Act (FOIA) requests to find out who is getting these data, she found out that researchers are nowhere near the top 50 recipients on the list.
Sweeney showed a sample of deidentified hospital discharge data that she bought for $50 from Washington State, which she then linked to newspaper articles that appeared during the same time period. “We stick with news stories,” she explained, “because news stories have the same kind of information that an employer knows about you, a creditor knows about you, your family and friends know about you, and you could link the data as well.”
She linked those data to publicly available data from various lists. “We do not use voter lists anymore,” she said. “We use public records because they are readily available. They are far more comprehensive. Everybody is in it. Your nicknames are in it. Your cell phones are in it.” From that publicly available information she was able to get zip codes for individuals, which she linked with the hospital data to identify people who had been discharged from the hospital.
“We found that we had accurately reidentified 45 percent,” Sweeney said. They determined that accuracy rate by giving all of the names to Bloomberg News, and the publishers of Bloomberg News and Jordan Robertson, the reporter, contacted each person on the list under an agreement that only if the person agreed would they actually release the name.
This is the reidentification problem in 2014, she said. There is so much information floating around that is readily available to anyone who knows where to look that it is very difficult to get a handle on the exact risks of reidentification. But the first step in understanding those risks is to get a clear picture of exactly what information is available and how the different sources of information can be combined to identify people who appear in data sets that are supposed to be deidentified.
Before an individual can take part in a scientific or medical research study, it is generally necessary for that person to provide “informed consent.” That is, the person must be told enough about the study that he or she develops a reasonable understanding of the risks of participating and then still consents to take part. In universities, governmental agencies, and other organizations, the process of getting informed
consent—including details on how the specifics of the study are presented—is overseen by an institutional review board, whose purpose is to make sure that subjects in studies are not put at unnecessary risk and that they understand whatever risks might be involved.
Kevin Casey, the associate vice president for public affairs and communications at Harvard University, offered a detailed example of how informed-consent issues arise and affect scientific research during his presentation during Session 3 of the workshop. Researchers at Harvard followed a large number of subjects in six cities for nearly two decades as a way of studying the relationship between exposure to air pollution and mortality risk. The so-called Six Cities Study,5 data from which were first published in 1993 (Dockery et al., 1993), played a major role in the establishment of air quality standards.
According to Casey, the informed-consent form signed by participants in the study said, “Harvard University School of Public Health hereby gives the assurance that your identity and your relationship to any information obtained by reason of your participation in this study of respiratory symptoms will be kept confidential and will not otherwise be disclosed except as specifically authorized by you.” In short, Casey said, the subjects were assured that their data would remain private and that the data would remain in the hands of Harvard University.
The data for the Six Cities Study were of three types: some data consisted of basic information concerning the individuals and their health histories, some data described the subjects’ respiratory systems when they were admitted to hospitals and where they were living at the time, and the third type of data consisted of death records. The location data were important, Casey noted. “That is different than other kinds of biological issues, where, if someone is eating broccoli, you do not care where they are located.... In this instance, it is important to know where
5 The Harvard Six Cities Study evaluated a well-characterized cohort of adults from six communities (Harriman, Tennessee; Portage, Wisconsin; St. Louis, Missouri; Steubenville, Ohio; Topeka, Kansas; and Watertown, Massachusetts) on whom the health effects of air pollution were followed prospectively, beginning in 1974. The objective of the study was to estimate the effects of air pollution on mortality when controlling for age, sex, individual smoking status, and other risk factors. The study found that mortality risk was strongly associated with the levels of fine particulate air pollution (particles smaller than 2.5 microns, or PM2.5). This research has led to EPA air quality standards and regulations on PM2.5.
they are because it intersects with the air quality data at that time when they might be admitted.” The combination of the various types of data made the study particularly valuable for policy makers, Casey said, because it allowed researchers to draw connections between a person’s exposure to air pollution and that person’s health and (for those who died before the study was finished) age at death.
But the combination of information also made it much easier to identify the subjects in the data set, he noted, “because if you know what the air quality is at a particular time and you know when somebody died in a particular day and you know the town where they were located in, it is not very hard.... Two people died in Watertown on July 2, 1997. One is a woman. One is a man. We actually know who it is. We provide [those] data. It violates exactly what we had promised when these people enrolled.”
Casey added that the death index data were also provided with a promise from the researchers that they would not allow the data to be used in a way that would disclose any individuals who were part of that work.
But the situation sets up a tension between the importance of using the data for scientific and regulatory purposes and the need for maintaining the privacy of individuals who provided their data with the promise that their identities would not be revealed. The data from the study are “foundational information” used in setting clean air rules, Casey said. “And the Clean Air Act states that it should be revised periodically, and, when it is, it should draw upon the published research that is available, and when they do, it has directed attention to the Harvard School of Public Health. Our work percolates. It bubbles back up.” Unfortunately, the need to protect the confidentiality of the subjects enrolled in the study leads to what Casey described as “our inability to provide enough information to certain policy makers to make them feel like they have all of the data to support the determinations of EPA and others on clean air standards.” This in turn creates pressure to provide more data, which could make breaches of confidentiality more likely.
“It is a major issue,” Casey said, “because the whole concept of enrolling people in studies is at risk.” People are likely to respond much differently if they are told that their data will be kept private than if they are told that their data might be provided to members of Congress and other people who will use it as they like. “I think that it would have a chilling impact,” he said. “It is not to cast aspersions on the Congress or others. It is an actual conundrum that we are facing.”
The issues being raised by members of Congress are legitimate, Casey said. It is true that there are certain data that the researchers have not been willing to supply. “The reasonable-thinking people who are serious about clean air, who are serious about solid policy, who want the federal government’s work to be valid—those people have reasonable questions that they are raising,” he said. “We would like to get to the bottom of them, also. Our researchers are not trying to hide the ball. Some of them have been working for 20 years and have not gone to a meeting where they have not been confronted with this issue. If there were a way that they could thread that needle and provide the amount of information to satisfy reasonable policy makers in the pledge that they made to their subjects, they would do it tomorrow.”
This section summarizes the discussions on the challenges associated with data sharing that took place throughout the workshop.
Issues Associated with Reanalysis of Data
A workshop participant watching the workshop over the Web noted during the discussion after Session 2 of the workshop that one of the concerns that researchers have about sharing their data is that whoever does a reanalysis of the data may not do it to particularly high standards and come up with results that are incorrect. This is a particular worry when advocates for a certain side of an issue are looking at the data, but the problem is broader than that and extends to researchers who may not have a particular axe to grind but are simply not being very careful in their analyses. Specifically, how does one ensure the quality of the reanalysis of data? The example offered was a reanalysis of the association between vaccines and autism in children that found that an association did exist. Although the reanalysis was done poorly, its results were quickly picked up by the press and may have contributed to mothers choosing not to vaccinate their children. Gwen Collman of NIEHS suggested that any reanalysis should be subjected to the same standards as the original analysis. They should go through peer review before they are released and certainly before they are used in any sort of rule making by either the regulatory agencies or the courts. Anthony Cox, chief science officer of NextHealth Technologies, noted that the original analyses are often not of very high quality, which is why the
scientific literature has so many false positives. Logically, he added, false negatives and false positives are equally important, but, empirically, false positives are the problem, which is why he focused on them. Finally, he commented that he agreed with Collman that “all analyses that are published should have to jump through the same hoops.”
Al McGartland of EPA commented that there is a similar problem with reports in the gray literature, that is, literature containing the findings of research that is reported by associations or companies but that has not gone through peer review. Although these studies are not published in academic journals, they are often circulated on Capitol Hill and cited by the press. Should the researchers who publish gray literature release their data and models for use in reanalyses? Cox responded by stating, “I do not think there is a good, clean answer other than buyer beware or reader beware.” If the peer-reviewed scientific literature can be cleaned up and made “closer to the gold standard that we all want it to be,” Cox said, then the peer-reviewed literature will inevitably have increasing value compared to the gray literature and the gray literature will get less attention.
During the discussion after Session 2 of the workshop, Alan Morrison of the George Washington University Law School highlighted two basic issues with informed consent as it concerns environmental health data. “The real problem today,” he said, “is that when you ask somebody for consent, particularly a broad consent, neither you, the requestor, nor the person being requested has any notion at all as to what that means in terms of how it is going to be used because we do not even know what it is going to be used for down the road. The second thing is [that] when people asked for consent a long time ago, we understood that consent meant you were not going to hand out something to somebody else, but as long as we took your name off it, [it] was okay. Obviously, that is not enough anymore.” With all of the ways that people in data sets can be identified that have been developed, even if the identifying data have been removed, people who were once essentially unidentifiable are now identifiable. This changes the game and means that the consent that people gave 20 years ago can no longer be considered “informed” consent because neither the subjects nor the researchers had any way of knowing just how identifiable the records in data sets would become. Thus, Morrison concluded, “the notion of informed consent as it was
developed in the law is, in my view, essentially meaningless. We have to think about it in a different way.”
Bernard Lo, president and chief executive officer of The Greenwall Foundation, agreed with Morrison on both issues. Concerning the informed consent that researchers today are obtaining from subjects for the sharing of their data, he said that there is a tendency to promise too much. “By implying that we will only do what you consent to and that we will protect your privacy in very strict terms so there is no probability that people can reidentify you may be promising too much. It is unrealistic.”
As for the informed consent provided by subjects in older research studies, Lo agreed that it is questionable whether the consent can hold up today. “We heard data presented from investigators saying they looked retrospectively at consent forms for completed drug clinical trials. What did they say about sharing? It was split. Some trials did not mention it at all. They were silent.... Others gave broad consent: ‘This could be shared with other researchers.’ The participants may not have known what that means.” So what should be done with clinical trials where the original consent did not mention data sharing or said, for example, that the data should be shared only with other researchers in the same institution? It is not clear.
Linda Birnbaum, director of NIEHS of NIH, noted that the U.S. Department of Health and Human Services is currently working on the issue of informed consent and, in particular, is working on a proposed new rule related to changes in the informed-consent language that would offer broader informed consent, with researchers being given the opportunity to go back to the participants to ask them for permission to use the data in a new way each time that the researchers wish to reanalyze the data.
In the opening remarks for Session 3 of the workshop, Glenn Paulson, science adviser in the Office of the Administrator at EPA, said that there is a new working group on modernizing the Common Rule, the federal rule governing the treatment of research subjects in many federal departments. Changes in the Common Rule could affect such issues as informed consent and the sharing of data from surveys of trials involving human subjects.
The Harms of Reidentification
Assuming that some records in a data set are reidentified, the obvious question to ask is: What are the potential harms associated with that reidentification? What damage can be done to people who contributed data to a data set if their identities become known?
John Gardenier, a retired member of the confidentiality staff at the National Center for Health Statistics, raised the issue in a question that he posed to Daniel Barth-Jones in the discussion after Session 3 of the workshop: “We can all hypothesize the theoretical or hypothetical harms that could come from a false-positive reidentification,” he said, “but I do not know of any specific study that has been done in which anybody has documented an actual case of actual harm as an example of what the real societal risk is. Can you comment on that?”
Barth-Jones answered that he also was unaware of a demonstration of actual harm that reidentification had caused to a real person. “I think our evidence base at this point is fairly slim,” he said. “One of the things I would argue for is doing more systematic studies where we actually do good sampling and then look at how many people can be reidentified. And then I think we are at the point of having to make some subjective assessments of the sensitivity of the information that may be revealed and whether it could lead to potential harms.”
This discussion led to a series of other comments concerning the harms of reidentification. One workshop participant suggested that harms can come about in ways other than people in a data set being identified. “I think another model that exists out there is, if I know an individual, can I go to this data set and find out if that individual is in the data set?” This might come up, for example, if a plaintiff in a lawsuit claiming harms from environmental exposures of concern had taken part in an environmental health study. The opposing lawyer might search through the data set from the study, looking for additional information on the plaintiff, which would be accessible if it was possible to identify the plaintiff from among the people who took part in the study.
John Howard, director of the National Institute for Occupational Safety and Health, offered a second example. An employer could search for his employees or potential employees in a data set from an occupational health study in an effort to get information about the disease status of those people.
Howard also suggested that there does not have to be this sort of real-world harm for harm to have happened. Once a person who has been
promised privacy is identified within a data set, that breach of the promise of confidentiality is itself a harm, he said.
Barth-Jones agreed and described a similar situation that arose with reidentification of people taking part in the Personal Genome Project. “Even though there was no promise of confidentiality in that situation,” Barth-Jones said, “many of the people who were reidentified in that situation without having been consulted first seemed to express displeasure.”
Reidentification Risk as a Participation Deterrent
The individuals whose data are in a data set are not the only ones at risk from reidentification. If reidentification becomes a serious issue, it is possible that this could make it less likely that individuals will be willing to take part in studies.
Stacy-Ann Allen-Ramdial, intern with the House Committee on Science, Space, and Technology, raised the question explicitly during the discussion after Session 3 of the workshop: “I have heard a lot about the importance of sharing data for researchers and for the federal agency,” she said, “but wouldn’t the participants be a little bit more comfortable with the transparency early on, knowing that their information could potentially be used by the federal government for regulation? Or is there a risk that you will lose so many participants that you will not have enough for your research studies?”
Casey answered that people are trying to find the proper balance between the need for to share data and privacy concerns, but he suggested that it would have a “chilling effect” on people considering enrolling in a study that would take their personal information and study them over many years “if they knew that their personal information might become part of the public discourse record—individual information, not aggregated, not collected and shared in terms of overall studies.” Researchers have not really had to deal with this issue up to now, he said, and so there is no direct experience that indicates what would happen. However, he added, “I suspect that that would have an actual damaging impact on the ability to enroll people, not only in this kind of [clean air] study but potentially downstream in other kinds of clinical trials and other areas where things could become controversial.”
This exploitation should have its limits, though, warned Casey. For example, the data in the Harvard Six Cities Study have been, if anything, overanalyzed in the two decades since the study appeared. “The original
publication was 1993,” he said, “and there continues to be a fetish about reanalyzing that initial base of data when there have been over 20 studies done over the last 20 years by international and national groups that have come to similar conclusions and have drawn upon different sets of participants in the groups. To continually go back to the 1993 data is acting as if science was frozen two decades ago.”
Glenn Paulson, science adviser at EPA’s Office of the Administrator, offered his own experience from serving on the executive committee for an institutional review board at a major medical research institution. The institution carried out a wide range of studies, from clinical trials to epidemiological and behavioral research. Recruiting people to participate in research projects was always a problem, he said, but it never seemed to be due to the worries about privacy and confidentiality issues. The consent forms generally said that information about the subjects would be kept private “unless it is requested by a court of law or something to that effect,” he said, and this language did not seem to be a reason for people not being willing to join research studies.
Lynn Goldman, dean of the Milken Institute School of Public Health at George Washington University, spoke of the repercussions of a ruling to publicize data that a researcher had promised to keep private. In 2005, Bruce Lanphear and colleagues published a study looking at the effects of lead on intelligence quotient (IQ) in children. Working with colleagues who had collected data from children around the world, he pooled the data and found that exposure to lead lowered IQ in children (Lanphear et al., 2005). Because EPA had funded Lanphear’s systematic review, the decision was made that the data in his study had to be shared, so without the consent of any of his colleagues who had provided the data—and without the consent of the subjects who had provided the data—the data were made public to anybody who wanted to use them. “My guess is that he [Lanphear] repaired the relationships with those other investigators, and that they are okay,” Goldman said, “but there are people around the world who may feel cautious about collaborating with U.S. scientists in any setting where they might be asked to do a data-sharing agreement, knowing that if the federal government is funding it, their data then may become just available freely.”
“I do not think that that is helpful in terms of generating good will about scientific collaboration and cooperation among countries,” Goldman continued, “or for that matter eagerness for participating in systematic review. I feel that is a harm, to be honest.”
Ellen Silbergeld, professor at the Johns Hopkins Bloomberg School of Public Health and editor-in-chief of Environmental Research, added that, for the sorts of studies that she does, this situation would make it very difficult to get people to volunteer for studies. “If I told anybody in the studies we do with Native Americans, people in the Amazon, people in inner city Baltimore, and workers in South Carolina that the data might become available through a lawsuit, they would not participate.”
During her presentation in Session 5 of the workshop, Birnbaum suggested that the chilling effect is most likely to be seen in smaller studies involving people who are followed for many years. “This may not be as much of a problem in some of the very large, statistically based, almost ecological kinds of epidemiological studies,” she said, “but it may be extremely important when we are dealing with the smaller studies where we have small cohorts that we recruit and we want to follow.”
Explaining Risks Better to Subjects
Howard commented during the discussion after Session 3 of the workshop that it is no longer possible to believe—as it was several decades ago—that the confidentiality of research subjects can be absolutely assured and that researchers thus have an obligation to talk about confidentiality risks differently than they did many years ago. “It would seem to me,” he said, “that we are at a stage of developing obligation on the part of scientists [where] they should know that the situation of making those kinds of promises to a potential study participant in 2014 versus 1970 really isn’t the reality of the world we live in.” In particular, he said, it is important to educate researchers on the best way to talk to subjects about confidentiality risks, but it is not clear to him that many institutions are making sure that their researchers are up to speed. The situation concerning confidentiality risks has changed so much in the past 10 to 15 years that it may be the case that many scientists are still making promises to their subjects that they may not be able to keep.
Barth-Jones, D. C. 2014. Challenges associated with data-sharing: HIPAA deidentification. Presentation at the workshop Principles and Obstacles for Sharing Data from Environmental Health Research, Washington, DC.
Dockery, D. W., C. A. Pope, X. Xu, J. D. Spengler, J. H. Ware, M. E. Fay, B. G. Ferris, Jr., and F. E. Speizer. 1993. An association between air pollution and mortality in six U.S. cities. New England Journal of Medicine 329(24):1753–1759.
El Emam, K., E. Jonker, L. Arbuckle, and B. Malin. 2011. A systematic review of re-identification attacks on health data. PLoS ONE 6(12):e28071. Available at http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028071 (accessed October 27, 2015).
Lanphear, B., R. Hornung, J. Khoury, K. Yolton, P. Baghurst, D. C. Bellinger, R. L. Canfield, K. N. Dietrich, R. Bornschein, T. Greene, S. J. Rothenberg, H. L. Needleman, L. Schnaas, G. Wasserman, J. Graziano, and R. Roberts. 2005. Low-level environmental lead exposure and children’s intellectual function: An international pooled analysis. Environmental Health Perspectives 113(7):894–899. Available at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1257652 (accessed October 27, 2015).
Sweeney, L. 2014. Inconvenient truths of re-identification discourse. Presentation at the workshop Principles and Best Obstacles for Sharing Data from Environmental Health Research, Washington, DC.
This page intentionally left blank.