David Weir, of the University of Michigan, Ann Arbor, introduced a session on the state of the art in technologies for data sharing and in the rules and regulations that govern data sharing. In every area of research, he noted, the goal should be to find the proper balance and trade-offs among risk, consent, and procedures for protecting data and making them available to researchers. This balance determines the amount of protection that is accorded to the human subjects who provide the data.
George Alter discussed issues that arise in the storing and sharing of confidential data, citing his experiences as director of the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. ICPSR is a collaboration of more than 700 member universities around the world that contribute to the archiving of social science data that is shared among the member institutions. The consortium also provides sponsored archives for about 20 different agencies, including institutes at the National Institutes of Health and agencies in the Department of Justice. “For those sponsors we set up web portals to preserve and make available the data that they or their grantees are collecting,” he explained.
Disclosure Risks in Social Science Data
Alter briefly noted some of the factors that are increasing the risks of disclosure in social science data and, in particular, in data from which direct identifiers have been removed so that the data seem to be anonymous. As other speakers had noted, even with de-identified data it can be possible to identify individuals from the information that remains in the dataset, and several trends are increasing concerns about re- identification. One is that more and more research is being done with geocoded data, he noted. Another is increasing use of longitudinal datasets, such as the Health and Retirement Survey, “where the accumulation of information about each individual makes them identifiable,” he said. Finally, many datasets have multiple levels—data on student, teacher, school, and school district, for example, or on patient, clinic, and community—which can make it possible to identify individuals by working down from the higher levels.
Protecting Confidential Data
With respect to protecting confidential data, Alter said, it is useful to think in terms of a framework that considers protecting confidentiality with four different but complementary approaches: safe data, safe places, safe people, and safe outputs. “You can approach making data safe in all of these different ways,” he said, “and in general we try to do some of each.”
It is possible to take steps to make data safer both before and after they are collected. Before data are collected, one can design studies in such a way that disclosure risks are reduced. Researchers can, for example, carry out their studies at multiple sites because “a study that is designed in one location, especially when that location is known, is much more difficult to protect from disclosure risk than a national survey or a survey in multiple sites.” Researchers can also work to keep the sampling locations secret, releasing characteristics of the contexts without providing locations.
After the data are collected, there are many procedures that researchers can use to make the data more anonymous. They can group values, for instance, or aggregate over geographical areas. They can suppress unique cases or swap values, and there are a variety of more intrusive approaches, such as adding noise to the data or replacing real data with synthetic data that preserve the statistical relationships in the original data.
The second approach is to provide the data in places that are safe, Alter said, and there are three levels to this type of safety. The first is providing the data to a researcher under a data protection plan and making the researcher responsible for the data protection. Alter said that the consortium has been working on improving data protection plans. Because the technology for handling data is changing so quickly, ICPSR is trying to develop data protection plans that focus not on the technology but rather on the risks and on how a researcher plans to deal with them.
The second level of safety is using remote submission and execution. In this approach, the data are held in a data center and the researcher submits program code that is executed at the data center, with the results returned to the researcher. A virtual data enclave is an easier-to-use version of remote execution, Alter noted. Researchers gain access to and manipulate the data remotely from their own computers, but the data and the manipulations remain on the data center’s computers. Results are sent to the researcher after being reviewed. However, the researcher can still see confidential data on his or her screen so there must be a data use agreement as well.
The third level is the use of a physical enclave, to which the researcher must go in person to gain access to the data. “We have a room in our basement that is locked up, and we frisk people when they come in, and we go through their pockets when they come out to make sure that they are not taking anything that they should not,” Alter said. The physical enclave provides the most control over the data, but it is the most intrusive for the researchers because they must travel to get to the data.
There are two main ways to create safer people, Alter said. The first is to use data use agreements. The University of Michigan, ICPSR’s parent institution, signs data use agreements with both the researchers who produce the data and the researchers who use the data. For data producers, there is a data dissemination agreement that specifies how ICPSR will manage and preserve the data. Researchers who receive data sign an agreement describing how they will protect the confidentiality of research subjects.
ICPSR requires researchers to provide a research plan, IRB approval, a data protection plan, behavior rules, and also an institutional signature. In the institutional agreement, the member institution must agree that if the consortium alleges a violation or breach of the agreement by the researcher, the institution will treat that as research misconduct and pursue that individual under its own research misconduct policies. “I
consider this one of the strongest things that we do,” Alter said, “because we are saying it is not just the individual’s responsibility, it is really the institution’s responsibility to assure compliance with the agreement.”
The second approach is training. Until recently, Alter said, ICPSR has not done a particularly good job in training researchers about disclosure protection, but it is now developing an online training course that researchers will need to complete before they get access to the consortium’s data. It is designed to teach researchers about disclosure risk, about how they can use information technology to protect the data, and how they can make sure that their research is not published in ways that will reveal identities.
Making sure that the outputs from data analysis do not threaten confidentiality is done in the context of safe places. For example, remote submission and execution allows individuals at the data center to control what is returned to the researcher and to make sure that nothing in the results threatens confidentiality.
ICPSR has been developing ways of releasing data in which they adjust the requirements of release to the characteristics of the data, Alter said. For example, he explained, in the case of a national survey such as an opinion poll, which has very few identifying questions, “we certify the data as having very low risk of re-identification or harm” and provide it under a simple terms-of-use agreement, in which the user agrees not to re-identify anyone in the sample. For more complex data that have greater risks, ICPSR imposes a stronger user agreement and such technology as the virtual data enclave. “And for the stuff that is really radioactive,” he said, “we put it in our basement in the enclosed data center.” In this way, ICPSR is able to control outputs from the data and make sure that nothing threatens the confidentiality of the individuals whose sensitive information appears in the dataset.
Who Should Be Responsible?
Alter also addressed the question of who should be responsible for making sure that the confidentiality of shared data is protected. In general, he said, it is usually the IRBs of the data producers and the data producers themselves who think most deeply about the issues of risk and harm, because they are most closely associated with the research subjects. Thus, he said, “we usually rely on the data producer to tell us the terms of dissemination.”
On the other hand, many IRBs do not have expertise in disclosure
risk, which can get very technical. Furthermore, the data in a data center may persist longer than the IRB itself, or at least longer than the membership of the IRB. So it is important, Alter said, that there be centers of expertise in disclosure risk that can advise IRBs of what to do, and that there also be a system that provides for an IRB to take over supervision of a dataset if the original IRB is unable or unwilling to continue. Data repositories can play a role here, he said.
Ultimately, though, it is the institutions that receive data that are responsible for its security, Alter said. Ideally, the IRBs of the data recipients should defer to the protocols established by the original IRBs. The recipients’ institution, having signed an institutional agreement, is responsible for compliance with the data use agreement and for investigating any alleged violations. The recipient institution is also responsible for making sure that data users understand disclosure risks and that they behave safely.
Finally, there is the question of who is responsible for paying the costs of sharing confidential data. Often the institutions that pay for the data are willing to assume the cost of distribution, Alter said, “but I think for many things we are going to be moving to a situation where the data user, because using confidential data has special costs associated with it, is going to have to pay user fees for access to confidential data.”
Taylor Martin, of Utah State University, described her research in mathematics education and discussed what the proposed changes to the Common Rule could mean for education research. Martin studies how people learn mathematics and how mathematics education can be improved. The recent explosion in computer learning methods, such as online courses or online games designed to teach math skills, offers a “new microscope,” she noted, that can be used to study how people learn and then apply those insights to helping them learn better. For example, she explained, by analyzing how children interact with a game that teaches them about fractions, she was able to identify several patterns of learning. Some approached the game haphazardly, she explained, trying different things in a seemingly random way. Others carried out a more careful exploration, trying things in a very structured way. Still others thought carefully before each step, trying to zero in on the answer as efficiently as possible. Having identified these different patterns, Martin said, it becomes possible to see how the patterns relate to the effect of learning the game on students’ test scores, to explore which teaching strategies work best for different types of students, and see ways to modify the
game to encourage children playing it to try different approaches to maximize their learning.
The data derived from such observations can be combined with various other types of data, Martin said, such as neural activation patterns recorded during learning sessions, to provide more insights into learning. The lessons learned from such learning games can be used to develop general learning principles that can be applied to other games.
Martin characterized her approach to improving mathematics education as “big vision, big data.” The “big vision” includes four components: (1) personalized learning, (2) connected learning, (3) anytime/anywhere learning, and (4) increasing opportunity for all children in science, technology, engineering, and mathematics (STEM) education. There is a major push by education businesses to develop personalized learning, which involves watching how children interact with their learning environment—what kinds of resources they use, what lessons they learn and how they learned them, and so on—and then personalize the environment to reflect individual children’s learning styles. Keeping track of a student’s progress, and providing feedback to students, parents, and teachers, is an important component of this approach.
Connected learning, Martin explained, refers to creating connections between the various places where children learn. Children spend a relatively small percentage of their lives in school classrooms, and they learn in many other settings—in such places as the Exploratorium, the science learning center in San Francisco, and in online sites where they may learn to program, talk to their friends about programming, share their programs, and do other programming-related activities.1 Ideally, these learning settings should be connected. Anytime/anywhere learning refers to the possibilities offered by online learning, including massive open online courses (MOOCs). MOOCs and related approaches to online learning allow students to listen to lectures, do practice problems, and take tests anywhere, at any time.
The “big vision” Martin described calls for using personalized learning, connected learning, and anytime/anywhere learning to help interest kids in and teach them about STEM topics, giving all children the opportunity to learn about math and science. Achieving that vision will be helped along by the growing presence of “big data.” Martin characterized the present state of affairs as a “biggish data” world rather than a big data world, but believes that a world characterized not only by tremendously large amounts of data but also by rich data streams that provide a great
detail of data on any one individual, by connected data streams, and by shared data, is fast approaching.
The Effect of the Proposed Changes on Education Research
Martin spoke about the effects that the proposed changes to the Common Rule would have on her research and on education research in general. She shares with other speakers the goal of having more readable and understandable consent forms and agrees that continuing review should not be “one size fits all.” She believes IRB forms should be simplified, and that multisite studies should have a single IRB.
Focusing on the issue of information risk and educational data, Martin noted that she has been running education studies for 25 years and has kept her data stored in a locked filing cabinet. Nothing she has done in those studies has put a child a risk, she noted. However, she pointed out that massive datasets and powerful computers will increase the potential for exposure of information and introduce a new type of risk. IRBs have not been trained traditionally to assess this new type of risk, she added, so it will be important to develop standards to guide them.
Meanwhile, educational technology companies are collecting a lot of data, and they are not subject to the restrictions of the Common Rule. These companies’ analyses are often what schools, school districts, and states are basing their educational decisions on. In her view, partnerships with these companies, which have extensive product development and broad national scope, would be beneficial for many academic research groups. However, standards for data privacy for these situations would need to be clarified if this were done, she noted, because they would not be subject to the Common Rule.
Potential Information Risk Solutions
Martin suggested a few solutions to the information risks she had raised. First, she advocated that standards be set for risk in the “real universe.” Continuing work on what “de-identified” means will be needed as the possibilities for re-identifying de-identified data grow, she noted. Funding agencies should support the establishment of safe data repositories that follow standard guidelines, she added. For her, the most useful step would be to provide templates for institutional IRBs that instruct them on how to set up data management and safety plans. This would help not only her university IRB but also school district IRBs, many of which are struggling with these information risk issues.
Susan Bouregy, of the Yale University Human Research Protection Program, discussed some of the ways in which the proposed revisions to the Common Rule would lead toward unified data security requirements for human research and what some of the consequences of that might be. One of Yale University’s five IRBs, she explained, is devoted exclusively to reviewing social and behavioral research, she said, and it handles a very diverse range of research, from cognitive development in children to video ethnographies of marginalized communities. Although Yale’s healthcare clinic, faculty medical practice, and self-insured health plan are covered by the Health Insurance Portability and Accountability Act (HIPAA), research carried out by faculty, staff, and students outside these areas is governed by HIPAA only when it makes use of information from those health-related entities. Thus, most social and behavioral research that takes place at Yale is not covered by HIPAA at this time.
Bouregy’s presentation focused on the potential effects of Section 5 of the ANPRM, which deals with strengthening data protections. She identified three areas: (1) harmonizing the concept of “individually identifiable” information, (2) requiring data security protections to be indexed to identifiability, and (3) using HIPAA security and breach notification standards as the model for data protection schemes.
Harmonizing the Concept of Individually Identifiable Information
One of the key proposed revisions related to data protection in the ANPRM is that the Common Rule should adopt HIPAA standards regarding what constitutes individually identifiable information, a limited dataset, and de-identified information. Adopting the HIPAA definition of individually identifiable information would not be a major change, Bouregy said, because the current Common Rule definition is very similar. The major difference is that under the Common Rule identifiability is largely determined based on whether the investigator can identify the participants, whereas HIPAA is much broader.
The two regulations differ more sharply with respect to the question of how data are de-identified, she noted. The Common Rule leaves it to the IRB and the investigator to decide what needs to be done to data for it to be considered de-identified, while HIPAA is much more specific about what must be done. It lists 18 identifiers that must be removed from the data for it to be de-identified. Alternatively, a statistician can perform a documented risk assessment to show that there is very little risk that the data can be re-identified.
The practical effect of modifying the Common Rule to meet the HIPAA standard, she said, would be that a great deal of data that would generally be considered by an IRB to be de-identified will no longer meet the criteria. For example, ethnographic interviews that include a zip code or some other geographic information would now be considered identified. This is important because the issue of whether data have been de-identified will affect the data security requirements and the level of review for a study. In particular, a study whose data are not considered to be de-identified cannot be exempted from review.
On the other hand, Bouregy said, this change would address the problem that there is no single, generally accepted term in the literature that is used to convey the concept of “de-identified” or “anonymous” or “unlinked.” There are dozens of different terms used to convey this idea, and “it would be really nice to have a unified term,” she noted. The adoption of a clear definition of “de-identified” could also help clear up confusion on the part of IRBs and investigators regarding what constitutes de-identified data, she said.
Indexing Data Security Protection to the Level of Identifiability
For identifiable data, under the proposed changes, the Common Rule would mandate a minimum level of data security that is indexed to the identifiability of the data. In particular, the proposal would use HIPAA data security standards as the model. This would change how researchers and institutions deal with data in several important ways. First, the standards require encryption of data at rest (in desktops, laptops, thumb drives, smart phones, etc.), which comes into conflict with export control issues. It is illegal, for example, to take an encrypted laptop to certain countries. Second, HIPAA requires secure transmission of data, and the necessary e-mail encryption is difficult to use. Strong physical security is required, which can be a problem for researchers working in a remote location. Access control and logging is another HIPAA requirement that can be difficult to adhere to in the field.
Several issues would arise if the HIPAA requirements were adopted, Bouregy said. First, in her view, IRBs are not necessarily the best place for determining appropriate data security plans, but the proposed rule would require IRBs to become even more involved in data security plans than they are now. Under the proposed rule, even excused research would be subject to these data security standards, she added. Second, she noted that not all identified data are risky, and not all studies promise confidentiality. “We have plenty of studies where there is no risk to the participants by having their name associated [with the data], and so there is no need to
go through this process,” she said. Also, the proposed rules would greatly increase the cost of reviews, because IRBs would have to review more studies and go into greater detail regarding their data security plans.
Bouregy described several types of studies that were performed by members of the political science department and noted that the risks associated with identification of subjects ranged from great to essentially nonexistent. For example, some studies of efforts to promote voting in the United States have included data on who voted in local elections, which is publically available information. These data do not require a stringent security plan, she observed, but added that a similar study that was carried out in an emerging democracy could put some participants in the study at risk. “So it is not necessarily the identifiability of the data but the sensitivity of the data in context that needs to be taken into account,” she said.
Thus instead of a minimum level of required data security protection, she said, she would prefer to see some sort of detailed guidance for IRBs and researchers that evolves over time. The IRB is best suited to determining the risk of harm, and the principal investigator is best suited to determining what is manageable in the field. In her view, the best approach would be to provide them with guidance concerning the appropriate data security plan for low-, medium-, and high-risk data.
Incorporating the HIPAA Breach Notification Requirement
Bouregy also discussed using HIPAA security and breach notification standards as the model for data protection schemes. A breach, she noted, is unauthorized acquisition of or access to data. According to the HIPAA regulations, any access, use, or disclosure of personal information in a manner not in compliance with the rule is a breach, and is presumed to be a breach unless there is a risk assessment demonstrating that there is a low probability the data have been compromised. “That is pretty stringent,” she observed.
By contrast, under the Common Rule, a data breach is treated as an adverse event or unanticipated problem that must be reported to the IRB. The IRB can then consider notifying participants as part of a risk-mitigation strategy. In making that decision, Bouregy said, IRBs generally take into account such factors as the possible extent of the harm, whether anything can be done to further mitigate the problem by notifying the participants, and whether the subjects would want to know that their data were compromised, given the nature of the data and any confidentiality promises that were made. The HIPAA approach would not allow so much flexibility and adopting it for the Common Rule would likely lead
to increased costs. According to estimates, she said, it costs about $200 per record to do an investigation and notify participants of a breach incident.
The most relevant difference for social and behavioral researchers, however, may be that under the HIPAA approach the IRB would not have the ability to consider the context of a breach, which will influence both its significance and the value of providing notice of the breach. For example, if a researcher conducting a study in another country lost the data after returning to the United States, the risk to the subjects would likely be quite low. “The idea of going back and notifying that population back in the little village in some other country gets a little absurd,” Bouregy said, and it would not really be of much value to the subjects.
This page intentionally left blank.