Since the publication of the Common Rule in 1991, no aspect of human society has changed so dramatically as information and its rapid production, availability, and retention. The amount of information storage has grown at an annual rate of 25 percent, and the technological capacity to process information even more rapidly (Hilbert and Lopez, 2011). There is so much information that is freely and openly available about individuals that informational risk is ubiquitous in society. In many respects, informational risk is an everyday aspect of life in the 21st century, and it has the potential to change the meaning of informed consent.
While the level of risk varies, to some extent risk exists in all forms of information, whether the information is public, whether it is digitized and rapidly generated, whether it is collected for research purposes, whether it is readily identifiable, and whether it is mundane and routine or personal and sensitive. For most social and behavioral research, the primary risk is informational. Thus, this report devotes special attention to informational risk and the different forms of information used, harvested, or collected by investigators as they pertain to the Federal Regulations for the Protection of Human Subjects.
In this chapter, the committee addresses informational risk and data protection as an extension of the Chapter 2 recommendations concerning the newly proposed category of excused research set forth in the Advance Notice of Proposed Rulemaking (ANPRM; 76 Fed. Reg. 44,512). Consistent with the ANPRM and as discussed in Chapter 2, the excused category is intended and particularly well suited for addressing informational risk involved in (a) surveys, questionnaires, or other methods of information
gathering from individuals or (b) the use of pre-existing research or non-research data that include private information. The new category would cover a large proportion of studies in the social and behavioral sciences in which the research procedures themselves involve informational risk, but where that risk is no more than minimal when appropriate data security and protection plans are in place. Chapter 2 dealt specifically with the definition and characteristics of excused research and with issues related to its registration. This chapter focuses on the required data protection that needs to be calibrated to the type and level of informational risk in order to avoid inadvertent disclosure or to reduce the level of any potential risk to participants to no more than minimal.
The issue of data protection spans the spectrum of methods and modes of inquiry in the social and behavioral sciences, whether qualitative or quantitative, longitudinal or experimental, observational or questionnaire-based, or micro- or macro-level or large-scale. With excused research, investigators need to address data protection appropriate to the research and calibrated to informational risk.
The consideration of data protection and informational risk draws on expertise within the social and behavioral sciences. These research fields, the federal statistical agencies, and data providers for the social and behavioral sciences have, over decades, pioneered procedures and mechanisms for vetting data as public-use data files and providing access to restricted data under various data protection plans calibrated to the level of risk. For more than 30 years, the National Research Council (NRC) has issued reports and guidance that take into account changing information-risk circumstances. For example, awareness of the increased capacity to re-identify data has led to a greater emphasis on restricted-use data and the development of procedures for using and protecting such data. Similarly, awareness of the research potential of video observational data in classrooms or other group settings has led to access to such data under restricted-use conditions.
There is helpful guidance from federal agencies, in particular from the federal statistical agencies; data providers such as the Inter-university Consortium for Political and Social Research (ICPSR), large-scale multi-investigator data projects, NRC reports, and the scholarly literature (e.g., National Research Council, 2003; O’Rourke et al., 2006) that is instructive on data protection plans and data use agreements. More than 10 years ago, Seastrom (2002) provided an overview of agency-specific features of data use agreements and licenses. Also in 2002, the National Human Research Protections Advisory Committee issued recommendations on confidentiality and research data protections that include a compilation of federal
research confidentiality statutes and codes useful to investigators and their institutions.1
The thrust of the guidance is to seek to maximize use consonant with confidentiality protection of private information. Reports, such as Expanding Access to Research Data (National Research Council, 2005) and Putting People on the Map (National Research Council, 2007), offer useful roadmaps on mechanisms to protect data and facilitate use. Plans to protect against and minimize inadvertent disclosure or intentional intrusions include institutional as well as technical and statistical approaches. Licensing agreements with strong penalties for infraction, data enclaves, and secure access mechanisms (where data stewards execute the analyses) are typically used when there is strong risk of disclosure. From a technical point of view, data limitation, alteration, and simulation can also be used, although they limit the data that are available for analysis (National Research Council, 2007, Chapter 3).
Building on this foundation, the chapter opens with a definition, description, and general discussion of informational risk in research. While agreeing wholeheartedly with the ANPRM desire to reduce the amount of time institutional review boards (IRBs) spend evaluating informational risk, the committee disagrees strongly with the ANPRM view that the Health Insurance Portability and Accountability Act of 1996 (HIPAA) provides an appropriate standard for specifying data protection plans generally or specifically with respect to social and behavioral research. The chapter specifically discusses HIPAA limitations in this context. Data protection issues and mechanisms are also described, and committee recommendations are offered for strengthening data protection.
Looking to the future, the committee proposes that the federal government (specifically, the U.S. Department of Health and Human Services, HHS) take steps to continue to promote institutional and methodological mechanisms that maximize researcher access to data while protecting the confidentiality of data and ensuring informational risk that is no more than minimal. As noted earlier in this chapter and in Chapter 2, the social and behavioral science community and related institutions and federal statistical agencies have played a leadership role in reconciling researcher access to private information with confidentiality protection and risk reduction (see also Levine and Sieber, 2007). However, given rapid developments in data production, dissemination, and use, it would be timely and wise for revisions to the Common Rule to be accompanied by investment in some form of organizational or institutional entity dedicated to addressing new types of informational risk and mechanisms of risk reduction. For heuristic purposes, the committee outlines one such approach in the form of a national
1See http://www.hhs.gov/ohrp/archive/nhrpac/documents/nhrpac14.pdf [December 2013].
center with sufficient expertise in data protection to inform investigators, IRBs, and data providers about (a) how to carry out ethically responsible use of private information made possible through new technologies, (b) innovative use of institutional arrangements and technology for managing informational risk, (c) standard typologies of risk, and (d) standard solutions for managing risk that researchers could readily adopt.
The chapter also discusses the continued need to facilitate data sharing, a longstanding practice in social and behavioral research. This topic is considered here because of the ANPRM proposals on the use of pre-existing research and non-research data, the benefits to human subjects as well as science and society of further analysis of existing information, and the importance of data sharing consonant with data protection and minimizing informational risk. Finally, the committee notes that, in the rapidly changing environment of information and information technology, an ongoing research program is needed to ensure that regulation of informational risk continues to be adequate and appropriate.
Informational risk is the potential for harm from disclosure of information about an identified individual. For much of social and behavioral research, informational risk is the only or the primary risk, so social and behavioral research is particularly concerned with its management. However, all research on human subjects contains some element of informational risk, as Lowrance (2012) noted. Data sharing, which is common in social and behavioral research and is becoming increasingly common in biomedical research, requires specific plans for managing informational risk. While changing circumstances can create new challenges for managing informational risk, the social and behavioral sciences bring decades of experience and built expertise for doing so effectively (Levine et al., 2011; National Research Council, 2003, 2007, 2010).
Like all other types of risk, the central criterion for determining whether the informational risk in research requires IRB review is the benchmark of minimal risk. Understanding this benchmark, and evaluating whether the risk in a particular study or data-sharing activity falls above or below that benchmark, necessitates careful consideration by investigators before they decide whether to classify and register their research as “excused” as set forth in Chapter 2. Minimal risk is conventionally defined as no greater than the risk encountered by the general population in everyday life.2
2For the current interpretation of “minimal risk” under the Common Rule and the committee’s suggested revised definition, see the section, “Defining Minimal Risk,” in Chapter 3.
As with any participant risk that occurs in the context of research, the investigators have an ethical obligation to minimize the informational risk needed to achieve the goals of the research, but compromising research goals to reduce risk that is already below minimal is not in the best interests of science or of the human subjects of that research.
As discussed in the Chapter 3 section “Calculating the Probability and Magnitude of Harm,” risk in the language of the Federal Regulations for the Protection of Human Subjects is the product of two considerations: probability of an outcome occurring and the magnitude of harm from that outcome. The most relevant harms3 from information disclosure are potential economic harms (e.g., loss of job, insurance coverage, or economic assets), social harms (e.g., loss or damage to social relationships such as marriage), or criminal or civil liability (e.g., arrest for illegal behavior). Also, information made known in some contexts can increase the risk of physical harm (e.g., spouse abuse) or psychological harm (e.g., personal information if revealed could trigger depression). The magnitude of harm depends on the type or content of information being collected about participants in a study. Highly sensitive information, such as illegal activity or HIV status, has greater potential for harm than less sensitive information such as participants’ opinions or hours of work. Currently IRBs have the task of assessing the sensitivity of information and the magnitude of harm. In that task, IRBs vary in their likelihood of overestimating the potential of harm from information (Green et al., 2006).
Much more difficult, for IRBs and researchers alike, is determining the probability of disclosure. Disclosure occurs when information about a human subject is available to unauthorized personnel and can be associated with that subject’s identity. There are basically two ways this can happen: either through negligence in protecting identified data or through re-identification of a participant from information in a dataset that presumably has been de-identified (also called “secondary disclosure”). The de facto goal of current practice—to maintain the risk of secondary disclosure at near-zero levels—may be a worthwhile aim in some cases, but only as long as it does not produce hyper-regulation in scrutinizing minimal risk research. As noted earlier, the proposed introduction of an excused category aims to insulate research from overestimation of disclosure risk when risk is no more than minimal or may already be at or near a zero level. From a cost-benefit perspective on optimal regulation, current IRB practice over-regulates informational risk.
3See also the section in Chapter 3 titled “Potential Harms Resulting from Inadequate Confidentiality Protections for Social-Behavioral Research.”
The continuing challenge for investigators, IRBs, institutions, and data providers is twofold: (1) how to build adequate data protection plans in an environment where both the nature of private information and the technology to protect or disclose such information can rapidly change, and (2) how to do so while meeting the twin goals of minimizing individual risk of harm and maximizing research benefit. The former goal requires a deep analysis of the level of granularity of the data in any one dataset or the relationships between datasets and the potential for identity disclosure, as well as the strength of the data protection plan and how access will be provided to users under what conditions.
Informational risk can be conceptualized as the probability of harm of storing, using, and reporting on research data, multiplied by the magnitude of the harm from unintended release. The measure of harm is not static: there is some evidence that norms associated with informational risk and informed consent are evolving. Nissenbaum (2011, p. 34) notes that it is increasingly difficult for many people to understand where the old norms end and new ones begin because “[d]efault constraints on streams of information from us and about us seem to respond not to social, ethical, and political logic but to the logic of technical possibility: that is, whatever the Net allows.” And these views are changing rapidly. The sources of the norms, particularly with respect to consent, identifiability, public interest, safeguards, and indeed the very notion of “privacy” that have guided IRB decisions have also changed, not just in this country but in many others (Lowrance, 2012). Research data are less likely than in the past to be a carefully curated dataset produced by a statistical agency or research institute and resulting from careful experimental or longitudinal design. New norms that use different types of controls are evolving (Landwehr, in press; Pentland et al., in press). While federal statistical agencies, data providers, and others who allow use of restricted data have set standards for access and use, there needs to be continuing attention to trends in data protection and disclosure risk over time.
Technology has also changed the research risk-and-benefit calculus. In the past, the focus was often on de-identification to avoid the risk, but such an approach is now less likely to preserve the research utility of the data. Norms on identifiers and outliers must be reconsidered if research benefit is to be maximized. Identifiers, or key data elements, now need to be retained in order that data from one data source can be linked to multiple other sources. Data are more likely now to be part of a communally developed data infrastructure or observatory. Identifiers are necessary in order to match with other population datasets and make appropriate statistical inferences. Data on atypical cases need to be preserved. While early social
and behavioral research focused on describing population characteristics, modern research in the social and behavioral sciences also studies the behavior of individuals or businesses at the tail ends of the population distribution (e.g., health care costs that are disproportionately driven by a small proportion of the population or innovative business activities that result from the creative energies of a few unusual entrepreneurs). As a result, it is much more important to retain data on outliers: standard disclosure limitation techniques thus do not always apply. When direct identifiers (name, address, etc.) must be retained for future use, best practice is to maintain them on storage systems that are isolated from the storage systems holding information about the subjects. Protection of direct identifiers can be handled by good data management.
There have also been massive changes in the risk of re-identification, given the public datasets that exist to support re-identification and the tools available, both to anonymize and to de-anonymize the data. In addition, the baseline levels of both risk and harm have changed, given the vast amount of information already in the public domain. Determining the risk level of the data becomes harder in this environment, and experts are needed to understand the risk of harm from a given dataset. Re-identification in turn depends on the subject, the level of detail, the type of media, and the availability of possible match factors. None of these elements is static, and fundamental challenges will be faced in getting the calculus right. If IRBs are too cautious, they risk suppressing valuable social and behavioral research.4 If they are not cautious enough, they risk harming individuals. The benefit of understanding social and behavioral science trends over time must be balanced with the need to protect personal data.
Informational risk will continue to increase. The volume and type of data used for social and behavioral research will introduce many new types of identifying elements; the potential for re-identification will increase with more and better types of matching tools and algorithms. Fortunately, the very same technological change that has led to increased potential for loss of confidentiality and other harms has also led to enormous advances in the tools available to protect confidentiality. For IRBs to meet the goal of enabling valuable social and behavioral research, a more flexible system must be developed that better measures and minimizes informational risk.
4The social benefit from using the data must be a consideration. Since the tragic events of September 11, 2001, for example, the need for behavioral research to understand the human characteristics and dynamics in extremism has grown significantly (see, for example, Atran, 2003).
As stated above, the best way to protect human subjects while minimizing the regulatory burden on IRBs and researchers is through adequate protection against disclosure. Matching levels of risk to levels of protection simplifies regulation and allows for clearer communication to participants about the actual level of risk. The ANPRM proposes that elements of the HIPAA Privacy Rule be adopted as the mandated data security and information protection standard for all research data.5 As argued below, a single standard based on HIPAA is not a workable solution.
The ANPRM inquires if study subjects would be sufficiently protected from informational risks if investigators were required to adhere to a strict set of data security and information protection standards modeled on the Privacy Rule and Security Rule elements of HIPAA. The guidance offered by HIPAA is neither necessary nor sufficient for several reasons: the disconnect between the two rules, the failure to quantify risk, the failure to take into account the research value of data elements, and the focus on individual rather than group risk. These reasons are explained in the next two sections.
Disconnect Between the Privacy Rule and Security Rule
The disconnect between the two HIPAA rules stems from the fact that the Security Rule does not provide guidance on how to protect information in a manner that is proportional to its risk of disclosure. It only identifies mechanisms that can either be enacted or not enacted. Although it might be anticipated that information security requirements from the Security Rule could be combined with confidentiality requirements from the HIPAA Privacy Rule, this is problematic because the HIPAA Privacy Rule was not designed as a flexible confidentiality protection framework. In addition, the Security Rule provides relevant guidance regarding how an information security framework can be constructed, but it has little focus on maintaining the confidentiality of the information beyond limiting access to authorized users. This is an important principle of data protection, but it is not sufficient for mitigating informational risk.
In particular, the HIPAA Security Rule focuses on administrative, physical, and technical mechanisms in order to prevent the misuse of information in transmission or inappropriate access to data residing on a computer’s hard drive. Within these mechanisms, it enumerates specific controls (e.g., unique log-ins for users of data), which are either “required”
576 Fed. Reg. 44,525.
or “addressable.” In the case that they are addressable, the organization (or researcher) managing the data must provide documentation regarding why the choice was made not to implement the control in question.
Failure to Quantify Risk, Failure to Account for Value of Research, and Failure to Consider Group versus Individual Risk
The failure of the HIPAA Privacy Rule to protect social and behavioral research data stems from its approach. It states that data derived from participants can be studied in one of three ways.
1. Information can be used in an identifiable form if it has already been collected (or is “on the shelf”) and it is impracticable to obtain consent. In such a case, the requirement for consent can be waived and data that contain explicit identifiers (e.g., personal name) can be used for research, provided appropriate protection mechanisms (such as those specified in the HIPAA Security Rule) are set in place.
2. Less oversight is necessary if data are disclosed as a “limited dataset.” In this case, the data must be stripped of 16 enumerated features associated with the participant, such as Social Security numbers, telephone numbers, and specific residential addresses. In addition, the recipient of the limited dataset and the organization sharing the data must enter into a binding contract that prohibits the recipient from attempting re-identification of the records and uses of the data outside of the reasons specified in the contract. This approach to data protection is clearly less risky than using fully identified data under a waiver of consent, but the enumerated list is a heuristic that provides little quantification of the actual risk. Benitez and Malin (2011) have shown that application of such a policy leads to variable risk, depending on the region of the country from which the research participants come.
3. If a dataset is de-identified, then it is no longer covered by HIPAA. This occurs when “it does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information” (45 C.F.R. § 164.514). The Privacy Rule provides several ways in which de-identification can be achieved. The first is an extension of the limited dataset from 16 to 18 identifiers, plus an attestation that the provider of the data “does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information” (45 C.F.R.
§ 164.514). This strategy does have less risk than a limited dataset, but it, too, suffers from the fact that its guidelines are independent of the actual data and do not provide an actual quantification of risk. The 18 enumerated features are common to medical records, which HIPAA was designed to regulate, but do not include other potentially identifying data elements that might be present in social and behavioral research data. Conversely, the presence of one or even several of the enumerated elements in isolation from the others may not lead to any significant risk of re-identification in, for example, large population-based samples.
Alternatively, the HIPAA Privacy Rule states that de-identification can be achieved when “[a] person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable:
i. Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and
ii. Documents the methods and results of the analysis that justify such determination.”
This mechanism is noteworthy in that it requires actual quantification of risk. There are various ways in which such risk can be measured; however, despite the specification of such an option, there are several concerns.
First, the de-identification standard is an either/or policy. Either the dataset is not protected because it is de-identified or it is protected because it is identifiable. Thus, there is no quantification of risk beyond this binary level of protection.
Second, the HIPAA de-identification policy does not relate confidentiality to the utility of the data. In other words, the priority is put on privacy and not on the balance between the need to protect the data and the need to learn from the data via worthwhile scientific endeavors.
Third, the HIPAA de-identification model provides an emphasis on individual identification and does not address issues associated with group-based risks or the publication of aggregated summary statistics associated with the data.
Based on these arguments the committee concludes that HIPAA would not be the most suitable standard for the protection of many types of research, including research in the social and behavioral sciences.
Recommendation 5.1: HHS should not mandate HIPAA as the standard for data security and information protection.
In recommending that HIPAA not be mandated as the data protection and security standard, the committee is not suggesting that another particular set of standards be mandated for social and behavioral sciences but rather that there be an array of data protection approaches that best fit the data protection needs. These can include
• planning data protection with the concept of a portfolio approach considering safe people, safe projects, safe data, safe settings, and safe outputs;
• utilizing a wide range of statistical methods to reduce risk of disclosure;
• consulting resources and data protection models to help researchers and IRBs such as university research data management service groups, individual IT/protection experts, and specialized institutions such as the ICPSR and NORC at the University of Chicago;
• existing standards for data protection promulgated by the National Institute of Standards and Technology (NIST); and
• developing a future national center to define and certify the levels of information risk of different types of studies and corresponding data protection plans to ensure risks are minimized.
These approaches will be discussed in more detail in the next sections.
Once the risk profile is determined, the next step is to define a data protection plan that can address the needed risk in the research. The changing technological environment discussed above means that researchers and IRBs need to have a current and reliable source from which they can determine what reasonable measures can be taken that protect confidentiality and that are less reliant on solely statistical approaches. Data protection plans should use a diversified approach to minimize disclosure risk: safe projects (valid research aims), safe people (trusted researchers), safe data (data treated to reduce disclosure risk), safe settings (physical and technical controls on access), and safe outputs (reviewing products for disclosure risk) (Ritchie, 2009). Yet, the same changing technology that has made it much more difficult for individual investigators and IRBs to know how to ensure such safe use has also made it possible to identify new types of controls.
As noted earlier in this report, diverse sources of guidance exist for selecting among approaches for protecting data that can be calibrated to the level of informational risk and the identifiability of the data. Prior NRC reports set forth in considerable detail different approaches for data protection, data use, and data sharing (see National Research Council, 2005, 2007). The issues are sufficiently compelling that they continue to be examined as new forms of data or new technologies emerge. Forthcoming examples include responsive rules-based systems governance and fine-grained authorizations for distributed rights management (Pentland et al., in press), as well as approaches that institute access control and information flow policies or use media encryption, attribute-based encryption, or secure multiparty computation (Landwehr, in press).
Protection also means limiting the set of people who get access to a dataset (or resource) or limiting the information that is disclosed to the people who can get access. Protection could also be addressed through audit and liability requirements. These protection measures can be implemented as elements of data curation for the dataset. The aim of any of these approaches should be to maximize research accessibility relative to the level of disclosure risk.
An appropriate data protection plan outlines the mitigations for lowering the informational risk. It should outline both the physical and logical controls to be implemented—not just in securing the data but also in ensuring that only authorized users can access them. Some universities have special research data management service groups to guide researchers in developing data protection plans. For example, the ICPSR website includes guidance and samples, as well as links to resources at other universities in the United States and internationally.6 Federal statistical agencies such as the National Center for Education Statistics (NCES) offer resources and a procedures manual on the use of restricted identifiable data.7
There are multiple examples of new approaches for data protection. NIST Special Publication 800-63-1 is a generally accepted standard for information assurance in protecting information system transactions; it has a tiered scale of protection based on the level of data (National Institute of Standards and Technology, 2011). The NORC data enclave at the University of Chicago protects data to NIST standards, yet enables secure remote access to confidential micro datasets. NORC is used as a secure method for data dissemination by statistical agencies. It also archives and curates data and provides space for virtual collaboration by researchers. A similar
6The webpage described is at http://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/resources.html#a02. [December 2013].
approach has been developed by the UK Data Service at the University of Essex.8 The European Union’s Data Without Boundaries project9 has been funded to enable both onsite and secure remote access to official micro datasets for research. In addition, the University of Michigan also has a good data protection model for data collected and distributed through its Panel Study of Income Dynamics. The program includes free database access to unrestricted data, which requires only a user name and password, while restricted-data access requires a legal agreement and proof of following the university-supplied data protection plan.10 These models, combined with the NIST standard for secure information transactions, could be used by the Office for Human Research Protections (OHRP) to illustrate an appropriate foundation for establishing data protection plans for social and behavioral research data.
When researchers are developing specific data collection plans for studies, the plans will vary depending on the nature of the data and requirements of the data provider if pre-existing data are being used. Some of the major elements of a data protection plan include11
• nature of the data and degree of identifiability (e.g., continuum ranging from highest level of individual-level data with personal identifiers, to lowest level of aggregated community-level data with no identification of community);
• computing environment in which the data will be used (e.g., platform, number of computers, type of computers, network or standalone computers, access to and security of physical environment);
• locations and methods of data storage;
• controls used to secure the data;
• methods of transmitting the data between research team members; methods of encryption;
• methods of storage of computer output (electronic and paper);
11List of elements was summarized from these sources: Restricted Data Use Agreement with ICSPR. Available: http://www.icpsr.umich.edu/files/ICPSR/access/restricted/all.pdf [December 2013]. John Hopkins School of Public Health (May 8, 2011) Data Security Guidelines for Community-Based Research. A Best Practices Document Prepared by the Ad-Hoc Committee for Data Security Program for Global Disease Epidemiology and Control Department of International Health. Available: http://www.jhsph.edu/offices-and-services/institutional-review-board/_pdfs-and-docs/ih-nutrition-gdec-data-security-guidelines-final-2011-05-10.pdf [December 2013]. Partners Healthcare Enterprise Research Infrastructure and Services System Information Risk Evaluation. Available: http://rc.partners.org/eris [December 2013].
• specification of who has access to what types of data (e.g., raw, identifiable, de-identified, summary data) and how access is managed (e.g., password management or not, onsite and/or remote access); and
• audit capabilities to track access activity.
Guidance Recommended: OHRP should provide guidance for investigators and IRBs on models for data protection plans that illustrate acceptable practices for reducing disclosure risk for research with less than minimal risk, minimal risk, or higher levels of risk. To ensure guidance of continued relevance in a data environment that is ever changing, OHRP should periodically request that Federal agencies, approved data repositories, and scientific societies offer examples and models of best practices for OHRP guidance and assist with FAQs (frequently asked questions).
Recommendation 5.2: In light of rapid changes in data of scientific value and in technologies that can be harnessed to reduce or increase informational risk, HHS should consider developing an institutional or organizational entity such as a national center to define and certify the levels of information risk of different types of studies and corresponding data protection plans to ensure risks are minimized.
An entity such as the national center referred to in Recommendation 5.2 could support IRBs and researchers in facilitating the science, understanding the risks, and understanding the procedural and technical approaches to data protection. Whether it would be better to use existing organizations or to set up a new organizational form within a government agency could be determined through further study. However, such an entity could provide essential guidance, as well as anticipate new challenges in informational risk by looking ahead.
Existing data repositories within the United States are actively engaged in addressing how to approach the massive increase in new forms of digital data in order to make them available for analysis and use. Issues of data protection and risk assessment are integral to data access and sharing. Most recently, 22 U.S. data repositories in the social and natural sciences met at ICPSR leading to the release of a white paper on Sustaining Domain Repositories for Digital Data (Ember and Hanisch, 2013).
Other countries have also recognized the need to enlarge such services. For example, the Economics and Social Research Council—the UK equivalent of the Social, Behavioral, and Economic Sciences Directorate of the U.S. National Science Foundation—has announced two calls for proposals to establish two key elements within an Administrative Data Research Network.
One call is for proposals to establish four Administrative Data Research Centres (ADRCs), one each in England, Wales, Scotland, and Northern Ireland. The second call is for proposals to set up the Administrative Data Service to the ADRCs. The ADRCs will have the following roles:
• Provide state-of-the-art facilities for research access to de-identified administrative data by accredited researchers.
• Provide data management and statistical analysis support functions for external researchers accessing the data.
• Commission and create new linked administrative data resources for a growing research agenda.
• Conduct original research using linked administrative data and related analytical and methodological approaches.
• Engage in training, capacity building, and public engagement.
• Work in collaboration with other elements of the Administrative Data Research Network.
Another example is the Australian National Data Service (ANDS), which is supported by the Australian government. According to its website,12 ANDS is transforming Australia’s research data environment to
• make Australian research data collections more valuable by managing, connecting, enabling discovery, and supporting the reuse of this data; and
• enable richer research, more accountable research; more efficient use of research data; and improved provision of data to support policy development.
The United States has developed similar capacities, albeit not government supported. For example, the ICPSR at the University of Michigan “provides leadership and training in data access, curation, and methods of analysis for a diverse and expanding social science research community”13 but is supported largely by project-specific grants and contracts. Similarly, the NORC at the University of Chicago “provides a wide range of data services to researchers and data producers… [and] offers the full cycle of data services, ranging from study design and concept to data archiving and access… [and] a main service of providing a confidential, protected environment within which authorized researchers can access sensitive microdata
13See http://www.icpsr.umich.edu/icpsrweb/content/membership/about.html [December 2013].
remotely.”14 Rich frontier and practical knowledge has been developed at the Human Dynamics and Media Labs of the Massachusetts Institute of Technology,15 as well as at Microsoft Research.16 However, these specialized organizations are not mandated to provide guidance to IRBs, nor do they likely have the support staff to do so at their current configuration of resources.
This rich capacity within the United States, as well as in other countries, suggests the value of a dedicated entity that could lead, coordinate, and build upon the depth of knowledge and experience that exists; keep pace with data and technological innovations; and foster research. One attractive option worthy of consideration is to establish a national center of expertise in research data protection technologies. This center could be charged with providing operational guidance to investigators, institutions, or IRBs, derived from interactions among commercial, academic, and government experts. Such a center could have the following features:
• Authority. The center could be authorized by HHS to carry out the activities identified in Recommendation 5.2. It could serve as a resource to support improvements in enhancing data protection and addressing informational risk under varying conditions. It could use its convening authority to bring together broad-based experts. Also, it could serve as a catalyst for research.
• Staffing. The center could employ a research staff to ensure that changes in technology are readily acknowledged and researched.
• Expertise. The center could be charged with identifying experts who could certify both established and frontier approaches used by research organizations to protect different types of research data and with providing guidance about the advantages and disadvantages of both.
• Products. The center could be responsible for producing three key products: (1) current guidance about the characteristics of datasets that could be used to create discrete informational risk profiles, conditional on different levels of research utility; (2) a menu of certified data protection plans that would be appropriate to use for each of the risk levels and that researchers and IRBs can use in their work; and (3) a set of recommendations for limiting disclosure when publishing results.
• Dissemination. The center could be responsible for maintaining a constantly updated website for IRBs and researchers to use that
16See http://research.microsoft.com/apps/pubs/default.aspx?id=80239 [December 2013].
characterizes the informational risk profiles of different types of datasets, matches data protection plans to those risk profiles, and provides guidance to IRBs in determining informational risk.
Data sharing has been referenced in some of the discussion above concerning data protection, but in this final section the committee discusses specific needs to foster and guide data sharing and responsible use, which is a longstanding practice in social and behavioral research (Levine and Sieber, 2007; National Research Council, 1985). Implicit in encouraging data sharing includes encouraging agencies, organizations, and institutions to make accessible administrative records consonant with confidentiality agreements (see, e.g., National Research Council, 2005, 2007). Data sharing is a highly desirable component of an open and democratic scientific community. It allows verification through replication of findings of the original investigators; it permits novel investigations by researchers with hypotheses different from those of the original investigators; it creates research opportunities for students and junior investigators without resources for large original data collections. It is increasingly required by federal funding agencies as a condition of research awards.17
Many investigators have neither the expertise nor the continuity of funding to sustain the effort of making data available, particularly if restricted-access arrangements are needed. Data archiving organizations can play a valuable role in promoting data sharing. Their roles could be enhanced if there were credentialing procedures or other guidance to help investigators make appropriate choices among data-archiving organizations.
Guidance Recommended: OHRP should facilitate data sharing by issuing a list of participating and approved data archives that have been reviewed by an OHRP expert panel as having (a) the technical expertise to provide public-use data files and restricted-access data files and (b) the procedures in place for review of such data. Investigators obtaining data from participating archives must adhere to guidelines for public-use data files and to data use agreements in the case of restricted-use data. Adherence to these conditions is essential for classifying investigator use of public-use files as not human-subjects research and of restricted-use data as excused.
17See the Memorandum for the Heads of Executive Departments and Agencies on Increasing Access to the Results of Federally Funded Scientific Research from the Executive Office of the President, Office of Science and Technology Policy, February 22, 2013, at http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf.
Researchers using secondary data are still bound by ethical obligation to protect the privacy of human subjects, whether or not data providers make explicit such conditions. Attempts to identify human subjects in secondary data or to describe to others methods for doing so should be considered research misconduct and punished appropriately. The only exception is analysis of disclosure risk when it is authorized by the data provider.
Recommendation 5.3: As a condition of undertaking secondary research on public-use or restricted-access data, investigators have the responsibility to protect the confidentiality of the data and honor the data protection plan and other agreements with the data provider, whether the data provider is the primary researchers involved in the study, an agency or institution, or a data distribution organization. The revised regulations and OHRP guidance on data use should make clear that secondary users must honor confidentiality agreements but that no further consent from human subjects is needed to use such data. The revised regulations should also make clear that data providers may share data without consent of human subjects as long as users adhere to the original confidentiality agreements and other conditions of use.
Guidance Recommended: OHRP should clarify that the determination of whether research data collected from human subjects can be distributed to other researchers through public-use or restricted-access agreements should be made by (a) the investigators who collected the data or (b) a data distribution organization delegated by the original investigators and approved by the IRB as the distributing organization.
As set forth in Chapter 2, research on public-use data files is not human-subjects research and outside of the Federal Regulations for the Protection of Human Subjects. Those preparing such data for public-use need to ensure that the data have been de-identified and that risk of re-identification is at or approximates zero. In certifying data for public use, IRBs make a judgment on this classification based on this defining characteristic of public-use data.
Research data are not appropriate for public use when they involve informational risk that is potentially more than minimal because they include (a) highly sensitive, private information that could lead to civil or criminal liability or economic, social, or psychological harm or (b) information that could increase the likelihood of re-identification. High standards for de-identification and stringent data disclosure tests may reduce informational risk, though certain variables may need to be excluded from public-use data files. Alternatively, when such data have scientific value such that making them available for research purposes is desirable, there are a number of
possible mechanisms to reduce informational risk while allowing research access. As discussed in the context of data protection plans, these mechanisms include licensing agreements and the use of secure enclaves.
Restricted-use data are data about human subjects that retain or include potentially identifiable information and so require special data protection plans to protect against disclosure. In general, the option of combining restricted-use data with public-use data is an expected part of a data-sharing system and should be accounted for in the data protection plan for the restricted-use data. Addition of new public data in a research activity should be registered but does not require additional review. Combining multiple types of restricted-use data may significantly increase informational risk and so requires approval of the data provider and registration of a new data protection plan.
Guidance Recommended: OHRP guidance should clarify that investigators with access to restricted-use data or datasets must have the approval of the data provider to integrate additional restricted-use data. Under such circumstances, the guidance should cover the following situations: (1) Investigators must obtain approval and modify as necessary their data protection plan to account for additional use of restricted data. Such additional study remains excused but must be registered with an updated data protection plan. (2) Under circumstances where investigators have access to restricted-use data and are enhancing these data with publicly available information, they may do so without the approval of the data provider as long as a new data protection plan is registered that accounts for the use of additional public information.
Data linkage is a powerful tool for increasing the scientific value of data collected from human subjects. Opportunities for linkage may arise after contact with human subjects has ceased. Many sources of linked data, such as government administrative records, can only be obtained with consent of the individual whose records are sought. The Common Rule should not impose or encourage such a requirement where it does not exist. Rather, it should in all cases regulate the protection of data so that informational risk from data linkage is managed appropriately.
The specification of an appropriate arrangement is the responsibility of the data provider and the associated IRB. Researchers gaining access to restricted-use data through these arrangements, and their institutions, accept responsibility to protect the data. Conditions often include stiff penalties for violations; for the NCES, violations are a class E felony subject to up to 5 years in prison and/or up to $250,000 in penalties. The terms of the agreements should not in general require review by the IRB of the
recipient. Secondary use of restricted-access data, however, should be registered as excused.
Guidance Recommended: OHRP should issue guidance that investigators with access to restricted-use data through site licenses, data enclaves, or other mechanisms operated by government agencies and other data providers are excused from IRB review. They are, however, responsible for registering their research at their own institution, including filing the approval for use of such data and the conditions under which they have obtained access.
Recommendation 5.4: If investigators collected data from human subjects (i.e., primary data collection), their additional consent is not necessary to subsequently link to other pre-existing data, except under circumstances where human subjects are being asked to participate further in the research or if their original consent prohibited future data linkage. The fact that additional consent is not required to link data does not reduce the responsibility of investigators to modify and register their data protection plans.
Recommendation 5.5: Investigators using non-research private information (e.g., student school or health records) need to adhere to the conditions for use set forth by the information provider and prepare a data protection plan consonant with these conditions, calibrated to the level of risk, and sufficient to reduce risk through disclosure. Further consent is not required from such individuals as long as investigators pledge to adhere to confidentiality agreements.
Finally, the committee concludes that, in the rapidly changing environment of information and information technology, an ongoing research program is needed to ensure that regulation of informational risk is adequate and appropriate. The following research recommendation is consistent with that of several important NRC reports released over the past 10 years.
Research Needed: (1) Research is needed on innovations in the data use of non-research information and records, new ways of collecting and linking data, and new methods for measuring and quantifying risk and risk reduction techniques. (2) Since it is increasingly unknowable whether existing disclosure limitation mechanisms sufficiently balance disclosure risks and the utility inherent in social and behavioral research datasets, the committee recommends that (a) disclosure limitations mechanisms be tested against social and behavioral research datasets to identify methods that are appropriate to develop best practices, and
(b) information-disclosure risk assessment and risk mitigation strategies should be developed that are consistent with the nature of social and behavioral research datasets.
While the earlier sections of this chapter often had as reference points quantitative, large-scale data surveys and administrative records, the recommendations apply appropriately to all forms of data. Qualitative studies, including ethnographic methods and in-depth observational projects, are also amenable to sharing with high standards for protection to ensure that the data are not identifiable. Therefore, a separate section is devoted here to protecting qualitative data because the nature of the interaction between researchers and participants, the data collection process, and the resulting data are substantially different when using qualitative methods than in quantitative studies. Using the example of fieldwork, sociocultural anthropologists, ethnographic sociologists, religion scholars, market researchers, and many others employ fieldwork, each in slightly different ways. Fieldwork most generally refers to data collection taking place outside of specialized, researcher-controlled settings or contexts (e.g., a laboratory or survey questionnaire). It can entail everything from observation of rural villagers with little social interaction between a researcher and research participants, through short-term, “participatory-action” research involving a collaboration between researcher and an urban community in solving a social problem, to long-term, discovery-oriented “participant observation” during which the researcher becomes closely involved with a community or organization and research objectives shift in response to new information. We discuss below issues related to protecting qualitative data and approaches for ensuring that private information acquired is secure.
Protection for Primary Data
As part of their professional ethics in protecting research participants, fieldworkers and other qualitative researchers are trained to keep their notes and recordings secure. They have an ethical obligation to keep confidences not just in note taking but also in their social interactions. When it comes to ethnographic field materials (e.g., field notebooks and other notes based on participant observation and interviewing; recordings and transcripts; personal materials collected from informants, such as letters, drawings, and so on; photographs, whether created for personal or research reasons; and similar materials created by the ethnographer or given to the ethnographer by persons with whom she or he has a field relationship), data
need to be protected through secure storage by the researcher: Examples of secured storage include locked office file boxes to which only the researcher has access, password-protected computers, and locked thumb drives. Over the past 40 years, the American Anthropological Association has developed a diverse set of case materials and references to an expanding published case literature on ethics and data protection.18 More recently, the American Sociological Association has made extensive case materials available on its website,19 and other professional associations are doing likewise.
Protection for Data Sharing
Qualitative research poses major challenges for privacy protection and data sharing: this is important to recognize, particularly in light of funders’ relatively new data sharing advisories and requirements. Irwin (2013, p. 297) points out that making qualitative data available for secondary analyses is not feasible for many types of ethnographic and field studies because it is not possible to cleanse field notes and other research materials “of the contextual, conceptual, and interactional context in which they were produced and through which they could be understood.” In these cases, research materials are securely curated by the researcher for personal use; upon the death of the researcher, in some cases these materials are archived in repositories having extensive experience curating context-rich documents (e.g., the Smithsonian Institution Archives20). However, qualitative data resulting from formal and some kinds of semi-structured interviews, or research questions whose answers do not depend on context-rich information and extensive social interaction between the researcher and the respondent, could have more value for secondary analyses by third parties (Irwin, 2013).
In view of new funder requirements to make qualitative data available for secondary analyses, Parry and Mauthner (2004) describe another set of issues that make archiving and reusing qualitative data more challenging than they are for quantitative data. In some cases when copyright, or ownership, of data is transferred to archives, both respondents and researchers lose control of deposited data. This loss is particularly meaningful for qualitative data, which are inherently more personal, in-depth, and developmental. Even when respondent data appear to be anonymized, in some qualitative studies confidentiality may not be achievable because of very small numbers of participants and distinctive community circumstances inextricable from the central research questions. In such cases, removing or
18See http://www.aaanet.org/cmtes/ethics/Ethics-Resources.cfm [December 2013].
masking demographic variables and geographical information may change the meaning of the data or limit their utility. Given these and other related challenges, Parry and Mauthner (2004) urge special provisions for protecting qualitative data.
While ICPSR is known more for archiving quantitative data, they also archive qualitative datasets.21 In archiving qualitative data, ICPSR instructs researchers to follow guidelines on its webpages, which instruct researchers about how to keep data confidential through replacing names with generalized text, replacing dates, and removing unique or publicized items.22 However, this advice reflects ICPSR’s central interest and experience with quantitative datasets and, as suggested above, may not be appropriate for many qualitative materials. The ICPSR website also refers to an archive in the United Kingdom that is specifically dedicated to archiving qualitative data and works with social scientists in developing protection methods that fit these challenging data (Corti et al., 2000).
Researchers and regulators need to be aware that there are many other repositories with decades of experience handling qualitative research data, both specialized (e.g., the University of California’s Melanesian Archives23) and general (e.g., the National Archives24), not to mention the special collections held by the libraries of research universities. These repositories contain collections serving humanities disciplines such as history and are appropriate for the long-term management of the research materials generated by qualitative social research using interpretive methods.
Atran, S. (2003). Genesis of suicide terrorism. Science, 299(5612):1534-1539.
Benitez, K., and Malin, B. (2011). Evaluating re-identification risks with respect to the HIPAA privacy rule. Journal of the American Medical Informatics Association, 17(2):169-177.
Corti, L., Day, A., and Backhouse, G. (2000). Confidentiality and informed consent: Issues for consideration in the preservation and provision of access to qualitative data archives. Forum: Qualitative Social Research, 1(3). Available: http://www.qualitative-research.net/index.php/fqs/article/view/1024/2207 [December 2013].
Ember, C.R., and Hanisch, R.J. (2013). Sustaining Domain Repositories for Digital Data. Available: http://datacommunity.icpsr.umich.edu/sites/default/files/WhitePaper_ICPSR_SDRDD_121113.pdf [December 2013].
21Observational video data are archived at the ICPSR, along with quantitative measures as part of the Measures of Effective Teaching (MET) Project Longitudinal Database. See: http://www.icpsr.umich.edu/icpsrweb/content/METLDB/about/index.html [December 2013].
22See http://www.icpsr.umich.edu/icpsrweb/content/deposit/guide/chapter3qual.html [December 2013].
Green, L.A., Lowery, J.C., Kowalski, C.P., and Wyszewianski, L. (2006). Impact of institutional review board practice variation on observational health services research. Health Services Research, 41(1):214-230.
Hilbert, M., and Lopez, P. (2011). The world’s technological capacity to store, communicate, and compute information. Science, 332(6025):60-65.
Irwin, S. (2013). Qualitative secondary data analysis: Ethics, epistemology, and context. Progress in Development Studies, 13(4):295-306.
Landwehr, C. (in press). The operational framework: Engineered controls. In J. Lane, V. Stodden, H. Nissenbaum, and S. Bender (Eds.), Privacy, Big Data and the Public Good. Cambridge, UK: Cambridge University Press.
Levine, F., and Sieber, J. (2007). Ethical issues related to linked social-spatial data. Appendix B. In National Research Council, M.P. Gutmann and P.C. Stern (Eds.), Putting People on the Map: Protecting Confidentiality with Linked Social-Spatial Data. Washington, DC: The National Academies Press.
Levine, F.J., Lempert, R.O., and Skedsvold, P.R. (2011). Social and Behavioral Sciences White Paper on Advanced Notice of Proposed Rulemaking (ANPRM). Available: http://www.aera.net/Portals/38/docs/Education_Research_and_Research_Policy/SBS%20White%20Paper%20Report%20Final10-26-11.pdf [December 2013].
Lowrance, W.W. (2012). Privacy, Confidentiality, and Health Research. Cambridge, UK: Cambridge University Press.
National Institute of Standards and Technology. (2011) Electronic Authentication Guideline: Information Security. Gaithersburg, MD: U.S. Department of Commerce. Available: http://csrc.nist.gov/publications/nistpubs/800-63-1/SP-800-63-1.pdf [November 2013].
National Research Council. (1985). Sharing Research Data. Committee on National Statistics. S.E. Feinberg, M.E. Martin, and M.L. Straf (Eds.). Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press.
National Research Council. (2003). Protecting Participants and Facilitating Social and Behavioral Sciences Research. Panel on Institutional Review Boards, Surveys, and Social Science Research. C.F. Citro, D.R. Ilgen, and C.B. Marrett (Eds.). Committee on National Statistics and Board on Behavioral, Cognitive, and Sensory Sciences. Washington, DC: The National Academies Press.
National Research Council. (2005). Expanding Access to Research Data: Reconciling Risks and Opportunities. Panel on Data Access for Research Purposes. Washington, DC: The National Academies Press.
National Research Council. (2007). Putting People on the Map: Protecting Confidentiality with Linked Social-Spatial Data. Panel on Confidentiality Issues Arising from the Integration of Remotely Sensed and Self-Identifying Data. M.P. Gutmann and P.C. Stern (Eds.). Committee on the Human Dimensions of Global Change. Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press.
National Research Council. (2010). Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Panel on Collecting, Storing, Accessing, and Protecting Biological Specimens and Biodata in Social Surveys. R.M. Hauser, M. Weinstein, R. Pool, and B. Cohen (Eds.). Committee on National Statistics and Committee on Population, Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press.
Nissenbaum, H. (2011). A contextual approach to privacy online. Daedalus, the Journal of the American Academy of Arts & Sciences, 140(4):32-48.
O’Rourke, J.M., Roehrig, S., Heeringa, S., Reed, B.G., Birdsall, W.C., Overcashier, M., and Zidar, K. (2006). Solving problems of disclosure risk while retaining key analytic uses of publicly released microdata. Journal of Empirical Research on Human Research Ethics, 1(3):63-84.
Parry, O., and Mauthner, N.S. (2004). Whose data are they anyway?: Practical, legal, and ethical issues in archiving qualitative research data. Sociology, 38(1):139-152.
Pentland, A., Greenwood, D., Sweatt, B., Stopczynski, A., and de Montjoye, Y.-A. (in press). The operational framework: Institutional controls. In J. Lane, V. Stodden, H. Nissenbaum, and S. Bender (Eds.), Privacy, Big Data and the Public Good. Cambridge, UK: Cambridge University Press.
Ritchie, F. (2009). Designing a National Model for Data Access. Paper presented at the Comparative Analysis of Enterprise (Micro) Data Conference, Tokyo, Japan. Available: http://gcoe.ier.hit-u.ac.jp/CAED/papers/id213_Ritchie.pdf [December 2013].
Seastrom, M.M. (2002). Licensing. Pp. 279-296 in P. Doyle, J. Lane, J.J.M. Theeuwes, and L. Zayatz (Eds.), Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. Amsterdam, North Holland: Elsevier.