National Academies Press: OpenBook

Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata (2010)

Chapter: 3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data

« Previous: 2 Collecting, Storing, Using, and Distributing Biospecimens
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

3
Protecting Privacy and Confidentiality: Sharing Digital Representations of Biological and Social Data

One of the advantages of collecting biological specimens as part of social surveys is that digital representations of the data derived from the specimens—such as measurements of lipid levels or indicators of the presence or absence of genes or diseases—can be appended to the survey data and shared with other researchers. Wide dissemination of data facilitates advances in research and public policy. Indeed, the benefits of wide access to data have led the National Institutes of Health (NIH) to require data sharing as a criterion for funded proposals. However, biological and social data cannot be widely shared without consideration for the rights and interests of study participants from whom the data were derived, including their interest in confidentiality. As has been noted in a number of previous reports by the National Research Council (1993, 2000, 2005, 2007), there is an inherent conflict between confidentiality and data access: the obligation to protect confidentiality pushes data disseminators to restrict access, whereas researchers’ demands push them to share highly detailed data. Balancing these conflicting demands can be complicated, especially for large surveys that combine biological and social data.

This chapter examines methods of sharing digital representations of biological and social data. It begins with a discussion of the risks inherent in sharing such data. The second section reviews existing data sharing approaches and considers their potential usefulness in the context of this report. The final section summarizes some of the federal regulations related to the confidentiality of combined biological and social data.

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

RISKS TO CONFIDENTIALITY IN LINKED BIOLOGICAL AND SOCIAL DATA

It is well understood that, before sharing data, organizations must remove all direct identifiers, such as names and addresses, from the files. Deidentification refers to a reversible process whereby the data are key-coded, encrypted, or pseudonymized to remove personal information, but a key is generated that allows the data to be reassociated with the personal information.1 By contrast, anonymization is an irreversible process in which the data are completely stripped of all identifying information that can be linked to the study participants (Elger and Caplan, 2006; Shostak, 2006).

Deidentification and anonymization of biodata should decrease the risk that unauthorized individuals can identify the participants who were the source of the data. However, deidentification and even anonymization may not be sufficient to protect participants’ confidentiality when the released data contain other variables that, in combination, might enable a malicious data user (hereafter called an intruder) to identify individuals in the file. For example, given precise values of such demographic variables as age, race, sex, education, and occupation, an intruder might be able to link records in a released file to records in other databases that include data subjects’ names. Or, given a unique medical profile, such as a diagnosis of a rare disease, an intruder might be able to use public knowledge, health records, or research data sets to link to identifiers or other data. This determination of an individual’s identity from data that have been deidentified or anonymized is referred to as reidentification, and it is a risk that is growing as more and more data become readily available electronically. For example, electronic health records may enable an intruder to access individuals’ medical data and possibly link those data to social surveys with overlapping medical variables.

Broadly speaking, confidentiality risks are of two types: identification disclosure risk and attribute disclosure risk. Identification disclosure occurs when an intruder learns that information on a targeted individual is in a particular shared file; if this happens, it may be possible for the intruder to determine which of the records in the file belongs to the targeted individual by examining the demographic or other variables. Attribute disclosure occurs when an intruder learns the value of a sensitive variable for a targeted individual, which may make it possible to identify records belonging to that individual. (There are other types of disclosure, such as perceived identification disclosure and inferential disclosure, but they are not discussed here.)

1

Note that some biological measures vary sufficiently across time that they do not raise a risk of reidentification that is any greater than that for most social, economic, or psychological characteristics. On the other hand, some of the latter characteristics are unchanging and thus pose a greater risk. In this chapter, the focus is on characteristics that are unchanging or sufficiently stable that they raise a risk of reidentification.

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

A great deal of research has been done to assess the risks of disclosure for biological and social data (e.g., National Research Council, 2005, 2007). Conceptually, the combination of biological and social data presents no new issues related to disclosure risk. What is important in the context of this report is that disclosure risk increases as more variables are added, and the risk of identification disclosure is greater with combined data than with any single type of data. Suppose, for example, that an individual is nearly identifiable from a combination of demographic variables that are publicly available; this would be the case, for instance, if only a small number of people matched those particular characteristics. Then adding a set of biomarkers that was available to an intruder could enable the intruder to identify that person. Similarly, an individual might not be identifiable from a small number of genes provided in a released file, but adding demographic data could provide enough information to result in identification.

Additionally, the degree of potential harm from attribute disclosure is greater in the case of combined data. For example, an intruder’s learning that someone has a particular set of genes might not be especially damaging, but the potential damage would be much greater if the intruder also learned phenotypic information from social data, such as criminal histories or sexual habits. Likewise, learning someone’s identity from social science data might be innocuous, but if the person’s disease status were also uncovered, harm could result—for example, from discrimination.

Assessing the risks of sharing combined data is complex. First, the data steward must determine to the extent possible which variables would be available to intruders. The answer depends on the nature of the variables and may change over time. Furthermore, some biological and social variables are effectively permanent characteristics, so that sharing them poses risks for both the present and the future. A key example is genetic data. While there is currently no way to search for a person’s identity using his or her DNA, ongoing work of researchers, government, and companies such as 23 AndMe may well make individuals’ genomes, in whole or in part, available to intruders in the near future. Hence, because of the permanence of genetic data, any sharing of such data now could lead to disclosure risks in the future. Indeed, research by Malin and Sweeney (2004) indicates that genetic data already pose risks: the authors show that unlabeled DNA sequences stripped of demographic data and identifiers, when interpreted for common disease genes and screened against publicly available data, could be narrowed down to several individuals.

Anyone who disseminates such data also must take into account the variation that may exist across databases in the variables available to intruders. High variation provides an additional layer of confidentiality protection since the same record may have different values in the shared file and in the intruder’s external database, making correct matching difficult. While such variations can easily be found in social science data because of inconsistencies in self-

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

reporting, some clinically derived biological data are measured with very little error, so this additional layer of protection does not exist. Genetic data are again a good example, as the error rates for genotyping are known to be very small (Hao et al., 2004; Saunders, Brohede, and Hannan, 2007).

A third consideration is future technologies that might be used by intruders. Data that appear safe today may not be safe in the future. As an example, a recent report by researchers at the Translational Genomics Research Institute in Phoenix, Arizona, and the University of California, Los Angeles, shows that there is a much greater risk of identifying participants in a genetic survey than was previously thought (Homer et al., 2008). The report relates to genome-wide association studies (GWASs), which use the complete genomes from a number of individuals with a particular disease to scan for markers that may be associated with that disease. The surprising finding was that if one knows the DNA of a particular person, it is possible to tell whether that person is included in the GWAS database simply by looking at the allele frequencies. This finding led NIH to put some of its data behind a firewall, a step that was taken by the Wellcome Trust as well (Clayton, 2008). Once an intruder knows that someone is in a particular data set, that knowledge may provide information about the person’s health or disease status, since GWASs typically focus on a particular phenotype—dementia, for example, or breast cancer.

Finally, while much of the research on identifiability has involved genomic data, similar risks can be expected for other highly detailed and high-dimensional data, such as proteomic and metabolomic data.2 Given ongoing rapid technology advances and the likelihood that electronic health records will become commonplace in the next few years, it is possible that these data will be available to intruders in the near future.

APPROACHES TO SHARING BIOLOGICAL AND SOCIAL DATA

The literature on data sharing describes two broad approaches to protecting the confidentiality of individuals whose records appear in data collections: restricting access and restricting data. The restricted access approach involves controlling who can access the data for analysis and under what conditions. Specific strategies include licensing agreements and data enclaves. The restricted data approach involves providing a redacted version of the data to those who wish to use it. Redaction strategies are often termed statistical disclosure limi-

2

Proteomics refers to the branch of molecular biology that deals with the full set of proteins encoded by a genome; proteomic data are data associated with proteins expressed by a genome, cell, tissue, or organism. Metabolomics refers to the study of the chemical fingerprints that specific cellular processes leave behind; metabolomic data are the small-molecular metabolite profiles present in cells or tissue.

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

tation (SDL); they include such approaches as recoding variables, suppressing data, and perturbing data.

Restricted Access

There are four primary strategies for restricted access: (1) licensing, (2) remote execution systems, (3) data enclaves, and (4) virtual data enclaves.

Licensing

To obtain data with little or no redaction other than removal of direct identifiers such as names and addresses or geocodes, researchers sign a licensing agreement not to use the data for malicious purposes, such as identifying individuals and subsequently taking injurious actions based on those identifications. A number of statistical agencies, including the National Center for Education Statistics, use this approach, as do public data archives such as the Inter-university Consortium for Political and Social Research (ICPSR) and the principal investigators of many social science and biosocial surveys. Licensing allows researchers to obtain highly detailed data and thus facilitates secondary analyses of the data. However, this approach relies on researchers not violating the terms of the license. Enforcement is generally the responsibility of the funding agency. For certain types of violations, substantial penalties can be levied, such as revoking funding from the responsible investigator and institution. One disadvantage of licensing is the difficulty that often characterizes the process; Box 3-1 describes some ways to make the process less burdensome for both researchers and data stewards.

Remote Execution Systems

In this approach, confidential data are maintained in a computer system owned by the data disseminator, and a secondary researcher who wishes to perform a study with the data submits a query to the system. As long as the information is not confidential, the system provides the results of the query (which may be computed by the system or by staff at the data-disseminating organization) to the researcher without revealing the individual data. The National Center for Health Statistics and the U.S. Bureau of the Census maintain such systems. These systems are not foolproof, however, as intruders can use judicious queries to glean sensitive information about data subjects. To minimize this risk, the system must limit the types of analyses that can be performed, which reduces the utility of the data to outside researchers. Furthermore, without full access to individual-level data, researchers find it difficult to perform exploratory data analyses or to check the fit of models.

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

BOX 3-1

Ways to Facilitate the Licensing Process

A number of authors have noted that the licensing process can be difficult, time-consuming, and even costly and thus may limit the number of researchers who use the specimens and data in a storage facility (see, for example, Nolte and Keller, 2004; National Research Council, 2005). The most successful researchers, who have many options from which to choose, may eschew projects for which the licensing process is too demanding; indeed, concern has been expressed that some investigators may not use data unless they have easy access. At the same time, managing licenses is difficult for data stewards. Those who produce data for research often lack the capacity to manage numerous contracts. Even professional data archives can be overwhelmed by the requirements of dealing with file cabinets full of paper contracts that require constant monitoring.

Thus it is important to find ways to make licensing easier. One recommendation to that end is offered in the National Research Council report Expanding Access to Research Data: Reconciling Risks and Opportunities: “Statistical and other agencies that provide data for research should work with data users to develop flexible, consistent standards for licensing agreements and implementation procedures for access to confidential data” (National Research Council, 2005, p. 79). A variety of options could be explored along these lines. For example, it might be possible to have a central licensing agency that would license individual researchers and laboratories in such a way that they would not have to seek a new license each time they sought access to a new data set. Or there could be a tiered licensing system with differing licensing requirements depending on such things as the sensitivity of the data or the amount of access requested, so that investigators whose projects presented a lesser risk to the confidentiality of the data might be subject to a less onerous licensing process. In one effort to contribute to solving problems with licensing, the NIH-sponsored Data Sharing for Demographic Research Project at the Inter-university Consortium for Political and Social Research (ICPSR) published guidance for developing and implementing a restricted-use data contract or license (see http://www.icpsr.umich.edu/DSDR/rduc/ [accessed May 27, 2010]). In the next step of this project, plans are to develop a largely automated system for managing licenses and for validating the security of computers with which licensed data will be analyzed.

Data Enclaves

With this approach, an investigator works in a room dedicated to accessing the data. Only approved researchers are allowed in the enclave. Computers in the room are not connected to the Internet or to other external resources. Researchers cannot take individual-level data from the enclave, and all results they obtain are checked by the data disseminator for potential confidentiality breaches before they can be taken out of the enclave. Because of these features, this approach offers the highest level of security. It also carries a variety of costs,

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

however. Data enclaves are inconvenient for investigators who must travel to them. Moreover, most enclaves require that data users pay a substantial fee to cover the costs of maintaining the facility and its staff; they also require potentially time-consuming background checks and proposal approvals. Hence, the amount of analysis likely to be carried out on the data in an enclave is limited (National Institutes of Health, 2006).

Virtual Data Enclaves

Virtual data enclaves combine features of licensing, remote execution systems, and data enclaves. The data are housed in a system owned by the data disseminator. Licensed secondary researchers access the data remotely; that is, the researcher’s computer serves as a dummy terminal. The National Opinion Research Center (NORC) at the University of Chicago maintains such a system, called the NORC Data Enclave. Users of a virtual data enclave cannot store the data on their computers, and certain functions on their computers, such as printing and the ability to copy data to removable media (including disks and micro-vault storage media), are disabled. Virtual data enclaves thus allow researchers to access the data without traveling to a secure data enclave. They also avoid some of the disclosure risks posed by licensing researchers to store data on their own machines, such as the accidental loss of CD-ROMs or sharing of data with unapproved investigators or students. As with licensing, however, confidentiality protection depends on researchers not violating terms of the data use agreement; thus, virtual data enclaves are less secure than physical ones.

Restricted Data

Many data sets containing biological or social data are shared after identifying or sensitive values have been altered. Alterations can be made in a variety of ways, including coarsening the variables by, for example, releasing ages in 5-year categories rather than as exact values; top-coding, as in reporting incomes that exceed a threshold T simply as “above T”; swapping, or exchanging small amounts of data between records, with the goal of introducing uncertainty about identities; adding noise to numerical variables; and replacing sensitive values with synthetic data derived from a probability model (see the detailed discussion in National Research Council, 2007). One could characterize these methods as protecting confidentiality by obscuring relatively high-order features of the data. Top-coding, for example, destroys analyses of the tails of the distributions; swapping attenuates the correlations among swapped and non-swapped attributes; adding random noise distorts distributions and attenuates relationships; and synthetic data are guaranteed to preserve only the relationships in the synthesis models.

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

Generally speaking, greater modifications of the data lead to greater protection but also to greater reduction in the utility of the data. Thus, determining the amount and type of data alteration involves trade-offs. Weighing the type and extent of protection against the reduction in the utility of the data requires developing a way of measuring both the risk of disclosure for data altered in various ways and the utility of the data. Although some metrics for data utility exist (Karr et al., 2006), the assessment is a nontrivial task. In general, it is difficult for researchers not trained in methods of confidentiality protection to apply these methods in ways that optimize the utility of data while adequately protecting confidentiality.

It is difficult to imagine that standard protection methods will be effective for multidimensional biological and social data. Consider, for example, using these methods to protect 660,000 single nucleotide polymorphisms (SNPs) that are planned for release. It is not clear how many and which SNPs are needed to identify individuals, particularly if the data also contain psychological, social, or economic variables. If the data disseminator decides to protect a large proportion of the SNP data by, for example, swapping or synthesis, it is inevitable that many interaction effects among the SNPs will be nearly destroyed. This outcome is problematic since analyses of gene–gene or gene–environment interactions are inherently high-dimensional. One way to get around the limitations of data restriction is to combine this type of approach with restricted access. An example of such a combined approach is provided in Box 3-2.

Choosing a Data Sharing Strategy

The restricted access and restricted data strategies described above provide varying degrees of confidentiality protection, and each has its limitations with respect to preserving the utility of the data for research purposes. In general, there is a trade-off between the level of protection that is afforded and the level of utility that is preserved.

Of the restricted access strategies, remote execution systems and data enclaves offer the highest level of protection, but they also impose the greatest limitations on the utility of the data: the former because the types of analyses that can be performed are significantly limited, and the latter because researchers bear a heavy burden in having to travel to the enclave, pay substantial fees, and undergo background checks and proposal approvals. Licensing and virtual data enclaves impose fewer limitations, but the protection they provide depends on researchers abiding by the terms of the license or data use agreement. Nonetheless, the panel believes that, given rigorous enforcement (see the discussion later in this chapter), these two strategies hold the greatest promise for sharing combined biological and social data in ways that support and sustain promises of confidentiality while preserving the utility of the data. As discussed in Chapter 5, the panel also believes that an effort must be made to improve the

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

BOX 3-2

Data Sharing for the Health and Retirement Study

The Health and Retirement Study (HRS) at the University of Michigan uses a combination of approaches to share data while protecting confidentiality (Nolte and Keller, 2004). Investigators who wish to use sensitive or potentially identifiable data from the HRS can apply for a license for full access to the data if they agree to certain conditions, such as developing and implementing a data protection plan to protect participant confidentiality; allowing yearly inspections; providing annual reports; and in some cases, submitting to a prepublication review of the analysis results. The major disadvantages of this approach are the length of time required to obtain approval for a nonstandard data protection plan and the fact that only principal investigators of a federally funded project who are also affiliated with an institution with an NIH-certified human subjects review process can apply for a license. Thus, junior faculty members and students, for example, have difficulty accessing the data through such agreements (Nolte and Keller, 2004).

For this reason, the HRS also makes its data available through the Michigan Center on the Demography of Aging (MiCDA) Data Enclave in Ann Arbor. Investigators who do not meet the requirements for a data license can come to the data enclave to perform data analyses. There are few restrictions other than a review of the analyses by the data enclave staff to ensure that they include no information that could compromise the confidentiality of the participants whose data are in the database. The main disadvantage is that an investigator must go to the data enclave to perform an analysis (Nolte and Keller, 2004). To avoid this disadvantage, the HRS began a virtual data enclave program, through which researchers can access data on the HRS server remotely from their own institutions. To maintain confidentiality, all the data are kept in the HRS computers. Remote users send instructions to perform analyses that are carried out on the HRS computers and then receive the results. Various security systems ensure that no confidential data are released, and the data enclave staff still review the results of the analyses to ensure that no breaches of confidentiality occur. The main disadvantage of this approach is the cost of setting up secure computer systems at the investigators’ home institutions (Nolte and Keller, 2004).

The HRS approaches to data access illustrate two access models available for dealing with biospecimens and associated data.

licensing process so as to make it less complex and time-consuming, employing options such as those outlined in Box 3-1.

Data restriction strategies are both imperfect in protecting confidentiality and likely to significantly reduce the utility of the data. Combining these strategies with some form of restricted access, as is done for the Health and Retirement Study (Box 3-2), can address these limitations by enhancing security and necessitating less extreme alterations of the data.

Regardless of which strategy for data sharing is chosen, it is essential to

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

BOX 3-3

The Importance of Planning for Data Sharing: Two Case Examples

Because placing biosocial survey data into data archives poses so many risks to participants’ confidentiality, it is important to take the obstacles to sharing data into account when planning a survey, preparing consent forms, and working with the relevant Institutional Review Boards (IRBs). If the data sharing strategy is not properly planned in advance, it can be difficult to arrange later. A brief description of the types of issues encountered by two projects based in Italy is sufficient to illustrate some of those difficulties.

The SardiNIA Study of Aging is an international collaboration funded primarily by the National Institute on Aging (NIA) to study the effects of various genetic variants on aging in a group of people in Sardinia, the second-largest island in the Mediterranean Sea. The research has helped identify genetic variants linked to lipid levels and risk for coronary artery disease. But difficulties arose when the principal investigators were asked to provide their data to the Database of Genotypes and Phenotypes (dbGaP), a database operated by the National Center for Biotechnology Information to archive and distribute data from genome-wide association studies (GWASs).

Because the SardiNIA study began before dbGaP had been fully established, no provisions were made during the planning for the study to place the data in dbGaP. The SardiNIA investigators were able to provide dbGaP a listing of the traits and diseases that had been studied, a data dictionary, and the genotyping platform that had been used, along with p-values for every single nucleotide polymorphism (SNP) for every trait that had been studied. The investigators were able to supply that information because it did not include any personal data that could lead to breaches of confidentiality. On the other hand, the investigators were unable to provide dbGaP with information on family relationships and raw data connecting each participant to corresponding trait values and genotypes.

Placing the SardiNIA data into dbGaP would require approval from the IRB overseeing the study. In their original informed consent forms, the investigators had

make the decision during the process of planning for the study and to ensure that consent forms contain the necessary detail to enable the strategy’s implementation. Box 3-3 provides two case examples of the importance of planning in advance for data sharing.

SALIENT FEDERAL REGULATIONS

A variety of federal regulations deal with issues related to data sharing and privacy. This section provides an overview of those that are most important to researchers dealing with biodata in social science surveys.

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

not discussed the possibility of depositing data into dbGaP or any other archive. Instead, the consent forms had specified only that investigators and staff with the National Research Council of Italy would have access to the data, and that NIA would have access only to data that were anonymously coded. Depositing all of the SardiNIA data in dbGaP would require participants’ reconsenting, and only data from surviving participants who reconsented would be able to be deposited. Obtaining reconsents would require a great deal of additional expense, effort, and time.

One option the investigators suggested was providing a limited amount of data to dbGaP—only general information about the study and statistics regarding the prevalence of various genetic variants in the population—and then accepting requests for more complete data from individual researchers. The raw data would remain with the SardiNIA group rather than at dbGaP, and the SardiNIA IRB would decide on a case-by-case basis who would receive access to the more complete data.

Similar issues arose with the proposed archiving of data from the InCHIANTI (Invecchiare in Chianti, or “aging in the Chianti area”) Study, which examined aging among the populations of two small towns in the Tuscany region of Italy. In this case, the principal investigator of the study spoke with the study’s steering committee, explaining the position of the National Institutes of Health that data from research paid for with public funds should be shared. The steering committee agreed that at least part of the data should be archived, but its members were concerned that doing so might not be possible given the language in the consent forms, which said the data could be shared with the collaborators of the InCHIANTI group but did not mention making the data available in a public database. To archive the InCHIANTI data would require obtaining new consent forms from the participants that specifically allowed for placing the data in a public archive. In the case of participants who had since died, it would be possible to place some of their data in an archive because of the wording of the original consent form, but none of their genetic data could be included.

One clear lesson to be drawn from these two examples is the importance of including a discussion of data sharing in consent forms. Without such explicit discussion, it may be impossible to make data publicly available.

Federal Privacy Regulations

Among the federal regulations pertaining to privacy are the Common Rule (45 Code of Federal Regulations [CFR] 46 Part 160 and Part 164, Subparts A and E) (U.S. Department of Health and Human Services, 2000, 2005b) and the Standards for Privacy of Individually Identifiable Health Information, issued by the Department of Health and Human Services and commonly referred as the Health Information Portability and Accountability Act (HIPAA) Privacy Rule. These regulations are intended to protect the human subject’s personal health information and identity while allowing society to benefit from the use of that information, including for research purposes (U.S. Department of Health and Human Services, 2003).

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

There is some complexity to the application of these federal regulations. The HIPAA Privacy Rule, for example, applies only to health data obtained from covered entities, for example, physicians, hospitals, and Medicare or Medicaid records, but not to reports or biodata obtained from individuals in the course of a social science survey. Moreover, both rules may be interpreted differently depending on the kind of research in question—for example, direct human subjects research versus research using leftover biospecimens in a biorepository or biodata stored in a biobank. Generally speaking, biorepositories and biobanks are not considered covered entities under the HIPAA Privacy Rule unless their contents were obtained in research requested and approved by a covered health provider.

The Common Rule and the HIPAA Privacy Rule need to be clarified to alleviate the scientific community’s fears that they could be misinterpreted in a way that could be damaging to research on biospecimens. In particular, it is important to (1) develop a standard for what constitutes “minimal risk,” especially in the context of the use of existing biospecimens and biodata, and (2) define “human subject” more clearly in light of the fact that research can be carried out on specimens and data that were collected from people who later died (Meslin and Quaid, 2004, p. 230). Under the Common Rule, someone who has died is not considered a “human subject,” but in the case of biospecimens, it can be argued that the surviving interests of the deceased person, such as his or her reputation, should be protected. Another point of confusion is whether, if the research carried out on deceased persons can generate information about living people, those individuals should also be considered “subjects” (DeRenzo, Biesecker, and Meltzer, 1997; Meslin and Quaid, 2004, p. 230).

In some instances, the Common Rule and HIPAA Privacy Rule actually contradict each other, such as in informed consent for future, unspecified research: the Common Rule allows authorization of such unspecified uses, but the HIPAA Privacy Rule requires a specific research purpose for each authorization of release of protected health information (Vaught et al., 2007). Furthermore, the HIPAA Privacy Rule distinguishes between the creation of a biospecimen repository or a database containing protected health data and the release of data from such resources for research purposes. The rule requires a participant’s authorization for each instance of data release unless a waiver is granted by the Institutional Review Board (IRB). This inconsistency between existing federal guidelines, along with differing interpretations of the guidelines, has prompted some experts to lament the burden on researchers that results, as well as to question the rules’ effectiveness in protecting the privacy of research participants (Bankhead, 2004; Nosowsky and Giordano, 2006).

Since the privacy, confidentiality, and identifiability of human research subjects are important concerns, many institutions devote a good deal of time and resources to the discussion of best practices in these areas. The National Human Genome Research Institute (NHGRI) held a workshop in October

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

2006 to address these issues and how they pertain to genomic research. The discussion document developed for this workshop and the workshop report provide useful points to consider, especially with respect to the definition of “identifiable” data, strategies for data deidentification, and the risks that deidentified data may be reidentified (Lowrance, 2006a, 2006b). According to the workshop report, some of the “themes that were accepted as granted” are that (1) at this stage of advanced genomic science, the scientific community must do everything possible to respect and protect the privacy of human subjects whose genomic information is being used for research, and (2) everyone in the chain of data collection shares the responsibility for privacy and identity protection.

On this subject, it is also worth noting two other points concerning privacy. First, the Common Rule does not apply to research done by private companies, organizations, or individuals that do not receive support from the U.S. government, except in the case of institutions (e.g., universities) that have accepted blanket responsibility for enforcement of the rule, regardless of the source of research support. Second, neither the Common Rule nor the HIPAA Privacy Rule will protect an individual’s privacy and identity in the case of a court-ordered subpoena requesting personal information, but that level of protection can be provided by a federal Certificate of Confidentiality.

Genetic Discrimination

On March 21, 2008, after 13 years of debate, Congress passed and President Bush signed into law the Genetic Information Nondiscrimination Act. This act protects Americans from being discriminated against in either employment or health insurance based on their genetic information (U.S. House of Representatives, 2008). Although this legislation provides a certain amount of protection from genetic discrimination, it is limited to asymptomatic individuals and thus does not protect individuals already presenting readily detectable symptoms of a genetic condition (Rothstein, 2008). At this point, there is relatively little in the judicial record to indicate how courts will deal with employers, insurers, or others discriminating against an individual on the basis of information about that individual’s genetic makeup. No cases of genetic discrimination have yet been brought before federal or state courts, for example.

One case that offers some insight into how the courts may deal with the issue is a 2001 suit filed by the Equal Employment Opportunity Commission (EEOC) against Burlington Northern Santa Fe (BNSF) Railroad for testing its employees without their consent for a rare genetic condition that causes carpal tunnel syndrome. The doctors hired by BNSF also screened the employees for diabetes and alcoholism without their knowledge, and at least one BNSF employee was threatened with termination for refusing testing (National Human Genome Research Institute, 2008). The EEOC argued on behalf of BNSF employees that these tests were unlawful based on the Americans with

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×

Disabilities Act (Public Law 101-336) as they were not job related, and any change in employment due to the results would constitute discrimination based on a disability. This lawsuit was quickly settled, and BNSF agreed with all the EEOC’s requests.

Legal Sanctions and Enforcement

At present, NIH relies on its ability to restrict grant funding to ensure that best practices are upheld, data are shared, confidentiality is respected, and so forth. If, for example, an investigator signs a licensing agreement to obtain data from a repository and then fails to abide by the terms of the agreement, NIH can cut off all or part of that researcher’s funding. The general perception is that enforcement of these agreements is weak, but there has been no evidence of breaches of confidentiality (National Institutes of Health, 2006).

As more and more biospecimens and biodata are collected by and made available through biorepositories and biobanks, it will become increasingly important to ensure effective enforcement of the rules governing the use of these materials and data, coupled with strong legal sanctions when the rules are broken. To this end, a number of questions need to be answered: What sort of enforcement scheme should be used? Should digital representations of biological data be turned over to an archive such as the Inter-university Consortium for Political and Social Research? Given that most researchers do not want to do their own policing, who should police the use of biospecimens and biodata—the Office of Research Integrity or some other agency?

Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 41
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 42
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 43
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 44
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 45
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 46
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 47
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 48
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 49
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 50
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 51
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 52
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 53
Suggested Citation:"3 Protecting Privacy and Conἀdentiality: Sharing Digital Representations of Biological and Social Data." National Research Council. 2010. Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata. Washington, DC: The National Academies Press. doi: 10.17226/12942.
×
Page 54
Next: 4 Informed Consent »
Conducting Biosocial Surveys: Collecting, Storing, Accessing, and Protecting Biospecimens and Biodata Get This Book
×
Buy Paperback | $31.75 Buy Ebook | $25.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Recent years have seen a growing tendency for social scientists to collect biological specimens such as blood, urine, and saliva as part of large-scale household surveys. By combining biological and social data, scientists are opening up new fields of inquiry and are able for the first time to address many new questions and connections. But including biospecimens in social surveys also adds a great deal of complexity and cost to the investigator's task. Along with the usual concerns about informed consent, privacy issues, and the best ways to collect, store, and share data, researchers now face a variety of issues that are much less familiar or that appear in a new light.

In particular, collecting and storing human biological materials for use in social science research raises additional legal, ethical, and social issues, as well as practical issues related to the storage, retrieval, and sharing of data. For example, acquiring biological data and linking them to social science databases requires a more complex informed consent process, the development of a biorepository, the establishment of data sharing policies, and the creation of a process for deciding how the data are going to be shared and used for secondary analysis--all of which add cost to a survey and require additional time and attention from the investigators. These issues also are likely to be unfamiliar to social scientists who have not worked with biological specimens in the past. Adding to the attraction of collecting biospecimens but also to the complexity of sharing and protecting the data is the fact that this is an era of incredibly rapid gains in our understanding of complex biological and physiological phenomena. Thus the tradeoffs between the risks and opportunities of expanding access to research data are constantly changing.

Conducting Biosocial Surveys offers findings and recommendations concerning the best approaches to the collection, storage, use, and sharing of biospecimens gathered in social science surveys and the digital representations of biological data derived therefrom. It is aimed at researchers interested in carrying out such surveys, their institutions, and their funding agencies.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!