RISKS TO CONFIDENTIALITY IN LINKED BIOLOGICAL AND SOCIAL DATA

It is well understood that, before sharing data, organizations must remove all direct identifiers, such as names and addresses, from the files. Deidentification refers to a reversible process whereby the data are key-coded, encrypted, or pseudonymized to remove personal information, but a key is generated that allows the data to be reassociated with the personal information.1 By contrast, anonymization is an irreversible process in which the data are completely stripped of all identifying information that can be linked to the study participants (Elger and Caplan, 2006; Shostak, 2006).

Deidentification and anonymization of biodata should decrease the risk that unauthorized individuals can identify the participants who were the source of the data. However, deidentification and even anonymization may not be sufficient to protect participants’ confidentiality when the released data contain other variables that, in combination, might enable a malicious data user (hereafter called an intruder) to identify individuals in the file. For example, given precise values of such demographic variables as age, race, sex, education, and occupation, an intruder might be able to link records in a released file to records in other databases that include data subjects’ names. Or, given a unique medical profile, such as a diagnosis of a rare disease, an intruder might be able to use public knowledge, health records, or research data sets to link to identifiers or other data. This determination of an individual’s identity from data that have been deidentified or anonymized is referred to as reidentification, and it is a risk that is growing as more and more data become readily available electronically. For example, electronic health records may enable an intruder to access individuals’ medical data and possibly link those data to social surveys with overlapping medical variables.

Broadly speaking, confidentiality risks are of two types: identification disclosure risk and attribute disclosure risk. Identification disclosure occurs when an intruder learns that information on a targeted individual is in a particular shared file; if this happens, it may be possible for the intruder to determine which of the records in the file belongs to the targeted individual by examining the demographic or other variables. Attribute disclosure occurs when an intruder learns the value of a sensitive variable for a targeted individual, which may make it possible to identify records belonging to that individual. (There are other types of disclosure, such as perceived identification disclosure and inferential disclosure, but they are not discussed here.)

1

Note that some biological measures vary sufficiently across time that they do not raise a risk of reidentification that is any greater than that for most social, economic, or psychological characteristics. On the other hand, some of the latter characteristics are unchanging and thus pose a greater risk. In this chapter, the focus is on characteristics that are unchanging or sufficiently stable that they raise a risk of reidentification.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement