In the panel’s first report, we briefly reviewed some of the major privacy issues related to combining multiple data sources, including the relevant laws and approaches that statistical agencies have taken to fulfill those laws and protect their data while providing access to researchers. In this chapter we expand on that discussion, focusing on legal issues that arise in collecting, acquiring, and combining multiple data sources. We discuss privacy from a legal perspective and a computer science perspective, attempting to reconcile these different views. We also discuss how moving to a world of combined multiple data sources changes threats to privacy and introduces new threats. We then address the implications for federal statistical agencies, including the additional privacy and confidentiality laws that apply to statistical data, as well as the legal and policy issues that arise with linking records from different data sources. We continue discussion of privacy issues in the next chapter, expanding on the discussion in our first report on how federal statistical agencies can use security measures, computer science technologies, statistical methods, and administrative procedures to protect data and permit access for statistical purposes.
Almost all federal household and economic surveys assure respondents that the information they provide will be protected and will not be used to harm them. For example, respondents to the American Community
Survey are told: “We never reveal your identity to anybody else. Ever.”1 The website explains that “[a]ll Census Bureau employees take an oath of nondisclosure and are sworn for life to protect all information that could identify individuals.” Information that could identify individuals is referred to as “personally identifiable information” (PII), and each federal agency has regulations and procedures for protecting PII.
Combining data sources has the potential to reveal more information about individuals in the data sources. For example, if records from a survey of college graduates were linked with university records and information about the subsequent work history of the respondents, the original, limited information from the survey now is joined with more detailed information. It is possible that a public-use dataset published from the linked dataset might have enough information to allow for the identification of individual respondents even if all information such as names, addresses, dates of birth, and names of universities and employers had been deleted from the records. The additional information available through the linkage also makes it possible to publish statistical summaries, such as cross-tabulations on more variables, and, in some instances, those tables taken together might be used to identify individuals in the survey even though the individual tables contain only statistical information on groups of records.
The phrase “personally identifiable information” is central to the development of modern privacy law. The phrase appears frequently in federal law, judicial opinions, and legal scholarship. It is also an imprecise term that has led to confusion, particularly between legal scholars and technology experts. When legal scholars use the phrase, they anticipate that a determination will be made, within the context of laws and legal institutions, as to what constitutes PII. As with many terms in law, the phrase takes on meaning in the context of specific use: data that may be considered PII in some circumstance may not be considered PII in other circumstances. Computer scientists, in contrast, have a different view of the phrase. They would argue that all information could be viewed as PII, and, as a consequence, threats to individual privacy are not adequately addressed by the PII/non-PII dichotomy. Moreover, computer scientists would argue that in a networked world, the protection of privacy will require mathematically rigorous notions that can be translated with algorithms into numerical outcomes.2
The core differences, and the source of much confusion, may be understood as the difference between the central place of PII in modern privacy law and the ability of modern computer science to breach individual pri-
1 See https://www.census.gov/programs-surveys/acs/about/is-my-privacy-protected.html [August 2017].
2 See the section “Examples Elucidating the PII/Non-PII Issue” for a more detailed discussion of both views.
vacy, that is, to turn non-PII into PII in situations in which it is not obvious. There is no simple way to resolve these two views of PII: one relies on legal constructs, the other on scientific specifications.
However, there may be a way to integrate the insights of both disciplines to inform current understanding of PII. In law, the concept of PII carries with it legal rights and responsibilities, often described as fair information practices. The aim is to ensure the protection of PII, which is a legal obligation typically assigned to the entity in possession of the PII. There is a good reason to assign this responsibility to the data holder and not the data subject: the entity in possession of the personal data is in a better position to reduce the risks that might result from adverse use or a security breach. As explained in the famed 1973 report that led to establishment of the Privacy Act (Turn and Ware, 1976, p. 1):
Privacy is an issue that concerns the computer community in connection with maintaining personal information on individual citizens in computerized record-keeping systems. It deals with the rights of the individual regarding the collection of information in a record-keeping system about his personal activities, and the processing, dissemination, storage, and use of this information in making determinations about him.
The corollary is that such obligations do not apply if the dataset does not contain PII. In recent years, computer scientists have helped make clear that what may not appear to be PII is in fact PII when new techniques or additional data are considered.3
The current situation has led some experts to suggest that PII is no longer a workable category because PII and non-PII are no longer readily distinguished. But if the legal purpose of PII—to assign rights and responsibilities in the collection and use of data—is combined with the scientific ability to reveal the existence of individual privacy compromise when it is not obvious, then the better solution is to recognize that the legal definition of PII should include both data that are obviously PII and “latently PII,” that is, data that can be transformed into PII or, more broadly, that enable individual privacy compromise.
In essence, the panel believes that the legal PII category remains relevant and that the insight of scientists should inform how the law understands the term. One obvious implication is that the concept of PII becomes more important in a world of simultaneous use of multiple data sources.
3 A simple example of this is provided by a Social Security number. By itself, an SSN may appear not to be PII because the actual identity of the person associated with the SSN is not clear. However, if there is a lookup table that matches SSNs to individuals, the problem becomes trivial. The law understood this problem from the outset and always treated SSNs as PII (see U.S. Department of Health, Education, and Welfare, 1973; also see more detailed discussion in Chapter 5).
Our starting point for a discussion of the legal context of statistical data analysis in the federal government begins with the language of the federal Privacy Act (see Box 4-1). That law sets out a wide range of responsibilities for federal agencies that collect, use, and disclose personal information, namely, PII. That information can include everything from employment records for agency personnel to license applications for pilots to the investigative records of law enforcement agencies. When a record is contained in a system of records, many legal obligations are created, including obligations to ensure the accuracy and integrity of the record, to ensure its security, and to make it available to those individuals to whom it pertains. However, an important exception for records maintained by federal agencies is made for “statistical records.” It is these records that are the focus of our discussion.
The Privacy Act describes a statistical record as “a record in a system of records maintained for statistical research or reporting purposes only and not used in whole or in part in making any determination about an identifiable individual” (5 U.S.C. 552a(a)(6)) with a few exceptions. Other sections of the Privacy Act limit matching of datasets except those that “produce aggregate statistical data without any personal identifiers” (5 U.S.C. 552a(a) (8)(B)(i)) or “performed to support any research or statistical project, the specific data of which may not be used to make decisions concerning the rights, benefits, or privileges of specific individuals” (5 USC 552a(a)(8)(B) (iii)). Another provision of the law limits disclosure of personal records maintained by federal agencies except “to a recipient who has provided the agency with advance adequate written assurance that the record will be used solely as a statistical research or reporting record, and the record is to be transferred in a form that is not individually identifiable” (5 USC 552a(b)(5)). Records are also excluded from certain obligations, including privacy and accuracy if they are “required by statute to be maintained and used solely as statistical records” (5 USC 552a(k)(4)).
It is noteworthy that these legal constructs are quite distinct from those common in statistics. In statistics, a statistical record is an aggregate of individual records: the notion of a single statistical record conflicts with the fact that statistics are based on aggregates of multiple records. Statisticians more commonly refer to “statistical uses” of record systems, implying that the statistics are summaries of attributes of many records in a record system.
Under federal privacy law, statistical data may be widely gathered, exchanged, and disseminated with the understanding that the federal agency does not have the ability to make determinations about “identifi-
able individuals,” to gather “personal identifiers,” and that the records are not “individually identifiable.”
Critical to understanding the significance of the term “statistical data” in the context of federal agency systems is the recognition that many of the responsibilities assigned to federal agencies for the collection and use of personal data are relaxed for the category of statistical data. To better understand the current situation, we turn to a bit of history.
Prior to the enactment of the Privacy Act, there was a lengthy review of federal record-keeping systems that resulted in a major report. The U.S. Department of Health, Education, and Welfare (HEW) closely examined the issue of statistical data, and many of the insights in that report are reflected in the law that followed.
The report noted that, with few exceptions (U.S. Department of Health, Education, and Welfare, 1973):
[T]here is little to prevent anyone with enough time, money, or perseverance from gaining access to a wealth of information about identifiable participants in surveys or experiments. This should not be the case . . . (p. 93)
Social scientists and others whose research involves human subjects are vocal about the importance of being able to assure individuals that information they provided for statistical reporting and research will be held in strictest confidence and used only in ways that will not result in harms to them as individuals [emphasis in original]. (p. 93)
At the same time, the report noted the value of statistical data:
The obverse of the problem of data confidentiality is the need to make basic data more accessible for reuse or reanalysis by all qualified persons or institutions. Personal data systems for statistical reporting and research are largely in the hands of institutions that wield considerable power in our society. Hence, it is essential that data which help organizations to influence social policy and behavior be readily available for independent analysis. (p. 94)
The report even anticipated some of the current issues:
In principle, there should be no conflict between informing the public about how the government conducts its business and protecting the individual data subject from harm. If data cannot be made available for reuse or reanalysis without disclosing the identity of data subjects, special precautions may have to be taken before making basic data accessible to qualified persons outside the collecting organization, but such precautions should be taken. (p. 95)
Overall, the HEW report describes many proposed safeguards for statistical record systems that were eventually adopted in the federal law.
As we explain below, the concept of PII varies among different laws that relate to privacy, which has added to further skepticism among computer scientists about the use of the term. For example, it is not at all obvious why PII should create a different boundary condition for medical records than it does for video viewing records. However, many definitions of PII include both what is obviously PII and what could, through additional steps, be PII. The recently adopted General Data Protection Regulation of the European Union, which will likely be enormously influential in the years ahead, defines “personal data” as:4
. . . any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.
This definition contains the key phrase “identified or identifiable,” which conveys the view that actual identification may not be immediately apparent. This interpretation is strengthened by the subsequent modifier that a person may be identified “directly or indirectly.” Many privacy laws adopt this view of PII: that is, information that is both personally identifiable and information that could be personally identifiable.
As noted above, computer scientists look at privacy and statistical data through a different lens, questioning if there even exists a meaningful boundary dividing PII from non-PII. Given the tools of cryptography, they ask if such a distinction exists given the wealth of possible side (“auxiliary”) information that is available, such as other accessible datasets, last year’s statistics, newspapers, blog posts, and tracking information.5
Privacy laws and their implementing regulations include numerous examples of specific rules for identifying concrete characteristics as PII, as well as more open-ended decision tools (standards) for categorizing data as PII or non-PII. Thus, while PII is a core concept in modern privacy law, there is wide variation in the definition of PII (and similar terminology, such as “individually identifiable information” and “personal information”) and how it is interpreted in practice, even among federal agencies. For example, the Family Educational Rights Privacy Act and the Health Insurance Portability and Accountability Act provide different approaches to protecting data and enabling statistical use by external researchers. A
5 Statisticians would also be concerned about available auxiliary information and would apply statistical disclosure limitation methods.
clean mathematical separation between PII and non-PII could pave the way for unfettered access to the teachings of non-PII data on an Internet scale.
Thus, to computer scientists, the relevant questions about the appropriate meaning of PII are, “What is the law trying to promote? What is it trying to proscribe?” Equipped with answers to these questions, not only can one evaluate a proposed definition and treatment of PII and non-PII in light of these goals, but one can also begin to address these goals directly, bypassing the definitional question.
A focus on distinguishing between inferences about individuals and inferences about groups provides essential tools for reasoning about privacy when the data are not collections of records, each belonging to a single individual. For example, the inference approach permits reasoning about the privacy implications of sets of statistics computed from individual records, even if these records have subsequently been destroyed. The “post-PII” question of what can be inferred from the statistics is a persistent privacy concern even in this extreme setting. We cannot overemphasize this point: the mere fact that there is no record to “re-identify” or to associate with a unique individual (because the records have been destroyed) does not mean there is no residual risk of disclosure specific to an individual through inference from the statistics that are available (together with auxiliary information). Thus, for such a collection of statistics, PII sounds like a misnomer.
At this point in the argument, it is useful to note that “inference” in this context is itself distinct from the word as used in much of statistics and therefore common to the federal statistical system. Much of the descriptive power of sample surveys is based on the use of probability samples of large populations. The use of probability sampling provides mathematical bases for the relationship between a statistical aggregate in a sample and the corresponding aggregate for the whole population. In this sense, sample-based statistics have known inferential characteristics in relationship to the full population from which the sample was drawn. In contrast, “inference” with regard to the possibility of identifying a specific individual in a once-existing record system concerns the probability of inferring something about an individual based on information extractable from the record system. In this case, an “inference” is from a set of information to the identity of an individual record.
In this section we explore the PII/non-PII issue through the lens of inference. Our goal will be to distinguish the case in which it is possible to infer a sensitive attribute about a single individual from the case in which it is only possible to infer that attribute about members of a group as a whole. One can call this the difference between an individual privacy breach and
a group privacy loss. Generally, statistical analysis accepts group privacy loss. For example, one may learn that people with a specific gene have an increased risk of developing a particular illness, which is a fact about people in general. In one sense, “group privacy loss” may accurately be viewed as “scientific discovery.” Suppose the data from the study about this increased risk contain the medical history of an individual, Alice, who has been diagnosed with the illness. If the study enables one to infer that Alice has been diagnosed with this illness, then it is an individual privacy breach. It is not a fact about the population as a whole; an individual diagnosis does not logically follow from increased risk (or even illness). Anything that can be learned about Alice as a result of her participation in the study that could not have been learned had she not participated in the study is an individual privacy breach. In contrast, however, if Bob—who may or may not have been in the study—publishes his genetic data, and the study allows one to infer that Bob is at increased risk of the illness, it would not be considered as an individual privacy breach for Bob.6
With the distinction between group privacy loss and individual privacy breach in mind, it is useful to consider such subjects as water salinity data, ice shelf measurements, and location of the jet stream, which do not appear to be about people at all or have any implications for individual privacy. In the legal view, these data are not PII. However, consider air quality statistics that summarize levels of a pollutant produced only by automobiles in a small town. Since this is a direct measurement of something produced by human-driven vehicles, it is clearly “about” people’s driving patterns. Indeed, given enough information about the driving of all the inhabitants but one, and given the measured pollutant level, it is possible to learn how much the “final” inhabitant drove. There are ways of getting at this personal information; for example, by comparing measurements during days in which she is ill (and off the road) to measurements during days in which she is healthy. Although this leads, in theory, toward classifying the pollutant level as PII (especially in small towns), it is logical to adopt a “watch and wait” approach for this scenario because a number of factors suggest that breaching individual privacy may be difficult. Two such factors are “noise” in the measurements caused by atmospheric conditions such as wind and precipitation, and, the possible difficulty of obtaining repeated measurements of the type (currently) known to be useful, when combined, for an individual privacy breach.
Taking this example one step further, consider a dataset obtained by linking the pollutant measurements to health records. Assume further that
6 This example also highlights the danger of categorizing information: when Bob chose to publish his genetic data, he might not have anticipated the (future) scientific discovery of his increased risk of disease; had he done so, he might not have published the information.
the data exist for a large city, rather than a small town. A public health goal might be to learn about correlations between pollutant levels and the incidence of chronic bronchitis. Such correlations are aggregate statistics about the population and would not constitute PII. However, if one could learn that an individual in the dataset experiences chronic bronchitis, it would be a personal privacy breach unless the same conclusion could be drawn if the individual was not in the dataset. Thus, the individual health records should be viewed as PII, while the link between pollution level and chronic bronchitis is a statistical fact about the population. Moreover, in this example, the records could be queried in a way designed to breach individual privacy, something that appears harder to do in the example of the atmospheric measurements in a small town.
We close with a compelling example of the subtlety of the individual privacy breach determination: allele frequency statistics in genome wide association studies:7
A genome-wide association study is an approach that involves rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease. Once new genetic associations are identified, researchers can use the information to develop better strategies to detect, treat and prevent the disease. Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases, such as asthma, cancer, diabetes, heart disease and mental illnesses.
The large number of measurements in a genome-wide association study present a privacy problem that has only relatively recently come to be understood (Homer et al., 2008; Dwork et al., 2015): If one has only the statistics for the case group (people diagnosed with the illness) and control group (healthy individuals), together with the DNA of an individual, it is possible to determine if this individual is a member of the case group. Since the members of the case group have been diagnosed with the illness, this determination is an individual privacy breach. This situation is different from learning that someone’s DNA suggests an increased risk of the disease, which is what can be inferred for someone not in the study. In our approach to the PII/non-PII problem, learning the markers associated with an illness is a scientific fact about the population: learning that an individual has been diagnosed with an illness is an individual privacy breach.
The formulation set out in this chapter respects the perspectives of both law and computer science. The panel’s aim is to uphold the approach set out in the original Privacy Act of 1974 for the treatment of statistical data but to recognize that, over time, new techniques have emerged that have changed non-PII statistical data as defined in the original law to PII because of the availability of auxiliary information. We are not the only ones to attempt to bridge these views. Nissim et al. (2016, p. 33) wrote:
Legal and computer science concepts of privacy are evolving side by side, and it is becoming increasingly important to understand how they can work together. The field of computer science can benefit from an understanding of legal thinking on privacy, and the legal field can similarly be influenced by computer science thinking. The influence of one discipline on another can be very valuable in the future.
One way to resolve these two perspectives is simply to acknowledge that data that at one time may have been viewed as statistical data (i.e., non-PII data in the legal sense) are no longer statistical data. The practical consequence that follows from this acknowledgment is that the data must be protected in line with the higher standards typically associated with PII. And the acknowledgment has a further import: it establishes that the privacy status of data is dynamic over time, that datasets that are not individually identifiable today may in the future become individually identifiable.
There are at least two policy consequences of this acknowledgment. First, a determination that a dataset is statistical data should likely now include a date of certification that establishes when the data were deemed to be non-PII. Since it is a federal agency that is responsible for the management of the data systems, the federal agency should likely be responsible for this certification. These determinations should be periodically reviewed as more auxiliary data and new techniques are developed.
Second, if one wishes to establish that a dataset will remain statistical data as long as one can foresee, it should be provably so. This characterization can be seen in such data as water salinity on the Chesapeake: as the data never contain PII, there is no auxiliary information or technique that would make it PII at any time.
To this point in the chapter, we have contrasted legal and computer science definitions of PII and emphasized that the common legal interpretation of the PII status of data is not a simple, invariant function. Rather, the PII status of a record is a dynamic feature, not a static feature, of a record.
Procedures to protect the privacy of individuals are therefore an ongoing responsibility of the holder of a record system, which, for federal statistics, will involve a federal agency. Thus, this section considers the implications for federal agencies of the dynamic features of record.
This chapter represents a significant part of the panel’s assessment because when multiple record systems are combined, new issues of privacy protection may arise. Consider, for example, the case of a public-use file that is released by a statistical agency (agency A) for statistical analysis after careful attempts to anonymize the data. Now consider another record system with personal identifiers that has been kept totally confidential and in a program agency (agency B), never subjected to statistical analysis, and not released to the public. Since no information was ever disseminated from agency B, the work by agency A to protect the identity of the public-use file was not informed by the information held by agency B. If, however, the agency B dataset were combined with the dataset from agency A that generated the public-use file and statistical analyses disseminated on the combined set, the probability of re-identification of an individual in agency A’s dataset might be altered.
In short, moving into a world in which multiple datasets are combined can change threats to privacy. In the next sections of this chapter, we examine other privacy and confidentiality laws that apply to statistical data, as well as the legal and policy issues that arise with linking records from different data sources.
RECOMMENDATION 4-1 Because linked datasets offer greater privacy threats than single datasets, federal statistical agencies should develop and implement strategies to safeguard privacy while increasing accessibility to linked datasets for statistical purposes.
Other Laws Protecting Statistical Information
Administrative records systems on individuals are likely covered by the Privacy Act and subject to the permitted statistical uses described above. When statistical agencies use these records as a sampling frame for their surveys and append the survey data to that frame, the entire dataset is covered by the Privacy Act. In addition to the Privacy Act, federal statistical agencies are subject to protecting the confidentiality of identifiable information that they collect or acquire by the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA) and their own organic statutes. The confidentiality of Census Bureau data is governed by Title 13, Section 9, of the U.S. Code, which specifies that:
- Neither the Secretary, nor any other officer or employee of the Department of Commerce or bureau or agency thereof, or local government census liaison, may . . .
- use the information furnished under the provisions of this title for any purpose other than the statistical purposes for which it is supplied; or
- make any publication whereby the data furnished by any particular establishment or individual under this title can be identified; or
- permit anyone other than the sworn officers and employees of the Department or bureau or agency thereof to examine the individual reports.
No department, bureau, agency, officer, or employee of the Government, except the Secretary in carrying out the purposes of this title, shall require, for any reason, copies of census reports which have been retained by any such establishment or individual. Copies of census reports which have been so retained shall be immune from legal process, and shall not, without the consent of the individual or establishment concerned, be admitted as evidence or used for any purpose in any action, suit, or other judicial or administrative proceeding.
Similarly, CIPSEA Subtitle A, Section 512, requires that data be used only for statistical purposes and not be disclosed in identifiable form without consent:
- Use of Statistical Data or Information.—Data or information acquired by an agency under a pledge of confidentiality and for exclusively statistical purposes shall be used by officers, employees, or agents of the agency exclusively for statistical purposes.
- Disclosure of Statistical Data or Information.—Data or information acquired by an agency under a pledge of confidentiality for exclusively statistical purposes shall not be disclosed by an agency in identifiable form, for any use other than an exclusively statistical purpose, except with the informed consent of the respondent.
In addition to protecting PII, statistical agencies must protect identifiable information from businesses, schools, and health care providers, and many other organizations from which they collect or acquire data. Although the Privacy Act generally does not apply to these respondents, CIPSEA and the agency’s organic statutes do apply and impose strict requirements on agencies to ensure that they do not disclose identifiable information (e.g., see U.S. Office of Management and Budget, 2007). In addition, other laws, such as the Trade Secrets Act or exemptions in the Freedom of Information Act, may protect some of the information that statistical agencies collect from these organizations.
As we discuss in Chapter 2, combining some data sources will likely involve using record linkage techniques to match records from two different data sources on the same individuals or entities. These linkages could involve linking survey responses with administrative records, linking two or more administrative records sources, or linking private-sector information, such as credit reports or credit card transactions, to survey or administrative data.
Ivan Fellegi traced the field of record linkage methods to the 1960s, with three simultaneous developments: accumulation of large data files about businesses and individuals, new computing capabilities that enabled processing those files, and increased demand for more detailed information. These developments simultaneously resulted in increased demands for privacy safeguards. He wrote about the development of linkage policy at Statistics Canada (Fellegi, 1999, p. 12):
As a society we did not want comprehensive population registers, largely because we did not want a large scale and routine merging of information contained in government files. But we did not want to rule out some merging for some well justified purposes. So, as a matter of conscious public policy, we made linkage very difficult. However, we allowed the development of record linkage methodology for use in exceptional circumstances. The applications were indeed important, often requiring a high level of accuracy, so we refined the methodology, and also made it vastly more efficient.
Statistics Canada (2017, p. 2) recently updated its directive on microdata linkage, which acknowledges the “inherent privacy-invasive nature of the activity” of record linkage and expects that (1) the linked data results in information for the public good, (2) confidentiality will be maintained and the information will be used only for statistical purposes, and (3) the linkages offer demonstrable cost or respondent burden savings over other alternatives, or are “the only feasible option to meet the project objectives.” The directive includes omnibus authority for linkages for specific purposes, which can include linking surveys or administrative data.
In the United States, linking records has historically raised privacy concerns. The Computer Matching and Privacy Protection Act was enacted in 1988 as an amendment to the Privacy Act to create procedures to prevent any use of computer matching that could end program benefits without notifying individuals of the matching program or illegitimate uses of computer matching. Computer matching refers to the comparison of information that often includes PII data between two or more systems, which can be used between multiple agencies to ensure that federal ben-
efits are distributed properly. For example, records from the Temporary Assistance for Needy Families program can be matched with information from the National Directory of New Hires to see if program participants have acquired a job and therefore are no longer in need of program benefits. Computer matching on average has saved New York an estimate of $62 million annually (U.S. Government Accountability Office, 2014). But computer matching has also raised significant concerns about the protection of individuals. New York City recently sought to delete data about New Yorkers that could be used to prosecute immigration cases.8
The requirements of the Computer Matching and Privacy Protection Act do not apply to all federal agencies; exemptions include statistical or research purposes, law enforcement investigation, and some tax-specific matching, so this does not directly affect statistical agencies. Statistical and research exemptions are provided in 5 U.S. Code § 552a section 8, Records maintained on individuals:
- but does not include—
- matches performed to produce aggregate statistical data without any personal identifiers;
- matches performed to support any research or statistical project, the specific data of which may not be used to make decisions concerning the rights, benefits, or privileges of specific individuals.
Consent for Record Linkage
In reviewing the privacy risks inherent in record linkage and discussing obtaining consent from respondents to link records, the U.S. Government Accounting Office (2001, p. 57) noted:
The issue of consent to linkage derives from a core concept of personal privacy: the notion that each individual should have the ability to control personal information about himself or herself.
However, the report also stated that consent to linkage may not be necessary (p. 58):
If certain safeguards are in place, such as review by a group with the interests of the data subjects in mind or use of appropriate confidentiality and security protections.
The original Fair Information Practices described above do not explicitly refer to consent for record linkage. A subsequent version of Fair Infor-
8 National Public Radio: “City Officials Go to Court to Protect New Yorkers with Municipal IDs.” Available: http://www.npr.org/2016/12/20/506285207/city-officials-go-to-court-to-protect-new-yorkers-with-municipal-ids [September 2017].
mation Practices, set out by OECD in 1981, contemplates consent in two specific instances: (1) to allow the collection of personal information and (2) to use personal data for purposes other than those originally stated.9 Recently, the U.S. Office of Management and Budget updated its Circular A-130 (U.S. Office of Management and Budget, 2016), “Managing Information as a Strategic Resource,” which provides policies and requirements for federal agencies to follow for the management of federal information. This circular included the following “Fair Information Practices Principle”:10
Individual Participation. Agencies should involve the individual in the process of using PII and, to the extent practicable, seek individual consent for the creation, collection, use, processing, storage, maintenance, dissemination, or disclosure of PII. Agencies should also establish procedures to receive and address individuals’ privacy-related complaints and inquiries.
In the United States, there is no uniform policy that guides consent requirements for linking records. Some statistical agencies are in departments that are signatories to the Common Rule for the protection of human subjects (45 CFR Part 46), while others are not. Under the Common Rule, organizations must have an Institutional Review Board (IRB) determine whether the risks to human subjects have been minimized and informed consent has been obtained. However, even federal statistical agencies subject to the Common Rule may receive a waiver from the IRB or may not be required to go through an IRB or obtain consent from respondents for linking data because of the strong confidentiality protections they have for data they collect or acquire.
Currently, there are differences in policies and practices across statistical agencies regarding consent. For some surveys, including the National Health Interview Survey sponsored by the National Center for Health Statistics, interviewers ask respondents for explicit consent for record linkage. In contrast, the Survey of Income and Program Participation, sponsored by the Census Bureau, sends survey respondents an advance letter that states
9OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal Data. Available: http://www.oecd.org/sti/ieconomy/oecdguidelinesontheprotectionofprivacyandtransborderflowsofpersonaldata.htm [September 2017]. This elaborates the discussion in the panel’s first report.
10 Available: https://www.whitehouse.gov/omb/circulars_a130_a130trans4 [September 2017].
11 We note, however, that the individual participation principles in the OECD Privacy Guidelines (see fn. 9, above) do not address the issue of consent, but focus instead on the right of the individual to obtain information about the personal data that are held by others and to seek correction or deletion if requested.
the Census Bureau will obtain administrative records from other agencies and asks respondents to “opt out” if they don’t want this to occur:12
To be efficient, the Census Bureau attempts to obtain information you may have given to other agencies if you have participated in other government programs. We do so because it helps to ensure your data are complete, and it reduces the number of questions you are asked on this survey. The same confidentiality laws that protect your survey answers also protect any additional information we collect (Title 13, U.S.C., Section 9). If you wish to request that your information not be combined with information from other agencies, we ask that you notify the field representative at the time of the interview.
The Census Bureau has a provision in the Privacy Act that permits it to receive identifiable information from other agencies (5 USC Section 552a(b) (4)) for use in censuses, surveys, or other activities under Title 13 and does not need specific consent from the respondents to those censuses, surveys, or other activities.
Whether a statistical agency requires consent for linkage can have implications for the quality of the resulting linked data. Surveys that require consent to link records have reported declines in the percentage of respondents giving that consent, paralleling the decrease in response rates for surveys (Fulton, 2012; Kreuter et al., 2016). As with survey response rates, declines in consent rates do not necessarily introduce bias, but the potential for bias increases with declining consent rates. Bias in the linked data arises if there are differences between people who consent and those who do not consent.
Because of the many new data sources available, including those in the private sector and from the Internet, further questions have been raised about the feasibility of asking for consent or what informed consent means when all the uses of these data cannot be known (see e.g., Barocas and Nissenbaum, 2014). There have been international efforts to produce a set of ethical principles and discuss governance, legal frameworks, and other key issues such as privacy, consent, and data sharing for research using these new forms of data (OECD, 2016). This work goes beyond the scope of this panel, but we believe that federal statistical agencies should be aware of and follow these efforts and adapt from them any policies and best practices that they believe would be appropriate.
12 See https://www.census.gov/programs-surveys/sipp/information/sipp-faqs.html [September 2017].
Public Attitudes About Record Linkage
Public concerns about record linkage activities will need to be carefully monitored and addressed by federal statistical agencies as they move forward with combining multiple data sources. The New York Times13 recently noted the public outcry in reaction to a decree from the French government to merge information from passports and identity cards into one large database containing photographs, names, addresses, marital status, weight, and fingerprints. This database was to be used for identity verification, not statistical purposes, but there may not be a clear delineation between the two purposes in public perception. If publicity about record linkage leads to greater public mistrust, that mistrust can carry over to other aspects of the federal statistical system.