Our discussion so far has largely focused on federal statistical agencies’ production of statistics. However, the mission of statistical agencies is not only to produce statistics, but also to provide statistical research access to the data that they collect while protecting the confidentiality of that information. Although federal statistical agencies provide descriptive statistics that are policy relevant (National Research Council, 2013b), data from the federal statistical agencies are also a key resource for policy analysis and evaluation conducted by researchers and evaluators outside of government (National Research Council, 2005). Indeed, for decades U.S. society has profited from applied social science research and policy analysis using federal data, conducted by universities, nonprofit organizations, think tanks, and advocacy groups. These activities are supported by government grants and contracts, private foundations, and corporate and individual donors (National Research Council, 2005).
It is also well recognized that having external users analyze statistical data is key to improving the quality of federal statistical agency processes (Abowd et al., 2004). The importance of microdata access goes beyond the ability to perform primary analyses relevant to policy; it also includes evaluating the data generation process, replicating scientific findings, and building a knowledge infrastructure (Bender et al., 2016).
In this chapter, we first briefly review the work of previous panels of the National Research Council and National Academies of Sciences, Engineering, and Medicine that have examined issues of researcher access to federal data. We then discuss the risks of access and the challenges that statistical agencies face when providing data access and statistics in an ever-
changing data environment. Next, we review the common legal foundations for privacy and confidentiality that apply to the federal statistical agencies and federal administrative records: agencies must follow these laws when acquiring information from data providers, in linking different datasets, and in providing access to external researchers. We note some inadequacies of these laws. We summarize a variety of approaches used by federal statistical agencies for providing access to confidential data for statistical purposes, as well as access models from other countries, focusing on those that include combining multiple data sources. We conclude with a brief discussion of some relevant privacy-enhancing technologies, broadly describing their roles in the context of a single dataset (or statistical agency) and the implications for bringing together multiple data sources.
We note at the outset that the collection and use of personal information by the federal government, including use for statistical purposes, has long raised privacy concerns. Consequently, there are strong protections to guard the privacy of individuals, such as those for the U.S. decennial census, to encourage public participation and to ensure that data are used for the purpose for which they were collected (Allen and Rotenberg, 2016). These protections help ensure that statistical data are not used in ways that could cause adverse impacts on individuals.
Currently, however, the structures of privacy protections for statistical data are under increasing pressure, for several reasons. First, there is increasing use of government statistical data by private organizations that seek to link data collected for statistical purposes with identifiable individuals.1 This privacy risk arises from public-private partnerships that may lead to data uses not originally anticipated. Second, there are increasing and more sophisticated instances of data breaches and identity theft in the United States. Even data that are gathered for statistical purposes may be subject to misuse by others as a consequence of a data breach.2
1 See, for example, http://www.experian.com/marketing-services/insource-demographics.html [December 2016].
2 In Australia, for example, the Australian Bureau of Statistics (ABS) has had 14 data breaches since 2013, though “none of the notifications related to disclosure of mishandling of any census data, or attempts by an external party to expose or steal information” (see https://www.theguardian.com/australia-news/2016/jul/29/australian-bureau-of-statistics-reports-14-data-breaches-since-2013 [December 2016]). ABS has also faced criticism from a number of privacy and civil liberties groups over changes to the 2016 census that involved the length of retention of Australians’ names and addresses. This will mean that for the first time, the census will retain identifiable information on all Australians for 4 years. ABS has said this will allow it to form a “richer and dynamic statistical picture” of the country (see https://www.theguardian.com/technology/2016/jul/25/census-2016-australians-who-dont-complete-form-over-privacy-concerns-face-fines [December 2016]).
The “fair information practices” of the U.S. Department of Health, Education, and Welfare (1973) are the foundation for most modern privacy laws, including the laws that regulate personal information collected by federal agencies. These practices require notice to the individual about information being collected, consent by the individual for the collection, his/her ability to access his/her data, assurance that the data are kept securely, and some enforcement mechanisms for these data protections (see Gellman, 2016).
Since then, there have been many studies that have examined issues of protecting confidentiality and providing data access to researchers. Private Lives and Public Policies (National Research Council, 1993, pp. 15-16) provided a key conceptual framework for these issues, clearly defining the tension between privacy and the value of data-informed policy making:
Private lives are requisite for a free society. To an extent unparalleled in the nation’s history, however, private lives are being encroached on by organizations seeking and disseminating information. . . . In a free society public policies come about through the actions of the people. Those public policies influence individual lives at every stage [of life]. . . . Data provided by federal statistical agencies…are the factual base needed for informed public discussion about the direction and implementation of those policies. Further, public policies encompass not only government programs but all those activities that influence the general welfare, whether initiated by the government, business, labor, or not-for-profit organizations. Thus, the effective functioning of a free society requires broad dissemination of statistical information.
Private Lives and Public Policies defined “informational privacy” as “an individual’s freedom from excessive intrusion in the quest for information and an individual’s ability to choose the extent and circumstances under which his or her beliefs, behaviors, opinions, and attitudes will be shared with or withheld from others” (National Research Council, 1993, p. 22). It is distinguished from “confidentiality,” which refers broadly to an obligation not to transmit that information to an unauthorized party. A more modern definition of information privacy addresses the rights and responsibilities associated with the collection and use of personal information (Rotenberg, 2000).
Another aspect to privacy is concerned with what can be inferred about an individual based on the combination of publicly available information sources and the release of statistical information (Fellegi, 1972; Homer et al., 2008) or the taking of publicly observable actions based on statistical information (Calandrino et al., 2011).
There is a series of research efforts to understand how survey respondents think about the privacy and confidentiality of their information and
the risks of disclosure, and how these perceptions affect their behavior (e.g., see Singer, 2003; Singer and Couper, 2010; Singer et al., 2003). Federal statistical agencies typically make pledges to survey respondents that they will keep their information confidential and use it only for statistical purposes. Agencies believe that this is fundamental to gaining cooperation from respondents as well as obtaining accurate and complete information (National Research Council, 1979, 2005).
A decade after Private Lives and Public Policies, another National Research Council report (2005) noted a significantly increased societal need for data and increased public concern about privacy and confidentiality than had existed in 1993. These trends have continued, if not accelerated, in recent years. High-profile data breaches of companies, health care providers, universities, and federal agencies have raised people’s concerns about the security of their information. In particular, the 2015 Office of Personnel Management (OPM) breach of information raised concerns about the government’s ability to secure information. According to OPM, the personnel records of 21.5 million current, former, and prospective federal employees and contractors were stolen, including 5.6 million digitized fingerprints that are used as biometric identifiers to confirm identity and user names and passwords that applicants used for their background investigation forms. The Office of Management and Budget (OMB) reported to Congress in 2016 that the U.S. Computer Emergency Readiness Team received notice of 77,183 incidents over the past year3 and that cyberattacks against federal agencies are likely on the rise.
There are also access problems other than breaches. One example comes from Australia, which experienced a temporary denial of service shutdown of the website for the 2016 Australian census (Ramzy, 2016). Even prior to this event, the 2016 census had generated a great deal of concern about the privacy of census records (Chirgwin, 2016; Warren, 2016) because ABS had decided to retain names and addresses with the data records for 4 years instead of the 18 months retention used with previous censuses, in order to match census data with other sources. The privacy concerns intensified after reports that the census website had been cyberattacked (Ramzy, 2016). And as noted above, there have been multiple breaches of ABS since 2013; these kinds of threats can undermine the entire social science research enterprise itself, resulting in lower voluntary participation to surveys (National Research Council, 2005, 2013a).
CONCLUSION 5-1 Data breaches and identity theft pose risks to the public.
De-identified individual-level microdata files are of great benefit to researchers because they permit a wide variety of detailed analyses. However, they also present a risk to privacy because of the likelihood that combinations of many characteristics would uniquely identify an individual or organization. Survey data linked with administrative data offer even greater benefits to researchers, but they present even greater risks because they bring together more information than is found in a single source.
CONCLUSION 5-2 Combining multiple data sources increases risks to the public from data breaches and identity theft.
The proliferation of publicly accessible data, outside of the statistical agencies, has dramatically increased the risks inherent in releasing microdata because these other data sources can be used to re-identify putatively anonymized data. For example, a medical record containing a randomly generated identifier (in place of a name) and an individual’s date of birth might be able to be linked with a record from a different source that contains the random identifier and the individual’s gender and zip code.4
Similarly, supposedly “anonymized” Netflix rating records were re-identified by linkages to Internet Movie Database (IMDb) reviews on the basis of titles and approximate dates of as few as three movies (Narayanan and Shmatikov, 2008). Because these kinds of linkages occur when the deidentification systems are properly implemented, and require no breaches or security violation, it is difficult to know how frequently they occur. In the words of the President’s Council of Advisors on Science and Technology (2014, p. xi):
Anonymization is increasingly easily defeated by the very techniques that are being developed for many legitimate applications of big data. In general, as the size and diversity of available data grows, the likelihood of being able to re-identify individuals (that is, re-associate their records with their names) grows substantially.
The assumption underlying the use of attempted anonymization for privacy-preserving data analysis is that the data analyst can learn nothing about an individual but still learns the desired statistics. This assumption
4 This was the approach used to re-identify the anonymized medical encounter data of Massachusetts Governor William Weld, which linked birthday, zip code, and gender fields in the voter registration records (Sweeney, 1997).
is problematic at best. One idea to overcome this situation is to provide an analyst only with “statistical” access to the data, without direct access to the raw (anonymized or not anonymized) data. Although such an approach has promise, it is not in itself protective of privacy. There are at least two kinds of attacks that could be launched against systems of this sort: reconstruction and membership.
In a reconstruction attack, an analyst can learn the value of a confidential attribute (e.g., suffers from depression) of almost every member of a dataset (or a targeted subpopulation of the dataset) by combining even only relatively accurate estimates of the fraction of members in sufficiently many random subsets of the dataset (or a targeted subpopulation) having the given attribute (Dinur and Nissim, 2003; Dwork et al., 2007; Kasiviswanathan et al., 2013; Muthukrishnan and Nikolov, 2012).
In a membership attack, an analyst could, say, learn whether a target individual is a member of a case group (e.g., people diagnosed with a particular disease) in a genome-wide association study. It would require only the DNA of the target individual (easily obtainable, as a person sheds DNA on everything) together with (approximate) allele (protein) frequency statistics for the case group and the control group (Dwork et al., 2015a; Homer et al., 2008). That is, the released information is merely a set of statistics revealing the approximate frequencies of “C” and “T,” or of “G” and “A,” in the case and control groups, for a large number of locations in the DNA. In response to the work of Homer and colleagues (2008), the National Institutes of Health prohibited publication of allele frequency statistics in the studies it funds.
A report of the Institute of Medicine (2009, p. 100) considered the general problem of how to enable researchers’ use of medical data while safeguarding privacy:
In recent years, a number of techniques have been proposed for modifying or transforming data in such a way so as to preserve privacy while statistically analyzing the data (reviewed in Aggarwal and Yu, 2008; NRC, 2000, 2005, 2007b,c). Typically, such methods reduce the granularity of representation in order to protect confidentiality. There is, however, a natural trade-off between information loss and the confidentiality protection because this reduction in granularity results in diminished accuracy and utility of the data, and methods used in their analysis. Thus, a key issue is to maintain maximum utility of the data without compromising the underlying privacy constraints.
The report also noted (p. 101):
Precisely how this body of developing methodologies may be effectively used in the types of health research of the sort envisioned in this report
remains an open question and this is an area of active research. Thus, alternative mechanisms for data protection going beyond the removal of obvious identifiers and the application of limited modifications of data elements are required. These mechanisms need to be backed up by legal penalties and sanctions.
Much of the relevant law that governs federal agencies regarding maintenance of information about individuals derives from the fair information practices (U.S. Department of Health, Education, and Welfare, 1973), which set out the rights and responsibilities associated with the collection and use of personally identifiable information. Among the rights of individuals is the ability to know what personal information about them is collected, how it is used, and who has access to it. Among the responsibilities of those who collect and use personal information are the obligations to ensure the data are used for their intended purpose, are timely and accurate, and are protected against unauthorized use or disclosure.
The Privacy Act of 1974 (5 U.S.C. § 552a) established limitations on the use of a person’s Social Security number (SSN), which was viewed as the primary key attribute to combine databases. Section 7 of the Privacy Act provides that any agency requesting disclosure of an SSN must “inform that individual whether that disclosure is mandatory or voluntary, by what statutory authority such number is solicited, and what uses will be made of it.” Congress recognized the privacy interest and the dangers of widespread use of SSNs as universal identifiers by making unlawful any denial of a right, benefit, or privilege by a government agency because of an individual’s refusal to disclose her or his Social Security number (5 U.S.C. § 552a). The relevant Senate committee stated that the widespread use of SSNs as universal identifiers in the public and private sectors is “one of the most serious manifestations of privacy concerns in the Nation.”5 Short of prohibiting the use of SSNs, the provision in the Privacy Act attempts to limit the use of the number to only those purposes where there is clear legal authority to collect an SSN.
In addition to its provisions for the basic protection of individuals’ records, the Privacy Act includes provisions that pertain to the use of records for statistical purposes. The law sought to enable the continued use of data generated by federal agencies while safeguarding privacy (Allen and Rotenberg, 2016).
There are exceptions to Privacy Act obligations for the use of fed-
5 S. Rep. No. 1183, 93d Cong., 2d Sess., reprinted in 1974 U.S. Code Cong. & Admin. News 6916, 6943.
eral agency data for solely statistical purposes. Records may be matched between federal agencies when the matches “produce aggregate statistical data without any personal identifiers” (U.S.C. 552a(a)(8)(B)(i)). Matches may also be performed “to support any research or statistical project, the specific data of which may not be used to make decisions concerning the rights, benefits, or privileges of specific individuals” (5 U.S.C. 552a(a)(8) (B)(ii)). Agencies are also permitted to disclose records “to a recipient who has provided the agency with advance adequate written assurance that the record will be used solely as a statistical research or reporting record, and the record is to be transferred in a form that is not individually identifiable” (5 U.S.C. 552a(b)(5)).
Agencies are permitted to create exemptions to Privacy Act obligations that would otherwise apply if the records are “required by statute to be maintained and used solely as statistical records” (5 U.S.C. 552a(k)(4)). However, the definition of statistical data in the act is narrow: a statistical record means “a record in a system of records maintained for statistical research or reporting purposes only and not used in whole or in part in making any determination about an identifiable individual” (5 U.S.C. 552a(6)).
CONCLUSION 5-3 Privacy laws have established clear limitations on the collection and use of personally identifiable information for statistical purposes. There are also limits on the use of identifiers, such as Social Security numbers, that enable the linkage of distinct record systems. These laws reflect concerns about the use of personal data gathered by federal agencies.
When federal statistical agencies collect survey data from respondents, they usually pledge to keep the information they collect confidential and to use it only for statistical purposes.6 Statistical agencies are able to make this pledge to respondents because of authority in their authorizing statutes (e.g., Census Bureau’s Title 13) or through the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA). Prior to the passage of CIPSEA (P.L. 107-347), there was a patchwork of legislation protecting the statistical information collected by various federal statistical agencies (National Research Council, 1993; Wallman and Harris-Kojetin, 2004), with some agencies, such as the Census Bureau, having very strong legal protections for the confidentiality of the data they collected, and other
6 There are some notable exceptions, such as the Census of Governments, which collects public information from state and local governments, and this information is published in identifiable form. Similarly, the National Center for Education Statistics does not pledge to keep information in the Common Core of Data confidential as this basic information about K-12 schools is considered public and is widely distributed and used by the U.S. Department of Education and others.
agencies having no statutory authority to protect the confidentiality of the data they collected for statistical purposes.
CIPSEA established uniform confidentiality protections for information acquired by agencies—including both the principal statistical agencies and recognized statistical units—under a pledge of confidentiality and for exclusively statistical purposes. CIPSEA requires that such information be used exclusively for statistical purposes and not be disclosed by an agency in identifiable form, for any use other than an exclusively statistical use, without informed consent. CIPSEA defines a statistical purpose as “the description, estimation, or analysis of the characteristics of groups, without identifying the individuals or organizations that comprise such groups” (§502(9)(A)) and explains that the definition includes “the development, implementation, or maintenance of methods, technical or administrative procedures, or information resources that support such purposes” (§502(9)(B)).
CIPSEA provides a high and uniform floor of legal protections and includes criminal penalties for disclosure and unauthorized uses of the information. Specifically, intentional disclosure of confidential information is a class E felony punishable by a fine of $250,000 or 5 years in prison or both (see U.S. Office of Management and Budget, 2007). In conjunction with the authorizing statutes for the federal statistical agencies, CIPSEA provides the foundation for acquiring and protecting data not only from survey respondents, but also from administrative agencies and other data providers.
Federal statistical agencies are required to report to OMB on an annual basis on their implementation of CIPSEA and compliance with the requirements in OMB’s implementation guidance (see U.S. Office of Management and Budget, 2007). Agencies are required to use a uniform CIPSEA pledge for all of their data collections that are covered by CIPSEA. These pledges must state that the information will be kept confidential, will be used only for statistical purposes, and will not be disclosed to anyone except agency employees and their agents without the data provider’s consent. The pledges also state the penalties noted above for willful disclosure. Agencies are also required to annually train and certify that all employees and agents with access to data covered by CIPSEA have completed CIPSEA training and that all statistical products have been reviewed to ensure that there is no disclosure of identifiable information.
CIPSEA permits recognized federal statistical agencies or units7 to designate external researchers to obtain access to confidential statistical data for exclusively statistical purposes by giving these agencies the
7 For a list of OMB-recognized statistical agencies and units, see https://obamawhitehouse.archives.gov/omb/inforeg_statpolicy/bb-principal-statistical-agencies-recognized-units [February 2017].
authority to make such researchers their designated agents and to bind them to the same restrictions in their use of the data and the same criminal penalties for disclosure and misuse as the agencies’ employees. This authority has enabled greater opportunities for access and analysis of federal statistical data by agencies that did not previously have this authority.
Recently, federal statistical agencies were concerned that a provision in the Cybersecurity Act of 2015 could be used to undermine the confidentiality protections for the data they collect. Specifically, the Federal Cybersecurity Enhancement Act of 2015 (Title II, Subtitle B of the Cybersecurity Act) gives the secretary of the U.S. Department of Homeland Security (DHS) authority to access any information traveling to or from an agency information system notwithstanding any other law. While federal statistical agencies need and welcome cybersecurity protection from DHS, they were concerned that personally identified data could be accessed and used for purposes unrelated to their agency’s mission. Although statistical agencies came to agreement with DHS, they have modified their confidentiality pledges to acknowledge the screening of information by DHS.8
Statistical agencies are required by law to protect the confidentiality of the data they collect while maximizing their utility. Threats from data breaches and the growing availability of other sources of data that might be used to re-identify individuals or entities require statistical agencies to reconsider how they can maintain data confidentiality. The publication of statistics covering various groups and subgroups requires careful consideration of how to safely release statistical products and of the potential privacy losses that might occur. In this section, we discuss several different approaches to protecting the privacy of data, including minimizing the personal data that are collected, minimizing disclosure risk by restricting the data that are released, controlling access to and use of the data, encrypting data, and using differential privacy techniques to measure and control cumulative privacy loss.
One approach to data protection in a statistical environment is to minimize or eliminate the collection of personally identifiable information (Agre and Rotenberg, 1998), that is, information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other
8 For example, see Federal Register, Vol. 81, No. 235, pp. 88270-88272 (December 7, 2016).
information that is linked or linkable to a specific individual.9 The concept can be understood in multiple dimensions. First, in many areas of government statistics, there is little or no personal identifiable information gathered (Solove and Schwartz, 2015). For example, the National Oceanic and Atmospheric Administration (NOAA) collects vast amounts of data that are provided in statistical formats to aid government planning and private-sector activity. The data include weather forecasts, hurricane warnings, and climate trends. Other such data include NOAA’s recreational fisheries statistics, which provide catch estimates across various regions in the United States, for time periods ranging from monthly to annually. The data from these queries are used by state, regional, and federal fisheries scientists and managers to maintain healthy and sustainable fish stock.
However, many government datasets do involve the collection of information about individuals, and many of these are valuable for policy analysis. For these record systems, agencies need to consider whether it is necessary to include sensitive data elements that may have an adverse effect on individuals if disclosed. Broadly understood, the aim is data minimization, a concept central to modern privacy law (Allen and Rotenberg, 2016). Educational records pose unique challenges because of the interest in longitudinal studies that may result in tracking individuals over their lifetimes, from the educational environment through employment. The more information available on a single individual, the more likely a single data record can be used to re-identify the person. Details about an individual’s education and work history could harm the person if used inappropriately. As a consequence, strong methods need to be employed to ensure that statistical data cannot be readily used for re-identifying any individual or otherwise compromising privacy, such as learning sensitive attributes about an individual (these are not the same thing because sensitive attributes can be learned without re-identification). By minimizing the amount of data collected to those with pre-specified necessary uses, re-identification possibilities are reduced.
Restricting data includes removing explicit identifiers and applying a variety of statistical disclosure limitation methods to the dataset (see Federal Committee on Statistical Methodology, 2006) to reduce the risk of disclosure. To restrict data, before releasing microdata files statistical agencies remove all obvious identifiers. This approach is not sufficient,
9 Definition of personally identifiable information as given on page 33: https://obamawhitehouse.archives.gov/sites/default/files/omb/assets/OMB/circulars/a130/a130revised.pdf [February 2017].
however, because some people or entities have characteristics or combinations of characteristics that are rare or unique and make them identifiable. Consequently, there are a variety of statistical disclosure limitation techniques that federal statistical agencies use to reduce the disclosure risk of microdata files. These methods include reducing the amount of information released, perturbing microdata by adding noise, and creating simulated microdata.
The amount of information released is often reduced by recoding variables into categories or fewer categories than were originally collected. For example, occupation might be recoded into 10 high-level categories rather than providing the detailed job titles originally recorded from respondents. Similarly, income is often top-coded (e.g., “more than $250,000”) or bottom-coded (e.g., “less than $30,000”) to avoid revealing very large or small incomes.
A variety of means are also used to perturb microdata, including swapping, blanking and imputing, data blurring, and a combination of micro agglomeration, substitution, subsampling, and calibration (MASSC). Data swapping or switching is done by matching records that have a high risk of disclosure on a predetermined set of variables and swapping all the other variables. This approach introduces uncertainty as to whether a given record actually reflects the real values. The blank and impute method involves deleting some respondents’ values for some variables and replacing them with imputed values. Data blurring and microaggregation involve aggregating values across small sets of respondents for selected variables and then replacing the actual value of some other variables with the average. MASSC is a four-step procedure in which the dataset is partitioned into risk strata, and within these strata values of sensitive variables are swapped, and then records are randomly subsampled within each strata to be retained in the dataset. In the final step, calibration weights are assigned to the retained records to preserve the total weighted counts from the original dataset (for more information, see Federal Committee on Statistical Methodology, 2006).
Another approach is to create a completely synthetic dataset based on the relationships among the variables observed in the confidential data. Such synthetic datasets use statistical models to create microdata records that are plausible predictions of an individual record. In total, the synthetic dataset can reproduce many of the statistical conclusions available from the actual dataset. For example, the Survey of Income and Program Participation created a “synthetic beta” by applying multiple imputation techniques to the data after they were linked to earnings data from the Social Security Administration (Benedetto at al., 2013). We note, however, that privacy is not an automatic consequence of the data being synthetic (we touch on this further below).
These statistical disclosure limitation techniques come with a cost: they decrease the precision of the variables in the dataset and they introduce errors into the dataset, which can affect estimates of population parameters, as well as relationships among variables. As discussed below, this is unavoidable. Moreover, these techniques are not equipped with a single, unifying, measure of privacy loss that can be tracked across multiple applications to yield meaningful statements about cumulative privacy loss as the data are used and reused.
Restricting access uses administrative procedures and technology to restrict who can access the dataset and what kinds of analyses can be done with the data to reduce the risk of disclosure. Federal statistical agencies have also used a number of different modes for researchers to access and analyze “restricted use” datasets. These methods include licensing agreements, remote access, and online data analysis systems (see Federal Committee on Statistical Methodology, 2006).
Online data analysis systems are currently available for some statistical agency datasets. For example, the National Center for Education Statistics (NCES) has three online data query systems: the Data Analysis System (DAS), the National Assessment of Educational Progress (NAEP) Data Explorer, and QuickStats. These systems are able to provide a user with tabulations and correlations matrices and to construct simple weighted least squares or logistic regression models. (QuickStats does not provide modeling tools.) All three of these systems run the queries on data that have already been statistically perturbed. Furthermore, the tabulation output is limited to categorical statistics, weighted counts, and percentages. A formal data use agreement is not necessary to access these systems, but researchers do need to agree that the data will be used only for statistical purposes.
Although statistical agency online analysis systems have grown in sophistication and flexibility in recent years, they do not allow researchers to use popular statistical software, nor do they provide the capability for sophisticated statistical modeling by users for policy analysis. The National Center for Health Statistics (NCHS) has for a number of years used a remote access analysis system10 whereby researchers send the system a file by way of file transfer protocol with a program, which is scanned for nonallowable commands. The system then attempts to verify that the program is not trying to access unauthorized data files. Assuming there are no detected problems with the scanned files, the program is run on the real
data. After the program is executed, the output is computer-scanned for disclosure problems and, if none are detected, sent back to the researcher within minutes.11 If there are any issues, an NCHS staff member conducts a manual inspection. If the staff member approves of the output, it is then e-mailed back, usually within a few hours, depending on staff availability.
Another way to restrict access is through licensing agreements, which provide more flexibility than online data analysis systems. Licensing agreements allow researchers access to restricted data from their home institution through the use of strict security procedures and legally binding agreements. NCES has used licensing agreements for many years for many of its survey datasets, especially the longitudinal studies.12
To obtain an NCES license, a researcher must submit a written proposal that demonstrates the need for the data, as well as affidavits for any and all people who will be working with the data; the license document itself, which includes information concerning the laws that protect data and states penalties for violating terms of the agreement; and a data security plan. The license must be signed by an official from the researcher’s institution who can legally bind the institution.13 Researchers must also agree to unannounced onsite inspections of the research facilities where the data are secured, a review of statistical output before any public release, and to return and destroy the original data along with any derived files at the end of the license. Although some site inspections have uncovered lapses in procedures by individual investigators, there have not been any documented cases of unlawful disclosures of confidential data from the NCES data licensing program.
Federal Statistical Research Data Centers
Another approach for providing access to data for researchers is through federal statistical agency research data centers. Such centers have been used to provide more stringent controls on who has access to data and the conditions under which they have access. A number of statistical agencies have permitted researchers access to their sensitive data only at
11 For more information on current and planned capabilities, see https://fcsm.sites.usa.gov/files/2016/03/J3_Meyer_2015FCSM.pdf [December 2016].
12 The Bureau of Labor Statistics uses a similar licensing approach for the National Longitudinal Survey of Youth microdata, and the National Center for Science and Engineering Statistics uses a similar approach for licensing microdata for several of its surveys.
13 Researchers have to be associated with an academic or other research institution.
a data center located within their headquarters or a regional office.14 The Census Bureau pioneered research data centers in other locations around the country beginning in the 1980s.
In 2015, the Census Bureau’s Research Data Centers were rebranded as the Federal Statistical Research Data Centers (FSRDCs) to reflect the fact that a number of statistical agencies have at least some datasets available through FSRDCs and that there is growing interest in building, sharing, and governing this infrastructure across the statistical agencies. This infrastructure also opens up the possibility of external researchers linking and combining datasets from different statistical agencies, which, in many cases, current statutes do not permit the agencies themselves to do.
FSRDCs are Census Bureau facilities, housed in partner institutions that meet all physical and information security requirements for access to restricted-use micro data of the agencies whose data are accessed at the FSRDCs. There are currently 24 FSRDCs, and they partner with more than 50 research organizations, including universities, nonprofit research institutions, and government agencies.15
The FSRDCs provide computing capacity located behind the Census Bureau firewalls to handle large datasets, and researchers can also collaborate with other researchers across the country through that secure computing environment. All FSRDC researchers must obtain Census Bureau special sworn status, which includes a background check and swearing to protect respondent confidentiality for life, and noting that there are significant financial and legal penalties under Title 13 and Title 26 for failure to do so.
Currently, four federal agencies (the Agency for Healthcare Research and Quality, the Census Bureau, the Bureau of Labor Statistics, and the National Center for Health Statistics) directly provide data through FSRDCs, and each agency has its own review and approval process. In addition to the agencies that directly provide their data, nine other agencies that sponsor surveys also participate in the FSRDC program by allowing surveys they cosponsor to be made available. In a further expansion of the role of FSRDCs, administrative data from other federal agencies are also being made more accessible to researchers through them.
Nongovernment Data Enclaves
NORC at the University of Chicago has created a data enclave that provides various data services, including archiving, curating, and indexing the
14 For example, the National Agricultural Statistics Service provides external researchers access to CIPSEA protected microdata for statistical purposes at its data lab in headquarters or at data labs in its 12 regional field offices.
data, as well as statistically protecting confidential information. Researchers are also provided access to analytic data tools to work with in the secured environment. Researchers have the ability to access data both onsite and remotely. Remote access also allows researchers to share and collaborate while working with each other on the data. NORC staff manage the datasets, including education and training of users in order to ensure that the data are appropriately used, disclosed, and kept confidential. Datasets can be provided by federal, state, or local government agencies as well as private firms, universities, foundations, and other institutions. The providing entity sets the parameters for access and use (including linkage) of their datasets, which are administered and implemented by NORC.
The National Agricultural Statistics Service and the Economic Research Service (ERS) jointly sponsor and conduct the Agricultural Resource Management Survey and provide access to these data through the NORC data enclave.16 NORC staff are designated as agents of the statistical agencies and must adhere to all CIPSEA requirements as agency employees. Researchers submit a research proposal and, if it is approved, they are able to access the data remotely from their worksite. The output is reviewed by an ERS employee for potential disclosures before the researcher is permitted to download it.
Some universities are creating their own data enclaves, which can also house federal statistical data. The Center for Urban Science Progress at New York University is developing a data facility as a secure research setting with datasets, tools, and expert staff to provide research support services to students, faculty, and government employees. The data facility is designed to be user friendly: it includes user authentication and provides services, such as data curation, research project workspace, data access, and database creation. The primary goals of the data facility are to ensure that new and existing data are made available to and used by current and future members of the research community and that both staff in government agencies and local citizens can use the facility in addressing important research problems.
Data Access in Other National Statistical Systems
As noted in Chapter 3, many European countries make more use of administrative records for national statistics than does the United States. They have also created systems to allow access to administrative data for statistical and research purposes (Card et al., 2010).
16 The available data are from Phases II and III of the Agricultural Resource Management Survey, and the Tenure, Ownership and Transition of Agricultural Land. For procedures to access these data, see https://www.ers.usda.gov/data-products/arms-farm-financial-and-cropproduction-practices/contact-us/#RequestAccess [December 2016].
Statistics Denmark provides researchers de-identified data from combined administrative datasets for research projects through a secure server. Researchers and agencies can access administrative data from all government branches, beginning in 1970, on topics including population and demography, labor markets, earnings, income, consumption, prices, general economic statistics, agriculture, manufacturing industries, construction and housing, service sector, transport, environment and energy, external trade, and national accounts and balance of payments. Statistics Denmark manages, combines, and de-identifies information from multiple administrative databases for projects seeking data on the individual, family, household, workplace, and company levels.
Researchers request data through Danish universities and are accepted on the basis of scientific merit. Researchers have to be part of a Danish research environment; foreign researchers have to be affiliated with a Danish authorized environment. If approved, researchers access data remotely through a secure server (Statistics Denmark, 2014). Data can also be easily linked to other data sources, such as survey data or data from other government agencies. In addition, Statistics Denmark can carry out interview surveys customized to subject needs.
Another model for providing access to administrative data for statistical and research uses is the Administrative Data Research Network (ADRN) in the United Kingdom. The ADRN is made up of four data centers, one for each of the countries of the United Kingdom, and each data center is a secure location where researchers are able to request de-identified data sets for economic and social research. The ADRN functions by serving as a data broker that is able to acquire data for research purposes. The main partners that provide data to the ADRN are the UK Statistics Authority, the Economic and Social Research Council, and data custodians (government departments and agencies and national statistics authorities). To access the data for research purposes, the requester of the data must have a noncommercial and feasible goal for the project that provides public benefit, has scientific merit, and is ethically approved by the ADRN.
In addition to acquiring datasets, the ADRN is able to link and deidentify datasets for researchers. Two examples of linkage that the ADRN has done are linking benefits and earning data with health data to learn more about the impact of poverty on health and linking education data with crime data to understand how education affects criminality (Administrative Data Research Network, 2015). However, it is important to note that since the ADRN must request datasets, difficulties have arisen in acquiring some datasets as there is no mandate that requires data be given to the ADRN.
CONCLUSION 5-4 Federal statistical agencies have a strong tradition of confidentiality and data stewardship. There are growing
threats to data repositories and personal privacy that need to be addressed to support this tradition.
CONCLUSION 5-5 A continuing challenge for federal statistical agencies is to produce data products that safeguard privacy. This challenge is increased by the use of multiple data sources.
Using Computer Science and Cryptography to Protect Privacy
So far we have approached the issue of privacy and the protection of the confidentiality of data generally from two directions: government efforts to define and protect privacy and confidentiality through legislation and statistical agencies’ attempts to balance the need to make data accessible while still respecting privacy and ensuring confidentiality. We now approach the issues from the domains of theoretical computer science and cryptography.
There is a distinction between privacy and security, set out by Turn and Ware (1976, p. 1):
Privacy is an issue that concerns the computer community in connection with maintaining personal information on individuals in computerized record-keeping systems. It deals with the rights of the individual regarding the collection of information in a record-keeping system about his persons and activities, and the processing dissemination, storage and use of this information in making determinations about him. . . . Computer security includes the procedural and technical measures required (a) to prevent unauthorized access, modification, use, and dissemination of data stored or processed in a computer system, (b) to prevent any deliberate denial of service, and (c) to protect the system in its entirety from physical harm. . . .
Turn and Ware note that privacy and security issues emerged separately in the 1960s until the “privacy cause célèbre,” which was the proposal for a National Data Center, intended to be a centralized databank of all personal information collected by federal agencies for statistical purposes. More recently, a report of the President’s Council of Advisors on Science and Technology (2014, p. 33) points out:
Poor cybersecurity is clearly a threat to privacy. Privacy can be breached by failure to enforce confidentiality of data, by failure of identity and authentication processes, or by more complex scenarios such as those compromising availability.
But privacy and security are not equivalent (p. 34):
When people provide assurance (at some level) that a computer system is secure, they are saying something about applications that are not yet invented: They are asserting that technological design features already in the machine today will prevent such application programs from violating pertinent security policies in that machine, even tomorrow. Assurances about privacy are much more precarious. Since not-yet-invented applications will have access to not-yet-imagined new sources of data, as well as to not-yet-discovered powerful algorithms, it [is] much harder to provide, today, technological safeguards against a new route to violation of privacy tomorrow. Security deals with tomorrow’s threats against today’s platforms. That is hard enough. But privacy deals with tomorrow’s threats against tomorrow’s platforms, since those “platforms” comprise not just hardware and software, but also new kinds of data and new algorithms.
We distinguish two scenarios relevant to the discussion of bringing together multiple data sources, sharing and cooperating. For simplicity, we use “dataset” to describe the data held by one organization, although of course in reality a given organization holds many datasets. The key point is that a party has full access to its own dataset. In the sharing scenario, two or more parties (e.g., statistical agencies) pool their data so that all parties have access to all of the data. In the cooperating scenario, the multiple parties agree to cooperate in a computation on the combination of their multiple datasets, but that is the extent of the collaboration. That is, entity A should learn no more about the datasets of entities B and C than can be learned from the result of the computation. The sharing scenario is the subject of the field of secure multiparty computation, studied in the cryptographic literature since the late 1980s (see, e.g., Goldreich et al., 1987). Both the sharing and the collaborating scenarios could be used when combining data from different sources.
Data could be encrypted using state-of-the-art technology both in transit and at its destination to provide protection against harm in the case of data breaches or inappropriate data access. This can be done using mature technology. Surprisingly, advances in cryptography have shown encryption not to be an insurmountable impediment to data utility. Fully homomorphic encryption schemes (Gentry, 2009) permit arbitrary computations on encrypted data, with no need to decrypt anything except the outputs. In functional encryption, a user operating on encrypted data is given a special “key” that will allow the user to learn only the result of a specific computation (Boneh et al., 2011; Sahai and Waters, 2005). We note, however, that secure multiparty computation, fully homomorphic encryption, and functional encryption are not yet mature technologies.
Privacy-Protective Data Analysis
We now turn to privacy concerns that are independent of security and encryption, that is, problems that arise even when all the encryption and security technologies operate perfectly: threats to privacy that come from the desired outputs of statistical data analysis systems. Protection against these threats is the goal of privacy-protective data analysis.
Two important lessons have been learned from the past 15 years of research on confidential data. First, there are fundamental mathematical limits on “how much” can be computed while maintaining any reasonable notion of privacy: extremely detailed estimates of too many statistics can effectively result in a complete loss of privacy (Dinur and Nissim, 2003; Dwork et al., 2007, 2015a; Homer et al., 2008; Kasiviswanathan et al., 2013; Muthukrishnan and Nikolov, 2012). This body of work has come to be called the fundamental law of information recovery (for a review, see Dwork et al., 2017). This law holds even if all data security, encryption, access control, authorization protocols, password protection, network security, data enclave protocols, and programs are working perfectly.
Second, there are mathematical and algorithmic tools to formally quantify and control privacy loss; in some cases, these tools yield the best possible tradeoffs subject to the fundamental limit. There is hope that these tools, or new approaches yet to be invented, can match the fundamental limit in all cases. This is an extremely active area of research.
In other words, together, these findings delineate the tradeoff between the information that one gains through statistical analysis of a dataset and the loss of privacy that can result from those analyses. As statistical information is extracted from a dataset, there is increasing risk of disclosure of individuals in the dataset. This cumulative privacy loss can be conceptualized as a “privacy loss budget”: when a specified level of cumulative risk has been attained, the privacy loss budget would have been fully expended. Using a privacy loss budget means acknowledging that increased accuracy must come at the social cost of increased privacy loss. Conversely, to limit privacy loss to a budgeted total, controls or limits must be placed on analysis. This approach would raise a host of implications, such as prioritizing data usage. Who should be given the right and responsibility of setting a privacy loss budget for a given dataset? Who should be given the first choice of statistical analysis? To date, there is no developed social policy for these questions (Abowd and Schmutte, 2016); there is no technical panacea and no mathematical or computer science substitute for what are ultimately issues of judgment. We discuss these implications further below, and we will elaborate on them in our second report.
Success in privacy-preserving data analysis, however, does not obviate the need for strong encryption and other conventional cybersecurity
measures. All of these problems arise in the context of particular datasets. A data warehouse exacerbates security concerns by providing a more valuable target whose compromise is more devastating than the compromise of a single source. However, it may improve the situation with respect to privacy-preserving data analysis both for technical reasons (see Dwork et al., 2012) and because it allows for better coordination in decision making about the allocation of the data resource.
The most heavily studied approach to privacy-protective data analysis is differential privacy, which is a definition of privacy tailored to statistical analysis of large datasets, together with a set of algorithmic techniques for carrying out statistical analyses while adhering to the definition (Dwork, 2006; Dwork and Roth, 2014; Dwork et al., 2006, 2015b). Differential privacy is a promise, with specified levels of assurance, that an individual described by a data record in a dataset will not be affected, adversely or otherwise, by allowing that person’s data to be used in any study or analysis, no matter what other computational techniques, studies, datasets, or information sources are or become available. At their best, differentially private algorithms can make confidential data widely available for useful data analysis, without resorting to data clean rooms, data usage agreements, data protection plans, or restricted use enclaves. It permits the measurement and control of privacy loss that accumulates over multiple analyses.
Differential privacy can also be defined as the probability that any observed output is essentially unchanged, independent of whether any individual opts into or opts out of a dataset. The probabilities are taken over random choices made by the data analysis algorithm; “essentially” is quantified in the precise privacy loss guarantee. This simple requirement has many powerful consequences. First, it provides a formal measure of privacy loss. This measure allows one to track privacy loss as it accumulates over multiple computations. It also allows the construction of complex algorithms from simple differentially private building blocks (much as a complex program is the combination of simpler subroutines) while tracking and controlling the privacy loss measure. Finally, any output of a differentially private analysis is “future-proof,” meaning that it is robust to all algorithmic attacks and information resources that do not yet exist.
Of course, differential privacy cannot be a panacea—the fundamental law can no more be circumvented than can the laws of physics. Moreover, as noted above for other statistical disclosure limitation techniques, differential privacy introduces error—the fundamental law shows that this also holds true. Sometimes differential privacy introduces no more error than the fundamental law shows to be necessary. In other cases, there is a gap between what is currently known in differential privacy and the known minimal amounts of noise. It is possible that the gaps will be closed by
further research or that another, not-yet-invented, technology can help in these cases.
Using such formal privacy guarantees requires a new skill set, foreign to most statistical agencies, social science researchers, and data scientists. Synthetic data generated in a differentially private fashion can address these concerns (in general, privacy is not an automatic consequence of the data being synthetic, but a consequence of the method by which the synthetic data are produced). Differentially private synthetic data may be queried in an ad lib manner, with no risk of further privacy loss beyond that incurred in generating the synthetic data. The U.S. Census Bureau uses synthetic data generated with a variant of differential privacy in the agency’s OnTheMap tool, which provides aggregate information about where people work and where workers live (see Machanavajjhala et al., 2008). A drawback is that the task of generating synthetic data with rigorous privacy-protective guarantees can require excessive computational resources; moreover, as is always the case with synthetic data, the synthetic dataset has known properties only for the estimates of the specific statistics it has been designed to capture. If an analyst wants to ask a different question, there is no assurance that the estimates would have the needed properties.
A similar problem occurs when new data are incorporated. For example, consider the commonly used example of randomized responses with a 50-50 prevalence of “yes” for the not-sensitive question and a 50-50 randomization to the sensitive or not-sensitive question. A “yes” answer produces a 3:1 Bayes factor for the unobserved true state being “yes” (rather than “no”). However, if the same person is engaged a second time (with another randomization) and again answers “yes,” the Bayes factor is now 9:1. A more likely scenario is that the respondent is asked about a different behavior or attitude, using randomized response approach. A second “yes” produces a Bayes factor of between 3:1 and 9:1 for at least one underlying “yes” rather than “no”-“no,” with the value depending on the underlying association of the two behaviors or attitudes. Bounding the association gives a bound on the Bayes factor. One needs to keep in mind that this threat to privacy operates when the amount of linked data is broadened beyond that used to develop a privacy budget.
The fundamental law of information recovery makes clear that meaningful privacy guarantees come at a price. Differentially private algorithms are equipped with explicit tradeoffs between privacy and utility. The statistical nature of the utility loss is a property of the algorithm, not the dataset, and as such can be made public with no loss of data privacy. This characteristic can guide a data analyst in interpreting the outputs, much as the margin of error in an opinion poll informs the public of how to understand the reported results.
Some of the newer, more formal, statistical disclosure limitation tech-
niques have been shown to compare well with the more traditional methods described above. Haney and colleagues (2017) state:
We design private algorithms and show that they have utility comparable to the existing ad-hoc protection system for an establishment-based data product published by the U.S. Census Bureau.17
Anonymization techniques face similar challenges. For example, a study on privacy of data from massive open online courses from MITx and HarvardX on the edX platform reported that standard anonymization methods force changes to datasets that threaten replication and extension of baseline analyses (Daries et al., 2014).
Like secure multiparty computation, homomorphic encryption, and functional encryption, differential privacy is not a fully mature technology. Moreover, most statistical agencies do not currently have staff with skills in these techniques. Nonetheless, these technologies are gaining ground: for example, Apple has introduced differential privacy into iOS 10. The Census Bureau is setting up teams to begin to use differentially private methods in its programs. Indeed, Abowd (2016a, pp. 27-28) recently articulated the predicament currently faced by statistical agencies:
Almost all current disclosure limitation methods used by statistical agencies around the world are based on ad hoc criteria for measuring their effectiveness. They fail the criterion of equal protection under the law because their effectiveness is measured in terms of an agency’s best efforts to insure that the ensemble of publications does not violate the confidentiality of any respondents. Those best efforts, while diligently and competently delivered, were predicated on the assumption that most of the information that could be used to compromise the disclosure limitation procedure was inside the agency’s firewall. Such an assumption is simply no longer tenable. It must be replaced by assumptions that allow the agency to release the statistical summaries without fear of future attacks. Formally private disclosure limitation procedures meet this condition. And they are really the only player left standing.
CONCLUSION 5-6 As federal statistical agencies move forward with linking multiple datasets, they must simultaneously address quantifying and controlling the risk of privacy loss.
CONCLUSION 5-7 Privacy-enhancing techniques and privacy-preserving statistical data analysis can potentially enable the use of private-sector data sources for federal statistics.
17 See http://tpdp16.cse.buffalo.edu/abstracts/TPDP_2016_3.pdf [December 2016].
RECOMMENDATION 5-1 Statistical agencies should engage in collaborative research with academia and industry to continuously develop new techniques to address potential breaches of the confidentiality of their data.
RECOMMENDATION 5-2 Federal statistical agencies should adopt modern database, cryptography, privacy-preserving, and privacy-enhancing technologies.
As noted above, the fundamental law of information recovery has ramifications for statistical agencies’ disclosure limitation activities. Statistical agencies are accustomed to protecting data from individual inappropriate uses and reviewing each statistical product for disclosure risks; they are not accustomed to limiting statistical analysis or prioritizing analyses based on considerations of cumulative privacy loss or of using a privacy loss budget (Abowd, 2016b; Abowd and Schmutte, 2016). For example, how would one decide the privacy loss budget for the Census of Population and Housing and how much of that budget to assign to analyses for legally required redistricting activities, production of statistical summary information, and microdata analyses for general social science investigations. These kinds of questions are not the domain of the statistical experts inside the agencies, nor of those who create the privacy-preserving analysis systems. These policy issues will need to be confronted by the leaders in agencies, the data users, and stakeholders, including respondents and privacy advocates. We will explore these issues further in our second report, but we note here that answers to this wide set of issues are beyond the scope of this panel.