Risks of Access: Potential Confidentiality Breaches and Their Consequences
Chapter 3 has argued that to fulfill their function in a democratic society, statistical and research agencies must provide access to the data they collect. Yet, at the same time, they are charged with protecting the data’s confidentiality. That charge rests on three underlying considerations: ethical, legal, and pragmatic. The ethical obligation, rooted in the Belmont Report (National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, 1979), requires agencies to strive for a favorable balance of risks and harms for survey respondents. Legally, they are bound by federal laws to honor the promises of confidentiality they make, with potential civil and criminal penalties if they fail to do so. On a pragmatic level, their ability to collect high-quality data from respondents will be compromised by real or perceived breaches of confidentiality. This chapter elaborates on all three of these assumptions.
A pledge of confidentiality stipulates that publicly available data—whether summary data or microdata and including any data added from administrative records or other surveys—will be anonymized or otherwise masked to ensure that they cannot be used to identify a specific person, household, or organization, either directly or indirectly by statistical inference. Such a pledge also means that more readily identifiable data will be made available for research purposes only through restricted access modalities that impose legal obligations and penalties to minimize the risk that researchers with access to such data might disclose them to others. An example of such more readily identifiable data is a set of house-
hold survey records that, although stripped of names and addresses, contains codes for small geographic areas.
The reason for confidentiality pledges and for stringent procedures to prevent disclosure is that they improve the quality of data collected from individuals, households, and firms. It is essential that respondents believe they can provide accurate, complete information without any fear that the information will be disclosed inappropriately. Indeed, if the information was disclosed, harm might come to an individual respondent. Many government-sponsored surveys ask about sensitive topics (e.g., income or alcoholic beverage consumption), as well as about stigmatizing and even illegal behavior. The disclosure of such information might subject a respondent to loss of reputation, employment, or civil or criminal penalties. Furthermore, the breach of a confidentiality pledge would violate the principle of respect for those consenting to participate in research, even if the disclosure involved innocuous information that would not result in any social, economic, legal, or other harm (see National Research Council, 2003b:Ch.5).
The occurrence of a breach also threatens the research enterprise itself, because concerns about privacy and confidentiality are among the reasons often given by potential respondents for refusing to participate in surveys, and those concerns have been shown to affect behavior as well. Any confidentiality breach that became known would be likely to heighten such concerns and, correspondingly, reduce survey response rates. Efforts to increase researchers’ access to data must, therefore, take into account the need to avoid increasing the actual and perceived risks of confidentiality breaches.
This chapter begins by reviewing research linking survey nonresponse to concerns about confidentiality. The rest of the chapter discusses some of the ways in which confidentiality breaches might occur, with special attention to how increasing access might increase both the actual and perceived risks of confidentiality breaches. Although much of this report focuses on statistical disclosure—re-identification of respondents or their attributes by matching survey data stripped of direct identifiers with information available outside the survey—these sections serve as a reminder that statistical disclosure is by no means the only, and perhaps not even the most important, way in which confidentiality breaches might occur. They also serve as a reminder that public perceptions that personal data are being misused may be as potent a deterrent to participation by potential survey respondents as an actual breach of confidentiality.
CONFIDENTIALITY CONCERNS AND NONRESPONSE IN CENSUSES AND SURVEYS
The first experimental demonstration that confidentiality concerns increase refusal to participate in a government survey comes from a National Research Council study sponsored by the U.S. Census Bureau in the late 1970s (National Research Council, 1979), but most of the evidence comes from a series of surveys commissioned by the Census Bureau in the 1990s. In the 1990 census, for example, people who were concerned about confidentiality and saw the census as an invasion of privacy were significantly less likely to return their census form by mail than those who had fewer privacy and confidentiality concerns (Singer, Mathiowetz, and Couper, 1993; Couper, Singer, and Kulka, 1998). Although such attitudes explained a relatively small proportion of the variance in census returns (1.3 percent), this proportion represented a significant number of people who had to be followed up in person to obtain information required for the census.
Analysis of the mail returns of a sample of respondents in the 2000 census yielded similar results. Once again, respondents with greater privacy and confidentiality concerns were less likely to return their census forms by mail. The variance in census returns explained by attitudes toward privacy and confidentiality was very similar to that obtained in 1990 (Singer, Van Hoewyk, and Neugebauer, 2003). In 2000, respondents with greater privacy and confidentiality concerns were also significantly less likely to provide an address to Gallup survey interviewers for the purpose of matching their survey responses to the file of census returns, and they were much less likely to respond to a question about their income.
Another way of looking at the effect of confidentiality concerns is to look at the relationship between beliefs that the census may be misused for law enforcement purposes and the propensity to mail back the census form. Of the 478 respondents in the Gallup survey following the 2000 census who believed that census data are used for none of three purposes (identifying illegal aliens, keeping track of troublemakers, and using census answers against respondents), 86 percent returned their census form by mail. The percentage dropped to 81 percent among those who selected exactly one of the three items (N = 303), to 76 percent among those who selected exactly two items (N = 255), and to 74 percent among the 171 respondents who selected all three items (Singer, Van Hoewyk, and Neugebauer, 2003). In 1990, census return rates declined from 78 percent to 55 percent on a similar index of confidentiality concerns (Singer, Mathiowetz, and Couper, 1993). Given the cost of obtaining census information that is not sent by mail, this reduction in the likelihood of returning the census form has significant consequences. Other research on the
2000 census is in accord with these findings: one study (Hillygus et al., 2006) concludes that the census return rate in 2000 would have been approximately 5 percent higher if there had not been public anxieties over privacy and what was characterized in the media and by some political leaders as unwarranted “intrusiveness.”
There is also indirect evidence that requests for information on the census form that respondents consider sensitive leads to higher nonresponse rates for both the sensitive item and the entire questionnaire. For example, a 1992 experiment involving the Census Bureau’s request for Social Security numbers led to a decrease of 3.4 percent in the return of the census form and an increase of 17 percentage points in the number of questionnaires returned with missing data (Dillman, Sinclair, and Clark, 1993). An experiment involving a request for Social Security numbers conducted during the 2000 census led to an almost identical result (Guarino, Hill, and Woltman, 2001:17).
Of particular interest in this context is the finding that concerns about confidentiality and negative attitudes toward data sharing increased substantially between 1995 and 2000 (Singer et al., 2001:Tables 2.16-17, 2.21-29). People’s stated willingness to provide their Social Security numbers also declined, from 68 percent in 1996 to 55 percent in 1999 (Singer et al., 2001:Table 2.45). Several studies (summarized in Bates, 2005) have also documented that it has become increasingly difficult for the Census Bureau to obtain Social Security numbers. In the Survey of Income and Program Participation, there was an increase in refusals to provide them from 12 percent in the 1995 panel to 25 percent in the 2001 panel; in the Current Population Survey, there was an increase in refusals from approximately 10 percent in 1994 to almost 23 percent in 2003.
Evidence about the effects of concerns about privacy and confidentiality on response to nongovernmental surveys is provided by a series of small-scale experiments carried out in the context of the Survey of Consumer Attitudes (SCA). The SCA is a national telephone survey fielded every month at the University of Michigan, primarily to measure economic expectations and attitudes.
The first experiment, conducted in 2001, was designed to investigate what risks and benefits respondents perceived in two specific surveys—the National Survey of Family Growth (NSFG) and the Health and Retirement Study (HRS)—and how these perceptions affected their willingness to participate in the research. After hearing the description of each study, respondents were first asked whether or not they would be willing to take part in the survey, and if not, why not; they were then asked whether or not they thought each of several groups (family, businesses, employers, and law enforcement agencies) could gain access to their answers and how much they would mind if they did. Both the perceived risk of disclo-
sure (how likely various groups were seen as gaining access to respondents’ answers along with their names and addresses) and the perceived harm of disclosure (how much respondents would mind such disclosure) significantly predicted people’s willingness to participate in the survey described. Perceived benefits, as well as the ratio of risk to benefit, were also highly significant.
In January and April 2003, two virtually identical experiments were carried out, again on the SCA (Singer, 2004). The introductions to both surveys mentioned the possibility of record linkage—medical records in the case of NSFG and government (financial) records in the case of HRS. Respondents who indicated that they would not be willing to take part in the survey described (48 percent of the sample) were asked why they would not do so. The most frequent reasons given—59 percent of all first-mentioned reasons—were that the surveys were too personal or intrusive or that they objected to giving out financial or medical information or providing access to medical or financial records. As in the previous experiment, perceptions of disclosure risk, disclosure harm, individual and social benefit, and the ratio of risk to benefit were strong and significant predictors of people’s willingness to participate. Similarly, an experiment in connection with the 2000 census found that respondents primed to consider privacy issues had higher rates of item nonresponse to census long-form questions than a control group (Hillygus et al., 2006).
These experiments point to the importance of perceptions of disclosure risk, as well as of actual risks. Public awareness of confidentiality breaches in nongovernment surveys may adversely affect perceptions of the risks arising from participation in government surveys. That is, public knowledge of a breach of confidentiality by an employee of a government benefit agency or private insurance company may increase concern about such breaches by federal statistical agencies, such as the Census Bureau. Similarly, public knowledge of legal demands for identified records, such as subpoenas for data about individuals by law enforcement agencies or attorneys for plaintiffs or defendants, may increase such concerns. Similar concerns and effects may result from identity theft, through unauthorized access to an individual’s credit card account and Social Security numbers; from misuse of medical records by entities (e.g., insurance companies) that are entitled access to them for administrative purposes; or from misuse of administrative records or survey records by employees of a data collection agency. And, as noted above, such concerns about confidentiality adversely affect the likelihood of participation in government surveys.
WHY CONFIDENTIALITY BREACHES MIGHT OCCUR
Carelessness and Illegal Intrusions
Survey researchers have identified various ways in which the confidentiality of individual respondents might be breached. Perhaps the most obvious and common threat to confidentiality protection of research data arises from simple carelessness—not removing identifiers from questionnaires or electronic data files, leaving cabinets unlocked, not encrypting files containing identifiers, talking about specific respondents with others not authorized to have this information. Although there is no evidence of respondents having been harmed as a result of such negligence, it is important for government data collection agencies and private survey organizations to be alert to these issues, provide employee guidelines for appropriate data management, and ensure that the guidelines are observed.
Confidentiality may also be breached as a result of illegal intrusions into the data. For example, in 1996, ten Social Security employees (bribed by outsiders) were found to have stolen confidential information from agency computers. The key piece of information was mothers’ maiden names, which were stored in a database with password protection but less stringent security than that protecting earnings statements and other private information. The information was used to activate credit cards of residents in the New York area. Identity theft has been increasingly in the news since then.
As detailed data collected under a pledge of confidentiality are increasingly made available to researchers through licensing agreements or in research data centers, the potential for inadvertent disclosure as a result of carelessness and through deliberate illegal intrusions may also increase unless strong educational and oversight efforts accompany such means of access. In Chapter 5 we offer several recommendations designed to strengthen protections against these sources of disclosure of information about individuals.
However, the extent of the problem is not easily determinable, either by assessing past experience or predicting future effects. Numerous media stories have documented harms of identity theft from such sources as credit card and banking data. In contrast, there is no documented evidence of harms from misuse of research data or carelessness by researchers or others. Overall, very little is known about how many breaches of confidentiality may actually occur in such settings or how many people are harmed as a result. Under most circumstances, attempted breaches are difficult to detect, and relying on self-reports is problematic. A July 1993 survey by Harris, for example, reported that between 3 percent and 15 percent of the public, depending on the person or organization asked
about, believed that medical information about them had ever been improperly disclosed, and about one-third of these said they had been harmed by the disclosure (Singer, Shapiro, and Jacobs, 1997). But the accuracy of these reports is unknown. Moreover, disclosure of medical information to an insurance company may be permitted by law but regarded by survey respondents as improper. For many people, questions about breaches of confidentiality may be highly abstract so that their ideas about the uses that might be made of their medical information are limited. As a result, little is really known about what people have in mind when they answer such questions, and even less about the actual state of affairs. Again, in Chapter 5 we offer some recommendations to address this concern.
Law Enforcement and National Security
Potentially more serious threats to confidentiality than simple carelessness are legal demands for identified data, which may come in the form of a subpoena or as a result of a Freedom of Information Act (FOIA) request. Requests may also come from a law enforcement or national security agency to a statistical or other government agency; the legal status of such requests is not fully resolved, as discussed below. Individual records from surveys that collect data about such illegal behaviors as drug use are potentially subject to subpoena by law enforcement agencies. To protect against this possibility, researchers and programs studying mental health, alcohol and drug use, and other sensitive topics, whether federally funded or not, may apply for certificates of confidentiality from the U.S. Department of Health and Human Services. The National Institute of Justice (in the U.S. Department of Justice) also makes confidentiality certificates available for criminal justice research supported by agencies of the U.S. Department of Justice. Such certificates, which remain in effect for the duration of a study, protect researchers in most circumstances from being compelled to disclose names or other identifying characteristics of survey respondents in federal, state, or local proceedings (42 Code of Federal Regulations Section 2a.7, “Effect of Confidentiality Certificate”). The confidentiality protection afforded by certificates is prospective; researchers may not obtain protection for study results after data collection has been completed.
Protection for identifiable statistical data collected by federal agencies or their agents under a promise of confidentiality is also provided by the Confidential Information Protection and Statistical Efficiency Act (CIPSEA), which was enacted as Title V of the E-Government Act of 2002 (P.L. 107-347). The legislation is intended to “safeguard the confidential-
ity of individually identifiable information acquired under a pledge of confidentiality for statistical purposes by controlling access to, and uses made of, such information.” The statute includes a number of safeguards to ensure that information acquired for statistical purposes under a pledge of confidentiality “shall be used by officers, employees, or agents of the agency exclusively for statistical purposes,” and “shall not be disclosed by an agency in identifiable form, for any use other than an exclusively statistical purpose, except with the informed consent of the respondent.” Identifiable information can be disclosed, under proper conditions, for “statistical activities,” which are broadly defined to include “the collection, compilation, processing, or analysis of data for the purpose of describing or making estimates concerning the whole, or relevant groups or components within, the economy, society, or the natural environment” as well as “the development of methods or resources that support those activities, such as measurement methods, models, statistical classifications, or sampling frames.”
CIPSEA also imposed additional responsibilities on statistical agencies, requiring them to “clearly distinguish data or information [they collect] for nonstatistical purposes,” and to “provide notice to the public, before the information is collected, that the data could be used for nonstatistical purposes.” Nonstatistical purposes are defined as “any administrative, regulatory, law enforcement, adjudicatory, or other purpose that affects the rights, privileges, or benefits of a particular identifiable respondent” and include disclosure under the Freedom of Information Act. The act also provides criminal penalties for a knowing and willful breach of confidentiality by employees of the sponsoring agency and any of its “agents,” who may be data collectors or outside analysts.
CIPSEA offers great promise for increasing researcher access to confidential data. Fulfillment of that promise requires, in the first place, coordination of access and protection procedures across the various agencies in order to satisfy the uniform protection promised by the act. At the time of this report, the Office of Management and Budget is preparing regulations to implement the safeguards under CIPSEA. These implementing regulations will be critically important in translating a statutory right into clear rules that protect research participants across all federal agencies. The regulations are expected to define both the reach of protection for confidential statistical records and the opportunity for research access.
The regulations will have to cover a wide range of questions, such as:
Other than federal agency personnel, who can qualify as an “agent” under the statute and thereby be eligible for research access to identifiable records?
Does a licensing agreement between an agency and a private researcher for research access fall within the coverage of the statute?
What degree of risk of inadvertent disclosure of identifiable information will govern the release of anonymized records?
What form of public notice is required when a statistical agency collects identifiable information for nonstatistical purposes?
Which, if any, of the CIPSEA protections extend to identifiable administrative records that are used for research purposes?
How does CIPSEA affect existing regulations and practices under other agency statutes that protect research records?
What procedural safeguards are required to monitor the work of agency staffs and nonagency personnel who are deemed “agents” under CIPSEA?
Fulfillment of the potential for research access to data sharing under CIPSEA will ultimately also require companion legislation that would permit the Census Bureau to share tax information that it receives from the Internal Revenue Service (IRS) with the Bureau of Labor Statistics and the Bureau of Economic Analysis in order to reconcile the business lists built by the three agencies. In the absence of such legislation, data sharing for research among the three agencies is restricted to information that does not include, or derive from, tax data. Another topic that may need future legislative attention is the sharing of individual data, since the data-sharing provisions of CIPSEA currently apply only to business data.
The seeming clarity of the protections afforded by CIPSEA is clouded by concerns about potential conflict with access to identifiable data for national security purposes. In the past, government agencies have attempted to use confidential data collected by a statistical agency for law enforcement purposes, especially in times of heightened national security concerns. Seltzer and Anderson (2003) review attempts by various government agencies to obtain confidential census data between 1902, when the Census Bureau was established as a permanent agency, and 1965. A few of these attempts in the years before enactment of Title 13 in 1929—especially those involving national security—were successful and, in at least some of them, actual disclosure of information about individuals for national security or law enforcement purposes occurred. In 1917, for example, personal information from the 1910 census was released to courts, draft boards, and the Justice Department for several hundred young men suspected of not complying with the draft (Barabba, 1975:27, cited in Seltzer and Anderson, 2003). During World War II, according to Prewitt (2000:1): “The historical record is clear that senior Census Bureau staff proactively cooperated with the internment [of Japanese Americans], and that census tabulations were directly implicated in the denial of civil rights
to citizens of the United States who happened also to be of Japanese ancestry.”1 In 2004 the Census Bureau provided information about the residences of Arab Americans to the Customs and Border Protection agency of the U.S. Department of Homeland Security, but that information was also available on a public-use site and involved data masked to protect confidentiality. Although this incident was not a violation of law, it was perceived as such by many people, as well as a violation of trust (see Clemetson, 2004).
The 2001 USA Patriot Act, which is being considered for renewal by Congress as this report is being written, includes provisions for access by the U.S. Attorney General to identifiable research records of the National Center for Education Statistics (in the U.S. Department of Education). This provision appears to be unique: the panel is not aware of any other provisions for access to confidential research data for national security purposes. Both the Homeland Security Act of 2002 (P.L. 107-296) and the Intelligence Authorization Act for Fiscal Year 2003 (P.L. 107-306) make clear that exchange of federal agency information for homeland security needs does not include exchange of individually identifiable information collected solely for statistical purposes. Nevertheless, as Seltzer and Anderson have shown, national security crises have in the past led to circumventions or actual violations of confidentiality guarantees.2
Breaches of confidentiality due to carelessness, as well as those from illegal intrusions, are obviously more likely to occur if a data file contains direct identifiers—name, address, or Social Security number, for example. Yet there is increasing awareness that even without such identifiers, statistical disclosure may be possible. “Statistical disclosure” refers to the re-identification of respondents to a survey (or their attributes) even though direct identifiers such as names and addresses have been removed from the data file. Statistical disclosure involves using data available outside the survey to breach the protection thought to have been
In that same speech, former Census Bureau Director Kenneth Prewitt apologized on behalf of the agency for its activities in connection with the internment of Japanese Americans. For a detailed history of Census Bureau cooperation with national security activities during World War II, see Seltzer and Anderson (2000).
Although it is not directly relevant to national security, the Shelby Amendment (part of P.L. 105-277) and the Data Quality Act (see Chapter 2) also have implications for confidentiality protection that have not yet been fully determined.
afforded a survey data set by various data deletion and masking techniques. Re-identification of respondents may be increasingly possible because of high-speed computers, external data files containing names and addresses or other direct identifiers as well as information about a variety of individual characteristics, and sophisticated software for matching survey and other files. In Chapter 2 we noted some of the factors that may increase statistical disclosure risk and harm for respondents in government-sponsored surveys, including factors that are integral to the survey design and factors that are external to data collection agencies and researchers. In addition, there is a growing concern by data collection agencies (see below) that wider dissemination of research data may itself increase disclosure risk.
For a breach of confidentiality due to statistical disclosure to occur, there must be the technical or legal means, as well as the motivation to use them. With regard to motive, there are (at least) four: curiosity, sport (e.g., hackers), profit (e.g., identity theft), and law enforcement or national security.3
Breaches occurring because of curiosity or sport may never become known to the respondent. However, the confidentiality pledge has been violated, and ethical harm has been done, even if all that has happened is that someone has identified a record in a data file and not used it for any purpose.
The further harm a breach of confidentiality may cause depends in part on the type of intruder and the type of data. Federal regulations for the protection of human subjects of research (in the Common Rule, 45 Code of Federal Regulations 46) focus mainly on the potential harm to an individual’s reputation, livelihood, or liberty resulting from the disclosure of confidential information, suggesting that disclosure of deviant or illegal behavior or unpopular beliefs is most likely to be harmful. However, if an intruder’s aim is identity (or property) theft, then anything that permits the appropriation and abuse of another’s identity may be harmful to that individual. If the intruder is a hacker simply out to embarrass the survey organization, then public identification of one or more survey participants may be enough to do harm to the data collection and research enterprise, even if the information is not sensitive and the participants are not directly harmed.
A survey design factor that, prima facie, would seem to increase the risk of statistical disclosure is the increasing number and diversity of at-
tributes asked about and stored on the data record for each respondent. The greater the number of attributes about which information is provided, the greater is the theoretical potential for re-identification. Since the late 1960s, surveys have become more detailed on several dimensions. Thus, more and more surveys are collecting detailed socioeconomic attributes for individuals and households; more and more surveys are asking about individual behaviors, including those that are risky and even illegal; and more and more surveys are longitudinal in design, collecting repeated measurements on the same individuals. In addition, a growing number of both cross-sectional and longitudinal surveys collect data about an individual from multiple sources: for example, surveys of children in which data are obtained from parents, schoolteachers, and others, and surveys that collect information about individuals, the schools they attend, and the neighborhoods in which they live.
More recently, a small but growing number of surveys are making use of new technologies for collecting biological and geographic information, which in turn make it easier to identify respondents—or more difficult to conceal their identity (see, e.g., National Research Council, 1998, 2001a). Such information, which includes DNA samples, biological measurements, and geospatial coordinates, complicates the problem of making data files anonymous and heightens the dilemma of data collection agencies and researchers who want to increase access to the data they collect while protecting the confidentiality of respondents (see, e.g., Abowd and Lane, 2004).
Other factors that may increase the risk of statistical disclosure are external to the survey organization and researcher. As noted above and in Chapter 2, these factors include the increasing availability of files in the external environment that are suitable for matching to survey records and, in addition, contain names and addresses or other direct identifiers; the ready availability of matching software; and quantum increases in the processing and storage capabilities of computer hardware and software, which make it possible to manipulate multiple files with rapidity and relative ease. If microdata have been stripped of direct identifiers but no added steps have been taken to minimize disclosure risk, it is relatively easy to match the file with external databases that contain some of the same variables as the original midcrodata (plus names and addresses) and thus to identify some respondents (see, e.g., Winkler, 1988). Similar research has been conducted by others (see, e.g., Sweeney, 2001). However, the panel knows of no information on whether this has been done other than in a research context.
Statistical agencies and survey organizations understandably worry that wider access to ever more complex datasets, in an era of cheap, capacious computing technology and many outside data sources for match-
ing, will increase the risk of statistical disclosure and the potential for harm to respondents, as well as to survey participation. Although many factors seem to increase the risk of disclosure, there is some evidence suggesting that increasing the number of attributes in a data record does not necessarily lead to increased disclosure. For example, the Retirement History Survey (RHS), which followed people who were aged 58-63 in 1969 for 10 years, made more information publicly available than the HRS, which has followed people aged 51 and older since 1992. Yet there are no known instances of a breach of confidentiality for the RHS, from which microdata have been publicly available for more than 30 years. Similarly, there are no known instances of disclosure or consequent harm for other richly detailed and long-available datasets, such as the Panel Study of Income Dynamics, which has followed families and their descendants for more than 35 years. Although this evidence is suggestive, it is important for statistical and other agencies to know how often inappropriate disclosures of information actually occur and what the risk of disclosure is in different circumstances.
Ultimately, decisions about how much disclosure risk is acceptable in order to achieve the benefits of greater access to research data involve weighing the potential harm posed by disclosure against the benefits potentially foregone, as well as a judgment about who should make those decisions. The panel does not resolve these difficult issues. Rather, in Chapter 5 we recommend research to reduce disclosure risk while preserving data utility. We also recommend research that improves estimation of disclosure risk and procedures for monitoring the actual frequency of disclosure. Finally, we recommend continuing consultation with data users and data providers about all of these issues.