The United States at present has the most extensive array of data collection programs undertaken by federal statistical, research, and administrative agencies in its history. Collectively, these data yield a detailed portrait of population groups and of organizations that affect people’s lives (employers, educators, health care providers, and others). When made available in the form of microdata, particularly linked, longitudinal microdata, federal data collections provide an unparalleled resource for policy analysis and research on important social issues.1 The interest in such data is exemplified by a trend toward studies with great richness and detailed information, such as the proposed national children’s study on environmental and genetic effects on health and development (see www.nationalchildrensstudy.gov).
Yet this very trend has increased the risk of violating the confidentiality of those who provide the information. Recent innovations in information technology, such as the widespread availability of data about individuals on the Internet, have also increased that risk. In response, many data collection agencies have reduced the amount of detail in publicly available microdata sets, although they have also worked with researchers to develop new methods and arrangements for data access that protect confidentiality and respect privacy.
Privacy has many dimensions. The emphasis in this report is on informational privacy, which encompasses an individual’s freedom to choose the extent and circumstances under which personal information will be shared with others, and how it will be used. Confidentiality refers broadly to an obligation not to transmit identifiable information—for an individual or a business—to an unauthorized party. More specifically, this report is concerned with the explicit or implicit promises made to respondents regarding how their data will be used and the extent to which they will be protected against the risk that the data they provide may allow others to identify them (see National Research Council, 1993:22).
The nation needs to use its statistical data, especially properly protected microdata, for credible, detailed analyses of current and proposed government programs and policies in such areas as education, health care, and taxation. These data are also needed for basic research in the social, behavioral, and economic sciences that can advance the quality and scope of policy analyses. Much basic and policy research will be undertaken outside the federal government, in universities and other research centers. Thus, there are questions about how to provide researchers—inside and outside government—access to data that can both inform public policy and protect the privacy of respondents and the confidential nature of the information they provide.
SCOPE AND STRUCTURE OF REPORT
In response to those questions, the Panel on Data Access for Research Purposes undertook a study to understand and propose ways to resolve the tension between the goal of facilitating researchers’ access to federal data collections, particularly detailed microdata sets, and that of maintaining confidentiality. The panel was convened by the National Academies’ Committee on National Statistics (CNSTAT) at the request of the National Institute on Aging, which supports the collection of microdata and funds research that depends on the availability of those data for analysis. The panel began its work early in 2003, building on earlier efforts by CNSTAT. Those efforts included a major comprehensive review of the issues more than a decade ago, which produced Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics (National Research Council, 1993) and a workshop held in 1999 (National Research Council, 2000). A CNSTAT report from two decades ago on the benefits of sharing research data is also still relevant (National Research Council, 1985).
The panel was given the following specific charge for its work:
This study will assess competing approaches to promoting exploitation of the research potential of microdata—particularly linked longitu-
dinal microdata—while preserving respondent confidentiality. The ultimate goal is for the panel to make recommendations about how microdata should optimally (from a societal standpoint) be made available to researchers. This will require, among other things, thinking about how to measure the value of the research good made possible by data production and access, as well as the risk (and associated cost) of disclosures. Such measures are needed in order to assess the tradeoff between the benefits derived from increased protection of data versus those derived from fuller data access.
The panel may also focus on (1) technical, legal and statistical ingredients needed to promote arrangements within and between agencies, and also between government and private sector data producers; (2) enforcement of legal protections for data subjects and appropriate penalties for misuse, and how breakdowns in security are detected, assuming they are, and traced to responsible parties.
The panel will also consider the relative advantages associated with various approaches to data protection and form recommendations about (1) alternative, less burdensome systems (e.g., Internet, remote access, etc.) of providing researchers with access to restricted data and (2) cutting-edge statistical techniques for manipulating data in ways that claim to preserve important statistical properties and allow for broader general data release.
In undertaking its work, the panel quickly discovered that the range of issues, as well as developments since the earlier major CNSTAT report, precluded detailed consideration of the optional elements in our charge. We took as our task a broad overview of the basic issues; we did not explore in detail all data that are or might be available and all ways to protect them. For example, we did not consider the resources and structure of data collection agencies, although we recognize that agencies’ histories, priorities, stakeholders, and incentives are factors that affect data access. In addition, although we acknowledge efforts in other countries to develop innovative, workable methods for research access to microdata (notably, the work of the Luxembourg Income Study and the German Socio-Economic Panel Study), we limited our study to the United States because of the differences in this country’s laws, organizational structures, and public attitudes. And we touch only briefly on the role of nongovernment survey organizations.
Yet the basic responsibilities and techniques for protecting privacy and confidentiality while promoting data access for research are applicable across all kinds of data, including administrative data on individuals and businesses linked to microdata and individuals’ biological data collected in surveys. For example, many kinds of biological information—such as blood pressure, weight, or cholesterol—can be released publicly after some alteration, just as data on income or hours of work can, because
the information is not unique. In contrast, genetic data (such as a DNA sample) and data from geographic information systems (GIS) are unique, as are Social Security numbers, and pose much more difficult issues of protection (see National Research Council, 2001a).
This report is intended to take stock of the present situation; it should be seen as one in a line of periodic assessments that will be required over time. It does not pretend to have all the answers nor, given constrained resources, to represent a detailed investigation of alternative data access methods and arrangements. It provides a broad view of the issues, noting why imaginative data access methods are required to satisfy the public need for sophisticated policy analysis and basic social science research.
In addition to regular meetings, the panel held a workshop in fall 2003 to obtain a wide range of views on how issues of data access and confidentiality protection have changed over the past decade since Private Lives and Public Policies was published (see Appendix A for a summary of the workshop). The rest of the panel’s work was carried out through intense discussions and sharing of draft materials at its meetings.
The rest of this chapter and Chapter 2 provide context for the panel’s work. We begin in this chapter with a brief overview of Private Lives and Public Policies (National Research Council, 1993). Our report draws on the conceptual framework presented there though it does not revisit in detail the issues or recommendations covered in that study (some of which have not yet been implemented). Rather, our focus is on changes in key areas in the past decade, detailed in Chapter 2: increased public concerns about privacy and confidentiality; society’s increased need for data for policy analysis and evaluation and, consequently, for basic social science research; changes in information technology; changes in laws and regulations; and developments in methods for providing access to data for research.
Chapter 3 discusses the benefits to society from the research use of data collected by federal agencies, especially from rich microdata for individuals, organizations, and businesses. Chapter 4 discusses the potential costs to data providers in terms of possible breaches of confidentiality. Chapter 5 proposes ways to reconcile the tensions between the benefits and risks with recommendations for improved access to data for research purposes while protecting the data’s confidentiality.
Appendix A is a summary of our workshop, for which we commissioned papers on a range of topics relevant to the panel’s task: the economics of data confidentiality (Abowd and Lane, 2003); the role of data access in scientific replication (Bailar, 2003); balancing individual rights and societal benefits from data collection and analysis (Barquin and Northouse, 2003); the role of longitudinal microdata in research and policy
(Brown, 2003); the Census Bureau’s research data center network (Hildreth, 2003); recent legislation relevant to privacy, confidentiality, and data sharing (McMillen, 2003); protecting confidentiality of research data through legal means (Perritt, 2003); evaluating inferences from synthetic data (Raghunathan, 2003); estimating probabilities of identification for microdata (Reiter, 2003); monitored remote microdata access systems (Rowland, 2003); licensing and enforcement mechanisms for promoting data access and protecting confidentiality (Seastrom, Wright, and Melnicki, 2003), and the historical record of disclosure and risk (Seltzer and Anderson, 2003). In addition, Michael Larsen gave a talk on technical, legal, and organizational barriers to data linking.
PRIVATE LIVES AND PUBLIC POLICIES: A DECADE LATER
Just over 10 years ago, the Panel on Confidentiality and Data Access produced the report, Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics (National Research Council, 1993). Commissioned by the Committee on National Statistics in collaboration with the Social Science Research Council, the report emphasized the inherent tension between protecting the privacy of individuals and obtaining and disseminating accurate, detailed data to inform public policies. Society affirms to individuals the value of assuring their information privacy and confidentiality, but this affirmation must be balanced with the need of the community for data about individuals and organizations. The first two paragraphs (National Research Council, 1993:15) establish the competing forces:
Private lives are requisite for a free society. To an extent unparalleled in the nation’s history, however, private lives are being encroached on by organizations seeking and disseminating information. In their stewardship of data collection and data dissemination, federal statistical agencies have had a long-standing concern for the privacy rights of the data providers, but they now face mounting demands for privacy…
In a free society, public policies come through the actions of the people. Those public policies influence individual lives at every stage—financing of prenatal care, state aid to school districts, job training and placement, law enforcement, and determining retirement benefits. Data provided by federal statistical agencies … are the factual base needed for informed public discussion about the direction and implementation of those policies. Further, public policies encompass not only government programs but all those activities that influence the general welfare, whether initiated by government, business, labor, or not-for-profit organizations. Thus, the effective functioning of a free society requires broad dissemination of statistical information.
The report’s thorough analysis stressed that government data collection operations must reflect both the obligation to supply the information that is needed to inform the country’s democratic and free society and the sometimes competing obligation to respect the individual (or organization) who provides the often highly personal responses on which those data are based. Most importantly, the report laid out a series of recommendations for helping to resolve the tension between these two fundamental mandates. Today, many of these recommendations have been implemented and have led to better information practices. Why, then, look to this issue again?
At root, the tension between the concern for the data provider manifest in the phrase, “private lives,” and society’s need for data, signaled by “public policies,” is structural and can never be fully resolved, no matter how enlightened the practices of a statistical or other data collection agency may be. However, changing conditions can increase (or reduce) the level of tension. As brokers between the data provider and the data user, statistical and research agencies need to continually examine changing conditions and attempt to resolve the resulting tension as best they can.
There is no doubt that these early years in the 21st century challenge statistical and other data collection agencies with a sharply increased level of tension between the two mandates, which, in turn, calls for reexamination of information practices. Several key changes since Private Lives and Public Policies thus motivated our study:
There is evidence of increased public concern about personal privacy and distrust of government assurances of confidentiality; such concern is predictive of reduced cooperation with censuses and surveys (see, e.g., Singer, Mathiowetz, and Couper; 1993; Singer, Van Hoewyk, and Neugebauer, 2003; National Research Council, 2004b; Hillygus et al., 2006).
There is also evidence of considerable public unease over the burgeoning capability of businesses and private organizations (such as credit rating firms) to gather personal information about millions of individuals (see, e.g., Dash and Zeller, 2005). In early 2001, large percentages of Americans expressed concern about on-line credit card theft (87 percent), Internet fraud (80 percent), hacking of government computers (78 percent), and hacking of business computers (76 percent) (Fox and Lewis, 2001:2). More recently, there have been widely publicized instances of unauthorized release of personal records maintained by large data warehouse firms and credit card companies.
Assessment of complex public policies requires increasingly de-
tailed data, and researchers increasingly have the ability to carry out a wide range of causal analyses.
Statistical and research agencies have successfully carried out major data collections, including longitudinal surveys, that can answer important policy questions for which aggregate, cross-sectional data would be inadequate. There is an obligation to ensure that this substantial investment of public monies yields factual evidence that can inform debate in public policy areas.
New kinds of individually identifiable data, such as unique genetic information and increasingly precise geospatial detail, can be collected and disseminated.
Statistical agencies, which conduct substantial methodological research on data collection and estimation, often do not have either the internal resources or the political mandate to carry out the causal analyses needed for policy formulation and the advancement of scientific knowledge relevant to policy; this capability is more readily found in the research and policy analysis community.
Advances in information technology have raised both fears of privacy intrusion and expectations about access to information. Two key factors increasing the risk of disclosure are the existence of comprehensive databases with individual identifiers and the development of sophisticated record-linkage and data-mining methodologies, many of which are readily available on the Internet.
The legal framework that guides the information process has changed in important ways, notably with the enactment of the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA) and the fact that medical records, including those used for research, are subject to new confidentiality regulations under the Health Insurance Portability and Accountability Act of 1996 (HIPAA). At the same time, the USA Patriot2 Act of 2001 overturned the protection previously accorded education records gathered and maintained by the National Center for Education Statistics (see National Research Council, 2005:35).
New techniques have been developed for producing restricted data products that can be made publicly available because the data have been altered to minimize the risk of individual identification. Techniques have also been developed for analyzing the effects of different alteration methods on disclosure risk and data utility.
Agencies and researchers now have some experience with restricted data access procedures put in place during the last decade that
permit authorized researchers to use confidential data that are not publicly available. Those procedures include protected enclaves, commonly known as research data centers; licensing arrangements; and methods for secure, monitored on-line access to data.
The next three chapters explore these issues in more depth.