1

Introduction

In October 1999, the Committee on National Statistics (CNSTAT), in consultation with the Institute of Medicine, convened a 2-day workshop to identify ways of advancing the often conflicting goals of exploiting the research potential of microdata and preserving confidentiality. The emphasis of the workshop was on longitudinal data that are linked to administrative records; such data are essential to a broad range of research efforts, but can also be vulnerable to disclosure. Administrative data are collected to carry out agency missions and constitute the majority of agency data. An additional—much smaller—amount of data is collected specifically for research and other public purposes. It is sometimes feasible and useful to merge the latter data with the more extensive administrative records.

CNSTAT has had an active history working in the area of data confidentiality and access, culminating with the panel study that produced the volume Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics (National Research Council and Social Science Research Council, 1993). That study resulted in a series of recommendations for advancing researchers' access to data without compromising the ability to protect the confidentiality of survey respondents. This workshop brought together several participants from that study and many others representing various communities—data producers from federal agencies and research organizations; data users, including academic researchers; and experts in statistical disclosure limitation techniques, confidentiality policies, and administrative and legal procedures.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1
Improving Access to and Confidentiality of Research Data: Report of a Workshop 1 Introduction In October 1999, the Committee on National Statistics (CNSTAT), in consultation with the Institute of Medicine, convened a 2-day workshop to identify ways of advancing the often conflicting goals of exploiting the research potential of microdata and preserving confidentiality. The emphasis of the workshop was on longitudinal data that are linked to administrative records; such data are essential to a broad range of research efforts, but can also be vulnerable to disclosure. Administrative data are collected to carry out agency missions and constitute the majority of agency data. An additional—much smaller—amount of data is collected specifically for research and other public purposes. It is sometimes feasible and useful to merge the latter data with the more extensive administrative records. CNSTAT has had an active history working in the area of data confidentiality and access, culminating with the panel study that produced the volume Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics (National Research Council and Social Science Research Council, 1993). That study resulted in a series of recommendations for advancing researchers' access to data without compromising the ability to protect the confidentiality of survey respondents. This workshop brought together several participants from that study and many others representing various communities—data producers from federal agencies and research organizations; data users, including academic researchers; and experts in statistical disclosure limitation techniques, confidentiality policies, and administrative and legal procedures.

OCR for page 1
Improving Access to and Confidentiality of Research Data: Report of a Workshop KEY ISSUES The development of longitudinal data sets linked to health, economic, contextual geographic, and employer information has created unique and growing research opportunities. However, the proliferation of linked data has simultaneously produced a complex set of challenges that must be met to preserve the confidentiality of information provided by survey respondents and citizens whose administrative records are entrusted to the government. Unprecedented demand for household-and individual-level data, along with the continuing rapid development of information technology, has drawn increasing attention to these issues. Technological advances have rapidly improved the range and depth of data; opportunities to access, analyze, and protect data have grown as well. However, technology has concurrently created new methods for identifying individuals from available information, of which longitudinal research data are but one of many sources. Longitudinal files that link survey, administrative, and contextual data provide exceptionally rich sources of information for researchers working in the areas of health care, education, and economic policy. To construct such files, substantial resources must be devoted to data acquisition and to the resolution of technical, legal, and ethical issues. In most cases, requirements designed to protect confidentiality rule out the type of universal, unrestricted data access that custodians —and certainly users—of such databases may prefer. Several modes of dissemination are currently used to provide access to information contained in linked longitudinal databases. Dissemination is typically restricted either at the source, at the access point, or both. Products such as aggregated, cross-tabulation tables are published regularly and made available to all users, but of course offer no record-level detail. This type of data does not support research into complex individual behavior. Public-use microdata files, on the other hand, offer detail at the individual or household level and are available with minimal use restrictions. However, producers of microdata must suppress direct identifier fields and use data masking techniques to preserve confidentiality. Additional methods, such as licensing agreements, data centers, and remote and limited access, have been developed to limit either the types of users allowed access to the data, the level of data detail accessible by a given user, or both. Restricted access arrangements are generally designed to provide users with more detail than they would get from a public-use file. It is within this context that the workshop participants debated the key issues, which can loosely be organized at two levels. The first is the tradeoff that exists between increasing data access on the one hand and improving

OCR for page 1
Improving Access to and Confidentiality of Research Data: Report of a Workshop data security and confidentiality on the other.1 To examine this tradeoff, it is necessary to quantify, to the extent possible, disclosure risks and costs, as well as the benefits associated with longitudinal microdata and with linking to administrative records. Decisions about what types of data can be made available, to whom, and by what method hinge on the assessment of these relative costs and benefits. Researchers typically appeal for greater access to unaltered data, while stewards of the data are understandably often more focused on assessing and minimizing disclosure risk. At the second level of discourse, participants discussed alternative approaches to limiting disclosure risk while facilitating data access. Given that all longitudinal microdata require some protections, the compelling question is which approach best serves data users while maintaining acceptable levels of security. The choice reduces essentially to two options: (1) restricting access—physically limiting who gets to see the data, or (2) altering the data sufficiently to allow for safe broader (public) access. Other elements, such as legal deterrents, also come into play. Workshop participants articulated in detail the merits and relative advantages of alternative approaches. Their arguments are summarized in this report. WORKSHOP GOALS As noted above, a central objective of the workshop was to review the benefits and risks associated with public-use research data files and to explore alternative procedures for restricting access to sensitive data, especially longitudinal survey data that have been linked to administrative records. Doing so requires considering the impact on each group involved—survey respondents, data producers, and data users—of measures designed to reduce disclosure risk. Presenters from the academic community reviewed the types of research that are enhanced, or only made possible, by the availability of linked longitudinal data. Participants also identified and suggested methods for improving current practices used by agencies and research organizations for releasing public-use data and for establishing restricted access to nonpublic files. The overarching theme was the importance of advancing methods that maximize the social return on investments in research data, while fully complying with legal and ethical requirements. 1   Early on in the workshop, a participant clarified the distinction between “privacy” and “confidentiality.” Privacy typically implies the right to be left alone personally, the right not to have property invaded or misused, freedom to act without outside interference, and freedom from intrusion and observation. In the context of research data, confidentiality is more relevant. The term refers to information that is sensitive and should not be released to unauthorized entities. It was suggested that confidentiality implies the need for technical methods of security.

OCR for page 1
Improving Access to and Confidentiality of Research Data: Report of a Workshop The workshop, then, was designed with the following goals in mind: To review the types of research that are enhanced, or only made possible, using linked longitudinal data. To review current practices and concerns of federal agencies and other data producing organizations. To provide an overview of administrative arrangements used to preserve confidentiality. To identify ways of fostering data accessibility in secondary analysis. To assess the utility of statistical methods for limiting disclosure risk. To date, efforts to address these themes have been hindered by inadequate interaction between researchers who use the data and agencies that produce them and regulate their dissemination. Researchers may not understand and may become frustrated by access-inhibiting rules and procedures; on the other hand, agencies and institutional review boards are not fully aware of how statistical disclosure limitation measures impact data users. The workshop brought the two groups together to help overcome these communication barriers. REPORT ORGANIZATION Workshop topics were organized into the following sessions: (I) linked longitudinal databases—achievements to date and research applications, (II) legal and ethical requirements for data dissemination, (III) procedures for releasing public-use microdata files, and (IV) procedures for restricted access to research data files. This report is structured slightly differently to focus on themes as they emerged during the workshop. Chapter 2 outlines the tradeoff between data access and confidentiality. Presentations on the research benefits of linked longitudinal data are summarized, along with discussions of disclosure risk assessment and quantification. Chapter 3 reviews presentations that addressed ethical and legal aspects of data dissemination, as well as discussion on the role of institutional review boards. Chapter 4 summarizes participants' assessments of competing approaches to limiting disclosure risk and facilitating user access; the focus is on two primary competing approaches—data perturbation and access limitation. Agency and organization practices are the subject of Chapter 5. In adition, two appendices are provided: Appendix A is a list of the workshop participants; Appendix B is the workshop agenda.