Policy makers need information about the state of the nation—from the national economy to household use of Medicare—in order to evaluate existing programs and to develop new ones. That information often comes from research based on data collected by statistical agencies or others under a pledge of confidentiality. The most critical data are microdata—data about individual people, households, and businesses and other organizations.
The benefits of providing wider access to microdata for researchers and policy analysts are better informed public policies. The risk of providing increased access to microdata is increased risk of breaching the confidentiality of the data.
Both data collection and research are decentralized activities in the United States. Many federal agencies collect data—from the decennial census to statistics on traffic patterns—and some sponsor data collection through universities and other nongovernment institutions. Although some agencies have in-house staffs of policy analysts and researchers, most researchers are based at universities and other nongovernment institutions. The value of this decentralized system is to ensure a variety of perspectives and approaches to both data collection and research. The challenge is to safeguard the confidentiality of the data while making them available to researchers and analysts in a wide variety of settings. One consequence of the decentralized system is a frequent lack of understanding about how data could and will be used and of planning for those uses.
The charge to the Panel on Data Access for Research Purposes was “to
assess competing approaches to promoting exploitation of the research potential of microdata—particularly linked longitudinal microdata—while preserving respondent confidentiality.” The panel was asked to consider the tradeoffs between the benefits and risks of data access and to make recommendations about “how microdata should optimally (from a societal standpoint) be made available to researchers.”
The panel concludes that no one way is optimal for all data users or all purposes. To meet society’s needs for high-quality research and statistics, the nation’s statistical and research agencies must provide both unrestricted access to anonymized public-use files and restricted access to detailed, individually identifiable confidential data for researchers under carefully specified conditions.
Research using detailed confidential data is needed not only for well-informed policy making but also to improve the quality of public-use files, which are the most widely used microdata products made available by statistical and other data collection agencies. In turn, wide access to public-use data leads to new analyses and conclusions that must be tested on the more detailed confidential data available only through restricted access.
High-quality public-use files require continuing research into methods of assuring the inferential validity of the data while safeguarding their confidentiality. A great deal of promising work has been done on this topic, but more is clearly needed.
At the same time, the continuing need for restricted access to more detailed microdata means that the conditions for obtaining such access need improvement on a continuing basis. The use of licensing agreements, as a mechanism for granting wider access to confidential microdata, should be expanded. Especially important is easier access to research data centers, such as those maintained at universities and other host institutions by the U.S. Census Bureau. Such centers, which several other agencies maintain at their headquarters, are currently the only place where researchers have access to key microdata that provide the level of detail (e.g., small geographic areas) needed for many important analyses. Research to facilitate secure remote access to these data centers is also needed in order to remove the burden on researchers of traveling to a distant site.
We believe that the changes we recommend will result in wider access to high-quality anonymized public-use files as well as to potentially identifiable microdata. But such expanded access requires expanded procedural and legal protections. The panel believes that users, like agencies, should be held accountable for safeguarding the confidentiality of microdata files to which they are granted access. We recommend that statistical agencies set up procedures for monitoring any breaches of confidentiality that may occur, as well as their causes and consequences. We recommend
that agencies require auditing of license holders and penalties for violations of the license. We also recommend that agencies institute confidentiality agreements for public-use data files and meaningful penalties for all data users who willfully violate such agreements.
However, laws, enforcement, and penalties are not enough to safeguard the confidentiality of research records. What is needed in addition to the legal sanctions is a system of norms and values concerning the ethical use of such data. Everyone working with confidential research records—interviewers, data entry clerks, statistical analysts, and social and behavioral scientists—requires education and training in these ethical principles and practices. The statistical system of the United States ultimately depends on the willingness of the public to provide the information on which research data are based. To ensure such willingness, there must be scrupulous attention to assuring the informed consent of data providers, as well as continuing research into public attitudes relevant to data collection, privacy, and confidentiality.
The panel’s recommendations should be read in the context of the many existing reports that have addressed similar issues of data access and confidentiality protection in the past. In particular, we have drawn heavily on Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics, published in 1993, though we have not attempted to make recommendations in all of the areas considered in that report. Rather, our recommendations focus on the needs highlighted by legal, social, and technological changes during the last decade.
The panel offers four recommendations on basic issues of documentation and access:
maintenance of bibliographies of research and policy analysis publications by government and nongovernment data collection agencies in order to provide tangible evidence of the benefits of making data widely available for analysis;
use of a variety of modes for data access, including restricted access to confidential data and unrestricted access to appropriately altered public-use data, in order to meet research needs for high-quality data with different levels of detail and precision;
research to guide more efficient allocation of resources among different data access modes; and
greater involvement of users in planning modes of access to agencies’ data in order to better accommodate their needs.
The panel offers four recommendations focused specifically on public-use data. The first two are intended to increase access to data,
while the third and fourth try to balance increased access with increased safeguards against misuse:
research on techniques for providing useful, innovative public-use data sets that increase informational utility without increasing disclosure risk;
a new system of access to public-use microdata through existing and new data archives (following recommendations in Protecting Participants and Facilitating Social and Behavioral Science Research), intended to speed researchers’ access to such files;
a warning on all public-use data that the data are provided for statistical purposes only and a requirement that all users attest to having read the warning; and
restriction of access to public-use data to those who agree to abide by the confidentiality protections governing such data and the institution of meaningful penalties for willful misuse of those data.
Looking more closely at restricted access to confidential data, the panel offers five recommendations on research data centers, remote access, and licensing agreements:
for the Census Bureau, broadening interpretation of the criteria for access to data, maintaining a continuous cycle for reviewing research proposals, and taking account of prior scientific review of those proposals in order to facilitate and speed researcher access;
research by statistical and other agencies that sponsor data collection on cost-effective means of providing secure access through remote data access mechanisms, with the aim of increasing the availability of remote access to confidential data;
the use of licensing agreements by statistical and other agencies (that do not now have them) for access to confidential data, and expanding the data files for which a license may be obtained;
development of flexible, consistent standards for licensing agreements and implementation procedures by statistical agencies, with the involvement of data users; and
inclusion of auditing procedures and appropriate legal penalties in licensing agreements, for the willful misuse of confidential data, in order to balance expanded access with appropriate confidentiality safeguards.
People’s perceptions of benefits and trust that they will not be harmed as a result of the information they provide are crucial to their cooperation
with data collection efforts. The panel offers two recommendations on this topic:
provision by data collection agencies of basic information about confidentiality and data access to everyone asked to participate in statistical surveys; and
continuing research on the views of data providers and the public about research benefits and risks.
Because the panel believes that laws and enforcement alone are inadequate for protecting confidential data, it offers four recommendations on training, monitoring, and education to complement legal, administrative, and technical protections. The first two are directed to data collection agencies:
providing employees with continually updated written guidelines for confidentiality protection of individually identifiable data and training in confidentiality practices and data management; and
ongoing research into violations of confidentiality protection procedures and breaches of confidentiality that may occur, as well as the causes and consequences of those breaches.
The second two are directed at educational and professional organizations, which are an important source for the development of professional norms and ethical standards:
training in ethical issues related to research for all those involved in the design, collection, analysis, and distribution of data obtained under pledges of confidentiality; and
development of strong codes of ethical conduct to protect the confidentiality of personal data and education about those codes.
The panel is confident that, taken together, these recommendations can improve access to and use of data for research and so improve the quality and relevance of those data for social science research and public policy, while providing appropriate protection for the confidentiality of identifiable data.