ACCESS TO RESEARCH DATA: ASSESSING RISKS AND OPPORTUNITIES OCTOBER 16-17, 2003
The panel held a workshop early in its deliberations to hear from experts about how microdata can best be made available to researchers while protecting respondent confidentiality. The workshop goals were to generate information for the panel’s use and to provide a venue for the papers commissioned by the panel to be presented and discussed in a public forum.
This summary, following the workshop, is organized around the six topics that were the subjects of the commissioned papers:
the changing legal landscape;
facilitating data access;
measuring the risks and costs of disclosure;
the impact of multiple imputation on disclosure risk and information utility;
assessing the benefits of researcher access to longitudinal microdata; and
assessing research and policy needs—the economics of data access.
The papers, presenters, and discussants are listed at the end of the
Appendix. The papers are available electronically (www7.nationalacademies.org/cnstat/Data_Access_Panel.html).
BACKGROUND AND OVERVIEW
In 1999 the Committee on National Statistics (CNSTAT) held a workshop focused on the procedures used by agencies and organizations for releasing public-use microdata files and for establishing restricted access to nonpublic files. Tradeoffs between research and other data user needs and confidentiality requirements were articulated, as were the relative advantages and costs of data alteration techniques versus restricted (physical) access arrangements. The report of that workshop, Improving Access to and Confidentiality of Research Data (National Research Council 2000), provided a starting point for the panel’s work.
The panel’s workshop followed up on many of the topics discussed in the 1999 workshop, but the focus was less on what agencies are currently doing and more toward emerging opportunities, specifically relating to research access to longitudinal microdata. Participants provided an in-depth look at a number of topics ranging from the role of licensing and penalties for infringing on licensing agreements to the potential of data linking (e.g., between survey and administrative data), particularly the technical, legal, and statistical arrangements that would be needed to promote linking within and between agencies and between government and private-sector data producers. The workshop also sought to promote discussion of how to measure the risks and costs associated with data use, disclosure, and limiting access; what levels of risk are acceptable; and public perceptions about privacy as they pertain to market data in comparison with government data.
The first day was devoted to three topics on risks and opportunities: legal, technical and organizational, and normative. The papers and discussion in the first session examined various aspects of the legal landscape, emphasizing important recent changes, particularly the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA). The papers highlighted that the legal framework offers a range of opportunities for promoting wider access to research data. For example, through the use of licensing agreements, some of the legal responsibility for maintaining confidentiality can be shifted to data users by the agencies that collect the data.
The papers and discussion in the second session focused primarily on technical and organizational opportunities, both on a general level and as manifested by special organizations like the Census Bureau’s Research Data Centers and remote access to data that are stored centrally. The third session was devoted to the difficult problem of accurately assessing dis-
closure risks associated with access to microdata and also to the question of what if any harms have come to participants as a result of the disclosure of the information they provided to government agencies.
The second day of the workshop focused on ways of dealing with potentially conflicting goals—information utility and confidentiality protection. Session four focused on one particular method for accommodating those two values, the creation of imputed data, and assessed how well analyses based on such data might approximate models estimated from unaltered data.
Session five was devoted primarily to discussing the scientific and practical benefits of providing restricted as well as unrestricted access to research data. Participants also considered scientific replication and the usefulness of access to data by multiple parties in that process. The final session attempted to assess costs and benefits associated with different approaches to providing data access and protecting confidentiality.
THE CHANGING LEGAL LANDSCAPE
David McMillen opened the session with the presentation of his paper “Privacy, Confidentiality, and Data Sharing,” which reviewed new legislation dealing with these issues. He also provided an overview of the legislative history of CIPSEA, which is Title V of the E-Government Act of 2002, and offered some thoughts about its implementation.
McMillen focused on the principle and application of informed consent agreements. He underscored the point that the legislative history on privacy indicates that when people provide information to the government, or to a private entity, they do not give up all rights to how those data are used. Conversely, when the government receives information from the public, it is not free to use that information for purposes other than those for which the information was collected. The central issue for McMillen, then, is what the terms of this implied contract between data providers and their government are and what responsibility the government has for making those terms clear and explicit.
McMillen argued that agencies that collect information from the public should be as clear and detailed as possible in explaining to respondents how the information will be used and what the limits of confidentiality protection are. He concluded with the statement that government is based on open access to the citizens it serves, and that openness should be one of the principles that guide the development of policies about informing respondents of their rights and responsibilities when asked for information.
During the discussion of McMillen’s paper, there was considerable
divergence of opinion about how detailed informed consent agreements should or could be. Some participants articulated the view that McMillen’s prescriptions would lead to a serious decrease in the utility and value of government-collected microdata.
Marilyn Seastrom, Candice Wright, and John Melnicki then presented their paper, “The Role of Licensing and Enforcement Mechanisms in Promoting Access and Protecting Confidentiality.” Licensing agreements allow researchers to use protected confidental data files in a secure environment at their home institution, subject to the terms and responsibilities specified in an agreement. In the first part of their presentation, the authors reviewed the strengths and weaknesses of various licensing arrangements currently in practice in the United States and abroad. They also described instances when enforcement has led to sanctions and when administrative penalties have been implemented for misuse of data. The authors also reviewed application procedures and data security plans and the many types of data agreements in place.
The authors concluded that the enforcement mechanisms are, for a number of reasons, quite weak and that, consequently, there have been violations, though most of them are relatively minor (e.g., computers left unattended, failure to maintain a log for check in/out of data, or data not properly stored when not in use). They concluded with three recommendations: (1) more routine use of security inspections, (2) implementation of termination procedures, and (3) maintenance of a tracking database.
First, given the potential importance of security inspections as a means of monitoring and enforcement of data-use agreements, all agencies that license external researchers to use confidential microdata files should give serious consideration to instituting security inspections on a regular basis. Second, to meet the legal requirements associated with individually identifiable data, entities licensing the use of confidential data must have procedures in place for monitoring the disposition of the data files at the completion of a research project. This requirement can help ensure that the data are not subsequently used for unauthorized purposes. Third, to run an effective data-use agreement program, an agency must have and maintain complete, accurate, and thorough records for each data agreement. Such records are essential for monitoring the authorized users, the approved uses of the data, and the security of the data.
Henry H. Perritt, Jr., concluded the opening session with the presentation of his paper, “Efficacy of Different Theories of Enforcement.” The paper suggests a framework within which the efficacy of legal protections of confidentiality can be evaluated, offers qualitative standards for evaluating the effectiveness of existing law, and identifies alternative approaches for strengthening legal protections. The paper begins with an evaluation of the possibility that federal or state law might compel re-
searchers to disclose confidential data received from federal government sources. It identifies the two kinds of private interests that warrant shielding data from disclosure and the sources of law that prohibit disclosure of data identified as confidential by the government agency from which it was received.
Perritt concluded that legal liability is only a weak protection for data confidentiality because the principal privacy statutes do not recognize private rights of action for wrongful disclosure, and the case law under common-law legal theories provides sparse support, at best, for recovery for disclosure. Moreover, difficulties in detection, proof, establishment of damages, and the high cost of litigation make it unlikely that victims of wrongful disclosure would seek relief in the courts. Perritt noted that at least one respected commentator agrees with these shortcomings of existing privacy law.
Perritt proposed two promising ways to afford legal protection against wrongful research disclosure: (1) to require researchers who receive confidential data to establish internal protections, on pain of contract cancellation and bars to receiving grants in the future, and (2) to put nondisclosure language in license agreements that supports “third-party-beneficiary” recovery by data subjects.
During the discussion of Perritt’s paper, participants said that an additional protection exists because the institution at which the researcher works has the ability to discipline the researcher further; bringing the institution into the arrangement can strengthen the potential sanctions for disclosure. Perritt concurred, recommending that the design of the institutional mechanism should make clear that the individual researchers have responsibility and accountability and that they will be subject to discipline or discharge if they violate their obligations under the agreement. Furthermore, institutional liability itself is important since, in some sense, the institution has more at risk then does an individual. This incentive can be exploited to promote conformity to data protection rules.
Joe Cecil, one of the formal discussants for the session, said he found Perritt’s argument—that it would be difficult to create a meaningful right of private action for an individual and have it work in a way that would give agencies confidence that they are not left responsible for a breach of confidentiality—convincing. He argued that the notion expressed at the 1999 workshop of transferring this responsibility to data users and at the same time giving agencies greater confidence is perhaps a false hope. He suggested that perhaps the focus should be on how to strengthen the institutional mechanisms that Seastrom and others explored and developing data-use agreements for researchers who want to download public-use files.
Katherine Wallman, the second formal discussant, provided exten-
sive clarification about the specifications of CIPSEA. She noted that CIPSEA is the culmination of the work of not just four Congresses, but also almost 25 years of work by her, her predecessors as chief statistician of the U.S. Office of Management and Budget, and many others. The legislation is only the most recent in a long history of efforts to strengthen the legal protection of confidentiality for statistical information collected by the federal government.
The other major objective addressed in CIPSEA concerns the sharing of information among agencies with various kinds of confidential protections and others who are legitimate users of the information, including licensed researchers at universities, licensed researchers in public-sector organizations and pro bono organizations, and others. She noted, however, that the dual objectives of protecting confidentiality and increasing data sharing have sometimes caused confusion about what is in CIPSEA and what is not and about who is covered and who is not covered. For example, although only three named agencies are covered by the datasharing provisions of the legislation, CIPSEA’s confidential protections extend to all federal agencies that collect statistical data under a pledge of confidentiality. Wallman concluded her remarks by briefly outlining the plans for implementing CIPSEA’s provisions.
FACILITATING DATA ACCESS
Michael Larsen gave a presentation on the technical, legal, and organizational barriers to data linkage. Larsen first identified the benefits motivating the goal of data linkage—how such data would be used to enhance analyses. He then discussed technical and legal barriers inhibiting data linkage. He concluded by discussing the role of data enclaves and the example of data linking between the Health and Retirement Study (HRS) and Social Security Administration (SSA) records.
Several themes emerged from the presentation. First, Larsen clearly articulated the importance of data access and data linkages to research, noting examples of questions that could not be answered without access to linked data. In addition to the technical challenges associated with accurately matching records across sources, it is important when seeking respondents’ permission at the beginning of a project to think carefully about potential linkages. Proactive work is needed both to make linkage possible and to have respondents’ support.
The paper by Andrew Hildreth, “The Census Research Data Center Network: Problems, Possibilities and Precedents,” assessed the track record of research data centers (RDCs) and the potential and problems
associated with them. The RDC system stems from the desire to permit access to confidential data sets housed at the U.S. Census Bureau. The program started as a pilot in 1994; it was initially funded through the U.S. National Science Foundation (NSF) in partnership with the National Bureau of Economic Research (NBER). The goal was to make such data sets as the longitudinal research database more accessible to researchers by making them available in locations other than Washington, D.C.
Hildreth began with an overview of how to apply for access, the kinds of research projects that are undertaken, and what applicants can expect in terms of process, particularly how long it might take to get to an RDC and start working with the data. He also discussed questions relating to the long-term financial prospects of the data centers. The key problem that Hildreth focused on was that of time delays. He spoke strongly in favor of a system that allows continual review of project applications.
In conclusion, Hildreth said his most important recommendation was to improve the proposal submission and review process for junior users. Wider access will bring wider recognition of what the research has meant to the Census Bureau’s data programs and what work the RDCs do and can do. Second, RDCs can be a way to achieve better alignment of the data programs with the Census Bureau’s goals. Third, some kind of core funding would be very helpful, perhaps through local institutional support, so that RDCs and the researchers who want to use them do not face yearly worries around budget time.
J. Bradford Jensen related his experiences “from the trenches” in trying to design a national framework for a data enclave model. He characterized the RDC enterprise as expensive, fragile, and tenuous. He suggested that the U.S. Census Bureau experience is representative of those of other countries and other contexts. Jensen confirmed many of Hildreth’s observations about the difficulties that lie ahead for the RDC system. However, he, too, noted the immense potential to advance research at the RDCs and was hopeful that the obstacles to their continued and improved operation could be overcome.
Sandra Rowland presented a paper, “An Examination of Monitored, Remote Microdata Access Systems,” that focuses on monitored remote (electronic) access to confidential microdata. Many national statistical offices disseminate microdata in three ways: public-use microdata files on CD-ROM or on-line, research centers or licensed sites, and remote access. Rowlands’ paper covers a sampling of systems in national statistical offices that permit monitored remote access to confidential microdata. The sample includes six foreign systems and three systems in the United States. The foreign systems are the Luxembourg Income Study, Statistics Canada, Statistics Denmark, Statistics Netherlands, the Australian Bureau of Statistics, and Statistics Sweden. The U.S. agencies are the National
Center for Health Statistics, the National Center for Education Statistics, and the Census Bureau.
The paper reviews the type of methodology used in each of the systems because the methodology influences the kinds of access and results given to users. Rowland reviewed the use of each system and the kinds of research that have benefited from remote access to the extent that such information is available.
Joseph Hotz, the formal discussant for the session, emphasized the tradeoffs associated with different confidentiality protection methods. He said he was struck by the number of dimensions on which the various approaches differ, which makes the process of making an “optimal” decision difficult: there is no simple way of deciding on “the right method.” Across these different modes—from public access to data enclaves to remote access to licensing—there are differences not only in terms of degree of access, but also in ease of use, cost, appropriateness for the types of data and information available, ability to customize versus having to rely on standardized data, etc. The alternative methods also differ substantially with regard to how they are financed and how they might be financed. Hotz concluded that in evaluating different methods, one has to consider much more than simply access.
MEASURING THE RISKS AND COSTS OF DISCLOSURE
The paper by Jerome Reiter, “Estimating Probabilities of Identification for Microdata,” describes methods for measuring identification disclosure risks, including those associated with re-identifications from matching to external databases with public-use microdata. The paper describes general methods for calculating sampled units’ probabilities of reidentification from the released data, given assumptions about intruder behavior.
When agencies release microdata to the public, intruders may be able to match the information in those data to records in external databases. Reiter presented specific methods for altering data to prevent such matching, including global recoding of variables, data swapping, and adding random noise. He illustrated the methods with data from the Current Population Survey, including random swapping of a subset of the values of variables needed to protect sample “uniques” (across combinations of variables such as age, sex, race, marital status) and using an age recode in addition to swapping to provide the swaps with good protection. He noted that knowing property taxes greatly increases probabilities of reidentification, and adding noise to positive tax values is not sufficient for eliminating uniques, though top coding helps.
William Seltzer and Margo Anderson presented the paper, “Government Statistics and Individual Safety: Revisiting the Historical Record of Disclosure, Harm and Risk,” which examines the sparse but important historical record of disclosure, harm, and risk. In the broadest terms, the paper has two interrelated objectives: presentation of a body of facts and presentation of a reconceptualization of a number of the issues related to disclosure and statistical confidentiality in order to understand the implications of the facts assembled. The latter is rooted in the ethical, statistical policy, and statutory origins of the idea of statistical confidentiality.
The focus of the presentation was on issues of disclosure, harm, and risk that have emerged from the use of government statistical agencies or programs to assist in the nonstatistical task of targeting individuals or population subgroups for administrative action. The paper sets out the available evidence concerning such government efforts, which the authors argued have led to serious human rights abuses. Seltzer and Anderson also described a number of barriers to the study of disclosures, harms, and risks associated with government activities.
George Duncan, as the formal discussant, framed his comments on how to evaluate disclosure limitation methods in the context of measuring the risks and costs of disclosure. He cited limitations in current methods for measuring the presence of population uniques: most methods ignore the knowledge state of data snoopers (e.g., a snooper may or may not know that the target individual is in a sample); they provide little information about continuous data; and they provide minimal guidance for evaluating alternative disclosure limitation procedures. Duncan applauded Reiter’s application (using data from the March 2000 Current Population Survey) to demonstrate a framework based on probability of identity disclosure and for rigorously exploring the efficiency of such disclosure limitation approaches as global recoding, data swapping, and adding random noise.
THE IMPACT OF MULTIPLE IMPUTATION ON DISCLOSURE RISK
The presenter and discussants in this session focused on the advantages and disadvantages of using synthetic data as a method of protecting confidentiality while at the same time providing greater access to data and preserving their informational utility. The presentation and discussion concentrated on three questions: Could use of a multiple imputation method improve data confidentiality without significantly compromising informational utility? How well do the statistical inferences from multiply imputed data match the results that are obtained using the original
data? Do multiple imputations provide proper balance between data confidentiality and accessibility?
The paper presented by Trivellore Raghunathan, “Evaluation of Inferences from Multiple Synthetic Data Sets Created Using a Semiparametric Approach,” examined evidence on the difference in modeling results with original data and masked data. Techniques of data alteration—such as data swapping, post-randomization, masking, subseparation, truncation, rounding, and collapsing categories—may protect confidentiality, but they may also introduce bias in statistical inferences. The idea of using multiple imputations to create synthetic data sets for public release was introduced by Rubin (1993). The paper reviews pioneering work by Rubin (1993) and Little (1993) developing the methodology, presents extensions, and evaluates the methodology with simulated data sets. Raghunathan outlined the general-purpose semi-parametric approach for creating multiple synthetic data sets and showed it to be especially useful when underlying relationships are nonlinear. The goal of Raghunathan’s approach was dual: to protect confidentiality and to preserve the key statistical properties of the original data.
Raghunathan mentioned several advantages of creating synthetic samples. For example, one can link the data, synthesize the linked data, and enhance the missing data in the original data file (as is currently being attempted for the HRS and Supplemental Security Insurance (SSI) variables). For that application, the plan is to take some HRS public-use data and the SSI data and then create a full synthesis of that data set.
Although the method of generating synthetic data sets is computationally intensive, Raghunathan emphasized that these multiple data sets can be analyzed using existing software packages with little additional effort. Moreover, he suggested, users of synthetic data should be able to construct an unbiased estimate from the altered data without knowing what exact alteration procedure was used to protect confidentiality.
John Rust, serving as the formal discussant, agreed with Raghunathan’s goal of being able to have some statistical procedure that protects confidentiality without altering inference, but he expressed strong distrust of any completely mechanistic procedure to generate synthetic samples. He said that multiple imputation methods might work for some data sets, but he sees many problems with the application of this approach to such complex data as, for example, the HRS.
During the general discussion, Michael Hurd suggested—and several other discussants supported—the idea of an experiment whereby multiple data sets are imputed from actual data, and then one group of researchers analyses the actual data, while another group does the same
analysis on the synthetic data. Then the differences between their results could be compared. Discussants agreed that such a test would provide a lot of valuable information. John Abowd said that such experiments are already under way at the Census Bureau, with a link between data from the Survey of Income and Program Participation (SIPP) and detailed Social Security earnings records and other administrative data from the SSA. An extension of the above experiment, proposed by George Duncan, would be to also bring in data snoopers, using whatever tools they might have, to try to identify records in both data sets.
EMPIRICAL ASSESSMENTS OF THE BENEFITS OF RESEARCH ACCESS TO LONGITUDINAL MICRODATA
John Bailar presented a paper, “The Role of Data Access in Scientific Replication,” that describes the underlying issues raised by the role of access to data in scientific replication and, more broadly, the value of scientific replication. His focus was on data generated by nongovernmental sources, primarily in academia, and balancing the concerns of those who generate the data against the public interest in broader use. He noted that the state of understanding about this aspect of academic research is nowhere near as advanced as thinking about confidentiality of and access to federal microdata.
Bailar addressed several conflicts that arise in the context of data access in scientific replication. One such conflict is that society has a strong interest both in protecting privacy and confidentiality and in assuring that scientific findings and interpretations are as close to correct as the state of the art allows. Another conflict is that much research information has personal and proprietary value, which creates barriers to broad access to the data. A third conflict reflects the fact that data are the stock in trade for most research scientists, and scientific rewards are based almost exclusively on the generation and interpretation of data.
Bailar concluded with several propositions. First, few researchers would be happy to give away their final data—and especially the intermediate products of their investigations—if the products of their work are going to be examined by hostile interests bent on destroying the credibility of the findings. Nor is hostile scrutiny likely to advance the state of the science. It may discourage the best scientists from engaging in certain kinds of work that could lead to loss of exclusive access to data. Bailar strongly opposed the view that hostile examination is the best way to uncover the truth. Broad data access also raises questions about being scooped by competitors. Although this is certainly a big concern to re-
searchers and often a barrier to sharing the data, it may have little effect on practice for the simple reason that if nobody generates data, everybody will soon be out of business.
Bailar’s second proposition was that the people who are good at generating data are not always the best at analyzing the data. He suggested that there may often be good reasons (though with some limitations) for separation of support for data generation from that for analysis. One limitation is that such separation should be considered case by case. He also noted that those who generate data are not always diligent about completing their own work and making the results public.
Charles Brown presented the session’s second paper, “The Value of Longitudinal Data for Public Policy Decisions that Have Been Taken over Time,” which assessed the effects of research that uses longitudinal data on public policy. Assessing the effects on policy is more difficult than assessing the effects on academic research: legislators (and other decision makers) rarely cite academic papers and, when they do refer to academic work, it is fair to question whether the research changed the vote (or program decision) or whether the vote (or decision) was based on other considerations, which simply prompted reference to supporting research. However, Brown did attempt to identify policy-related findings based on longitudinal data and to ask whether policy appears to have responded to such findings.
Longitudinal data can make two contributions to research. First, they allow more accurate reporting of transitions between states, durations in a particular state, and changes in variables of interest than is typically possible from a single cross-sectional data collection. Second, longitudinal data allow a researcher to control for otherwise-omitted variation in outcomes among individuals, as long as this variation is constant for given individuals. Both of these contributions are evident in the examples discussed in Brown’s paper, drawn from five policy areas: welfare reform, job training, unemployment insurance, preschool programs, and retirement.
Though longitudinal data have played an important role in these policy areas, the contribution of research is constrained by a fundamental and inherent tension—policy research often demands prompt “answers,” and longitudinal data take time to be collected. Brown made several suggestions that he said would be particularly helpful for strengthening the contributions of longitudinal microdata to policy analysis:
Persistence in studying long-run effects. Often, because of funding issues, data are not collected over a long enough period to fully exploit the opportunities to study long-run effects.
Mining regulatory data. Academic researchers can make impor-
tant contributions to policy debates about regulatory activities if more data can be made available.
Matching. Data linking opens up many research opportunities. As an example, creation of data about firms matched to workers’ records would be extremely helpful to a range of research questions about business dynamics and the economy.
Dan Newlon, the formal discussant for the session, agreed with John Bailar that researchers should, on publishing their results, make data available at data archives, so that researchers who want to verify the results or extend them can do so. Newlon pointed out that the NSF funded a study by Bill Dewald, then editor of the Journal of Money, Credit, and Banking, of the replicability of research results published in the journal. The disturbing surprise was that a third of the authors were unable or unwilling to provide data to support their published results. Another third of authors provided data, but without adequate documentation, so that it was impossible to replicate the published results.
Newlon disagreed with Bailar’s position that researchers should not be forced to share their data with others and that the value of giving other researchers access to data was outweighed by the possibility of critical scrutiny that would require investigators to divert energy, time, and effort away from their own research. Newlon explained the essence of the current NSF policy on data sharing: there is a grace period, but once a researcher’s grant is finished and the researcher has started publishing results based on those data, then the data are expected to be in the public domain so others can use the data and extend and check the validity of the results.
During the open discussion, Richard Suzman provided another example of when replication in the form of a meta-analysis has been done and is needed—research on the levels of disability in the older population, and trends of disability. Many studies have been done, and they provide very different results. There have been concerns that some survey results could not be replicated: the issue is not just one of making the data available, but also of making the documentation clear.
Keith Rust pointed out that the Journal of Applied Econometrics has an online data archive, and the journal just introduced a replication policy as well, which encourages submission of articles replicating results. Suzman supported the idea of withholding some fraction of a grant award that involves data collection (to ensure funds to make the data available), although there are a few data sets that, if they had to be shared, would never be collected in the first place. He also recommended reprinting both “Sharing Research Data” (National Research Council, 1985) and “Private Lives and Public Policies: Confidentiality and Accessibility of Government Sta-
tistics” (National Research Council, 1993) and using them as a required component in training grants.
Richard Rockwell commented on the costs to data producers of archiving data. In his experience, almost all of these very real costs revolve around the documentation, not the data. For replication, and for secondary analysis of all sorts, researchers and data producers need funding to enable them to archive their data in a usable form.
Trivellore Raghunathan added the point that the current situation with data sharing is much better in social science than in medical science, where the prevailing attitude seems to be that “it is my data, and I have a 25-year plan for the data analysis, and only after the data-analysis plan is exhausted can I think about sharing the data.” He said he finds it disturbing that policy decisions can be made on the basis of some data analyses, but researchers and others cannot verify and replicate the findings that underpin those decisions.
SESSION VI: ASSESSING RESEARCH AND POLICY NEEDS AND CONFIDENTIALITY CONCERNS—THE ECONOMICS OF DATA ACCESS
The final session of the workshop was designed to facilitate discussion of the tradeoff between societal benefits of data dissemination and confidentiality concerns. It began with the presentation of a paper by Ramon Barquin (coauthored by Clayton Northouse), “Data Collection and Analysis: Balancing Individual Rights and Societal Benefits.” Barquin focused on government data collections, which provide the basis for analyzing factors involved in such issues as poverty, health care, education, traffic, public safety, and the environment. He described five benefits of data dissemination:
The wide dissemination of government statistical data informs policy research.
The findings that emerge from the analysis of statistical data undergo reexamination and reinvigoration when disseminated to the research community; this process improves data quality by exposing errors.
Data that are used for one purpose can be put to other uses without substantial investments in new data collection. In addition, data can be combined and result in much more powerful tools for examining the problems facing society.
When research techniques are shared along with the data, the research community and other data centers improve and hone their own techniques.
When data are shared with the research community, it will actively involve researchers in the problems confronting the nation and policy makers.
Barquin outlined a framework for how government agencies can attempt to balance the privacy concerns of the individual with the societal good generated from the use of data, offering three guidelines. First, establishing contractual relations with nongovernmental researchers offers a wealth of opportunity without causing undue risks to privacy and confidentiality. The process of applying for and receiving unfettered access to limited sets of government statistical data should force researchers to fully justify their projects and should demonstrate why it is necessary for the researcher to have access to all the data, rather than to a restricted set of the data with identifiers blurred or stripped. The contract should also rigorously uphold the principles of informational privacy, namely, security, accountability, and consent. Second, the Census Bureau’s efforts to establish research data centers across the United States offer a fruitful opportunity to share Census Bureau data and provide a good model for the sharing of other types of government statistical data. Third, ultimately, in balancing the public good and individual rights, data collection institutions must effectively manage the three components of this balance: they must supply the technology, provide the correct policy, and cultivate an ethical environment of good will and trust.
Julia Lane presented a paper (coauthored with John Abowd), “The Economics of Data Confidentiality,” in which she focused on cost-benefit analysis of data dissemination and confidentiality. In considering how statistical agencies might pursue optimal policies, Lane stated that this goal relies on accurate assessment of the benefits derived from the use of such data, the risks of access (and other costs), and the tradeoff between the two.
Lane noted the substantial social benefits associated with releasing microdata (benefits that are not always realized by the statistical institutes themselves), citing examples similar to those noted by other presenters:
it permits analysis of complex questions;
it allow researchers to calculate marginal, not just average effects;
it creates scientific safeguards when it ensures that other scientists can replicate research findings;
it promotes improvements to data quality: although statistical institutes expend enormous resources to ensure that they produce the best feasible product, there is no substitute for actual research use of microdata to identify data anomalies;
it promotes development of a core constituency: the funding of a statistical agency depends on the development of a constituency and greater use of data, which includes the creation of new products from existing data and fosters a broader constituency beyond those who directly have access to the data.
There are three costs of microdata use that must be weighed against the benefits of providing access. First is the cost of providing access. Clearly, the cost of providing access depends on the modality, and several have been developed by statistical institutes across the world: public-use microdata, licensing, remote access systems, and research data centers. The second cost is that of reputation. Most agencies expend enormous effort to make sure that published statistics with their imprimatur are of a high quality and take precautions to protect the confidentiality of the data; they would also have to expend sufficient funds to provide access. The third cost is that of the potential disclosure of respondent identities. The ultimate cost to an agency is for an external researcher to disclose the identity of a business or individual respondent. While the penalties for this are typically substantial—ranging up to 10 years in jail and a $250,000 fine in the United States—the consequences of such a breach could be devastating to respondent trust and response rates.
Lane said it is clear that statistical agencies will increasingly be challenged to provide more access to microdata. This pressure provides a chance to fulfill a critical societal mission. However, since increased access does not come without increased costs, it would seem reasonable to try to control these costs by combining research efforts. Some areas in which joint research and development might provide substantial dividends, for example, include:
the creation of inference-valid synthetic data sets;
the protection of microdata that are integrated across several dimensions (such as workers, firms, and geography);
the quantification of the risk-quality tradeoff in confidentiality protection approaches; and
the effect on response rates of increased microdata access.
Michael Hurd served as the discussant for the Barquin and Lane presentations. Commenting on Barquin’s paper, he expressed the view that, although the issues of ethics and trust are important, self-interest is a powerful mechanism limiting the risk of data disclosure. For an academic researcher, being involved in data disclosure through improper use of data would seriously impair that person’s career.
Hurd noted that one omission from both papers is the role of training and education. Researchers ought to be trained in ethics, but they also ought to be continually trained in proper procedures, so that they are aware that this is a serious issue.
On the Abowd and Lane paper Hurd remarked that, at the individual level, researchers want more data, and the reason they want more data is that they get great benefits from more data but bear very little of the cost if something goes wrong. In contrast, agencies are more or less in the opposite position. This creates the tension between the two groups. In economists’ terms, those interests need to be internalized in a utility framework in which, as a society, people can make a more informed decision that benefits society rather than the individual actors, namely, the researchers and the agencies. John Rust suggested broadening the two dimensions and including politics, which is another form of self-interest; it provides incentives that determine not just what data are released, but what data are collected. This political dimension can also drive the enterprise to inefficient solutions.
Abowd argued that one of the distinctions that Lane and he are trying to make is to distinguish the different analytic frameworks that economists and statisticians bring to this problem—not because they are in conflict, but because they actually are thinking about two different parts of the problem. Abowd said that statisticians have helped enormously in quantifying the tradeoffs in data production associated with risk measures and the information-loss measures from the basic data. But the economic tradeoffs involve other considerations, such as the benefits to society of research and the costs to individuals of disclosure.
Newlon noted that one needs to distinguish between academic users and other users. In the case of the academic user, there are reputational effects. In the Nordic countries, an academic researcher does not have to have a sworn status to access the data that government employees have because the academic user is viewed as part of the same research and policy advisory community. That is the kind of long-term goal he would like to see emerge.
Suzman raised the sociological issue of what sorts of data people don’t want to be released about themselves, and why. There seem to be huge variations in what can and cannot be asked among different people or groups. The ethos of what people consider to be confidential is simply not understood. For example, in some states property values and property taxes are publicly available on the Web, yet these appear to be “confidential” data in other states. That is an area that requires more study.
PAPERS AND PARTICIPANTS
Authors and Papers
John M. Abowd and Julia Lane, “The Economics of Data Confidentiality”
John Bailar, “The Role of Data Access in Scientific Replication”
Ramon Barquin and Clayton Northouse, “Data Collection and Analysis: Balancing Individual Rights and Societal Benefits”
Charles Brown, “The Value of Longitudinal Data for Public Policy Decisions that Have Been Taken over Time”
Andrew Hildreth, “The Census Research Data Center Network: Problems, Possibilities and Precedents”
David McMillen, “Privacy, Confidentiality, and Data Sharing”
Henry H. Perritt, Jr., “Efficacy of Different Theories of Enforcement”
Trivellore Raghunathan, “Evaluation of Inferences from Multiple Synthetic Data Sets Created Using a Semiparametric Approach”
Jerome Reiter, “Estimating Probabilities of Identification for Microdata”
Sandra Rowland, “An Examination of Monitored, Remote Microdata Access Systems”
Marilyn Seastrom, Candice Wright, and John Melnicki, “The Role of Licensing and Enforcement Mechanisms in Promoting Access and Protecting Confidentiality”
William Seltzer and Margo Anderson, “Government Statistics and Individual Safety: Revisiting the Historical Record of Disclosure, Harm and Risk”
John M. Abowd (panel member) is Edmund Ezra Day professor of industrial and labor relations at Cornell University, director of the Cornell Institute for Social and Economic Research (CISER), and a distinguished senior research fellow at the U.S. Census Bureau.
Margo Anderson is a professor of history and director of the Urban Studies Program at the University of Wisconsin-Milwaukee.
John Bailar is professor emeritus at the University of Chicago. His research is in the fields of medicine and statistics, and he has written extensively about science conduct and ethics.
Ramon Barquin is president of Barquin International and was previously the first president of the Data Warehouse Institute. His work is directed to developing information system strategies and data warehousing for the public and private sectors.
Charles Brown is a professor of economics and program director at
the Survey Research Center at the University of Michigan whose research focuses on a wide range of topics in empirical labor economics.
Joe S. Cecil (panel member) is project director in the Program on Scientific and Technical Evidence of the Division of Research at the Federal Judicial Center. He is responsible for judicial education and training in the area of scientific and technical evidence.
George T. Duncan (panel member) is a professor of statistics in the H. John Heinz III School of Public Policy and Management and the Department of Statistics at Carnegie Mellon University.
Andrew Hildreth is research director at the California Census Research Data Center and a professor in the Department of Economics at the University of California at Berkeley.
V. Joseph Hotz (panel member) is a professor and chair of the Department of Economics at the University of California at Los Angeles.
Michael Hurd (panel member) is senior economist and director for the RAND Center for the Study of Aging.
J. Bradford Jensen is deputy director of the Institute for International Economics, having recently moved there from serving as director of the Center for Economic Studies at the U.S. Census Bureau. At the Census Bureau, he directed the center’s internal and external research programs, managed its Research Data Center network, and supervised its relationships with collaborating universities and research organizations.
Diane Lambert (panel member) is the director of statistics and data mining research at Bell Labs. She has made seminal contributions to fundamental statistics theory and methods and has been a leader in defining a role for statistics in data mining and massive data problems.
Julia Lane is a principal research associate in the Labor and Social Policy Center at the Urban Institute, concentrating in the areas of income and wealth distribution, labor markets, employment, and education.
Michael Larsen is a professor in the Department of Statistics at Iowa State, working in the areas of survey sampling, administrative records and record linkage, missing data problems, finite mixture models and latent class models, small-area estimation, and Bayesian statistical modeling.
David McMillen is the information and government organization specialist with the Committee on Government Reform and Oversight, U.S. House of Representatives, and the government information specialist for Henry A. Waxman (D-CA). He covers issues involving the collection, dissemination, and preservation of government information, and he has worked extensively on legislation relating to confidentiality and data sharing.
John Melnicki is president and CEO of Harbor Lane Associates, Inc.,
in Washington, D.C. In addition, he is the senior security advisor for restricted data for the U.S. Department of Education, the National Science Foundation, and various research and educational institutions around the world.
Dan Newlon is program director for economic science at the U.S. National Science Foundation, where his job is to select directions for investment in research.
Henry H. Perritt, Jr., is a professor of law and vice provost at the Illinois Institute of Technology and director of the Center for Law and Financial Markets at Chicago-Kent College of Law.
Jerome Reiter is a professor at the Institute of Statistics and Decision Sciences at Duke University. He has worked with the U.S. Census Bureau and recently joined the Digital Government Project of the National Institute of Statistical Sciences.
Trivellore Raghunathan is a research professor at the Institute for Social Research and professor of biostatistics, both at the University of Michigan.
Richard C. Rockwell (panel member) is a professor of sociology at the University of Connecticut.
Sandra Rowland recently retired from the U.S. Census Bureau, where she was the Internet data dissemination system team leader. Among other duties, she managed the advance query interactive web system.
John Rust is a professor of economics at the University of Maryland, with major research interests in numerical dynamic programming and retirement behavior.
Marilyn Seastrom is chief statistician and program director for the Statistical Standards Program at the National Center for Education Statistics, U.S. Department of Education. She has written extensively on data access, licensing, and confidentiality issues.
William Seltzer is a senior research scholar at the Institute for Social Research of the Department of Sociology and Anthropology at Fordham University.
Eleanor Singer (panel chair) is a research professor at the Survey Research Center of the Institute for Social Research at the University of Michigan. Her research focuses on motivation for survey participation and has touched on many of the important issues in survey methodology, such as informed consent, incentives, interviewer effects, and nonresponse bias.
Richard Suzman is associate director for Behavioral and Social Research at the National Institute on Aging (NIA). Suzman developed NIA’s Economics of Aging program, one of the first of its kind to look at socioeconomic factors and health.
Katherine Wallman is chief statistician at the U.S. Office of Management and Budget. She is responsible for overseeing and coordinating federal statistical policies, standards, and programs; developing and fostering long-term improvements in federal statistical activities; and representing the federal government in international organizations.
Candice Wright is an analyst at the U.S. Office of Management and Budget. She recently completed her M.S. in public policy from Carnegie Mellon University and holds a B.S. in management from Bentley College. Her current interests include data privacy and information security.