4

Alternative Approaches for Limiting Disclosure Risks and Facilitating Data Access

Presentations throughout the workshop confronted aspects of the restricted data, restricted access debate. Data alteration allows for broader dissemination, but may affect researchers' confidence in their modeling output and even the types of models that can be constructed. Restricting access may create inconveniences and limit the pool of researchers that can use the data, but generally permits access to greater data detail. This chapter reviews the presentations and discussions addressing these two approaches, highlighting their advantages and disadvantages. It also summarizes the discussion of potential technical solutions and the use of legal sanctions to modify the behavior of individuals with access to the data.

DATA ALTERATION

There are both technical and statistical solutions for protecting data security. Several workshop participants argued that these solutions need to be blended with the substantive knowledge of researchers to solve disclosure problems in a way that satisfies all interested communities. This section reviews the discussion of statistical approaches, for which a presentation by Arthur Kennickell, titled “Multiple Imputation in the Survey of Consumer Finances,” was the centerpiece. Technical approaches are discussed later in this chapter.

Participants representing the research community expressed frustration with some of the standard perturbation methods employed by the large longitudinal surveys. Finis Welch of Texas A&M University articulated several



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 29
Improving Access to and Confidentiality of Research Data: Report of a Workshop 4 Alternative Approaches for Limiting Disclosure Risks and Facilitating Data Access Presentations throughout the workshop confronted aspects of the restricted data, restricted access debate. Data alteration allows for broader dissemination, but may affect researchers' confidence in their modeling output and even the types of models that can be constructed. Restricting access may create inconveniences and limit the pool of researchers that can use the data, but generally permits access to greater data detail. This chapter reviews the presentations and discussions addressing these two approaches, highlighting their advantages and disadvantages. It also summarizes the discussion of potential technical solutions and the use of legal sanctions to modify the behavior of individuals with access to the data. DATA ALTERATION There are both technical and statistical solutions for protecting data security. Several workshop participants argued that these solutions need to be blended with the substantive knowledge of researchers to solve disclosure problems in a way that satisfies all interested communities. This section reviews the discussion of statistical approaches, for which a presentation by Arthur Kennickell, titled “Multiple Imputation in the Survey of Consumer Finances,” was the centerpiece. Technical approaches are discussed later in this chapter. Participants representing the research community expressed frustration with some of the standard perturbation methods employed by the large longitudinal surveys. Finis Welch of Texas A&M University articulated several

OCR for page 29
Improving Access to and Confidentiality of Research Data: Report of a Workshop concerns. He argued that the introduction of noise to variables can, in certain instances, create major headaches for the modeler. For instance, adding noise to a field that will be used as a dependent variable in models may be acceptable as long as the expected value of the disturbance is zero, and hence efficiency of estimates is preserved. When dispersion is added to fields that will be used as explanatory variables, however, expected errors tend to be correlated with variable values. Welch believes priority should be given to developing perturbation methods that, if invoked, will preserve the key statistical properties of a data set. He made some specific recommendations as well. First, he advocated top-coding at fixed quantile levels, over time, rather than at absolute values; when the percentage of records top-coded changes, serious problems arise in longitudinal or panel modeling contexts. Welch also would like to see scrambling —as opposed to truncation—of “sensitive” data above the top-coded cutoff points to maintain full distribution, but eliminate knowledge of where an individual record fits into the distribution. Kennickell's use of a multiple imputation technique offers a more sophisticated form of data perturbation, one that could potentially improve data security (and, hence, allow greater accessibility) without seriously compromising modeling utility. Several workshop participants expressed support for the idea of exploring this type of approach to gain a clearer idea of how models might perform using imputed data and to assess the promise of the technique in terms of data protection. There is now a large multiple imputation apparatus in the Survey of Consumer Finances (SCF); Kennickell has shown it can be done. What remains to be seen is how effective and how useful the technique will be. Kennickell's research is moving toward answering that question. The SCF is conducted by the Federal Reserve Board, with survey information collected by the National Opinion Research Center at the University of Chicago. The data include sensitive and detailed information about respondents' assets and liabilities, as well as extensive demographic and geographic information. The survey, which oversamples wealthy households, is derived from statistical records based on tax returns maintained by the Statistics and Income Division of the Internal Revenue Service. To gain access to this information, the Fed agrees to a disclosure review similar to that for the public Statistics of Income files. Because the SCF is subject to legal constraints on data release, and because it contains sensitive information on a sample that includes a high-wealth population, the survey is a logical candidate for the multiple imputation experiment as a means of disclosure limitation. Because missing data have always been an important problem in the SCF, substantial resources have been devoted to the construction of an imputation framework that can be used to simulate data to replace those originally reported. For the public-release version of the SCF, the survey applies standard

OCR for page 29
Improving Access to and Confidentiality of Research Data: Report of a Workshop perturbation and data limitation techniques—rounding, collapsing categories, providing geographic detail only at the Fed division levels, truncating negative values, withholding variables, and a variety of more minor changes that cannot be disclosed. In addition, the full final internal version of the data is used to estimate models that are, in turn, used to simulate data for a subsample of observations in the public version of the data set. Kennickell's multiple imputation system deals with all variables, imputing them one at a time. It is iterative and generates, as the name implies, many imputations. The models, which are different for binary, continuous, and categorical variables, involve essentially relaxing data around reported values. The method requires a full probability specification at the outset; the notion behind the multiple imputation is to then sample from the full posterior distribution. It is this sampling that generates the variability needed for disclosure limitation. Kennickell described the application of multiple imputation for the SCF as a type of structured blurring: “a set of cases that are identified as unusual plus another set of random cases are selected, and selected variables within those cases are imputed subject to a range constraint (unspecified to the public), but they are constrained to come out somewhere in a broad neighborhood around the original values.” The knowledge of which data values have been intentionally altered is also partially disguised. The cumulative effect of the process is to decrease a user's confidence that any given record represents an actual original participant. The method is computationally intensive, but Kennickell described it only “as a modest step in the direction of generating fully simulated data.” He argued that it is possible to simulate data that do a good job of reproducing all simple statistics and the distributional characteristics of the original reported data. The extent to which imputed data will be able to provide a satisfactory basis for the leading-edge research required for fully informed policy is not yet clear. It is not known how imputation affects error structures of complicated models; what sampling error means in a fully simulated data set; what happens to complex relationships among variables; and, more generally, how researchers will interpret modeling results. One way to begin addressing these questions is to create synthetic versions of existing data sets with known nonlinear relationships or complex interactions and see whether they could have been detected with the simulations. Many of the workshop participants agreed that these performance questions require serious attention and that the answers will ultimately determine the success of imputation methods. Quantitative assessments of the extent to which disclosure risks can be reduced using these methods are also needed. At this point, social science researchers are skeptical about the accuracy of analyses not based on “original” data. Richard Suzman of the National Institute on Aging (NIA) added that all leading researchers currently supported by

OCR for page 29
Improving Access to and Confidentiality of Research Data: Report of a Workshop NIA are opposed to the imposition of synthetic data.1 Finis Welch and Suzman each noted that the value of synthetic data sets in longitudinal research is unproven; with the exception of the 1983–1989 panel, the SCF is cross-sectional. While complex longitudinal data increase disclosure risks, it is also more difficult to preserve key relationships and interactions among variables when this type of data is altered. Perturbation, therefore, may be more damaging to analyses that rely on longitudinal data than to those that rely on cross-sectional data. These criticisms notwithstanding, Stephen Fienberg of Carnegie-Mellon University considers Kennickell's work a major success in the use of modern statistical methods and disclosure limitation research. Fienberg made the point that all data sets are approximations of the real data for a group of individuals. Rarely is a sample exactly representative of the group about which researchers are attempting to draw statistical inferences; rather, it represents those for whom information is available. Even a population data set is not perfect, given coding and keying errors, missing imputed data, and the like. Fienberg finds the argument that a perturbed data set is not useful for intricate analysis not altogether compelling. Yet researchers are more critical of controlled modifications to the data and the introduction of structured statistical noise than of sampling noise. Thus two clear perspectives emerged among workshop participants. On one side are those who believe that, in addition to its role in statistical disclosure limitation, replacing real samples with records created from posterior distributions offers great potential in terms of maintaining fidelity to the original data goal (as opposed to the original data). On the other side are researchers who are concerned that synthetic data do not fit the model used by top researchers as they actually work with the data. Their position is that, in addition to delaying data release, imputation programs blur data in ways that create inaccuracies, such as those described earlier. Suzman expressed the need for the National Institutes of Health and others to advance empirical research that would address these issues. As these methods are advanced, it may become possible to provide researchers with clearer explanations of how imputation impacts the data. Moreover, data management programs may be developed that offer the option of choosing between using an altered public data set and submitting to additional safeguards to gain access to raw data. RESTRICTED ACCESS Presentations during Session I illustrated the benefits that can be derived from studies using complex research data files—that is, microdata files with 1   Suzman did acknowledge a role for synthetic data in creating test data sets on which researchers could perform initial runs, thus reducing time spent in data enclaves.

OCR for page 29
Improving Access to and Confidentiality of Research Data: Report of a Workshop longitudinal data, contextual information for small areas, or linked administrative data. Each research example cited was carried out under restricted data access arrangements. None could have been undertaken solely using microdata files that are available to the general public with no restrictions. Many of the analyses required the use of files that link survey data with administrative record information for the same persons. A survey of organizations that released complex research data files as public-use files was described during Session III by Alice Robbin, Indiana University (for further details about this survey, see Chapter 5). Files with linked administrative data were seldom released in that format, primarily because of concerns that users with independent access to the administrative source files might be able to re-identify persons whose records were included in the research file. Restricted-access files were distinguished from public-use files by the inclusion of more detailed information on the geographic location of sample persons and contextual data, such as median income classes and poverty rates, for the communities where they lived. While usually applied in a way that preserves basic statistics, masking procedures used to reduce disclosure risks associated with public-use files may introduce substantial biases when more complex methods of analysis are applied to the data. Therefore, arrangements for providing special or restricted access are often used to satisfy the needs of users for whom the available public-use data files are insufficient. Several such arrangements have been developed in recent years; primary among these are (1) use of the data by users at their own work sites, subject to various restrictions and conditions (commonly referred to as licensing); (2) controlled access at sites (often called research data centers) established for the purpose by custodians of the data; and (3) controlled remote access, in which users submit their analytical programs electronically to the custodian, who runs them and reviews the outputs for disclosure risk prior to transmission to the users.2 The next chapter reviews workshop presentations that described current and planned restricted-access arrangements at various agencies. ADVANTAGES AND DISADVANTAGES OF DATA ALTERATION VERSUS RESTRICTED ACCESS The most frequently cited advantage of data alteration, as opposed to restricted access, is that it facilitates broader public release and simpler user acquisition. Steven Fienberg articulated the advantage of data perturbation 2   There are other possible arrangements, such as the release of encrypted microdata on CD-ROMs with built-in analytical software, but these methods are not widely used at present and were not discussed at the workshop.

OCR for page 29
Improving Access to and Confidentiality of Research Data: Report of a Workshop concisely: “These methods (particularly if they can be developed to produce more acceptable statistical properties than people are willing to admit) address a very compelling public need, which is the sharing of data collected at public expense that are a public good and would otherwise not be broadly accessed.”3 Proponents argue that sophisticated perturbation methods offer one of the few tools that may help meet simultaneously the need of researchers to access better data and the need to protect respondents who supply information. The primary disadvantage of data alteration generally and advanced perturbation specifically is researchers' decreased confidence in modeling output; use of such data is believed by some to limit modeling flexibility as well. Researchers at the workshop expressed concern that data alteration inhibits complex modeling, particularly when the relationships that have the greatest policy relevance are nonlinear or when causal modeling requires correct ordering of temporal events. For example, it may be difficult to accurately model real-world retirement behavior, which is thought to be driven by jumps in eligibility and benefit rules faced by workers, if blurring techniques are used to smooth such spikes in the data. Moreover, even if key statistical properties are preserved, researchers must be convinced that this is the case before they will use the data; they must also learn how to interpret and report the results of models estimated from altered data. These are real costs associated with data perturbation. The challenge for proponents of data imputation approaches is to determine how accurately relationships among data fields can be preserved and to communicate their findings to researchers. The extent to which this challenge can be met is, as of now, uncertain. Robert Boruch articulated a strategy for addressing this need. He suggested that it is important to monitor the performance of increasingly complex models, when data used in estimation procedures are altered in various ways, by building a knowledge base of calibration experiments. The advantage of restricted access is that those granted permission have fuller access to primary data. On the other hand, costs are incurred in enforcing access rules and in operating data enclaves and remote programs. Restricted access arrangements also impose an operational burden on researchers. These operational costs can be significant; for instance, Kennickell reported that the Fed does not have the research budget to establish data centers or even licensing agreements. While their multiple imputation program is a major undertaking, it is a less costly method of providing broad 3   Ivan Felligi and others believe that if data linkage continues to increase, it may not be possible to safely offer public release files at all. While this view may be extreme, it does point to the tradeoff: given more linking possibilities and richer native data, more restriction is required to hold disclosure risk constant.

OCR for page 29
Improving Access to and Confidentiality of Research Data: Report of a Workshop access to researchers. Kennickell also believes that, among users of the SCF, unrestricted access to the data is a higher priority than is access to unrestricted data. While researchers at the workshop did express a general preference for limited access to full data, as opposed to public access to limited data, they also noted that the on-site requirement can be burdensome. Thus, they voiced enthusiasm for the idea of developing flexible remote access systems. Researchers also want assurances that restricted access data sets at centers would not replace the types of data now publicly available. Most of the researchers indicated a preference for the licensing option, which is viewed as least burdensome (since they plan to follow access rules). Agency representatives noted that the sanctions in existing laws could be strengthened (see the next section). However, it is impossible to ensure that all users are familiar with the laws, and federal agencies are ultimately responsible for data safety. Licensing is effective because it transfers a portion of that responsibility to users, allowing agencies greater latitude in dissemination (see also the discussion of licensing in Chapter 5). Ultimately, if different types of users can be identified reliably, appropriate levels of access can be established for each. Researchers are probably willing, for selected studies, to go through the required steps to use less-altered data under restricted access arrangements. In some cases, the existence of legal penalties for misuse will provide a sufficient deterrent, and access to full raw data may be allowed. Participants voiced the view that a one-size-fits-all approach to data access is unsatisfactory, since it would likely produce data of insufficient detail for cutting-edge research while perhaps unnecessarily disclosing information not needed for more general research. Similarly, marketers or the general public who want fast Web access likely cannot be granted the same access to the highest-quality data as those who undergo security precautions. Participants were also optimistic about the ability to use technology to obtain a proper balance between confidentiality and accessibility. Latanya Sweeney and others described evolving approaches that may eventually advance confidentiality protection within both data alteration and restricted access frameworks. For example, there may be ways to improve remote access using rapidly evolving net-based foundations that would allow researchers to run interactive programs externally (see also the discussion of remote access in Chapter 5). More sophisticated linking may also be possible, particularly if methods can be developed to monitor the combinations of variables regularly used by researchers. Once a clear research need to link certain variables or data sources has been established, it may be safer to link at the source instead of having copies of both data sets go out in their entirety each time a researcher needs to make the link. Similar approaches may enhance the ability to establish joint data centers or centers with multiple sources of data.

OCR for page 29
Improving Access to and Confidentiality of Research Data: Report of a Workshop ROLE OF LEGAL SANCTIONS Stricter legislative protections offer another potentially efficient means of improving confidentiality—efficient because the probability of disclosure can be decreased without imposing costs on rule-abiding researchers. Indeed, several participants, including Richard Suzman, suggested that perhaps this method of data protection should be given the highest priority. These participants cited a recommendation from the report Private Lives and Public Policies that there should be “legal sanctions for all users, both external and agency employees, who violate requirements to maintain the confidentiality of data ” (National Research Council and Social Science Research Council, 1993:7) and added that existing penalties should be stiffened.4 J. Michael Dean pointed out that the potential for unintended disclosure exists at the data source as well as at the user stage. A primary reason that agencies are able to maintain confidentiality is their ability to impose a high cost penalty for misbehavior (violators can lose their jobs). Dean argued that this regulation should be expanded across agencies and institutional lines, thereby creating more linking opportunities; in other words, the regulatory approach to native databases could be extended to linked data. Again, such approaches are generally applauded by researchers, who prefer regulation of the people using databases over alteration of the databases themselves. Presenters from the HRS noted that, in part because of a lack of confidence in the adequacy of sanctions, the funding agency (NIA) demands that the University of Michigan provide linked data only to individuals working under federal grants so that disregard for the confidentiality guidelines will be subject to federal rules. The situation is different for agencies that have a licensing mechanism tied to the data, which allows for more options. Other agencies must operate purely within the realm of the Privacy Act. Many participants believe that increased harmonization of the legal framework is needed, if for no other reason than to allow researchers to know roughly what is expected without that expectation shifting from context to context. 4   Donna Eden pointed out that there is considerable room for increasing penalties. For instance, in HIPAA there exists no private right of action for subjects. The act sets forth broad criminal prohibition, but only minor criminal sanctions for disclosure of data in violation of regulations. A $100 civil penalty is unlikely to be effective against corporate or even most individual abuses. Also, it should be noted that the size of a penalty is of limited importance if it is rarely imposed.