Reconciling the Benefits and Risks of Expanded Data Access
The charge to this panel—and the challenge to those who collect data from individuals and organizations and those who use them—is to understand and weigh the tradeoffs between the benefits and risks of increased access to research data. The benefits of increased access are better data for policy analysis and research; the risks are breaches of confidentiality and their consequences. As noted in Chapter 4, breaches of confidentiality can occur in a variety of ways. The work of this panel has focused primarily on statistical disclosure—the re-identification of individual respondents (or their attributes) through the matching of survey data with information available outside the survey.
Achieving the benefits of access to research data presupposes a willingness by people in households and organizations to provide detailed and sometimes sensitive information for government-sponsored statistical surveys and censuses. Such willingness requires public trust: trust that the data will be used for important research and policy purposes and that the confidentiality of the information will be maintained. Thus, the agencies that collect data have an obligation to communicate clearly to respondents the purposes of the data they collect and to assure respondents of the confidentiality of the information they provide.
In this final chapter of the report, we draw on technological, legal, administrative, and statistical sources to offer recommendations that we believe will facilitate access to research data while protecting the confidentiality of information provided by the public. We offer recommendations in eight broad areas: documenting the use of research data; planning
for access to data through a variety of modes; expanding access to public-use files; facilitating access to research data centers; expanding remote access capabilities; broadening the use of licensing and bonding agreements; assuring informed consent; and safeguarding confidentiality through training, monitoring, and education in research ethics.
As discussed in Chapter 3, data collected by government agencies benefit society by providing the basis for research and policy analysis, which, in turn, can inform policy makers and the public. Longitudinal surveys that obtain data for analyzing the determinants and consequences of social and economic behaviors have been a major positive development for research and policy in the past 30 years. Linking survey and administrative data can create particularly rich datasets that, in some cases, can substitute for additional surveys, thus reducing respondent burden as well as government costs. To realize the full potential of these data, researchers outside of government need access to them.
The United States has a decentralized, pluralistic research structure, in which not only the staffs of statistical and other data collection agencies, but also researchers in many institutions, with different policy preferences and perspectives, can and should have access to data. This broad access by independent researchers and analysts provides checks and balances on the government’s dominant role in public policy. The fruits of the diverse, decentralized social science research enterprise include not only studies that contribute to longer-term understanding of the dynamics of individual, household, and organizational (e.g., business) behavior, but also analyses that contribute directly to public discourse and the development and evaluation of public policies.
The rapid expansion of information and communication technologies in the past decade has enhanced the potential for society to benefit even more from the collection and provision of data for social science research. With greater computing power and better software, the value of data, especially complex longitudinal microdata, has increased. With such data, investigators can estimate and test models that are closer approximations to reality and that directly address problems of causal inference. Although the new technologies make it more difficult to protect the confidentiality of respondents, they also enhance the possibilities for disseminating research data more widely. Failure to use such technologies inhibits the ability of society to exploit fully the rich data collected by federal agencies or others on their behalf.
In part because of the public goods aspect of data collection and research and in part because of the decentralized structure of both data col-
lection and data analysis activities, it is difficult for data collection agencies, research organizations, or society to assess the value of the data produced. Although the benefits of data access are compelling (see Chapter 3), no one has developed a generally accepted approach for quantifying the extent and value of that access or placing a quantifiable value on the uses of the data.
Careful tracking of the numbers and types of users and the body of research produced would provide a sense of the importance of various data products. More broadly, we conclude that a more comprehensive record of the research use of data would be valuable to agencies, policy makers, and researchers. Registries and documentation of data use could help foster understanding by both the public and policy makers of the value of various kinds of data, including microdata, and the research that these data inform.
In addition to a record of the use of research data, it is important to know how many requests for data are received and how many of those are denied, as well as the time required to gain access when the request is granted. The panel that produced Private Lives and Public Policies explicitly recommended that procedures be established for keeping records of data requests denied or partially fulfilled (National Research Council, 1993:100). This panel endorses that recommendation, which has not yet been implemented. Records of such requests for confidential data may also be useful to agencies in monitoring confidentiality protection procedures and actual breaches of confidentiality (see “Research on Breaches of Confidentiality,” in this chapter).
A first step in documenting use would be to assemble bibliographies of research papers for particular data sets. The bibliographies of research papers maintained for some existing datasets offer models of the kinds of documentation that we envision. They include: the National Longitudinal Surveys of Labor Market experience (NLS, housed at the Center for Human Resource Research at Ohio State University), the Panel Study of Income Dynamics (PSID, housed at the University of Michigan), the Health and Retirement Study (HRS, housed at the University of Michigan), and the General Social Survey (GSS, housed at the University of Chicago). The U.S. Census Bureau has a bibliography of nearly 2,000 references to published and unpublished work using data from the Survey of Income and Program Participation (SIPP) and the predecessor Income Survey Development Program, but the bibliography has not been updated since 1998 (www.sipp.census.gov/sipp/aboutbib.html). In addition to research publications, such as articles and books, we encourage bibliographies of research presentations at scholarly meetings and any research analysis presented in the form of software applications.
Federal statistical agencies could use such bibliographies, not only to
assess the use of their data sets, but also as a sampling frame for contacting researchers to obtain feedback on the quality and usefulness of the data. Similarly, research funding agencies could usefully commission analyses that build on research bibliographies to assess the extent, quality, and importance of social science research conducted with statistical data, particularly microdata.
Recommendation 1 As a first step to facilitate systematic study of the extent and value of data access for research, public and private agencies that collect social science research data should maintain up-to-date bibliographies of research and policy analysis publications, presentations, and software applications that use the data.
ACCESS THROUGH MULTIPLE MODES
The actual data collected for statistical purposes from households, individuals, business establishments, and other organizations through censuses and surveys under a pledge of confidentiality are never made available to users. Instead, data are made available either in the form of confidential, restricted-access data files or in the form of anonymized data products, including published tables and microdata files.
Confidential files delete direct identifiers such as names and addresses but retain the observational structure of the original data and include all of the value added by an agency to generate its published statistics (such as analysis weights, imputation for unit and item nonresponse, data quality edits, geocoding, industry coding, occupation coding). They also contain details (such as place of residence, occupation, industry, income, and wealth) that cannot be made available on public-use files. In some cases an agency may also create links to administrative records or other data files.
Public-use microdata files, constructed from the confidential files, contain data that have been masked through various steps (rounded, aggregated, edited) or that have been altered through such techniques as multiple imputation to ensure that individual respondents and their attributes cannot be identified. Public-use files are the most accessible and widely used microdata products made available by statistical agencies, but their value for much policy-relevant research is limited. To exploit the full research and policy value of microdata, researchers will often need access to the confidential files. Modes of access to such data (restricted access modes) include access at supervised locations, remote access with prior review of data output, and access through licensing and bonding agreements (discussed below).
Because public-use files are available to all, statistical agencies must
exercise great care to ensure their anonymity, as well as their usefulness and accuracy. The ultimate goal is to provide public-use data that will yield the same statistical inferences that would be derived from the confidential data. But since many of the analyses that will be performed on public-use files cannot be foreseen at the time of their release, the process of assuring their quality requires a continual feedback relationship between the public-use files and the underlying confidential data. Analyses performed on the confidential data can be used to improve the next generation of public-use data in several ways. For example, such analyses can identify errors in the original data, the public-use data, or both, as well as anomalies that should be corrected in statistical procedures (such as imputation). They can also identify ways in which the public-use data could be made more useful for research by altering the procedures for confidentiality protection for some items. Such analyses can also suggest priority areas for improvement and enhancement of data content in subsequent data collections.
This feedback relationship serves two related purposes. First, it improves the quality and relevance of public-use files, an outcome of great importance to the statistical agencies because they must attest to the usefulness of these products as a source of statistical information to a broad audience. Second, it justifies a substantial investment in facilitating access to the underlying confidential microdata because such an investment supports an agency’s core mission of assuring the quality and usefulness of its public-use products. Both kinds of access—restricted access to the confidential data and unrestricted access to inference-valid public-use data—are needed, not only to accommodate different types of users with different purposes, but also to maintain and improve the quality of public-use files and to obtain accurate estimates of the error entailed by their use.
Economists and statisticians have begun to model the optimum mix of different data access modes by examining the costs and benefits of providing research access through public-use data products or through restricted access modes (Abowd and Lane, 2004). Such modeling is in its infancy but could be valuable at several levels. For individual variables, cost-benefit modeling might identify specific items that could be moved from restricted access to public-use data without impairing confidentiality protection and, conversely, items that should be moved from public-use products to restricted access modes.
At a broader level, cost-benefit modeling could be used to evaluate the tradeoffs among the various forms of restricted access—research data centers, remote access, licensing—as well as among different ways of restricting data (through various masking techniques and various ways of producing synthetic data). Cost-benefit modeling entails the use of a large number of assumptions, including the expected number, variety, and
value of uses in each mode; the disclosure risk and associated costs for each mode; the user costs for access through a research data center compared with a public-use file or a licensing arrangement; and the producer costs for preparing the public-use product compared with running a research data center and approving and overseeing licensees. For realistic estimation, cost-benefit modeling will require empirical estimates of such factors as disclosure risks and costs, numbers and benefits of research uses, user costs of access for various modes, and producer costs of providing various modes of access. Such estimates do not currently exist, and some of them, including disclosure risks and costs and the benefits of research use, are not easy to develop, although promising work is under way on estimating disclosure risk (see Reiter, 2003; see also “Public-Use Data” in this chapter).
Given the importance of facilitating research access to statistical microdata, statistical and research agencies should encourage research on the most efficient allocation of resources among different access modes to guide their planning and to support changes (including legislative changes, if needed) to facilitate one or other type of access. If a variety of agencies, as well as users, are involved in such research and planning efforts to develop data access programs, the data that agencies produce are more likely to be widely used, ultimately leading to better research and policy analysis and to important feedback to data producers that can enable them to enhance data quality and relevance.
Examples of the kinds of involvement of users and producers that we envision include the major longitudinal surveys that are funded through grants to academic survey organizations (including the HRS and the PSID), which have active boards of users and potential users that guide their development. Although most federal statistical agencies also have outside advisory groups, they rarely focus on data access programs for particular data sets. However, one example of focused user involvement for a statistical agency microdata collection is the Association of Public Data Users Working Group on SIPP Data Products, which was active from 1989 to 1994. It played a major role in the development of more user-friendly data products and comprehensive user documentation from the Survey of Income and Program Participation.
Recommendation 2 Data produced or funded by government agencies should continue to be made available for research through a variety of modes, including various modes of restricted access to confidential data and unrestricted access to public-use data altered in a variety of ways to maintain confidentiality.
Recommendation 3 The National Science Foundation, the National Institutes of Health, and major statistical agencies should support research to guide more efficient allocation of resources among different data access modes.
Recommendation 4 Statistical and other data collection agencies should involve users more fully in planning modes of access to their data.
Public-use files, introduced in the 1960s, are the most widely available form of research data. As described above, information generated from investigating confidential data in a restricted access environment can be used to improve their quality and relevance. Improved public-use data files, in turn, can reduce the need and demand for restricted access for some data sets. If, for example, it is possible to move certain variables, such as summary measures or average values, from restricted-access to public-use files while maintaining confidentiality protections, overall access to research data could be increased at reduced cost to users and producers. Doing so requires further sustained research and development of methods that permit data collection agencies to assess the increase in disclosure risk posed by the addition of specific variables to existing public-use files.
Data from the HRS provides an example of the kind of assessment needed. The primary insurance amount (PIA), which is calculated from the detailed Social Security earnings records of HRS respondents, is a very desirable variable for many researchers, most of whom do not need the detailed Social Security administrative records. PIA information would seem to present little additional disclosure risk because many patterns of lifetime earnings will produce the same PIA. Thus, one cannot recover any particular detailed earnings history from a single PIA number. But because HRS policy requires that any variable derived from Social Security records must be a confidential variable, all researchers must apply for access to the confidential files (a tedious and costly process) in order to gain access to the PIA data. Research that demonstrated the absence of increased disclosure risk might make it possible to alter the HRS policy so that PIA data could be made available as part of the public-use file. Such a step would be likely to substantially reduce the demand for detailed earnings data. Moreover, and paradoxically, by increasing ac-
cess to the public-use file, the overall risk of disclosure might be reduced because fewer people would have access to the confidential data.
Other improvements to public-use files also require research. Currently, these files rely on a variety of disclosure limitation methods, such as rounding, coarsening, data swapping, and top coding (see Duncan, 2002). Research is needed to examine and quantify the tradeoffs between disclosure risk and data utility when these methods are used. Since all public-use files require disclosure limitation in order to protect confidentiality, it is crucial that the methods used maximize both the utility of the data and the protection of confidentiality, recognizing that a zero risk of disclosure can never be guaranteed.1
Disclosure risk research has begun to advance from the delineation of general approaches with simulated data to empirical estimation of the disclosure risk in existing microdata sets. For example, Reiter (2003) estimated changes in disclosure risk by altering variables, such as age and property taxes, in the Current Population Survey, under alternative assumptions about what a data snooper might know (see also Duncan et al., 2001; Duncan and Stokes, 2004). Much could be learned from more empirically based work with different variables and different files, including work by agencies that do not currently release much, if any, public-use microdata, such as the Social Security Administration. That work might determine that such disclosure limitation methods as calculating summary measures or average values from confidential data and attaching these measures to other commonly used microdata could result in highly useful, fully protected public-use microdata for research and policy analysis on such important topics as taxation and retirement income security.
Currently, studies of the probabilities of disclosure include estimates of the technical possibility of matching public-use survey data with other widely available information, but they do not include estimates of the likelihood that such matching would be attempted. Consequently, estimates of disclosure risk may overstate the potential risk. Yet without data on people’s propensities for snooping through different types of survey data (which may not be the same as their propensities for hacking into credit card company records or other data sets that present clear financial incentives), one cannot empirically estimate the likelihood of attempts to re-identify records in particular microdata sets. However, it would be possible to use different assumptions about those propensities to help agencies set bounds on their estimates of disclosure risk for particular types of data.
Another factor that could be considered in the calculus of the disclosure risk-benefit tradeoff is the degree of harm that might result from disclosure of particular types of information about an individual or organization. It may be that some degree of uncertainty about the risk of disclosure could be tolerated for data that are not likely to put an individual at risk of serious harm, while a worst-case assumption about disclosure risk would be prudent for highly sensitive data (e.g., reports of illegal drug use or detailed information about financial assets). In making a decision about acceptable levels of disclosure risk, it is important that data providers’ views be considered along with those of the data collection agency and potential research users (see below).
Research on disclosure limitation methods should include establishment (business) data as well as household and individual data. This topic has so far received relatively little attention for business data because of the presumption that establishments are too easy to identify unless their attributes are so heavily masked that the resulting public-use microdata would have little analytic value. Yet there may be methods that could be effective in protecting confidentiality and preserving research utility for establishment-based public-use microdata. Without research on different methods, access to establishment data is likely to continue to be severely restricted, limiting research and policy analysis on important topics.
Methods of disclosure limitation based on synthetic or virtual data, which are constructed from confidential data through partial or complete multiple imputation techniques, show promise in safeguarding confidentiality and permitting the estimation of complex models; they should continue to be explored as an alternative to other disclosure limitation methods (see, e.g., Abowd and Woodcock, 2001; Doyle et al., 2001; Raghunathan; 2003). The chief drawback of synthetic data, as of masked data, is their potential for yielding misleading results, especially when complex models are estimated. Empirical estimates of the amount of error introduced by imputation-based disclosure limitation methods under various assumptions are needed both to demonstrate their utility and to suggest ways in which they can be improved. Such research, which should also be conducted for perturbation and data swapping methods, will become increasingly important as highly useful but highly sensitive variables—such as biological markers, including DNA samples, and geospatial coordinates—are increasingly linked with survey responses.
Synthetic public-use microdata could also facilitate data access by providing a means for researchers to explore, test, and refine estimation models at relatively low cost before incurring the higher costs of access to confidential data through a research data center or another restricted access mode. For this purpose, the synthetic data would need to meet a high standard for supporting valid inference but not as high a standard as
would be necessary for research and policy analysis that relied on the synthetic data alone.
Recommendation 5 Agencies that sponsor data collection should conduct or sponsor research on techniques for providing useful, innovative public-use data that minimize the risk of disclosure. Such research should also be a funding priority for the National Institutes of Health and the National Science Foundation. In particular, research should be directed to:
(1) developing measures for quantifying disclosure risk;
(2) estimating the effect on disclosure risk of adding selected variables from confidential data files to public-use files;
(3) estimating and improving the utility-disclosure limitation tradeoffs of alternative disclosure limitation methods, including synthetic data; and
(4) developing disclosure limitation methods for establishment data.
Facilitating Access to Public-Use Files
Academic researchers in the United States need approval from an Institutional Review Board (IRB) to conduct research involving human participants, including, in many cases, secondary analyses. Currently many, perhaps most, IRBs lack the expertise required to review the adequacy of the confidentiality protection for research that involves original data collection. As a result, researchers spend much time justifying to IRBs proposed reanalyses of public-use microdata from federal agencies and established data archives that incorporate best practices for confidentiality protection (see National Research Council, 2003b).
The Panel on Institutional Review Boards, Surveys, and Social Science Research recommended a new confidentiality protection system, built on existing and new data archives and statistical agencies, to facilitate secondary analysis of public-use microdata (National Research Council, 2003b:138 [Recommendations 5.2 and 5.3]). Such a system would permit IRBs to exempt secondary analysis with such data from review as a matter of standard practice under clause 46.101(b)(2) of 45 CFR 46, Subpart A, Federal Policy for the Protection of Human Subjects.
The system could be developed as follows: the Office for Human Research Protections in the U.S. Department of Health and Human Services would work with statistical agencies, appropriate interagency groups, and data archives to develop a certificate to accompany the release of public-use data sets. Data producers and archives could obtain certification for all of their public-use data sets or for individual files if they rarely produce public-use data. Such a certificate would attest that the public-use
file reflects good practice for confidentiality protection and that the data were collected with appropriate concern for informed consent and other human research participant protection issues. With such a certificate, the IRB would exempt from further review any analysis that proposes to use only the data from the certified files.
In supporting its recommendation, the panel noted (National Research Council, 2003b:138-139):
We argue that IRB review of secondary analysis with public-use microdata is unnecessary and a misuse of scarce time and resources … If the data in a file have been processed to minimize the risk of re-identifying a respondent by using widely recognized good practices for confidentiality protection, then the research is eligible for exemption under the Common Rule….
Development of the process of certification assumes heavy participation by the statistical agencies, the Office of Human Research Protections, the Office of Management and Budget Statistical and Science Policy Office, and interested data archives and would clearly require an initial investment of significant amounts of time and money. But if IRBs throughout the country accepted certification as sufficient to exempt from review research involving data from a statistical agency or a nationally recognized survey organization, access to research microdata would be considerably enhanced. We therefore endorse the recommendations of the earlier panel.
Recommendation 6 To enhance access to public-use files for secondary analysis, we endorse the recommendations of the Panel on Institutional Review Boards, Surveys, and Social Science concerning establishment of a new system of confidentiality protection for public-use microdata based on existing and new data archives and statistical agencies. Statistical agencies and participating archives would certify that public-use data sets obtained from them were sufficiently protected against statistical disclosure to be acceptable for secondary analysis, and IRBs would exempt such analyses from review on the basis of the certification provided.
Extending Legal Obligations to Data Users
At present, the obligation to protect individual respondents falls primarily on those who collect the data, thereby creating a disincentive for providing access to other researchers. We believe this obligation should be extended to the users of public-use data as well. All releases of statistical data by federal agencies, including public-use data files, should include a warning that the data are provided for research purposes only
and that any attempt to identify an individual respondent in the data file is a violation of federal law and will result in penalties comparable to those currently imposed only on agency personnel and licensed users. Such a warning currently accompanies public-use records released by the National Center for Education Statistics (NCES). Although such restrictions may be difficult to enforce, especially for public-use data, the legal sanction will stand as an expression of professional norms regarding the use of research data.
The ability to seek penalties may require new legislation for most agencies. The language that is in place for data sets from the NCES (P.L. 107-279, Education Sciences Reform Act of 2002, Section 183(d)(6)) is an example of the kind of penalties that would be appropriate to invoke against users of public-use data from federal statistical agencies who breach confidentiality:
Any person who uses any data provided by the Director, in conjunction with any other information or technique, to identify any individual student, teacher, administrator, or other individual and who knowingly discloses, publishes, or uses such data for a purpose other than a statistical purpose, or who otherwise violates subparagraph (A) or (B) of subsection (c)(2), shall be found guilty of a class E felony and imprisoned for not more than five years, or fined as specified in section 3571 of title 18, United States Code, or both.
Recommendation 7 All releases of public-use data should include a warning that the data are provided for statistical purposes only and that any attempt to identify an individual respondent is a violation of the ethical understandings under which the data are provided. Users should be required to attest to having read this warning and instructed to include it with any data they redistribute.
Recommendation 8 Access to public-use data should be restricted to those who agree to abide by the confidentiality protections governing such data, and meaningful penalties should be enforced for willful misuse of public-use data.
At noted above, new legislation would be required for some agencies to have the authority for such penalties.
FACILITATING ACCESS TO RESEARCH DATA CENTERS
One key way to provide researcher access to confidential data is through research data centers (RDCs), including the eight centers maintained by the Census Bureau and those maintained at the headquarters of
the Agency for Healthcare Research and Quality and the National Center for Health Statistics, in the Department of Health and Human Services, the Bureau of Labor Statistics in the Department of Labor, and some other agencies.2
As noted in Chapter 2, the Census Bureau’s RDCs represent an important step toward facilitating research access to confidential data; however, they are believed to be underused, and their use appears to be declining. Two of the three reasons for this trend are the length of the review process and the costs involved in doing research away from one’s home institution. The third reason is a very stringent interpretation of the five criteria for approving a research project, which must demonstrate:
a likely benefit to the Census Bureau under Title 13 and, indeed, that its predominant purpose is to provide one or more Title 13 benefits, such as improving imputations for nonresponse (see www.ces.census.gov/ces.php/guidelines);
scientific merit in terms of a project’s likelihood to contribute to existing knowledge (which is similar to the criterion for research-funding agencies, such as the National Science Foundation [NSF] and the National Institutes of Health [NIH]);
a clear need for nonpublic data;
from the applicant, a willingness to accept all confidentiality protection and disclosure review requirements, including strict limits placed on how much and how often intermediate output can be taken out of the RDC (e.g., by a graduate student for review by his or her professor) and the requirement that the addition of new investigators to a project requires a de novo review and approval.
Stringent application of these criteria may discourage applications or the withdrawal of proposals before a decision is reached and probably contributes to the length of the review process. Yet such stringency arguably does not enhance confidentiality protection nor forward the mission of the Census Bureau to facilitate data use. In particular, the panel concludes that the first criterion has been interpreted in a way that actually impedes furtherance of the agency’s mission. Research that uses Title 13 data should be deemed eligible for approval so long as the researcher agrees to provide information to the Census Bureau about the quality and usefulness of the data, without the requirement to demonstrate that the
For a description of the Census Bureau’s RDC operations, see Hildreth (2003); see also the summary of the panel’s workshop, Appendix A.
research’s predominant purpose is for data improvement under Title 13 (see Hildreth, 2003; see also discussion in Appendix A). The research use of the data is a key part of the all-important feedback cycle that contributes to improvement of published statistics, public-use microdata, and summary products, as well as the Census Bureau’s knowledge about its data.
With regard to the length of review, (incomplete) data from the Census Bureau’s RDC network indicate that it takes an average of 7 months for approval of a project with economic data and an average of as much as 20 months for review of a project using matched administrative data, such as state Medicaid and unemployment records matched with CPS or SIPP data. (These figures do not include proposals that require revision and resubmission or that are withdrawn from consideration.) One reason that the use of matched data takes so long to approve is that the custodian of the administrative data undertakes its review following that of the Census Bureau. If a researcher is then asked to revise and resubmit the proposal, and the researcher has, say, a 2-year grant period, the project is impossible to carry out in the allotted time.
Until recently, the process was further slowed because there were only three review cycles each year. In 2004, the Census Bureau implemented a continuous review process in which reviews are being conducted on an “on demand basis.” Although it is too early to assess the effects of this change, it is an important step in improving access to confidential data under the RDC program. However, the Internal Revenue Service, because of staff limitations, continues to have only three review cycles each year for projects that propose to use data from Social Security earnings records or income tax returns.3
As noted in Chapter 2, the Census Bureau has indicated openness to other ideas for streamlining the application and review process for RDC projects, and some ideas may also be relevant for RDC operations at other agencies. For example, some research projects that are proposed for implementation at an RDC have already been reviewed and recommended for funding by an agency (e.g., the NSF or NIH) through its own peer review process. The Census Bureau (or other agency) could accept that funding recommendation as part of its review process, concentrating only on the appropriateness of the RDC for the work.
For a description of the agreement between the Census Bureau and the IRS regarding access to confidential IRS data, see “Criteria for the Review and Approval of Census Projects that Use Federal Tax Information” (www.ces.census.gov/download.php?document=50 [May 2005]). This document applies with full rigor only to proposals using both Title 13 (Census) and Title 26 (IRS) data. When the proposal uses only Title 13 data, the Census Bureau may interpret the statute without IRS review.
Recommendation 9 To achieve the research potential and cost-effective operation of the Census Bureau data centers, the Census Bureau should (1) broaden the interpretation of the criteria for assessing the benefits of access to data; (2) maintain the continuous review cycle; and (3) take account of prior scientific review of research proposals by established peer review processes.
If IRS data are involved, that agency would also have to agree on the new criteria.
Other steps to consider for stimulating use of research data centers include broader advertising of the centers and the procedures for using them, and special proposal submission and review processes for junior researchers. In addition, the Census Bureau and other statistical agencies should explore ways to house confidential data from as many agencies as possible in a single supervised location in a number of host institutions in order to add to their value for research use. The 2002 Confidential Information Protection and Statistical Efficiency Act (CIPSEA) may facilitate this process.
Currently, statistical agencies have few resources to facilitate access to confidential files by external researchers, which is a disincentive to maintain, let alone expand, the operations of research data centers and other modes of restricted access. Similarly, potential host organizations often lack adequate resources to contribute to the operations of research data centers, and they are unlikely to increase their contributions if the access process is so cumbersome that it deters researchers from seeking to use confidential data. In order to provide adequate access, research data centers need funds for a range of tasks, from processing applications and overseeing access, to preparing and updating user-friendly documentation and access tools, to checking researchers’ work to ensure that breaches of confidentiality have not occurred. We note that the Census Bureau RDCs are supported in part through grants from the NSF and the National Institute on Aging (see National Research Council, 2000:48). Increased funding through a variety of mechanisms, and from a variety of agencies, should be explored, contingent on improved data access.
EXPANDING AND IMPROVING REMOTE ACCESS
One way to reduce the costs in time and money involved in traveling to a research data center is to expand access to the confidential data stored in those centers from a remote computer. Because the methodology used by an agency in processing and archiving its data affects how remote access to the data can be structured, there are no simple designs for remote access (see Rowland, 2003). Furthermore, access from a remote computer
poses significant challenges to the maintenance of confidentiality because of the risk posed by repeated queries to the database and the potential ability to infer individual attributes by comparing results for some table cells against others (see Duncan and Mukherjee, 2000). At this stage of software development for disclosure review, manual monitoring before output is sent back to a user may be more effective at protecting confidentiality. It may also, as in the NCHS system, allow users to request a broader array of outputs (e.g., regressions of various types in addition to tables). However, manual monitoring is more costly for the sponsor agency and precludes rapid response to user submissions.
Research that will permit expansion of this mode of access to confidential data is needed. The research should focus on efficient disclosure limitation methods for remote access that allow users to request a wide range of outputs and obtain output within reasonable time limits.
Recommendation 10 Statistical agencies and other agencies that sponsor data collection should conduct or sponsor research on cost-effective means of providing secure access to confidential data by means of a remote access mechanism, consistent with their confidentiality assurance protocols.
An alternative to research data centers, one that reduces burden to users because it does not require them to travel to a different location, is a licensing agreement. Licensing agreements, which are a valuable means of access to confidential data, have developed in different ways for different datasets. Although the Census Bureau does not currently have the authority to allow access to its confidential data under licensing agreements, the Bureau of Labor Statistics, the NCES, and NSF’s Division of Science Resources Statistics, among other agencies, license the use of confidential data to researchers who meet certain criteria. The HRS, which is carried out at the University of Michigan with funding from the National Institute on Aging, also licenses researchers to use its data. These licenses enable researchers to work at their home institution, without incurring the costs of relocating.
NCES—which currently uses licensing more than any other agency—requires potential users (such as state and local agencies, contractors, researchers) to complete an application designed for the specific type of user. The process involves preparing and submitting a formal letter of request, a license document, an affidavit of nondisclosure, and a security plan.4 Users of confidential HRS data must be affiliated with an institu-
For details, see “Restricted-Use Data Procedures Manual” (nces.ed.gov/statprog/rudman/ [November 2004]).
tion that has a human subjects review process (including an IRB that is registered with and has been approved by the Office for Human Research Protections in the U.S. Department of Health and Human Services) and be a current recipient of federal research funds. Users of HRS data are required to submit for approval a research proposal and a data protection plan, as well as IRB review and a signed agreement for use of restricted data.5 Most licensing agreements are time limited and require users to return or destroy the confidential data files.
Although potentially very useful for expanding access to confidential data, licensing is not yet widely used by statistical agencies. This mechanism could be significantly expanded: agencies that currently lack authority for licensing should investigate obtaining such authority, and agencies that currently license only a few data sets should consider expanding the number of data sets for which a license may be obtained. In expanding the use of licensing agreements, agencies should sponsor consultations among data users and producers in developing the standards governing such agreements in order to assure the widest possible access consistent with confidentiality protection. Implementation of CIPSEA will facilitate—indeed, require—developing relatively uniform procedures across agencies.
Recommendation 11 Statistical and other agencies that provide data for research and do not yet use licensing agreements for access to confidential data should implement such an access mechanism. Agencies that use licensing for only a few confidential data sets should expand the files for which a license may be obtained.
For some agencies, such a mechanism may require new legislation.
Recommendation 12 Statistical and other agencies that provide data for research should work with data users to develop flexible, consistent standards for licensing agreements and implementation procedures for access to confidential data.
Both the NCES and the HRS licensing agreements include two important enforcement provisions. One is random auditing of the licensed research site by a qualified auditor for adherence to the conditions of the license, including storage of the data on secure servers, restriction of access to personnel named on the agreement, and encryption of the data when in transit. The second is severe penalties for serious violations of the agreement. In the case of the HRS, for example, the penalties include forfeiture by the investigator—and, possibly, the investigator’s entire insti-
For details, see “HRS Restricted Data: Application Materials: Basic Requirements” (hrsonline.isr.umich.edu/rda/rdapkg_req.htm#reqoutline [November 2004]).
tution—of all current funding, and denial of future funding by the sponsoring agency.
An early review of the results of audits by NCES revealed that the violations uncovered resulted from simple carelessness, did not result in confidentiality breaches, and did not trigger the imposition of penalties (see McMillen, 1999). A more recent review concluded that enforcement mechanisms throughout the government are quite weak (see Seastrom, Wright, and Melnicki, 2003), contributing to violations; however, most if not all of the violations resulted from carelessness or not following proper procedures, rather than from willful misuse of data, and, again, there was no evidence of disclosure of individual data.
Although the panel recognizes that broadening access through licensing agreements may increase the risk of disclosure by increasing the number of people with access to confidential data, we believe that the risk is outweighed by the benefits of wider access. In order to provide as much protection as possible, we recommend that future licensing agreements include the two key enforcement features—auditing and penalties for violations—that are designed to minimize that risk. We also recommend that all data providers be informed that their data may be used in unanticipated ways, and by researchers other than those carrying out the data collection, but only for research purposes (see Recommendation 14).
Recommendation 13 Licensing agreements should include auditing procedures and appropriate legal penalties for willful misuse of confidential data.
INFORMING RESPONDENTS OF DATA USE
As we stress throughout this report, the foundation for achieving the benefits of data for research and policy is the public’s willingness to supply the information requested. In turn, all agencies that collect data have an obligation to inform respondents about the purposes for which the data are being collected and how they will be used.
Recommendation 14 Basic information about confidentiality and data access given to everyone asked to participate in statistical surveys should include notification about:
(1) planned record linkages for research purposes;
(2) the possibility of future uses of the data for other research purposes;
(3) the possibility of future uses of the data by researchers other than those collecting the data;
(4) planned nonstatistical uses of the data; and
(5) a clear statement of the level of confidentiality protection that can be legally and technically assured, recognizing that a zero risk of disclosure is not possible.
This recommendation substantially mirrors one from Private Lives and Public Policies (National Research Council, 1993:220-221).
The following paragraph may provide a model for a brief statement that responds to the spirit of items (1) – (5).6
Your information is being collected for research purposes and for statistical analysis by researchers in our agency and in other institutions. Your data will not be used for any legal or enforcement purpose [unless required by the Patriot Act]. The researchers who have access to your data are pledged to protect its confidentiality and are subject to fines and prison terms if they violate it. Data will only be provided to researchers outside our agency in a form that protects your identity as an individual. Some uses of your data may require linking your responses to other records, always in a manner that honors our pledge to protect your confidentiality.
The panel also believes that in formulating policies about data access, neither statistical agencies nor IRBs should assume that they know what kinds of data members of the public consider sensitive or what disclosure risks they are willing to tolerate. Instead, these policies should take the views of the public into account.
Recommendation 15 Statistical and funding agencies should support continuing research to monitor the views of data providers and the general public about research risks and benefits, including such topics as the sensitivity of questions, data sharing for statistical purposes, methods of obtaining consent for survey participation, the importance of privacy and confidentiality, and similar topics.
SAFEGUARDING CONFIDENTIALITY: TRAINING, MONITORING, AND EDUCATION
So far, we have discussed ways of expanding research access while protecting confidentiality, focusing mainly on risks of statistical disclosure and how to measure and safeguard against them. In this concluding section, we address the issue of confidentiality protection more generally,
acknowledging, as we did in Chapter 4, that wider access to confidential data is likely to increase the risk of confidentiality breaches, but that statistical disclosure is not the only, or even the main, threat. We consider three aspects of confidentiality protection: (1) training employees in procedures to safeguard confidential data, (2) research on violations of confidentiality protection procedures and actual breaches of confidentiality, and (3) educating researchers and staff in the ethical foundations of privacy and confidentiality.
One common threat to confidentiality protection of research data arises from simple carelessness—not removing identifiers from questionnaires or electronic data files, leaving cabinets unlocked, not encrypting files containing identifiers, talking about specific respondents with others not authorized to have this information. Just as institutional review boards currently require researchers to undergo training in human subjects protection issues before undertaking research involving human participants, so statistical agencies and private survey organizations should provide their employees with guidelines for confidentiality protection, as well as regularly updated training in appropriate data management (such as secure storage of identifiable information) to ensure that the guidelines are observed. Data collection agencies that have such guidelines and training should regularly review their procedures to ensure that they are up to date and systematically enforced.
Recommendation 16 Statistical agencies and survey organizations that collect individually identifiable data should provide written guidelines for confidentiality protection, as well as training in confidentiality practices and data management that guard against disclosure, for all staff who work with or have access to such data.
Such training should include all aspects of data management—entering, storing, manipulating, and analyzing electronic records. Everyone who handles electronic records needs to be fully aware of the need to protect them, as they do with paper records.
Research on Breaches of Confidentiality
Just as better information is needed about the use made of research data (see Chapter 3), information is also needed about violations of confidentiality protection practices and the actual occurrence of confidentiality breaches. Without knowing how many breaches occur in an agency, it is
impossible to know, for example, whether laws and penalties designed to prevent improper disclosure of confidential information are effective or whether other kinds of deterrents are needed. Statistical agencies and individual researchers have generally resisted suggestions for research on confidentiality breaches, yet such research is necessary to evaluate the effectiveness of data access mechanisms in preventing unwarranted disclosure.
The Office of Research Integrity in the U.S. Department of Health and Human Services is currently funding research into such violations of research ethics as data fabrication and plagiarism; a few such studies have been published (see, e.g., Swazey, Anderson, and Louis, 1993; Martinson, Anderson, and de Vries, 2005). Research into the extent, nature, and causes of confidentiality breaches is long overdue. If well-designed and executed research and monitoring finds little evidence of such breaches, it would do much to reassure the public and the agencies themselves that the benefits accruing from wider dissemination of research data will not incur undue costs in terms of breaches of confidentiality. The U.S. Government Accountability Office has in the past expressed an interest in undertaking such research.
Recommendation 17 Statistical agencies should set up procedures for monitoring, on an ongoing basis, violations of confidentiality protection practices and instances of confidentiality breaches that may occur. The system should be designed to obtain information on the causes and consequences of these breaches.
Education in Research Ethics
Laws and procedures designed to prevent confidentiality breaches and punish their occurrence will not be optimally effective unless they are accompanied by internalized norms of research ethics and fair information practices (see Barquin and Northouse, 2003; Duncan, 2004). To inculcate these among current and future researchers and the staffs of the statistical agencies, universities as well as agencies that collect data from the public should be encouraged to develop curricula (presented in courses, workshops, and other educational forums) dealing explicitly with the requirements of fair information practices, as well as with the requirements for conducting ethical research with human beings. The two are not identical, and both have a role to play in the training of researchers and others who will work in the field of government statistics. Such education programs should deal with ethical, legal, and data quality issues, as well as with administrative and technical procedures for confidentiality protection, data security, disclosure limitation, and informed consent.
Statistical agencies could make important contributions to the devel-
opment of such training programs by providing advice based on their experience and expertise. Funding organizations, such as the NSF and the NIH, could contribute to the necessary financial support. There is also an important role in education for professional associations, many of which have codes of professional conduct and ethical standards. Such associations as the American Statistical Association, the American Sociological Association, the Population Association of America, the American Economic Association, the American Association for Public Opinion Research, and their counterparts for other disciplines and fields can contribute significantly to the development of strong norms for fair and ethical practices in research and information gathering.
Recommendation 18 Training in ethical issues related to research, including fair information practices, as well as principles and practices related to research with human participants, should be part of the professional training of all those involved in the design, collection, distribution, and use of data obtained under pledges of confidentiality. Such training should be updated at intervals after the end of formal schooling.
Recommendation 19 Professional associations should develop strong codes of ethical conduct that reflect the need to protect the confidentiality of personal data and make adherence to these codes an integral part of their educational activities.
In addition to encouraging educational and research institutions to add training to their programs, consideration should also be given to requiring completion of a specialized training program as a condition for use of confidential data. Such a program might be designed along the lines of the training and certification programs required of all researchers who are subject to IRBs. Professional associations may be one kind of organization to provide such training.
The challenge facing statistical and other data collection agencies in disseminating the best data as widely as possible in order to foster sound public policy and research while protecting the confidentiality of those data is formidable, but it can be met. With appropriate safeguards, and recognizing that the technological and legal environment is likely to be one of continual change, the nation can reap enormous benefits from the information the public provides.