Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 37
Improving Access to and Confidentiality of Research Data: Report of a Workshop 5 Current Agency and Organization Practices This chapter reviews presentations that described current practices aimed at preserving the confidentiality of microdata. Practices related to release of public-use files and restriction of data access are reviewed in turn. RELEASE OF PUBLIC-USE FILES More than 20 years ago the Federal Committee on Statistical Methodology (1978) recommended that all federal agencies releasing statistical information formulate and apply policies and procedures designed to avoid unacceptable disclosures. In releasing public-use files —typically made available with no restrictions other than, in some cases, imposition of a user fee—agencies generally comply with this recommendation through the use of various forms of statistical disclosure limitation. Two workshop presentations described policies and procedures used for the release of microdata. Alice Robbin provided an overview of results from a survey on the statistical disclosure limitation (SDL) practices used by government agencies and research organizations that distribute public-use microdata files with longitudinal, linked administrative, or contextual data for small areas. Alvan Zarate of the National Center for Health Statistics (NCHS) presented an overview of the Interagency Confidentiality and Data Access Group 's (ICDAG) Checklist on Disclosure Potential of Proposed Data Releases, developed to help ensure that
OCR for page 38
Improving Access to and Confidentiality of Research Data: Report of a Workshop principal safeguards are in place when electronic data files are released for public use.1 Statistical Disclosure Limitation Practices Alice Robbin, Thomas Jabine, and Heather Koball conducted a survey of organizations that produce and distribute complex research microdata files. The survey was intended to contribute empirical evidence on how knowledge about SDL procedures has been applied by these organizations in the production of public-use microdata. Information was gathered on the types of microdata that are released publicly, current SDL practices applied to public-use data, and organizations' demarcation between public-use and restricted access data. Several themes emerged from this information, including (1) the extent of variation in SDL practices across organizations, (2) special risks to data confidentiality, and (3) the tension between the data needs of researchers and data confidentiality. Variation in Organizational SDL Practices. Survey respondents conveyed familiarity with the broad issue of data confidentiality. All respondents knew that direct identifiers of respondents should not be released and expressed concern about protecting respondents' identities. Furthermore, because of concerns about data confidentiality, few organizations release public-use geographic/contextual data for small areas. Similarly, linked administrative data are generally confined to a restricted access format. On the other hand, the survey revealed considerable variation across organizations in terms of knowledge about SDL techniques. This variation is a function of the extent of practitioners' knowledge about deductive disclosure, the type of organization, and the timing of decisions related to release of public-use files. Some respondents appeared to be unfamiliar with terminology and concepts associated with data confidentiality, while others were well versed in these matters. 2 The treatment of special “at-risk” variables, such as age and income, varies widely by organization. Most organizations appear to base their SDL decisions for public-use longitudinal files on a cross-sectional model. That is, they assess the risks of disclosure for a given cross section, with little consideration of longitudinal effects. One factor that may contribute to the relatively liberal policies ap- 1 The ICDAG was recently renamed the Committee on Data Access and Confidentiality. 2 This generalization—that there is a wide variety in knowledge and practice of SDL techniques —was corroborated by Erik Austin of the Inter-University Consortium for Political and Social Research. Austin has been involved with this issue for three decades; his organization has examined thousands of files and reviewed SDL plans so that data can be released publicly.
OCR for page 39
Improving Access to and Confidentiality of Research Data: Report of a Workshop plied to longitudinal data is the fact that follow-up data are often released several years after earlier panels. Decisions about previous data releases may or may not play a role in decisions pertaining to the release of longitudinal files. A second factor that appears to influence decisions about release policies for longitudinal data is knowledge of user preferences. Staff are sensitive to the fact that longitudinal data are deemed particularly useful when the data contain the same variables over time. The survey responses indicated greater variation among the standards of universities than among those of government agencies.3 Government agencies, particularly the Bureau of the Census and NCHS, have standards on which they base decisions about release of public-use data and SDL techniques. Many of the Census Bureau's standards have had a significant influence on other organizations that distribute microdata. The Census Bureau also has a Disclosure Review Board, which reviews microdata sets prior to release. NCHS has an IRB; a data confidentiality committee; and a data confidentiality officer, who makes final decisions about SDL techniques for public-use data. Special Risks. In general, issues related to deductive disclosure have been brought to the attention of organizations only in recent years; as a result, the SDL techniques applied to microdata sets have changed. Moreover, older longitudinal microdata sets are at particular risk for deductive disclosure as they contain more detailed information about respondents than would be released under current practices. The data sets also follow respondents over long periods of time, so they contain a wealth of detailed information, some of which is revealed only as a result of the longitudinal structure. The combination of changing SDL standards and the compilation of data on respondents over time may make older longitudinal data sets particularly vulnerable. At the same time, it is often the longitudinal structure that makes these microdata sets particularly useful to researchers. Needs of Researchers Versus Data Confidentiality. Organizations appear to be keenly aware that their microdata sets are more useful if they provide greater detail. One respondent stated that his organization increased data availability because of the demands of users. The organization increased the level at which income was top-coded in response to complaints about the lack of data detail. Other respondents indicated that their decisions to release 3 Austin agreed with this point as well, noting that academic data producers typically have less knowledge about SDL techniques than agency counterparts. He recommended establishing venues for communicating SDL techniques more effectively.
OCR for page 40
Improving Access to and Confidentiality of Research Data: Report of a Workshop data are based, in part, on providing as much data as possible to researchers. They “advertise” public-use data on Web sites by highlighting the data's detailed and longitudinal nature. This advertising is designed to increase use of the data and to demonstrate to program funders that the data distribution function is being performed. Ease of access for researchers was also cited by respondents as an aspect of the tradeoff between data utility and the need for confidentiality. The Internet makes access to public-use data increasingly easy; this ease of access facilitates the research process, while also increasing the risk of deductive disclosure. The survey findings are generally consistent with the body of empirical evidence that has accumulated on organizational decision making. They reveal that members of organizations are sensitive to the external environment, and that structural and political factors influence decisions related to release of public-use data. The findings also reveal that structures within organizations can be fragmented. Organizational units are governed by different policies, some of which are contradictory. Further, staff turnover contributes to loss of institutional memory, and historical records about data release decisions are often not maintained to compensate. Survey design, data collection, and preparation of public-use files may be overseen by different units of the same organization or by different organizations, and this can affect information flows. Management control also differs across organizations and units; some project managers are familiar with the nuts and bolts of data release decisions, while others are not. Agency staffs are sensitive to the consequences of releasing data that could identify individuals, particularly in light of legislative initiatives responding to public concerns about confidentiality. One agency respondent noted that his organization's dissemination of a public-use file of survey data had ceased because of a recently enacted statute that was interpreted as preventing distribution of the data. Regardless of whether the statute was properly interpreted, what is important is that a perceived threat from the external environment resulted in the restriction of important data. Furthermore, policies governing data access and confidentiality are subject to change. Institutional interpretations of these policies influence decisions about the release of public-use microdata, how data will be prepared, and the conditions under which access will be permitted. Robbin, Jabine, and Koball offered a number of policy recommendations regarding the release of public-use files. These recommendations were directed to producers, distributors, and analysts of large-scale microdata files, as well as to funding agencies and project managers. Communicating and Educating About Statistical Disclosure Risk and Limitation Procedures. Appropriate policy and policy compliance require improved communication about current research on disclosure risk, as well
OCR for page 41
Improving Access to and Confidentiality of Research Data: Report of a Workshop as education of professionals about good practice. Research into disclosure risk has been conducted for more than 20 years. Statistical agencies have published documents analyzing the risk and providing guidelines for good practice. Peer-reviewed journals have published articles on the subject. The American Statistical Association's Committee on Privacy and Confidentiality has prepared informational materials that are available at its Web site, and ICDAG has disseminated its Checklist on Disclosure Potential of Proposed Data Releases (discussed below). Yet agency and organization staffs appear inadequately aware of current SDL practices. The result is that, in some cases, statutory confidentiality requirements go unmet, while in others, data are overly restricted. To facilitate dissemination of information about good SDL practices and standards, Robbin, Jabine, and Koball recommended producing and circulating a bibliography of key publications that describe evaluative deductive disclosure methods. The American Statistical Association's Committee on Privacy and Confidentiality has prepared bibliographies on the subject, and the committee's work can serve as the basis for selecting additional informational resources. Documents on the subject need to be available at a basic technical level to be useful for staff who may have less statistical expertise but are on the front lines of data production. Responsibility for ensuring that data organizations employ good SDL practices should not lie only with data processing staff (the survey results indicated that, in many cases, programmers were given nearly sole responsibility for preparing public-use files). Internal review units should be available to evaluate proposed releases of microdata files; outside of government, IRBs and other groups can incorporate experts on deductive disclosure. The survey revealed that a number of respondents, while familiar with the general issues of data confidentiality, were not knowledgeable about disclosure risk and SDL techniques. Thus, there exists a clear opportunity to achieve advances through further education. Workshops and panels at annual professional meetings offer an appropriate forum for launching such efforts. Interactive environments such as IRBs and data centers represent additional ongoing opportunities. Institutionalizing Communication to Improve SDL Practices. A general set of rules governing data release is not possible because virtually every proposed release is unique in some way, even within the same agency and program. It is important to obtain expertise on the subject in the initial planning stages of statistical programs and research projects, and then later during evaluation and testing to prepare public-use files. Improved documentation is also an essential aspect of communicating survey objectives and methods. Detailed documentation can minimize the loss of institutional memory that results from staff turnover and other factors.
OCR for page 42
Improving Access to and Confidentiality of Research Data: Report of a Workshop Data User Participation in Data Release. There are multiple approaches to developing good SDL practices. Data users have important knowledge to contribute during the early stages of organizational decision making on the practices to be employed. Checklist on Disclosure Potential of Proposed Data Releases The introduction to the Checklist (Interagency Confidentiality and Data Access Group, 1999:1) clearly describes its function: Federal statistical agencies and their contractors often collect data from persons, businesses, or other entities under a pledge of confidentiality. Before disseminating the results as either public-use microdata files or tables, these agencies should apply statistical methods to protect the confidentiality of the information they collect. . . . [The Checklist] is one tool that can assist agencies in reviewing disclosure-limited data products. This Checklist is intended primarily for use in the development of public-use data products. . . . The Checklist consists of a series of questions that are designed to assist an agency's Disclosure Review Board to determine the suitability of releasing either public-use microdata files of tables from data collected from individuals and/or organizations under an assurance of confidentiality. Zarate's overview of the Checklist was presented within the broader theme of how agencies operate “caught between the twin imperatives of making usable data available, while also protecting the confidentiality of respondents.” Zarate noted that, while disclosures can occur, it is not justifiable to withhold valuable data for legitimate research purposes. Zarate explained that the Office of Management and Budget's Statistical and Policy Office helped form the ICDAG in 1991 to coordinate and promote research on the use of statistical disclosure methods and to catalog related developments at agencies and among academic researchers. The Checklist is intended for use in risk assessment by agency statisticians in charge of data release. Although it is not a formal regulatory document, its widespread visibility should motivate a closer look at organizational methods. Though the Checklist does offer nontechnical discussion and advice on all basic SDL techniques, users must be familiar with survey design and file content. Zarate pointed out that none of the rules can be followed blindly. There are real constraints on any attempt to standardize data protection; for instance, rules may be very different when applied to data for a demographically unusual group or for a survey topic that involves especially sensitive information. The Checklist is not a substitute for knowing the data that are to be released. Researchers at the workshop voiced the concern that, if users are not adequately knowledgeable about the data and the associated risks and benefits, they may misuse documents such as the Checklist as a rationale for
OCR for page 43
Improving Access to and Confidentiality of Research Data: Report of a Workshop overprotection. With institutions' reputations on the line, standardization can lead to conservatism in release policies, which researchers worry will inevitably limit the availability of data required for the most important and innovative research. In this context, the Checklist must be an evolving document. Users must be educated enough to adapt it to fit their specific requirements, reflecting, as Zarate put it, “that there is an art as well as a science to disclosure analysis.” Currently, the Checklist emphasizes proper handling of geographic information. “Small areas” are defined as 100,000 people by the Census Bureau; previously, geographic information was available only at the 250,000 person sampling unit level. The more recent definition reflects a rule of thumb without a real quantitative basis. Zarate argued that there is a real research need to develop empirical evidence to justify recommendations regarding geographic specificity. In fact, the disclosure risk posed by geographic delimitation can be assessed only in the context of other variables that are available in data records, as well as information about ease of external linkage. Adding detailed geographic identifiers to specific age, race, and other contextual variables makes data more useful, but also increases the probability of disclosure. The Checklist directs attention to the variety of available external files (e.g., voter registration, birth and death records) that could be linked to disclose record identity. It may also help guide decisions when data are being issued in a format that is easily manipulated. Finally, Zarate suggested that the Checklist needs to be developed to direct special attention to longitudinal data. At the time of the workshop, the Checklist had only scratched the surface in terms of alerting data disseminators about additional risks that arise when records are followed through time. If an agency is locked into certain procedures, it can become clear over time that appropriate levels of security are not in place. For instance, characteristics that can be predicted from one period to the next may not be masked by top-coding or other techniques. The Checklist does not currently address these issues, but is expected to do so in the future. RESTRICTION OF DATA ACCESS Several of the workshop presentations described existing and planned restricted access arrangements for managing complex research data files. Paul Massell of the Census Bureau provided a comparative overview of licensing arrangements used by six U.S. agencies and two university-based social science research organizations. Marilyn McMillen of the National Center for Education Statistics (NCES) gave a detailed description of licensing procedures used by NCES, with emphasis on inspection procedures used to monitor observance of the conditions of access. Mark McClellan of Stanford University described procedures employed to protect confidentiality at an
OCR for page 44
Improving Access to and Confidentiality of Research Data: Report of a Workshop academic research center that is using microdata files from multiple sources under licensing arrangements. Several presentations addressed research data centers: J. Bradford Jensen of the Census Bureau's Center for Economic Studies and Patrick Collins of the recently established California Census Research Data Center at the University of California, Berkeley, described their centers. John Horm reported on NCHS's Research Data Center, which offers both on-site and remote access to the agency 's non-public-use data sets. And Garnett Picot of Statistics Canada outlined his agency's current restricted access procedures, as well as its plans to establish several research data centers. The following sections summarize the features of the three principal kinds of restricted access arrangements as presented and discussed at the workshop, as well as one special procedure involving respondent consent that has been used by Statistics Canada. The main features of interest of these arrangements are the adequacy of the data for the desired analyses, eligibility requirements, the means used to provide adequate protection for individually identifiable data, the costs of obtaining and providing access, and the way these costs are shared by users and custodians of the files. Licensing NCES was one of the first organizations to issue licenses to researchers that allow them to receive and use nonpublic microdata sets at their own work sites. Nearly 500 licenses for files from several different NCES surveys have been issued since 1991. There are no specific restrictions by type of organization; licenses have been issued to government agencies at all levels, universities, research corporations, and associations. Applicants must provide a description of their research plan and demonstrate that it requires the use of restricted data, identify all persons who will have access to the data, and prepare and submit a computer security plan. They must execute a license agreement signed by an official with authority to bind the organization legally to its conditions and submit affidavits of nondisclosure signed by all persons who will have access to the data. They must agree to submit to unannounced inspections of their facilities to monitor compliance with security procedures; an NCES contractor carries out a systematic program of inspections. Licensees are subject to severe criminal penalties for confidentiality violations, as specified in the National Education Statistics Act of 1974. During the past decade, several other agencies and organizations have developed and used licensing agreements for access to restricted data sets. Specific conditions vary. Some licensors provide access only to institutions certified by the National Institutes of Health as having met procedural criteria for IRBs or human-subject review committees. The duration of license agreements varies, with extensions being available in most instances. Some licensors require that publications based on the data be submitted to them for
OCR for page 45
Improving Access to and Confidentiality of Research Data: Report of a Workshop disclosure review; others leave this responsibility to the licensee. Most agreements allow for unannounced inspections of facilities, but not all licensors have a systematic inspection program such as that conducted for NCES. Every licensee must cover the costs of going through the application process, which generally requires a significant amount of paperwork, and of establishing the physical and other security safeguards required to obtain approval of their computer security plans. Unlike NCES, which uses agency funds to cover the costs of processing applications and conducting inspections, some licensors charge user fees to cover these costs fully or partially. Potential penalties for violations vary substantially. Federal agencies other than NCES that release files may be able to impose penalties under the Privacy Act or other legislation; however, these penalties would be less severe than those available to NCES. Penalties available to universities and other licensing organizations are generally of a different kind: immediate loss of access and denial of future access to data, forfeiture of a cash deposit, notification of violations to federal agencies that fund research grants, and possible liability to civil suits for violating contract provisions. Research Data Centers The Census Bureau pioneered the distribution of public-use microdata files from the decennial census and household surveys. However, microdata from establishment censuses and surveys cannot be publicly released because of higher associated disclosure risks. The Census law does not permit release of restricted data to users under licensing arrangements. Thus, the only viable option is to provide for access to such files at secure sites maintained by the Census Bureau. Access is allowed only to persons who are regular or special sworn Census employees and would be subject to penalties provided in the law for violations of its confidentiality provisions. The Census Bureau's Center for Economic Studies, established in the mid-1980s at Census headquarters in Suitland, Maryland, initially constructed longitudinal files of economic data that were used for research by Census staff and by academic research fellows working at the Center as special sworn employees. Since then, additional research data centers have been established in the Bureau's Boston regional office; at Carnegie-Mellon University in Pittsburgh; and at the University of California at Los Angeles and the University of California, Berkeley. Another center is scheduled to open at Duke University in 2000. To date, only files of economic data for firms or establishments have been available, but the centers are planning to add restricted data sets from the decennial census and major household surveys, as well as linked employer –employee data sets. All researchers desiring to use the research data centers' facilities must submit proposals that are reviewed for feasibility, scientific merit, disclosure
OCR for page 46
Improving Access to and Confidentiality of Research Data: Report of a Workshop risk, and potential benefits to the Census Bureau.4 The applicant must explain why the research cannot be done with publicly available data files. To minimize disclosure risks, projects are limited to those that emphasize model-based estimation, as opposed to detailed tabulations. To ensure that no confidential data are disclosed, all research outputs are reviewed by center staff and may not be removed from the center without the administrator's approval. Fees are charged for use of the centers' facilities, but some fellowships are available on a competitive basis to partially defray these costs; grantees of the National Science Foundation and the NIA are exempted. NCHS recently established a research data center that provides both on-site and remote access to nonpublic data files from several NCHS surveys (see below for discussion of the center's remote access arrangements). The main requirements and conditions for on-site access are similar to those of the Census Bureau's research data centers. Research proposals must be submitted and are reviewed by a committee for disclosure risk, consistency with the mission of NCHS, and feasibility given the availability of the center's resources. All outputs are subject to disclosure review before being taken off site. Users are charged a basic fee for use of the center's facilities and an additional fee for any programming assistance provided by the center staff. Statistics Canada recently decided to establish six to eight research data centers to provide access to data from five new longitudinal surveys of households and persons, including one with linked employer data. Other restricted data sets will be added as needed. The features of the centers will be similar in most respects to those of the Census Bureau's regional centers. They will be located at secure sites that have stand-alone computing systems and are staffed by Statistics Canada employees. They will operate under the confidentiality requirements of the Canadian Statistics Act, and only “deemed employees” will be allowed access to the data. Proposed research projects will be subject to a peer review process led by Canada's Social Science and Humanities Research Council. For users to be considered “deemed employees,” they must produce an output or service for Statistics Canada. To meet this requirement, each user must produce a research paper that will be part of a series sponsored by the agency. The research paper will not include policy comments, but after meeting the requirement to produce the paper, researchers will be free to publish their results anywhere, accompanied by their interpretation of the policy implications. 4 Access to identifiable data by special sworn employees is permitted only when such access is deemed to further the agency's mission, as defined by law.
OCR for page 47
Improving Access to and Confidentiality of Research Data: Report of a Workshop Remote Access Arrangements for provision of remote access to NCHS restricted data files, which preceded the establishment of the research data center, have now been taken over by the center. Center staff construct the data files needed for various analyses, a process that can include merging user-supplied files with the appropriate NCHS data files. The center provides a file of pseudo-data in the same format as the real data so users can debug their programs. In this manner, the number of back-and-forth iterations is substantially reduced. SAS is the only analytical software that can be used, and some of its functions, such as LIST and PRINT, cannot be used. Users submit their programs to the center by e-mail; following disclosure analysis, output is returned to them by e-mail. Charges for access to a file for a specified time period depend on the number of records included in the file. There is also a charge for file construction and setup services provided by center staff. Statistics Canada has used remote access procedures on an ad hoc basis. An example is the provision of access to nonpublic data from a longitudinal survey of children and youth. The results have been mixed; a small evaluation survey and informal contacts with researchers have indicated that the system is judged by some to be cumbersome to use and not sufficiently interactive. Respondent Consent Procedure Section 12 of the Canadian Statistics Act permits Statistics Canada to share nonpublic survey data with an incorporated organization for statistical purposes, provided that survey respondents have given their permission to do so. The agency has used this procedure, known as a Section 12 agreement, to make such data sets available to other federal departments and to provincial statistical agencies. The respondent consent procedure must specify the organizations that will have access to the data. Typically, from 90 to 95 percent of respondents, who can be either persons or firms, give permission for their data to be used in this way. Discussion One trend that emerged from the presentations and discussion of restricted access was the rapid growth during the 1990s in the number of researchers obtaining access to complex research data files through all three of the principal methods of restricted access. The establishment of regional research data centers and the inclusion of demographic files at the Census Bureau's centers is likely to fuel a further expansion. NCES and the National Science Foundation are collaborating on the development of a manual of licensing procedures, which could potentially be used by other agencies and
OCR for page 48
Improving Access to and Confidentiality of Research Data: Report of a Workshop organizations with the authority to employ this arrangement. Adoption of a more uniform set of procedures could reduce the cost and time required to submit proposals to licensors. There have been some negative user reactions to the controlled remote access approach because of limitations on software options, delays, and the relative difficulty of interaction between researcher and data source. However, a substantial research effort is now under way to develop more effective procedures for controlled access to microdata sets via the Internet. Although expanded access to research files by a variety of methods is likely, there are some legal obstacles. Both the Census Bureau and NCHS, for example, have concluded that they do not have the legal authority to issue licenses that would allow researchers to use restricted data sets at their own facilities. Some workshop participants suggested that the Census Bureau should make restricted data files of other agencies available at its research data centers. Even if this were done, however, access to files from various agencies at a central point would probably still require different administrative procedures because of differences in the laws governing access to each agency 's data. Under interagency agreements, NCES has undertaken distribution of files created by other agencies using licensing arrangements; it has the legal authority to do so as long as the files in question include data relevant to education. The use of restricted access arrangements, which has been deemed necessary to provide adequate protection for confidential information about individuals and businesses, results in increased costs to conduct research. Custodians of the data files need additional resources to process applications, operate inspection systems, staff research data centers, and inspect outputs to ensure that disclosure does not occur. Researchers require resources to prepare applications for access, to provide appropriate physical security for the data, or to visit a secure site. At present, these costs are being covered partly by federal agency budgets and partly by user fees. The Census Bureau's research data centers have been supported in part by grants from the National Science Foundation and NIA, but may eventually have to recover more of their costs from users. Several workshop participants suggested that, if possible, graduate students should be exempted from such user fees. Various restricted access arrangements offer different levels of protection for the confidentiality of individually identifiable information. Researchers working in research data centers under the supervision of agency employees are under closer supervision than those licensed to work with the data at their own facilities. Although there have been no known disclosures of individual information from NCES data files released under licenses, inspections have turned up numerous violations of the license requirements, such as failure to notify the agency of changes in personnel authorized to use the data. Protection of the data under controlled remote access arrangements depends prima-
OCR for page 49
Improving Access to and Confidentiality of Research Data: Report of a Workshop rily on the effectiveness of automated screening systems and the vigilance of agency staff responsible for manual reviews of outputs prior to their release to users. Most arrangements for restricted access are time-limited, and licensees are generally required to return or destroy their files and derived work files containing potentially identifiable records. It was pointed out that such provisions can make it difficult or impossible for other researchers to attempt to replicate research findings or for either the original or other researchers to pursue leads generated by the initial results.
OCR for page 50
Improving Access to and Confidentiality of Research Data: Report of a Workshop This page in the original is blank.
Representative terms from entire chapter: