The Changed Landscape
This chapter details significant changes in the past decade that affect researchers’ access to government microdata: increasing public concern about issues of privacy and confidentiality; society’s increased need for data; changes in information technology; changes in the legal environment; developments in limiting data identifiability; and developments in methods and procedures for restricted access, including research data centers, monitored remote access, and licensing. We end the chapter with a brief comment on the potential for devising new approaches to data access while taking account of these changes.
INCREASING PUBLIC CONCERN
Private Lives and Public Policies noted public concerns about privacy and confidentiality, but did not describe them in any detail. This report, more so than its predecessor, takes account of changing public attitudes about privacy and confidentiality issues as they bear on principles of data collection and data access.
From recent analyses of data on public attitudes (see Chapter 4), two central findings emerge. First, levels of public concern about the intrusiveness of government inquiries and about whether there is or might be unauthorized disclosure of individual data appear to have increased in recent decades. Second, people who are worried about privacy and confidentiality issues are less likely to cooperate with government surveys. Response to the 2000 census strongly confirmed this second finding, as
summarized in Chapter 4 and elsewhere (see National Research Council, 2004b; Hillygus et al., 2006). In addition, there is survey evidence that many members of the public do not believe the government’s pledge that data will be kept confidential. In one survey, less than half of the public said that the promise of census confidentiality could be trusted. Also, nearly as many Americans agree as disagree that census answers can be used against them (Hillygus et al., 2006).
A vivid expression of public concern about the privacy and confidentiality of government statistics emerged in spring 2000, when talk show hosts, editorial pages, late-night comics, and even political leaders attacked the 2000 census long form on grounds that it was too intrusive. Responding to a public outcry, President George W. Bush, then a presidential candidate, said he understood “why people don’t want to give over that information to the government. If I had the long form, I’m not so sure I would do it either” (Prewitt, 2004:1452). The U.S. Senate passed a nonbinding resolution urging that “no American be prosecuted, fined, or in any way harassed by the federal government” for not answering certain questions on the census long form, in effect telling the public it was acceptable to break the law (Prewitt, 2004). Many more census respondents in 2000 than in 1990 answered long-form questions only selectively, leading to unprecedented levels of imputed values for missing responses (National Research Council, 2004b:283-285).
Relevant research does not draw a clear distinction between the effects of privacy concerns and confidentiality concerns on survey response. However, the long-form controversy suggests that it will be useful in future research to determine when respondents are resisting “unwarranted intrusiveness” simply because they do not like particular questions (a privacy concern) and when they are uncooperative because of fears about “unauthorized disclosure” (a confidentiality concern).
The Census Bureau recognized the importance of this distinction when, in its statement of privacy principles for the general public, first developed in 2000, it acknowledged the importance of balancing the need for statistical information with a respect for individual privacy. The Census Bureau now offers a principle, titled “respectful treatment of respondents,” under which it offers two promises (for voluntary surveys): “we promise to respect your right to refuse to answer any specific questions or participate in the survey” and “we promise to set reasonable limits on our efforts to obtain completed questionnaires and will restrict the number of follow-up contacts we conduct” (www.census.gov/privacy/files/data_protection/002822.html).
This newly articulated principle on the part of the Census Bureau is indicative of how much has changed in the few years since Private Lives and Public Policies was issued. Few people would have suggested in 1993
that the Census Bureau, which has built its reputation on persistence in getting complete answers from nearly 100 percent of its sample respondents, would only a decade later offer a principle that seems to contradict long-established practices.
SOCIETY’S INCREASED NEED FOR DATA
Public policies very often focus on population groups defined in terms of one or more characteristics: low-income families, veterans, Medicare patients, preschool children, software engineers, drug addicts, homeowners, to name a few from a long list. Policy design proceeds on the basis of knowing how many people are in these groups, how they are geographically distributed, and how they differ in characteristics. Other public policies focus on establishments: small businesses, public schools, military bases, banks, hospitals, prisons, insurance firms, and so on—all entities that are subject to statistical measurement and for which detailed information is required in order to inform policy choices.
Complex policy-making requires multivariate causal thinking about policy alternatives, which, in turn, requires complex, multivariate, often longitudinal data. For example, how will changing the age of eligibility for Social Security affect retirement decisions across different occupations and regions of the country? Over what time-frame (if at all) does Head Start close the educational gap between children from disadvantaged families and children from better-off families? At what levels of state and local unemployment will single mothers in welfare-to-work programs find secure jobs, and what will be the consequences for their children?
As the economy grows more complex and the population becomes more diverse, increasingly detailed data and data analyses are required for policies to match well with economic and demographic realities. This is true not only for policy making, but also for policy assessment and evaluation.1 A nation learns how well policies are working by comparing their intended effects with the actual outcomes. These comparisons draw on statistical information. One example makes this obvious point. Congress recently debated whether undocumented college-age students who have lived in the country for at least 5 years and performed well in secondary schools should be eligible for in-state tuition if they enroll in a public university in their state of residence (American Association of State Colleges and Universities, 2003). On one side of the debate are people
who assert that rewarding undocumented people in this fashion will be an incentive for increased illegal immigration. On the other side are people who assert that there will be long-term economic contributions to society if this group attends colleges and universities. Either, neither, or both of these assertions could turn out to be correct. Evaluating these alternative assertions appropriately requires an exercise in data-based policy analysis.
In addition to federal, state, and local government policy uses of statistical information, commentators on American democracy—starting with George Washington and Thomas Jefferson—have repeatedly stressed that an uninformed public is incompatible with preserving democratic principles and practices. Just as statistical data are used by political leaders to design policy, they are used by the electorate to assess how well those policies have worked and thus to rate the effectiveness of the government, especially when data are available on trends. What are often referred to as social and economic indicators play an important role in democratic accountability, as the public gauges the quality of public life by taking note of what is trending up and what is trending down. Crime rates, air quality, access to health care, charitable giving, education levels, homeownership, out-of-wedlock births, voter participation, and unemployment are illustrative of features of our society that are given public visibility in the nation’s official statistics. This public visibility strengthens democratic practice.
The private sector is no less in need of detailed information on the American population. Citing only data from the decennial census, a recent National Research Council report (2004b:66) observed:
Retail establishments and restaurants, banks and other financial institutions, media and advertising firms, insurance companies, utility companies, health care providers, and many other segments of the business world use census long-form-sample data, as do non-profit organizations. An entire industry developed after the 1970 census to repackage census data, use the data to develop lifestyle cluster systems (neighborhood types that correlate with consumer behavior and provide a convenient way of condensing and applying census data), and supply address coding and computer mapping services to make the data more useful.
The most important source of the information used for policy design and evaluation and the other purposes described above are the more than 70 federal agencies that carry out statistical activities of at least $500,000 per year (U.S. Office of Management and Budget, 2004). These agencies include statistical agencies, such as the Bureau of Labor Statistics (BLS), the National Centers for Education and Health Statistics (NCES and NCHS), and the U.S. Census Bureau. They also include research funding and analysis agencies, such as the Agency for Healthcare Quality and Re-
search, the National Institutes on Aging and Child Health and Human Development, and the National Science Foundation (NSF). In fiscal 2004 these agencies were authorized to spend an estimated $4.8 billion for statistical programs to serve the nation’s information needs (U.S. Office of Management and Budget, 2004).
Over a 10-year period, the 2000 census alone cost more than $6.5 billion. This seemingly very high cost pales in comparison with the value of the many uses to which census data are put. The Constitution requires that seats in the U.S. House of Representatives be allocated in proportion to population, for which it mandates an enumeration of the population every 10 years. Two other major uses are redistricting and fund allocations. Congressional and state and local legislative districts are drawn on the basis of census counts for small geographic areas. Currently, federal agencies allocate more than $200 billion of federal dollars to states and other areas by formulas that, directly or indirectly, depend on census data (National Research Council, 2004b:Ch. 2). Across the decade, about $2 trillion in federal funds depend on census data, so that the investment in the census represents only 0.0035 percent of the federal funding based on census results. And this calculation does not take into account state and local expenditures, or the huge business investments in marketing, labor practices, and manufacturing and retail location decisions that rely on census data.
Similarly, other statistical programs provide data that serve many purposes and are collected at a fraction of the dollars at stake in the decisions made on the basis of analyses with those data. For example, the National Assessment of Educational Progress (NAEP), conducted by the NCES, has made and continues to make major contributions to analysis and evaluation of the effectiveness of the nation’s elementary and secondary education policies at the federal, state, and local levels (see nces.ed.gov/nationsreportcard).
Society’s increased needs for data and their intelligent analysis have been matched by a significant expansion of the analytic capacity found in the nation’s universities, policy organizations, corporations, and advocacy groups. The government frequently turns to this private-sector-based analytic capacity—especially to university researchers and analysts in specialized private research institutions—to carry out policy analyses and basic research using government-collected data, particularly microdata on individual units. This behavior is recognition that federal research and evaluation agencies are not funded to take full advantage of the public investment in the collection of the original data on their own. In the case of statistical agencies, they may avoid policy-oriented data analysis so as not to impair credibility, which is based in part on the public’s perception of their objectivity (National Research Council, 2005).
An effective public-private partnership between data collection agencies and the research community is a critical element in bringing analyses of complex data, particularly microdata, to bear on policy design and assessment. The partnership is strengthened by continuous improvements in data access, both through public-use data sets and through restricted data access modalities (see below and Chapter 5).
CHANGES IN INFORMATION TECHNOLOGY
The information world now functions through extraordinarily complex networks of humans and computers, capturing enormous numbers of records of personal and organizational information, storing them in data warehouses with capacities in petabytes, analyzing them through sophisticated statistical and data mining tools, and disseminating the results through ever more capable communications media. This explosion in the capability of information technology is evident at each stage of the process of data capture, storage, integration, and dissemination by public and private entities (Duncan, 2004).
As recently as 1993, when Private Lives and Public Policies was published, easy access to complex computerized databases produced by federal statistical and other data collection agencies was only a design goal. Now the Web provides access to vast arrays of information from a desktop. For example, through www.census.gov, there is ready access to tables and maps of 2000 census data for all geographies to the block level, as well as access to complex microdata sets through extraction and downloading tools. Other statistical and research agencies also offer Web access to detailed aggregate and microdata.2
One useful measure of the extent of this enhanced information technology is reduced cost. In each of the four stages of the process of data capture, storage, integration, and dissemination, advances in information technology have pushed costs lower. Although the picture is complicated by demographic and social factors that drive costs up (such as the declining willingness of the public to respond to telephone surveys), in many ways the cost of obtaining data is much less today than it was 10 years ago. Electronic data capture techniques—such as scanning and computer-assisted interviewing, surveillance by video cameras in buildings and on streets, and satellite imaging—have become commonplace and readily available at moderate cost. Similarly, terabytes of data storage can be purchased for little. By scanning, one terabyte of storage can
The Web address for access to all statistical agencies is Fedstats.gov.
hold the contents of 2,000 file cabinets of documents. Ten years ago, such storage would have cost $1 million; about 5 years later the cost was less than $800 (Gilheany, 2000). More recently, Hayes (2002:Fig. 4) estimated the cost of a megabyte at a few tenths of a cent, and Rhea and his colleagues (2003) estimated that the cost of a terabyte will be $100 in 2006. Data integration—that is, consolidating information from heterogeneous databases—is no longer a horrendously complex task, but one that is facilitated by data standards (such as XML), the growth of the Web, and fast and inexpensive data transmission capability. Correspondingly, data dissemination through the Web and electronic mail is now free for all practical purposes.
Lowered cost at each stage of the data process provides benefits to researchers. With lowered costs, researchers and policy analysts can enjoy the prospect of being able to work with new data sources, to use historical records that are maintained indefinitely in user-friendly formats, to create rich contextual databases with relevant attributes, and to disseminate their results quickly throughout the world.3
Lowered cost also provides opportunities for “data snoopers,” by which we mean individuals or organizations that attempt to identify respondents for purposes that range from curiosity to marketing to pinpointing individuals who may have committed a crime or who may constitute a terrorism risk. In contrast, researchers are not interested in individuals as such, but only in the answers to research or policy questions that can be obtained by analyzing aggregations of individuals’ attributes.
Yet the data that are most useful to legitimate researchers typically have characteristics that pose substantial risk of disclosure. Some data characteristics that create vulnerability include:
detailed geographic information;
repeated data collection from the same subjects;
outliers, such as people with very high incomes;
many attribute variables; and
complete census data rather than a survey of a small sample of the population.
Data with geographic detail, such as census block data, may easily be
One such historical source is the Integrated Public Use Microdata Series (IPUMS), which contains individual records from the U.S. censuses from 1850 through 2000 (see Ruggles, 2000). IPUMS is available on-line at the University of Minnesota (www.ipums.umn.edu) with funding from the NSF and the National Institutes of Health.
linked to known characteristics of respondents, unless steps are taken to alter or mask the data. Longitudinal data obtained in panel surveys, which track entities over time, pose substantial disclosure risk—both because identifiers must be retained by the collection agency in order to recontact respondents and because longitudinal data typically include many more attributes than do one-time surveys. In general, data files containing many attribute variables permit easier linkage with known attributes of identified entities. This problem is magnified when social survey data are linked to unique identifiers, such as genetic data. Furthermore, data from a census or near census pose more disclosure risk for some kinds of data snooping than data from a survey having a small sampling fraction: there is little likelihood, for example, that a record from a small sample survey that has some attributes in common with an identified record from an administrative source is actually unique in the population.
The risk of disclosure has been significantly increased because of the ready availability to data snoopers of external databases on the Web (see Sweeney, 2001). These databases identify persons or other entities by name and location and share with statistical data certain attribute variables that may permit matching the anonymized statistical data with identified data (as with marketing databases). Moreover, software for matching is widely available and easily used. Thus, would-be data snoopers now have a treasure trove of potential methods of infringing on the privacy and confidentiality of subjects of statistical surveys by matching information across databases (Winkler, 1998).
CHANGES IN THE LEGAL ENVIRONMENT
For the reasons outlined above, statistical agencies face increased tension as they try to respond to public policy and research needs for data while protecting the confidentiality of the underlying information. In addition, they are expected to maintain both a high quality of the data they produce and their role at the forefront of scientific data collection (see Groves and Lepkowski, 1985). At their disposal they have an array of legal as well as technical solutions. This legal environment has changed in important ways during the last decade.
Recently, federal statistical data have received broad new statutory protections against traditional threats to confidentiality, but they may be increasingly vulnerable to new threats from statutes intended to enhance national security and government accountability (see National Research Council, 2005:35, App. B). In 2002, Subpart A of the Confidential Information Protection and Statistical Efficiency Act (CIPSEA) established minimum standards for protection of information gathered by a federal agency for a statistical research purpose under a promise of confidentiality. Such
information may not be disclosed in identifiable form for nonresearch purposes without the consent of the respondent: nonresearch purposes include administrative determinations, law enforcement investigations, and adjudicatory proceedings. CIPSEA thereby provides statutory protection to the many statistical agencies that previously had only custom or other nonstatutory authority to back up pledges of confidentiality. The obligation to protect research data extends beyond federal agency personnel to include those who contract with the agency to provide statistical research services, such as conducting survey interviews or preparing data products. The U.S. Office of Management and Budget (OMB) is developing guidance for the implementation of Subtitle A of CIPSEA, but such guidance has not yet been issued for public comment.
Subpart B of CIPSEA permits identifiable business records to be shared for “statistical” purposes by the U.S. Census Bureau, Bureau of Economic Analysis (BEA), and Bureau of Labor Statistics (BLS), subject to written agreements that specify the nature of the records, the statistical purposes, and the procedures governing access and security. Such data sharing, which fully maintains confidentiality protection, can support significant improvements in the nation’s ability to obtain high-quality data on business formation, internationalization of employment, and other critically important issues for economic policy. A key element in the Census Bureau’s data is its business register, which is constructed with data from the Internal Revenue Service (IRS). However, without new legislation (to amend Title 26 of the U.S. Code, which governs access to IRS tax data), the business register and associated data cannot be shared with BEA and BLS.
Medical records, including those used for research, are subject to new confidentiality regulations under the Health Insurance Portability and Accountability Act (HIPAA) of 1996 (P.L. 104-191). Researchers who rely on information collected by health care providers must comply with strict requirements governing the use and disclosure of health care information (see National Research Council, 2003b:117-118). Identifiable medical information may be disclosed for research purposes only with the written consent of the person providing the information or in a limited set of circumstances in which an institutional review board determines that the identifiable medical information is essential to the conduct of the research and the disclosure presents minimal risk to the individual. The researchers must protect identifiable information from improper disclosure and destroy the identifiers at the earliest opportunity consistent with the conduct of the research (45 C.F.R. § 164.512).4
At the same time, new challenges to confidentiality of research records have arisen. As noted in Chapter 1, the USA Patriot Act of 2001 (P.L. 107-56) overturned the strict confidentiality protection of education records gathered and maintained by the NCES, a change in protection that was later reflected in corresponding amendments to the statute governing NCES. The USA Patriot Act allows the Attorney General to petition a court for access to identifiable education records, including those from research, for use in the investigation and prosecution of terrorist activities.
Access to federal research information for nonresearch purposes is also permitted by the Shelby Amendment to the Omnibus Consolidated and Emergency Supplemental Appropriations Act for Fiscal 1999 (P.L. 105-277), which requires the OMB to set forth regulations to ensure that all data that are supported by a federal grant to colleges, universities, hospitals and other nonprofit institutions “will be made available to the public through procedures established under the Freedom of Information Act.” The resulting OMB guidelines restricted access to data under the Shelby amendment to published or cited research that has been used by the federal government to develop legally binding regulations and rulings, and noted the exemptions to access under the Freedom of Information Act for information that would result in a “clearly unwarranted invasion of personal privacy, such as records that could be used to identify a particular person in a research study” (for a detailed discussion of this issue, see National Research Council, 2003a). CIPSEA strengthens such an interpretation by prohibiting disclosure of confidential information under the Freedom of Information Act. The validity of the OMB guidelines and the effect of the CIPSEA restrictions are the subject of some dispute and have yet to be tested through litigation.
Federal statistical agencies also confront increased scrutiny about the quality of information that they disseminate to the public, even if the data have not been used as part of the regulatory process. The Information Quality Act (also known as the Data Quality Act, P.L. 106-554, § 515, which was enacted in 2000 as an amendment to an appropriation bill), directs OMB to issue guidelines for “ensuring and maximizing the quality, objectivity, utility, and integrity of information disseminated … by federal agencies” to the public, and requires all federal agencies to establish administrative procedures to allow affected parties to obtain correction of information disseminated by an agency that does not meet those standards. The resulting OMB guidelines (U.S. Office of Management and Budget, 2005) define “scientific information” to include “any communication or representation of knowledge such as facts or data, in any medium or form.” “Dissemination” is defined as “agency initiated or sponsored distribution of information to the public.” Taken together, these definitions would extend the regulations to include agency distribution of
public-use and restricted use statistical data sets. Agencies must meet higher information quality standards for distribution of “influential scientific information,” which is defined as information reasonably expected to “have a clear and substantial impact on important public policies or important private sector decisions.” Agencies that disseminate influential scientific information must conduct a peer review prior to dissemination and reveal the data and methods used to generate the scientific information to the extent necessary to facilitate independent reanalysis, while taking into account privacy, confidentiality and intellectual property rights of those who are the focus of the data.5 The act has raised concern among some researchers that those opposed to certain agency policy initiatives may challenge the findings and quality of research data as a means of impeding agency regulatory activities. In 2003, some 19 agencies received requests for data correction under this act.6
Clearly, the courts can have difficulty in balancing individual privacy against a right to public access. For example, in a recent case (Southern Illinoisan v. Dept. of Publ. Health, N.E.2d, 2004 WL 1303656 [Ill. App.5 Dist. June 9, 2004]), a newspaper sought release under the state freedom of information act of cancer registry records for people with a rare form of childhood cancer. The state department of public health opposed release of the information, which included zip code of residence, pointing to another state statute that prohibits the public inspection or dissemination of any group of facts that tends to lead to the identity of any person whose condition or treatment information is submitted to the registry. The agency then supported its claim that release would result in inadvertent disclosure of identifiable information by offering expert testimony in which the expert linked most of the records in a test sample to other data and identified the patients. The court dismissed this demonstration, responding that such identification by one expert was not proof that the records could be readily identified, and ordered their release.
DEVELOPMENTS IN LIMITING DATA IDENTIFIABILITY
The confidentiality of individual data may be protected either by restricting researcher access to such data (restricted access) or by various alterations that limit the identifiability of the data and hence permit them to be made publicly available (restricted data). Statistical agencies and
National Academies workshop in 2003, “Peer Review Standards for Regulatory Science and Technical Information,” also explored this issue; for a transcript, see www7.nationalacademies.org/stl [June 2005].
See www.whitehouse.gov/omb/inforeg/2005_ch/draft_2005_cb_report.pdf [June 2005].
their contractors currently use both methods (see Cohen and Hadden, 2004).
Historically, beginning with the development of public-use summary files and microdata samples from the 1960 and 1970 censuses, restricted data have represented the most widely used method for facilitating researcher access to complex data (see Dunton, 2000; Gaquin, 2000a, 2000b). Restricted data products play an especially important role in providing research access to data because such products are available to all researchers—both inside and outside the government—for critical assessment and alternative analyses. Restricted data may be in the form of microdata files, which contain transformed or imputed attribute values for a sample of individuals (such as age, sex, race, income, occupation, labor force history) or organizational entities. Restricted data may also be in the form of tabular arrays, such as cross-classifications of income and education for geographic areas.
At present, a wide array of summary and microdata sets are available in public-use form. Reflecting the advances in Internet capabilities, access to such data sets is increasingly provided on-line. Examples include:
the Census Bureau’s American FactFinder for geographic area tabulations from the census and American Community Survey (at www.census.gov) (see Hawala, Zayatz, and Rowland, 2004);
the NSF’s on-line tabulation systems for data on science and engineering personnel and resources (www.nsf.gov.statistics.databases.cfm);
the DataFerrett System of the Census Bureau, which enables researchers to extract and analyze such complex public-use microdata sets as the Census Bureau’s Current Population Survey and Survey of Income and Program Participation, and microdata sets from the NCHS, including the National Health Interview Survey and the National Health and Nutrition Examination Survey (dataferrett. census.gov); and
the Online Data Analysis System of the National Archive of Criminal Justice Data, which uses software developed by the Computer-Assisted Survey Methods Program at the University of California, Berkeley: the archive contains a wide range of federal, state, and local criminal justice data sets and is maintained by the University of Michigan Interuniversity Consortium for Political and Social Research, with funding from the Bureau of Justice Statistics and the National Institute of Justice (www.icpsr.umich.edu/NACJD).
Although many types of restricted data continue to be available, statistical agencies (in response to the increased threats to confidentiality protection noted above) have curtailed the availability of some data that were previously included in public-use files. For example, public-use microdata
samples (PUMS files) from the 2000 census were somewhat more restricted in data content than PUMS files from previous censuses, and, because of state laws, the NCHS no longer makes publicly available linked files of microdata from the National Health and Nutrition Examination Survey and mortality records. (Public-use files of business microdata have never been created because of the relative ease of identifying individual firms.)
Restricted data are created through disclosure limitation techniques, which involve either the transformation of the original data, which is called masking, or through the use of the original data to guide the generation of synthetic or virtual data through a statistical technique of multiple imputation. Initially, relatively simple data masking techniques, such as top coding income amounts (that is, assigning all income amounts above a certain value to a single category), were used to generate restricted data products. During the last decade the increasing risks of confidentiality breaches have led researchers to develop increasingly sophisticated methodologies for restricted data products (see, e.g., Doyle et al., 2001; Singh, Yu, and Dunteman, 2003).
As developed in Duncan and Pearson (1991), masking may involve coarsening of the data through various forms of data recoding. Attributes may be deleted or combined, or attribute values may be grouped into categories (or bins), such as broad intervals for asset values. Masking may also involve perturbing the data, for example, by adding random noise or stochastically misclassifying certain entities in a table or by swapping selected data among individual records.
The Research Triangle Institute has developed a set of procedures as a sophisticated approach to masking, called macro-agglomeration with substitution, subsampling, and calibration (MASSC) (see, Singh, Yu, and Dunteman, 2003). Substitution refers to perturbation of data fields, and subsampling refers to suppression of individual records. MASSC was originally developed to create public-use files for the National Survey of Drug Use and Health, sponsored by the U.S. Substance Abuse and Mental Health Administration. An important design goal was to protect against breaches of confidentiality by individuals who know someone in the survey—such as a parent who provided consent for a child to participate. MASSC is able to provide measures of disclosure risk and information loss for a particular application.
The development of a methodology for generating synthetic or virtual data is a relatively recent activity (Rubin, 1993). A key objective of the method is to preserve faithful representations of the original data so that inferences from the synthetic data are as consistent as possible with the inferences that would be drawn from the original data. The method is akin to creating multiple samples from the true population. Estimates
from any one simulated data set are unlikely to equal those from the observed data, but by combining estimates from several such samples (hence, “multiple imputation”), it is possible to estimate the true value, as well as the amount of variation produced by three sources of error: sampling the collected data, sampling the synthetic units from the population, and generating values for those synthetic units (Reiter, 2003; Raghunathan, Reiter, and Rubin, 2003). The value and usefulness of synthetic data for inferential analysis, though promising, have not yet been fully studied or determined.
In developing restricted data, researchers are paying increasing attention to methods for systematically analyzing the joint impact of various disclosure limitation techniques on disclosure risk and data utility (e.g., Abowd and Woodcock, 2001). A general framework for such an analysis is the risk-utility (R-U) confidentiality map (Duncan, Keller-McNulty, and Stokes, 2003), which incorporates quantified measures of disclosure risk as well as measures of data utility (Duncan and Lambert, 1986, 1989; Lambert, 1993; Reiter, 2003). Nonetheless, more research is clearly needed to assess the relative ability of different masking methods, and of synthetic data, to reduce the risk of disclosure while preserving data utility.
DEVELOPMENTS IN METHODS AND PROCEDURES FOR RESTRICTED ACCESS
Just as important developments have been under way since the early 1990s in methods for producing restricted data products that can preserve both data confidentiality and utility, so, too, has there been substantial expansion in the repertoire of modes for facilitating restricted access to confidential microdata. (Confidential information refers to any identifiable information, regardless of whether direct identifiers, such as name or address, have been removed from the record.) All such restricted access methods are designed to provide the researcher with data not subjected to the perturbations—variable suppression, top and bottom coding, rounding, swapping, random noise, etc.—found in microdata provided in public-use files.
For several decades, major statistical agencies, including BLS, the Science Resources Statistics Division of NSF, and the Census Bureau, have sponsored fellowship programs through the American Statistical Association and the NSF for researchers to work with confidential data at the agency’s site.7 These programs have been invaluable to the agencies in
obtaining significant commitments of researchers’ time to working with an agency’s key data sets; however, they provide such access to only a handful of researchers each year. The last decade has seen the development of distributed research data center programs, monitored remote access systems, and licensing to encourage many more researchers to work with confidential data.
Research Data Centers
Stimulated by researchers’ interests in analyzing detailedwdata on business organizations for which it is very difficult to create useful public-use microdata products, the Census Bureau’s Center for Economic Studies worked with researchers to set up two research data centers (RDCs) in 1994. The Census Bureau currently sponsors eight RDCs, with a ninth scheduled to open in late 2005. These RDCs are operated and funded in a variety of ways: one is run by the Census Bureau itself at its Washington-area headquarters; several are run by consortia of research institutions that include universities, not-for-profit organizations, and government agencies; and several are run by a single university with financial assistance from other sponsors, including the Census Bureau (see www.ces.census.gov/ces.php/rdc).8 The NCHS established an RDC at its headquarters in Hyattsville, Maryland, in 1998 (see www.cdc.gov/nchs/r&d/rdc.htm), and the Agency for Healthcare Research and Quality opened an RDC at its headquarters in Rockville, Maryland, in 2004 for analysis of confidential information from the Medical Expenditure Panel Survey (see meps.ahrq.gov/DataCenter.htm). BLS maintains three separate RDCs at its headquarters in Washington, D.C., for research using confidential data maintained by the Office of Employment and Unemployment, the Office of Compensation and Working Conditions, and the Office of Prices and Living Conditions (www.bls.gov/bls/blsresda.htm). The approval of the Census Bureau is required for access to some BLS data sets for which the Census Bureau is the data collection agent.
The Census Bureau states that the purpose of its RDCs is to increase the utility and quality of Census Bureau data products by providing confidential microdata to qualified researchers under conditions that do not pose unacceptable disclosure risks. The NCHS offers a similar rationale. To access data through an RDC, qualified researchers submit proposals that are reviewed for feasibility, scientific merit, consistency with the agency’s mission, and conformity with confidentiality protection proto-
cols. Work is done on a secure site, with secured computers, and under the supervision of agency staff. All outputs are subject to disclosure review. In addition, researchers using the NCHS site sign a confidentiality protocol. To use a Census Bureau site, researchers obtain a special sworn status as a Census Bureau employee. Violation of the terms of that status subjects the researcher to the same legal penalties as Census Bureau employees: for disclosure of confidential data, a fine up to $250,000, imprisonment for up to 5 years, or both.
The rules governing use of the research data centers are more constraining than those encountered in most research settings. Many researchers are willing to accept those constraints because the RDCs provide unique research opportunities that require access to confidential microdata records. At the Census Bureau sites, for example, researchers have access not only to microdata sets on business establishments, but also to demographic microdata sets (including versions of the Current Population Survey and the Survey of Income and Program Participation) with more geographic and socioeconomic detail than is made publicly available and to linkages of population and economic census, survey, and administrative records data (such as the data sets being assembled by the Longitudinal Employer-Household Dynamics Program). At the NCHS site, researchers can merge that agency’s data with their own data files.
The development of the RDC concept—particularly the concept underlying the Census Bureau’s program of distributing RDCs around the country—is an important step in providing access to microdata sets that pose particularly difficult challenges of data protection. Hildreth (2003:5) claims that the Census Bureau’s “RDC network has … led to some of the most innovative social science research currently being undertaken.” To establish and operate an RDC, however, is costly for a statistical agency and, in the case of a distributed RDC, for the host institution (see Hildreth, 2003). Indeed, the RDC once located at Carnegie Mellon University was closed because the university was unwilling to continue the required level of financial support.
To recover their operating costs, RDCs must attract a sizable clientele. However, the experience to date with the Census Bureau’s RDC network is that the long and arduous approval process may be deterring many researchers from applying, particularly graduate students and junior researchers who cannot afford to lose much time before beginning their work. The amount of time varies by project: an economics data project, for example, takes an average of 7 months for approval, not including proposals that require revision and resubmission (Hildreth, 2003:19). Part of the delay for many projects is the necessity to obtain the approval of another agency, such as the Internal Revenue Service (IRS), for the Census
Bureau’s business establishment files (which include information from tax returns).
In addition, in response to a 1999 IRS review of the Census Bureau’s protocols for confidentiality protection, the Census Bureau specified that the predominant purpose of research using data sets that fall under Title 13 (which governs the Census Bureau’s operations) must be to benefit Census Bureau programs. This strict interpretation of Title 13, which applies to most of the demographic and economic data provided through the Census Bureau’s RDCs, may deter research that would use Title 13 data in important ways but does not meet the strict criteria for approval.
Hildreth (2003) contends that the criteria and time for approval and the direct costs to the researcher associated with use of an RDC have led to their underutilization. He notes the common view among RDC directors and researchers that the use of restricted data sets, specifically the Longitudinal Research Database and other Census Bureau data, has declined. Recognizing these issues, the Census Bureau has recently indicated its intention to consider ways to streamline the RDC process and to explore the addition of data sets from other statistical agencies to its RDC network, which would increase their attractiveness to researchers.9
Monitored Remote Access
Monitored remote access to confidential data is currently implemented in four federal statistical agencies:
the NCES, which permits access to a range of education files containing confidential data using the NCES Data Access System (nces.edu.gov/das);
the NCHS, which permits access to almost all of the surveys sponsored by NCHS, including geographic and other detail not contained in public-use data products, through remote access to its research data center;10
the U.S. Census Bureau, which permits users to develop their own tabulations from the 2000 census basic records using the Advanced Data Query System (advancedquery.census.gov); and
the Economic Research Service (ERS) in the U.S. Department of Agriculture, which recently inaugurated a remote system for statistical
analysis of microdata from the Agricultural Resource Management Survey (ARMS).11
The pioneer of monitored remote access is the Luxembourg Income Study (LIS) in Belgium, which makes microdata from 66 household income surveys available to researchers from 25 participating countries. LIS began in 1983; its software allows users to submit their own programs using standard statistical software. Output is monitored both electronically and manually to protect confidentiality. The LIS software is also used to provide access to the Luxembourg Employment Study, the German Socio-Economic Panel, and EUROSTAT data (see Rowland, 2003). As described by Rowland (2003:4):
Remote access systems make it possible for users to analyze restricted microdata without visiting an RDC. The systems used for remote access to restricted microdata are monitored automatically and/or manually for disclosure avoidance. They employ automated and manual filters that block certain kinds of queries and results. The files available are usually edited for disclosure avoidance using the same techniques as those used for public use files. They provide more detail to researchers than public use files, but less detail than is usually available in an RDC. The files reside in the [federal statistical agencies] and extracts of microdata and direct access to the records are not permitted.
Unlike the NCES remote access system, the NCHS system allows users to submit SAS (statistical analysis system) programs by e-mail to produce most kinds of output supported by SAS. Output is returned within a few hours of submission (during work hours). Researchers must obtain approval from NCHS for the proposed analysis, sign an affidavit of confidentiality protection, and pay a minimum processing fee of $500 per month, or they can pay $500 per year for selected files that have been developed for repeat and multiple users (for the details of the NCHS policy, see www.cdc.gov/nchs/r&d/rdcfr.htm; Institute of Medicine, 2005). The most used data file is the National Survey of Family Growth, which contains geographic and other detail not available in the public-use format.
The Census Bureau’s Advanced Data Query System enables users to develop their own tables from the full 2000 census complete count and
ARMS obtains information from farm households on income, assets, and selected crop practices. Access to the remote system requires a memorandum of understanding for research purposes between ERS and the research institution, an approved research project agreement, and a confidentiality agreement with the National Agricultural Statistics Service. The system software monitors data output for confidentiality protection (see arms.ers.usda.gov).
long-form sample records. Tables must be for standard census geographic areas: to protect confidentiality, data cannot be obtained for city blocks or block groups, unlike the prespecified tables that are available from the American FactFinder. Users must register and log in with a user identification and password; there are no processing costs. As of 2003, more than 500 users were registered to use the Advanced Query System; in comparison, the NCHS remote access system had about 45 users in the period 1998-2003 (Rowland, 2003:15,20).
Monitored remote access has the advantage that a researcher does not have to go to an RDC to make use of confidential data and, in the case of the NCES and Census Bureau systems, output is returned quickly. However, output in those systems is limited to tables. The NCHS system provides more output choices, but it has waiting periods to obtain output. With regard to the efficacy of the disclosure review systems, evaluation (see Rowland, 2003) suggests that they protect well against direct disclosure but not against complementary disclosure (that is, disclosure of the information in a table cell by manipulating other cells).
Licensing, the third major mode for restricted data access, was first established in 1989 by the NCES. Other agencies and archives that have licensing procedures include the Bureau of Labor Statistics, the Division of Science Resources Statistics of NSF, the Health and Retirement Study data archive (hrsonline.isr.umich.edu/rda), the University of Michigan National Archive of Criminal Justice Data, and the Wisconsin Longitudinal Study (see www.ssc.wisc.edu/wlsresearch).12
The license allows researchers to use nonpublic microdata at their own work site, and thus is the most convenient of the three modalities that have emerged over the last decade, although, to date, it is the least used mode. Applicants submit a research plan that includes justification for the use of confidential data, identification of all persons who will have access to the microdata, and a computer security plan. For successful applicants, a license is signed by an official with authority to bind the university, research corporation, or other government agency to the conditions spelled out in the license. Persons with access to the data also sign affidavits of nondisclosure and agree to unannounced inspections to monitor
compliance with security procedures. Most license agreements include severe criminal penalties for confidentiality violations.
License agreements are not available in agencies with existing legislation that places more demanding restrictions on confidential data. The Census Bureau, for example, remains constrained by legislation that restricts access to individual records to sworn officers and employees. Licensing conditions vary from agency to agency, as does the duration of the license agreements. Penalties for violating license agreements are uniformly severe, but the procedures for monitoring the performance of licensees and detecting and taking appropriate action against violations are weak (Seastrom, Wright, and Melnicki, 2003). Audits of data protection protocols have found violations due largely to carelessness; they have not found any actual breaches of confidentiality.
MEETING THE CHALLENGES
The panel’s report points to a number of serious challenges at the interface of confidentiality and data access. It places a high value on protecting confidentiality. It also takes seriously the responsibility to assure that the nation’s robust research and policy analysis infrastructure has sufficient access to microdata so that it can provide intelligent analysis of social and economic conditions and of the effect of policies designed to improve them.
Although it is easy to agree with the Jeffersonian principle that absent an informed public there is no democracy, it is equally easy to agree with the late Senator Moynihan, who famously justified his vote rejecting Robert Bork for the Supreme Court: “I cannot vote for a jurist who simply cannot find in the Constitution a general right to privacy …” But the Jeffersonian public that needs to be informed is the same public that must supply answers to questions sometimes viewed as infringing on privacy and must be assured that answers given are confidential.
The panel finds in history the warrant for asserting that there are ways to move forward without sacrifice to either the value the nation places on privacy and confidentiality or the value it finds in a data-rich democracy. Statistical agencies, working closely with scholars, have for more than 40 years simultaneously improved the technologies that protect confidentiality and the modalities that provide appropriate access to microdata. Even as some methods are applied to decrease disclosure risk, others have been designed to improve access under carefully controlled conditions.
In response to increased public concerns about privacy and confidentiality and developments in information technology and data availability in the past decade, the statistical and research communities responded quickly with new methods for restricted access modes and restricted data
products. The challenge at the present time is to evaluate how well the new methods are working, forthrightly assess problematic areas, and determine ways in which alternative methods can be improved. Nothing in the past suggests that increasing access to research data without damage to privacy and confidentiality rights is beyond scientific reach. This report offers recommendations that, if implemented, will continue the past record of simultaneous improvement along both dimensions. Such improvement will require strong partnership between the research community and statistical and research agencies in the design of innovative research on disclosure avoidance techniques and data access modalities and in the implementation of the advances that result from such research.