Ensuring Access and Confidentiality
The National Science Foundation (NSF) has largely completed its review of the options for reducing disclosure risk from the tabular data in the Race/Ethnicity, Gender, and Fine Field of Study Tables (called here the REG tables) of the Survey of Earned Doctorates (SED) and has announced its design decision for moving forward for the next publication round. Issues of confidentiality protection and data access, however, are ongoing and require continual review, as capabilities for both protection and intrusion in published tabulations are enhanced over time. This chapter summarizes information from the workshop on emerging models for ensuring access and confidentiality and developments in risk management that can help guide thinking about how to improve both access and protection in the future.
EMERGING METHODS FOR ACCESS AND CONFIDENTIALITY
As mentioned in Chapter 2, the Confidentiality and Data Access Committee of the Federal Committee on Statistical Methodology has taken on most of the responsibility for tracking, sharing, and documenting work on emerging methods for providing access to statistical data while limiting the risk of disclosure of confidential information in the federal government. At the workshop, Jacob Bournazian—the confidentiality officer of the U.S. Department of Energy’s Energy Information Administration, the chair of the Confidentiality and Data Access Committee during 2001–2003, and
the lead author of the revised version of Working Paper 22—provided an assessment of the current status of research and development on methods to provide access and limit disclosure risk in the federal government. These emerging models include research data centers (RDCs), which permit onsite use of confidential files in a closely delimited area with specialized equipment and extreme security; systems of remote access over secure electronic lines to dedicated computers; fellowships and postdoctoral programs, in which researchers can be treated as agency employees, permitting a less restrictive form of access; and use of confidential data offsite but under highly restricted conditions, as spelled out in a legally binding agreement, such as a license. The emerging role of public query systems for accessing tabular and microdata was also discussed.
Bournazian began by stating his overall assessment that the data aggregation approach selected by NSF is compatible with both user needs and future growth in accessing data. However, he cautioned that any disclosure limitation approach adopted today must be designed with public database query systems in mind, and that NSF may need to develop a restricted access model to complement the application of the data aggregation approach for tabular data.
The issues surrounding the application of different disclosure limitation methods are appropriately considered at the design stage, rather than at the back end of the system, and the approach that ultimately gets chosen may have ramifications on future data release strategies, he observed. To the extent that some users are not satisfied with the access afforded in the scheme selected to protect the data, NSF may need offer to restricted access to the microdata.
The selection of the appropriate disclosure limitation methods, Bournazian suggested, should be based on the results of a formal risk assessment.1 The information that could be disclosed by the table design using prior releases is the appropriate basis for the assessment. It is important for NSF to look at the risk of reidentification of individuals in small cell counts by matching files to the tabular data. If there are only a few identified disclo-
sure risks, then the hierarchical structure of the field of study variable used as the identifier can be collapsed, using a scheme that follows the Classification of Instructional Programs (CIP) taxonomy, as already decided by NSF. If there are a large number of disclosure risks, the branches of the hierarchical structure should be removed or collapsed.
When making the risk assessment, he continued, an important consideration is the unique nature of the information in the SED files. There is a high likelihood of knowing whether or not a specific person is in the survey. Indeed, with a 92 percent response rate, the likelihood is virtually assured. For this reason, it is essential to consider all files in the public domain to which an intruder could potentially match a small cell, including the University of Michigan dissertation files by year, search engines, individual searches, mailing lists, public records such as state driver’s license files, and similar public and proprietary sources.
The results of this risk assessment will assist NSF in deciding on the best means to protect the tabular and public-use files.
Accessing Confidential Data
In addition to tabular aggregation, Bournazian observed that a means for NSF to satisfy the twin concerns of data access and confidentiality is to make data files available under highly controlled conditions. Recently, disclosure protection research and innovations in computing technology have significantly increased the range and depth of access to materials previously not released or available only to very few users. He named the four options that have emerged across the federal statistical agencies that afford increased access and protection: (1) licensing, (2) RDCs, (3) remote access through data enclaves, and (4) remote access through online query systems.
Licensing is already an option offered to users by NSF to institutions. Under licensing arrangements, researchers in institutions obtain access to restricted data by signing an agreement or license pertaining to the institution. Such agreements may be cumbersome: they require a demonstrated need for sensitive data; authorization for all users at the requesting institution; signature by a senior official and key staff; a data security plan; agreement by researchers not to identify individual research subjects or to link data received with other microdata files; and review of all statistical output
before publication. They are for a specified period of time, with the stipulation that data files must be returned or destroyed. Some licensors require fees or approval by an institutional review board (or both). However, licensing has not been very effective for the SED data because very few licenses have been sought or awarded for data from this survey.
Research Data Centers
RDCs are a relatively expensive option for a statistical agency and the data users that have to pay for the access. The access through an RDC is usually at the offices of the agency that holds the data or at a remote site under the control of the RDC, under highly restricted conditions. The essential characteristics of these centers are that a research proposal must be submitted; a formal agreement covers the research and analysis to be done, the data to be used, and the types of output; data files must be stripped of personal identifiers; and data processing equipment must be dedicated to the restricted use. The data holder usually conducts a thorough disclosure review of the output, and all materials removed from the site are inspected. Adherence to these strictures is ensured by the physical presence of and oversight by agency staff.
Remote Access and Online Query Systems
A growing number of government agencies have been developing remote access systems and online query systems that combine a database system with a spreadsheet program to allow users to request tabulations and correlation matrices from restricted microdata files. To avoid the risk of disclosure, the data produced are categorical, all counts are weighted, and estimates are produced only for cells with at least 30 respondents. Some agencies manage the system through research data centers, and others, such as the National Institute of Standards and Technology and the U.S. Department of Agriculture, use the secure facilities of third parties, one such being the National Opinion Research Center (NORC) data enclave.2
Like the RDC option, remote access is governed by procedures to
The NORC data enclave is a service that provides a confidential, protected environment in which authorized researchers can access sensitive microdata remotely. It is designed primarily to disseminate sensitive microdata that have not been fully deidentified for public use (see http://www.norc.org/DataEnclave/).
evaluate and approve proposals, grant and monitor access, filter data queries through a set of primary disclosure rules, and generate tabular presentations of the results of accessing the records (never the records themselves). To illustrate the state of the art in remote access and online query systems, Bournazian described the systems of three federal statistical agencies:
The National Center for Health Statistics (NCHS) has developed six online public query systems for releasing aggregate statistics. NCHS also provides remote access to restricted microdata through a system that allows authorized users (those who have submitted a research plan for preapproval) to submit an analysis program, the specific query, and information on the purpose of the query. NCHS then sends back a dummy data set to allow the user to make sure the program is working. If so, the user returns the program, which is run by NCHS against the database, and only the results that have been submitted to disclosure review by the agency are returned to the user.
The National Center for Education Statistics (NCES) system permits authorized users to go to the agency website, obtain authorization by completing an online, “one click” agreement, submit queries to the database that contains the confidential data, and get responses that are sanitized to ensure that there is no inadvertent release of confidential data.
The National Agricultural Statistics Service (NASS) uses the Quick-Stats system. It combines an RDC-type center in 40 field offices around the country with online queries using a virtual private network. Access is limited to authorized users, and the results of queries are inspected by field office staff to ensure no release of confidential data. When the request is complicated, requiring interface with the database, it is also possible to purchase special tabulations at a cost of $500 per tabulation or $500 per day. The Economic Research Service in conjunction with the National Agricultural Statistics Service also developed a system for users to generate customized data tables by accessing microdata from the Agricultural Resource Management Survey (ARMS) Program. In the ARMS system, disclosure limitation has already been applied to the microdata.
Like these systems for accessing microdata, the use of online query systems for releasing tabular data has been growing in federal statistical
agencies. The main reasons for this growth, Bournazian suggested, is that these systems avoid hard-copy publication costs, allow customized table designs, and preserve intelligence on the most popular queries and reports in ways that can be used by agencies to further improve access over time. The agencies that have led in the development of this capability are NCES, the Census Bureau, NASS, the Energy Information Administration, NCHS, and the Bureau of Labor Statistics.
Bournazian called the online statistical database query systems the next step in the evolution of the release of statistical data by national statistical offices. He postulated that an online query system would permit NSF to control the data on race/ethnicity and gender from SED that are accessible without suppressing values. An online system addresses disclosure risk in advance, because risk has already been factored into the rules that govern the availability of data to the user, ensuring that no tabulation available to the user represents a disclosure.
In the discussion period that followed Bournazian’s presentation, Lynda Carlson, director of the NSF Science Resources Statistics Division, reported that NSF is now considering a data enclave arrangement for accessing SED and other science and engineering statistical series, as well as an online query system. The agency intends to test the NORC enclave and is considering an online query system developed by Space Time Research. These capabilities are being developed in a manner that will ensure that the environment is controlled to prevent inadvertent disclosure of confidential data. The agency is also mindful of the cost of the system.
Mark Schneider of the American Institutes for Research and former director of NCES expressed concern about data loss in an online query system in which the raw values have been suppressed. This is especially troublesome when the data inquiry is designed to extract data from the database to feed into a model. Missing values will invalidate the model run, thus limiting the kind of statistical and trend analysis that the inquiry was designed to yield. For this reason, NCES selected, as its primary access option, a restricted access model (licensing) rather than an online query system for its users. However, NCES continues to support an online system, the Data Access System (DAS). DAS is a website that provides public access to education survey data collected by the U.S. Department of Education as well as to analytic reports about education policy issues. On this website, users can create their analysis tables and covariance analyses using the DAS application, view and download predesigned analysis tables and the DAS programming files used to create them, and view the
highlights of report findings, with figures and tables, for various topics written by researchers for NCES.3
Bournazian pointed out that the DAS system had limited value for NSF because it is designed for sample-based data files, whereas the SED data are nearly a census, so the issues over disclosure are of greater concern. In his view, a targeted online query system is most suitable for NSF.
ISSUES IN DISCLOSURE RISK FOR TABULAR DATA
According to the next presenter, Jerome Reiter of Duke University, when assessing disclosure risk, it is important to first distinguish between the types of disclosure. He indicated that there are two main types of disclosure risk: (1) identification disclosure, in which by matching a record in released data it is possible to learn that someone participated in the study, and (2) attribute disclosure, in which records could yield the value of a sensitive variable for an individual being targeted.
He contends that it is important to make the distinction between identification disclosure and attribute disclosure in the context of SED tabular data. Understanding of these types of disclosure will assist in thinking about the release strategy.
One measure of identification disclosure risk is the number of population “uniques,” that is, a data record that is unique in the population. This differs from sample uniqueness because being a sample unique is not necessarily revealing. If there are many people in the population with the same characteristics as the person in the sample, there is not necessarily a risk of disclosure. But if there are few persons in the population with the same characteristics, there is a greater risk. To understand the amount of risk from population uniques, one needs to understand what an intruder already knows about the subject based on key variables. Other factors affecting the calculation of risk are whether the data are continuous and whether statistical disclosure limitation methods that alter the data have been applied.
A growing body of literature considers the issue of computing the risk from population uniques. Recent work by Skinner and Shlomo assessed the risk of identification of respondents in survey data, using applications from the United Kingdom’s Office for National Statistics. The authors set out to quantify the risk associated with matching categorical key variables between microdata records and external data sources, as well as the risk associated
Available at http://nces.ed.gov/das/.
with the application of statistical models, Poisson regressions, and log linear models to sample data in order to predict population estimates from sample counts. They found stability across the models, and, as a result, they were able to quite successfully match records (Skinner and Shlomo, 2008).
In Reiter’s view, the SED data are virtually a census of the population of persons who earn doctorate degrees, and thus it is not necessary to use the methods of Skinner and Shlomo to estimate the number of population unique records. The NSF simply can determine uniqueness in the available data based on the REG variables.
If NSF decides to perturb REG data values, Reiter suggested it would also be important to estimate the risk of disclosure from probability-based methods that intruders may employ to reidentify a target individual regardless of whether the person is unique in the population. The agency could do this by using probability-based methods that mimic an intruder who attempts to make direct matches with an external database. To estimate the risk of reidentification from probability-based methods, it is necessary to make assumptions about intruder knowledge and behavior.
To illustrate the risk from these methods, Reiter set out a scenario in which an intruder gained knowledge about a particular person’s graduation year, field of study, presumed gender, and inferred other information from a source such as the University of Michigan’s doctoral dissertation database. The SED files could be searched for people with the characteristics of the targeted person, and, if a small number of people match those characteristics, the intruder can claim that the target of the intrusion participated in the SED.
To measure this type of risk, Reiter drew from work by Duncan and Lambert (1986). Reiter set up an example using notation in which z is released data on r records, and M is information about disclosure protection the agency has applied. If the target (t) has the characteristics of a white man, a U.S. citizen, with a degree granted in 1999, then:
Let J = j when record j in Z matches t.
Let J = r + 1 when target is not in Z.
For j = 1, …, r + 1, intruder computes Pr(J = j|t,Z,M).
In this formulation, the intruder would select j with highest probability. If nonresponse is ignored, then:
If nonresponse is taken into account, then only minor adjustments are required. Reiter suggested that statistical agencies could go through every record j in the database and compute the probabilities of redisclosure.
Attribute disclosure risk measures address the question, “Given released data and information known by the public, how well can an intruder reproduce the original data?” Say the intruder knows the graduation year, gender, and field. Can he or she learn race or citizenship?
Released data: Z.
Information about disclosure protection: M.
Target: t (e.g., man, Ph.D. in statistics, 1999).
Intruder tries to learn his race. The intruder seeks
where k = 1 for white, k = 2 for black, etc.
In Reiter’s formulation, the intruder would select k with highest probability. Thus, for original tabular (census) data, the probability of an attribute disclosure equals conditional percentages of each race type, given the target’s known characteristics. The risk measure could simply be the percentage of times the intruder gets it right.
Reiter also addressed the issue of the risk associated with small cell counts. He concluded that small counts usually imply small conditional probabilities, but they are not necessarily indicative of the risk for attribute disclosure. From this, he suggested that it might be possible to use fewer types of alterations to the data or fewer aggregations of the data if the risk of concern is attribute disclosure.
He then discussed trends in disclosure risk assessment, highlighting algebraic methods and three computer science approaches. The algebraic method assumes that there is a full table with all variables, but it is not released to the public because of possible disclosure concerns. Using computational algebra, for modest size tables, it would be possible to enumerate all possible tables that are consistent with the published margins and fixed
tables and thus generate nondisclosed information (Fienberg and Slavkovic, 2004).
The approach to risk assessment known as the k-anonymity approach is a computer science method based on the assumption that every combination of key variables has at least k minus one record (Sweeney, 2002). The k-anonymity approach is helpful in understanding the risk of disclosure, but Reiter suggested that one of the downsides of k-anonymity is that it does not necessarily prevent disclosures when the intruder has external information, so it may not be adequately protective.
L-diversity is a protective measure that ensures that any block of key variables has at least L well-represented variables. With this methodology, there is a mix with a minimum number (L) of persons in each grouping, and fewer disclosures are thus possible.
Differential privacy is a mathematical way of representing the idea that the incremental risk of an individual joining a data set would be small. It is a way of depicting the risk to an individual of joining a data set.
Reiter also discussed the possibility of releasing partially synthetic data; that is, data that have been modeled to have the statistical properties of the original data but that are not the same as the original data. Initially suggested by Little (1993), this method would create multiple, partially synthetic data sets for public release so that the released data would comprise a mix of observed and synthetic values and would look like the actual data (Reiter, 2005). In this method, statistical procedures valid for the original data would be valid for the released data. The advantages for the SED tabular data are that it would be possible to publish fine field level of detail for the REG tables and preserve the longitudinal character of the data. The method would be straightforward for analysts and not too difficult to implement, since there is only a small number of variables. This method could also be applied to the microdata files.