The implementing guidance for the Confidential Information Protection and Statistical Efficiency Act, issued by the U.S. Office of Management and Budget in June 2007, noted increasing concerns on the part of survey respondents with issues of confidentiality and privacy and outlined the requirements that federal agencies must meet to honor a pledge of confidentiality protection. In response, the Division of Science Resources Statistics (SRS) of the National Science Foundation (NSF) undertook a detailed review of existing rules and procedures for protection of the data collected under a pledge of confidentiality in all its surveys.
The Survey of Earned Doctorates (SED), the focus of this report, collects data on the number and characteristics of individuals receiving research doctoral degrees from all accredited U.S. institutions (see Box 1-1). The results of this annual survey are used to assess characteristics and trends in doctorate education and degrees. This information is vital for education and labor force planners and researchers in the federal government and in academia.
As a result of its review, SRS implemented more stringent procedures to protect the confidentiality of data provided by respondents to SED. These new procedures, which were implemented for the 2006 SED data released in 2007, suppressed many previously published data elements. The suppressed elements were mostly in fields in which very small numbers of doctoral degrees had been awarded. Nonetheless, the data about these fields in which few degrees had been awarded were often closely watched
Survey of Earned Doctorates
The Survey of Earned Doctorates (SED) surveys all individuals who earned research doctorates from accredited U.S. institutions between July 1 and June 30 of the preceding year. A research doctorate is a doctoral degree that (1) requires the completion of an original intellectual contribution in the form of a dissertation or an equivalent project of work (e.g., a musical composition) and (2) is not primarily intended as a degree for the practice of a profession. The most common research doctorate degree is the Ph.D. The total universe in the 2007 survey comprised more than 48,000 research doctorate recipients from over 420 accredited U.S. doctorate-granting institutions. The response rate is usually about 92 percent, but every recipient of a research doctorate degree in the reporting year is included in the survey, whether or not they responded to it. For nonrespondents, limited records (containing field of study, doctorate institution, sex, and baccalaureate degree) are constructed on the basis of information collected from commencement programs, graduation lists, and other similar public records.
by users, in part precisely because of their rarity. The data items that were suppressed pertained to race/ethnicity, gender, and subfields—all of which were of interest to policy makers, researchers, and educational institutions. The organizations and institutions that had previously relied on these data to assess progress in this most important measure of achievement and equality suddenly found themselves without a yardstick with which to measure progress. Since the elimination of the data came without warning and for reasons that were not made clear to data users, their reaction was negative.
In response to these user concerns, NSF took a number of steps: gathering information from users, reconsidering the means by which confidential data can be protected from disclosure, and securing outside review of its decisions. The workshop that is summarized in this report is one of those initiatives. The goal of the workshop was to address the appropriateness of the decisions that SRS made and to help the agency and data users consider future actions that might permit release of useful data while protecting the confidentiality of the survey responses.
DECISION TO SUPPRESS DATA
Confidentiality is an issue for SED mainly because individuals who earn doctorate degrees from institutions of higher education supply information about themselves when they respond to the survey. This information includes sensitive matters that many individuals want to keep private, such as future plans, money owed as a result of their schooling, and expected income, among others.
These individual-level data are collected by SRS under a pledge of confidentiality to the individual respondent. The importance of protecting personally identifiable data supplied by respondents is heightened because SED is a virtual census of everyone receiving a research doctorate in a given year (see Box 1-1). Indeed, if a person received a research doctorate in a given year, it is known with a very high certainty that the individual’s information is contained in the survey.
In the revised procedures to protect the confidentiality of data, SRS decided to suppress from publication a number of data cells with very small counts. These occurred in tables in an annual publication series, Doctorate Recipients from United States Universities: Summary Report (also known as the Interagency Summary Report), as well as in a number of additional standard tabulations of SED data. These tabulations, which have been produced when ordered by interested parties, include the Race/Ethnicity, Gender, and Fine Field of Study Tables (called here the REG tables), which report national-level counts of doctorate recipients by detailed or fine field of doctorate, gender, race/ethnicity, and citizenship. The cells primarily affected are certain categories for the variables of race/ethnicity, citizenship, and gender, because the numbers of people in those groups who obtain doctorates in any given year are quite small, particularly when the data are arrayed by fine field of degree.
NSF received many complaints from the user community about these changes, in which less information from the survey was available than before, particularly for underrepresented minorities. A great deal of the concern related to the fact that SRS had implemented the changes without prior input from the user community and without much warning to sponsoring agencies and others who closely follow trends in these data series. Users strongly suggested that SRS solicit user input as to how best to design new tables to meet a broad spectrum of user needs.
RESPONSE TO USER CONCERNS
SRS reacted to these user concerns on multiple levels. The agency, in essence, retracted the decision to suppress the 2006 data and made them available in published form. Then SRS began a deliberate process of seeking input from stakeholders through a series of meetings and other outreach activities with users to solicit their views about the presentation of SED data. It also initiated an internal research effort to develop alternative formats for new tables presenting SED data by race/ethnicity, citizenship, and gender.
In doing so, SRS considered best practices used by federal statistical agencies to protect individually identifiable data and, specifically, the principles documented in the Federal Committee on Statistical Methodology’s Statistical Policy Working Paper 22 (Office of Management and Budget, 2005). It also opened for consideration new disclosure methodologies published in the research literature. SRS published the results of this work in a paper that discusses the issues; the activities undertaken to determine the types of, needs for, and uses of the data by a variety of users; the alternative approaches considered for presenting the data; and the rationale for choosing the approaches used in the proposed new tables (National Science Foundation, 2009).
The workshop summarized in this report is the culmination of the outreach activities initiated by SRS. At the request of SRS, the Committee on National Statistics of the National Research Council formed an ad hoc steering committee to plan for and conduct a workshop for the purpose of reviewing the proposed confidentiality criteria established for the SED. The major purpose of the workshop was to convene experts to address the decisions that SRS made on how to best present SED data so as to maximize the amount of data that can be released while maintaining the pledge of confidentiality made to respondents. The event was intended to provide an opportunity for experts in the field to provide input on the procedures SRS used and on the tables themselves.
The exchange of information and the publication of this report were the sole goals of the workshop. This report is intended as a record of the discussion of key issues identified by the steering committee and discussed by the subject-matter experts who attended the workshop. It draws no conclusions, nor does it make any recommendations.
Following this introduction, Chapter 2 continues with a discussion of the context for the protection of confidential data in the federal government. It takes a specific look at methods for protecting confidential data as propounded throughout the federal statistical agencies and in the statistical profession. The basis for this discussion consists of three major pieces of legislation: (1) the congressional mandate to NSF, the National Science Foundation Act of 1950, as amended; (2) the Privacy Act of 1974; and (3) the Confidential Information Protection and Statistical Efficiency Act of 2002. Federal agency practices are outlined in Statistical Policy Working Paper 22 (Office of Management and Budget, 2005). The perspective of the statistical profession, as contained in a 2008 statement on data access and personal privacy and appropriate methods of disclosure control (American Statistical Association, 2008), is also described in this chapter.
Chapter 3 presents the highlights of the NSF decision paper that lays out the options it considered for publishing future editions of the race/ethnicity and gender tables for SED. The preferred method would display fine fields of degree in which 25 or more doctorates are awarded in a given year and would aggregate the fine fields in which fewer than 25 doctorates are awarded into fields defined by the Classification of Instructional Programs taxonomy. The chapter summarizes the workshop’s lively discussion of the pros and cons of this proposed new strategy.
Chapter 4 steps back and takes a look at the requirements for the data from SED, particularly for the race and ethnicity categories. For this chapter, the report of the series of outreach meetings on the impact of the suppression of small cells on the survey is the main point of reference (Quality Education for Minorities Network, 2009).
Chapter 5 takes a broader view of the issue, looking at emerging models for ensuring confidentiality and data access and an emerging framework for assessing the risk of redisclosure of confidential information from published sources. The stock of optional methods for protecting data is growing rapidly in the federal statistical system, and academic research is developing increasingly sophisticated techniques for assessing the risk of redisclosure. The discussion of these new methods is included to assist NSF as it considers how to refine the decision it has reached to protect confidential data and make them accessible. The information in this chapter may well inform other federal statistical agencies facing issues similar to those confronted by NSF over the past 2 years.
In the final chapter, the views of participants on the appropriateness of the NSF decision to aggregate rather than suppress data are summarized. The chapter includes a series of issues that were raised in the general discussion period but not resolved during the workshop. They are presented as topics that could be further investigated by the NSF staff.