Context for the Protection of Confidential Data
This chapter sets out the legal and policy context in which the Division of Science Resources Statistics (SRS) of the National Science Foundation (NSF) operated when confronting the issue of publication of data from the Survey of Earned Doctorates (SED). This context also sets the stage for deciding on the new parameters for publishing data based on responses from small population groups.
The chapter begins with a description of the NSF legislative mandate pertaining to confidentiality of respondents and their responses. It then discusses the legislation and implementing guidance for the original decision to withhold some data from publication: the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA). Finally, the chapter addresses the several rules, guidelines, and practices that have emerged in the federal statistical system and the statistical profession, which also serve to establish the context for decisions on the balance between confidentiality and access by federal statistical agencies.
NSF LEGISLATION AND PRACTICE
Workshop presenter Stephen Cohen (National Science Foundation) began by saying that the SRS is both a part of the National Science Foundation and one of the major agencies in the federal statistical system. It is the primary organization in NSF that carries out its congressional mandate “to provide a central clearinghouse for the collection, interpretation, and
analysis of data on scientific and engineering resources and to provide a source of information for policy formulation by other agencies of the Federal Government” (National Science Foundation Act of 1950, as amended; 42 U.S.C. 1862).1 The National Science Foundation Act of 1950, as amended, in conjunction with the Privacy Act of 1974, as amended,2 defines the requirement to protect the confidentiality of respondent data. The NSF act pertains to SED because, although it is sponsored and funded by a consortium of federal agencies, NSF is the lead agency in the consortium, provides the bulk of funding, and is responsible for implementing the survey and disseminating the findings. The act provides the legal authority to collect the SED.
Section 14(i) of the NSF act, as amended, provides that survey responses “shall not be disclosed to the public unless the information has been transformed into statistical or abstract formats that do not allow for the identification of the supplier.” In addition to the restriction on public disclosure of identifiable statistical data, section 14(i) also specifically prohibits public disclosure of the “identities of individuals, organizations, and institutions supplying information” in responses to NSF surveys (see Box 2-1).
The legislative imperatives pertaining to SED are reflected in the confidentiality pledge statements included with both the paper and the web-based versions of the questionnaire. The confidentiality pledge statement on the survey forms assures potential respondents that their answers will not be disclosed to the public in identifiable form and specifically refers to the NSF act and the Privacy Act.
The reasons for NSF’s concern for the confidentiality of survey responses, Cohen stated, are based on these legal dictates, as well as on practical concerns that confront all statistical agencies that collect data from the public. In the face of increasing resistance to responding to statistical surveys generally, the assurance of protection of respondents’ answers from public disclosure is considered essential to the continued ability of federal agencies to collect survey data. This is particularly the case with SED, for which the danger of revealing confidential information provided by individual respondents is perceived to be greater than in many other sample surveys because there is a very high likelihood that information about a member of a
Available at http://www.usdoj.gov/oip/privstat.htm.
relatively small group—doctoral graduates in a given year—will be included in the published data.
The data are most useful when they are tabulated by multiple classification variables within a single table (e.g., field of doctoral degree by gender of doctorate recipient for particular years). The tabulation of data for small groups creates the possibility of small counts in individual data cells. Tables reporting counts of doctorate recipients by race/ethnicity and by field of degree are particularly likely to yield data cells with small counts, because relatively few doctoral degrees may be awarded in a single year in a given field or relatively few doctoral degrees may be awarded to members of a particular demographic group in a year.
There is a perceived danger that doctorate recipients are especially vulnerable to statistical disclosure because a number of other sources of information about them could be linked to the published data that would permit their identification. For example, the University of Michigan’s database of dissertation abstracts and other data sets on the web are susceptible to yielding information on the members of small groups of doctoral recipients. The ability to match records between SED and the other data sources is facilitated by the growth of data-mining techniques. Finally, the close-knit character of some academic fields in which, as one workshop participant suggested, “everybody knows everybody,” simplifies the task of linking small cells in SED tables to the names of particular individuals.
ADDITIONAL STATUTORY REQUIREMENTS FOR DATA PROTECTION
Concern over the possibility of reidentifying the respondents in SED had existed for a long time. However, prior to 2006, suppression had not been applied to the counts of fine field of doctoral degrees by race/ethnicity, gender, or citizenship. The immediate impetus for the action was the publication of implementation guidance for Title V of the E-Government Act: the Confidential Information Protection and Statistical Efficiency Act of 2002. In his presentation to the workshop, Brian Harris-Kojetin (Office of Management and Budget) described the purposes, issues, and requirements of CIPSEA and the reach of the implementation instructions.
As part of the E-Government Act, the confidential information protection legislation was designed to fill long-standing gaps in the ability of the federal statistical agencies to prohibit disclosure of information in identifiable form, control access to and uses made of statistical information, ensure
NSF Legislation Regarding Public Disclosure of Information
(A) Information supplied to the Foundation or a contractor of the Foundation in survey forms, questionnaires, or similar instruments for purposes of section 1862 (a)(5) or (6) of this title by an individual, an industrial or commercial organization, or an educational, academic, or other nonprofit institution when the institution has received a pledge of confidentiality from the Foundation, shall not be disclosed to the public unless the information has been transformed into statistical or abstract formats that do not allow for the identification of the supplier.
(B) Information that has not been transformed into formats described in subparagraph (A) may be used only for statistical or research purposes.
(C) The identities of individuals, organizations, and institutions supplying information described in subparagraph (A) may not be disclosed to the public.
(2) In support of functions authorized by section 1862 (a)(5) or (6) of this title, the Foundation may designate, at its discretion, authorized persons, including employees of Federal, State, or local agencies or instrumentalities (including local educational agencies) and employees of private organizations, to have access, for statistical or research purposes only, to information collected pursuant to section
that information is used exclusively for statistical purposes, and, by doing so, strengthen and foster public trust in pledges of confidentiality. Harris-Kojetin described the benefits of CIPSEA, including the application of uniform protection across agencies, coverage of all data collected for statistical purposes under a pledge of confidentiality, strong penalties for disclosure ($250,000 fine and/or 5 years in prison), and exemption from Freedom of Information Act requests. Under the authority of CIPSEA, the director of the Office of Management and Budget (OMB) coordinates and oversees the confidentiality and disclosure policies and promulgates rules and guidance for implementing the act. The implementing guidance for CIPSEA was published in draft form in October 2006 (Office of Management
1862 (a)(5) or (6) of this title that allows for the identification of the supplier. No such person may—
(A) publish information collected pursuant to section 1862 (a)(5) or (6) of this title in such a manner that either an individual, an industrial or commercial organization, or an educational, academic, or other nonprofit institution that has received a pledge of confidentiality from the Foundation can be specifically identified;
(B) permit anyone other than individuals authorized by the Foundation to examine data that allows for such identification relating to an individual, an industrial or commercial organization, or an academic, educational, or other nonprofit institution that has received a pledge of confidentiality from the Foundation; or
(C) knowingly and willfully request or obtain any nondisclosable information described in paragraph (1) from the Foundation under false pretenses.
(3) Violation of this subsection is punishable by a fine of not more than $10,000, imprisonment for not more than 5 years, or both.
SOURCE: Section 14(i) of the NSF Act (42 U.S.C. 1873(i)), in conjunction with the Privacy Act of 1974, as amended (http://www.usdoj.gov/oip/privstat.htm).
and Budget, 2006) and in final form in June 2007. This guidance defined “statistical purpose,” identified statistical agencies covered by the act, and outlined CIPSEA requirements.
It is important to define statistical purpose, because CIPSEA protection applies only to information acquired under a pledge of confidentiality for exclusively statistical purposes.3 Statistical purpose includes the description, estimation, or analysis of characteristics of groups, without identifying the
individuals or organizations that make up such groups; it also includes the methods and procedures to support these purposes. Likewise, CIPSEA applies throughout the federal government but grants statistical agencies special authority, which allows them to empower special sworn agents to analyze, collect, and process data protected under CIPSEA. Statistical agencies are defined as agencies or organizational units of the executive branch whose activities are predominately the collection, compilation, processing, or analysis of information for statistical purposes. OMB has determined that the Division of SRS of NSF is a statistical agency for the purposes of CIPSEA.
According to Harris-Kojetin, CIPSEA is prescriptive, in that it requires statistical agencies to inform respondents about confidentiality protection (using a pledge on the collection instrument) and the use of the information, to collect and handle confidential information in ways that minimize the risk of disclosure, to ensure the information is used for only statistical purposes, and to review information to be disseminated to prevent identifiable information from being reasonably inferred by either direct or indirect means. The guidelines do not go into the means that agencies should use to review and protect information. For that, the OMB guidance refers agencies to Statistical Policy Working Paper 22 (Office of Management and Budget, 2005).
FURTHER GUIDELINES FOR STATISTICAL DATA PROTECTION
In addition to the legislation covering the federal statistical system and that pertaining specifically to NSF, several other sources can be used as guidance for federal agencies in resolving issues of confidentiality and access. These include Statistical Policy Working Paper 22 and the recently issued American Statistical Association statement, “Data Access and Personal Privacy: Appropriate Measures of Disclosure Control.”4
Alvan Zarate, former confidentiality officer for the National Center for Health Statistics (NCHS), described these sources and outlined principles traditionally used by federal statistical agencies to protect tabular data from disclosure. Although not directly supported by legislation, Working Paper 22 has been cited in rules relating to the Health Insurance Portability and Accountability Act of 1996, the American Recovery and Reinvestment Act
Available at http://www.amstat.org/news/statementondataaccess.cfm.
of 2009, and the Health Information Technology for Economic and Clinical Health Act of 2009. Its purposes are straightforward: to detail existing methods of statistical disclosure limitation for tables and microdata files; to provide recommendations and guidance for selection and use of appropriate techniques; to promote the development, sharing, and use of statistical disclosure limitation software; and to encourage research to develop improved methods.
In general, Working Paper 22 notes that disclosure protection techniques are applied to data cells containing small counts of demographic variables because “if a cell has only a few respondents and the characteristics are sufficiently distinctive, then it may be possible for a knowledgeable user to identify the individuals in the population” (Office of Management and Budget, 2005, p. 57). The most common disclosure limitation techniques applied to small cells are cell suppression, data aggregation, and data perturbation (i.e., “adding noise” to data). These are some of the options that were considered by NSF in the process of coming to the current decision on disclosure limitation.
Zarate reported that in developing Working Paper 22, the Federal Committee on Statistical Methodology recommended creation of a group to communicate and share information on confidentiality, statistical disclosure limitation, and restricted access. The Confidentiality and Data Access Committee consists of representatives of federal statistical agencies who deal with confidentiality, data access, and disclosure review techniques. It developed a checklist on disclosure potential that agencies can use to identify tabular data that are at risk of disclosure; the committee also shares information on emerging mechanisms for restricted access (see Chapter 4). The recommendations of Working Paper 22 are heeded in the federal statistical system as indications of good practices when it comes to making decisions on the use of data suppression or other techniques to protect the confidentiality of tabular data.
A more recent source of information on good practices is the American Statistical Association’s statement, “Data Access and Personal Privacy: Appropriate Methods of Disclosure Control.”5 This statement gives the association’s perspective on the assessment of the risk associated with data dissemination and an overview of the way in which statisticians can help limit that risk. The statement recognizes that tabular data are subject to risk because, “although tables are intended to protect individual information by
Available at http://www.amstat.org/news/statementondataaccess.cfm.
presenting grouped figures, there are situations in which the size and/or the distribution of those groups can reveal more information about individuals … than had been publicly known.” The statement makes the case that the context of privacy protection has changed. The same powerful and sophisticated electronic technologies that have made data readily accessible to the public “pose a distinct threat—in perception if not in reality—to privacy, as well as a potential for inflicting great harm on persons.”
Zarate shared several illustrations of the risk of disclosure and techniques for overcoming those risks based on his experiences at NCHS. Like the NSF act of 1950, as amended, the Public Health Service Act of 1974 contains language specifying that no identifiable information may be used for any purpose other than for which it was provided, nor may be it released to any party not agreed to by the supplier (Section 308(d)). The protection of data at the agency engages a confidentiality officer, a disclosure review board, and the data division director. As part of the review process, the agency uses a disclosure checklist, which contains a series of questions to determine the geographic detail, the presence of sensitive variables (such as age, race, occupation, income, and household type), and whether other databases contain similar data. It is recognized that risk of disclosure remains even if all possible protections are applied, so the rule is that protection is paramount. Data are released for research “only when the risk is judged to be extremely low” (National Center for Health Statistics, 2004, p. 14).
In response to a question, Zarate stated that NCHS reviews its decisions on risk of disclosure every 4 years or so because threats change as computational power and techniques evolve.