2
Context for the Protection of Confidential Data

This chapter sets out the legal and policy context in which the Division of Science Resources Statistics (SRS) of the National Science Foundation (NSF) operated when confronting the issue of publication of data from the Survey of Earned Doctorates (SED). This context also sets the stage for deciding on the new parameters for publishing data based on responses from small population groups.

The chapter begins with a description of the NSF legislative mandate pertaining to confidentiality of respondents and their responses. It then discusses the legislation and implementing guidance for the original decision to withhold some data from publication: the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA). Finally, the chapter addresses the several rules, guidelines, and practices that have emerged in the federal statistical system and the statistical profession, which also serve to establish the context for decisions on the balance between confidentiality and access by federal statistical agencies.

NSF LEGISLATION AND PRACTICE

Workshop presenter Stephen Cohen (National Science Foundation) began by saying that the SRS is both a part of the National Science Foundation and one of the major agencies in the federal statistical system. It is the primary organization in NSF that carries out its congressional mandate “to provide a central clearinghouse for the collection, interpretation, and



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 7
2 Context for the Protection of Confidential Data T his chapter sets out the legal and policy context in which the Divi- sion of Science Resources Statistics (SRS) of the National Science Foundation (NSF) operated when confronting the issue of publica- tion of data from the Survey of Earned Doctorates (SED). This context also sets the stage for deciding on the new parameters for publishing data based on responses from small population groups. The chapter begins with a description of the NSF legislative mandate pertaining to confidentiality of respondents and their responses. It then dis- cusses the legislation and implementing guidance for the original decision to withhold some data from publication: the Confidential Information Pro- tection and Statistical Efficiency Act of 2002 (CIPSEA). Finally, the chapter addresses the several rules, guidelines, and practices that have emerged in the federal statistical system and the statistical profession, which also serve to establish the context for decisions on the balance between confidentiality and access by federal statistical agencies. NSF LEGISLATION AND PRACTICE Workshop presenter Stephen Cohen (National Science Foundation) began by saying that the SRS is both a part of the National Science Foun- dation and one of the major agencies in the federal statistical system. It is the primary organization in NSF that carries out its congressional mandate “to provide a central clearinghouse for the collection, interpretation, and 

OCR for page 7
 DATA FROM THE SURVEY OF EARNED DOCTORATES analysis of data on scientific and engineering resources and to provide a source of information for policy formulation by other agencies of the Fed- eral Government” (National Science Foundation Act of 1950, as amended; 42 U.S.C. 1862).1 The National Science Foundation Act of 1950, as amended, in conjunction with the Privacy Act of 1974, as amended,2 de- fines the requirement to protect the confidentiality of respondent data. The NSF act pertains to SED because, although it is sponsored and funded by a consortium of federal agencies, NSF is the lead agency in the consortium, provides the bulk of funding, and is responsible for implementing the sur- vey and disseminating the findings. The act provides the legal authority to collect the SED. Section 14(i) of the NSF act, as amended, provides that survey re- sponses “shall not be disclosed to the public unless the information has been transformed into statistical or abstract formats that do not allow for the identification of the supplier.” In addition to the restriction on pub- lic disclosure of identifiable statistical data, section 14(i) also specifically prohibits public disclosure of the “identities of individuals, organizations, and institutions supplying information” in responses to NSF surveys (see Box 2-1). The legislative imperatives pertaining to SED are reflected in the con- fidentiality pledge statements included with both the paper and the web- based versions of the questionnaire. The confidentiality pledge statement on the survey forms assures potential respondents that their answers will not be disclosed to the public in identifiable form and specifically refers to the NSF act and the Privacy Act. The reasons for NSF’s concern for the confidentiality of survey re- sponses, Cohen stated, are based on these legal dictates, as well as on practi- cal concerns that confront all statistical agencies that collect data from the public. In the face of increasing resistance to responding to statistical surveys generally, the assurance of protection of respondents’ answers from public disclosure is considered essential to the continued ability of federal agen- cies to collect survey data. This is particularly the case with SED, for which the danger of revealing confidential information provided by individual respondents is perceived to be greater than in many other sample surveys because there is a very high likelihood that information about a member of a Available at http://www4.law.cornell.edu/uscode/42/usc_sec_42_00001862----000-. 1 html. Available at http://www.usdoj.gov/oip/privstat.htm. 2

OCR for page 7
 CONTEXT FOR THE PROTECTION OF CONFIDENTIAL DATA relatively small group—doctoral graduates in a given year—will be included in the published data. The data are most useful when they are tabulated by multiple classifica- tion variables within a single table (e.g., field of doctoral degree by gender of doctorate recipient for particular years). The tabulation of data for small groups creates the possibility of small counts in individual data cells. Tables reporting counts of doctorate recipients by race/ethnicity and by field of degree are particularly likely to yield data cells with small counts, because relatively few doctoral degrees may be awarded in a single year in a given field or relatively few doctoral degrees may be awarded to members of a particular demographic group in a year. There is a perceived danger that doctorate recipients are especially vulnerable to statistical disclosure because a number of other sources of information about them could be linked to the published data that would permit their identification. For example, the University of Michigan’s data- base of dissertation abstracts and other data sets on the web are susceptible to yielding information on the members of small groups of doctoral recipi- ents. The ability to match records between SED and the other data sources is facilitated by the growth of data-mining techniques. Finally, the close-knit character of some academic fields in which, as one workshop participant suggested, “everybody knows everybody,” simplifies the task of linking small cells in SED tables to the names of particular individuals. ADDITIONAL STATUTORY REQUIREMENTS FOR DATA PROTECTION Concern over the possibility of reidentifying the respondents in SED had existed for a long time. However, prior to 2006, suppression had not been applied to the counts of fine field of doctoral degrees by race/ethnicity, gender, or citizenship. The immediate impetus for the action was the pub- lication of implementation guidance for Title V of the E-Government Act: the Confidential Information Protection and Statistical Efficiency Act of 2002. In his presentation to the workshop, Brian Harris-kojetin (Office of Management and Budget) described the purposes, issues, and requirements of CIPSEA and the reach of the implementation instructions. As part of the E-Government Act, the confidential information protec- tion legislation was designed to fill long-standing gaps in the ability of the federal statistical agencies to prohibit disclosure of information in identifi- able form, control access to and uses made of statistical information, ensure

OCR for page 7
0 DATA FROM THE SURVEY OF EARNED DOCTORATES BOX 2-1 NSF Legislation Regarding Public Disclosure of Information (A) Information supplied to the Foundation or a contractor of the Foundation in survey forms, questionnaires, or similar instruments for purposes of section 1862 (a)(5) or (6) of this title by an individual, an industrial or commercial organization, or an educational, academic, or other nonprofit institution when the institution has received a pledge of confidentiality from the Foundation, shall not be disclosed to the public unless the information has been transformed into sta- tistical or abstract formats that do not allow for the identification of the supplier. (B) Information that has not been transformed into formats described in subparagraph (A) may be used only for statistical or research purposes. (C) The identities of individuals, organizations, and institutions sup- plying information described in subparagraph (A) may not be dis- closed to the public. (2) In support of functions authorized by section 1862 (a)(5) or (6) of this title, the Foundation may designate, at its discretion, authorized persons, including employees of Federal, State, or local agencies or instrumentalities (including local educational agencies) and em- ployees of private organizations, to have access, for statistical or research purposes only, to information collected pursuant to section that information is used exclusively for statistical purposes, and, by doing so, strengthen and foster public trust in pledges of confidentiality. Harris- kojetin described the benefits of CIPSEA, including the application of uni- form protection across agencies, coverage of all data collected for statistical purposes under a pledge of confidentiality, strong penalties for disclosure ($250,000 fine and/or 5 years in prison), and exemption from Freedom of Information Act requests. Under the authority of CIPSEA, the director of the Office of Management and Budget (OMB) coordinates and oversees the confidentiality and disclosure policies and promulgates rules and guid- ance for implementing the act. The implementing guidance for CIPSEA was published in draft form in October 2006 (Office of Management

OCR for page 7
 CONTEXT FOR THE PROTECTION OF CONFIDENTIAL DATA 1862 (a)(5) or (6) of this title that allows for the identification of the supplier. No such person may— (A) publish information collected pursuant to section 1862 (a)(5) or (6) of this title in such a manner that either an individual, an industrial or commercial organization, or an educational, academic, or other nonprofit institution that has received a pledge of confidentiality from the Foundation can be specifically identified; (B) permit anyone other than individuals authorized by the Founda- tion to examine data that allows for such identification relating to an individual, an industrial or commercial organization, or an academic, educational, or other nonprofit institution that has received a pledge of confidentiality from the Foundation; or (C) knowingly and willfully request or obtain any nondisclosable information described in paragraph (1) from the Foundation under false pretenses. (3) Violation of this subsection is punishable by a fine of not more than $10,000, imprisonment for not more than 5 years, or both. SOURCE: Section 14(i) of the NSF Act (42 U.S.C. 1873(i)), in con- junction with the Privacy Act of 1974, as amended (http://www.usdoj. gov/oip/privstat.htm). and Budget, 2006) and in final form in June 2007. This guidance defined “statistical purpose,” identified statistical agencies covered by the act, and outlined CIPSEA requirements. It is important to define statistical purpose, because CIPSEA protection applies only to information acquired under a pledge of confidentiality for exclusively statistical purposes.3 Statistical purpose includes the description, estimation, or analysis of characteristics of groups, without identifying the This provision varies somewhat from the provisions in the National Science Founda- 3 tion Act of 1950 as amended, in that the NSF act applies to information collected for both statistical and research purposes.

OCR for page 7
 DATA FROM THE SURVEY OF EARNED DOCTORATES individuals or organizations that make up such groups; it also includes the methods and procedures to support these purposes. Likewise, CIPSEA applies throughout the federal government but grants statistical agencies special authority, which allows them to empower special sworn agents to analyze, collect, and process data protected under CIPSEA. Statistical agen- cies are defined as agencies or organizational units of the executive branch whose activities are predominately the collection, compilation, processing, or analysis of information for statistical purposes. OMB has determined that the Division of SRS of NSF is a statistical agency for the purposes of CIPSEA. According to Harris-kojetin, CIPSEA is prescriptive, in that it requires statistical agencies to inform respondents about confidentiality protection (using a pledge on the collection instrument) and the use of the informa- tion, to collect and handle confidential information in ways that minimize the risk of disclosure, to ensure the information is used for only statistical purposes, and to review information to be disseminated to prevent identifi- able information from being reasonably inferred by either direct or indirect means. The guidelines do not go into the means that agencies should use to review and protect information. For that, the OMB guidance refers agencies to Statistical Policy Working Paper 22 (Office of Management and Budget, 2005). FURTHER GUIDELINES FOR STATISTICAL DATA PROTECTION In addition to the legislation covering the federal statistical system and that pertaining specifically to NSF, several other sources can be used as guid- ance for federal agencies in resolving issues of confidentiality and access. These include Statistical Policy Working Paper 22 and the recently issued American Statistical Association statement, “Data Access and Personal Pri- vacy: Appropriate Measures of Disclosure Control.”4 Alvan zarate, former confidentiality officer for the National Center for Health Statistics (NCHS), described these sources and outlined principles traditionally used by federal statistical agencies to protect tabular data from disclosure. Although not directly supported by legislation, Working Paper 22 has been cited in rules relating to the Health Insurance Portability and Accountability Act of 1996, the American Recovery and Reinvestment Act Available at http://www.amstat.org/news/statementondataaccess.cfm. 4

OCR for page 7
 CONTEXT FOR THE PROTECTION OF CONFIDENTIAL DATA of 2009, and the Health Information Technology for Economic and Clini- cal Health Act of 2009. Its purposes are straightforward: to detail existing methods of statistical disclosure limitation for tables and microdata files; to provide recommendations and guidance for selection and use of appropriate techniques; to promote the development, sharing, and use of statistical dis- closure limitation software; and to encourage research to develop improved methods. In general, Working Paper 22 notes that disclosure protection tech- niques are applied to data cells containing small counts of demographic variables because “if a cell has only a few respondents and the characteristics are sufficiently distinctive, then it may be possible for a knowledgeable user to identify the individuals in the population” (Office of Management and Budget, 2005, p. 57). The most common disclosure limitation techniques applied to small cells are cell suppression, data aggregation, and data per- turbation (i.e., “adding noise” to data). These are some of the options that were considered by NSF in the process of coming to the current decision on disclosure limitation. zarate reported that in developing Working Paper 22, the Federal Committee on Statistical Methodology recommended creation of a group to communicate and share information on confidentiality, statistical disclo- sure limitation, and restricted access. The Confidentiality and Data Access Committee consists of representatives of federal statistical agencies who deal with confidentiality, data access, and disclosure review techniques. It developed a checklist on disclosure potential that agencies can use to identify tabular data that are at risk of disclosure; the committee also shares information on emerging mechanisms for restricted access (see Chapter 4). The recommendations of Working Paper 22 are heeded in the federal sta- tistical system as indications of good practices when it comes to making decisions on the use of data suppression or other techniques to protect the confidentiality of tabular data. A more recent source of information on good practices is the Ameri- can Statistical Association’s statement, “Data Access and Personal Privacy: Appropriate Methods of Disclosure Control.”5 This statement gives the association’s perspective on the assessment of the risk associated with data dissemination and an overview of the way in which statisticians can help limit that risk. The statement recognizes that tabular data are subject to risk because, “although tables are intended to protect individual information by Available at http://www.amstat.org/news/statementondataaccess.cfm. 5

OCR for page 7
 DATA FROM THE SURVEY OF EARNED DOCTORATES presenting grouped figures, there are situations in which the size and/or the distribution of those groups can reveal more information about individuals . . . than had been publicly known.” The statement makes the case that the context of privacy protection has changed. The same powerful and sophis- ticated electronic technologies that have made data readily accessible to the public “pose a distinct threat—in perception if not in reality—to privacy, as well as a potential for inflicting great harm on persons.” zarate shared several illustrations of the risk of disclosure and tech- niques for overcoming those risks based on his experiences at NCHS. Like the NSF act of 1950, as amended, the Public Health Service Act of 1974 contains language specifying that no identifiable information may be used for any purpose other than for which it was provided, nor may be it released to any party not agreed to by the supplier (Section 308(d)). The protection of data at the agency engages a confidentiality officer, a disclosure review board, and the data division director. As part of the review process, the agency uses a disclosure checklist, which contains a series of questions to determine the geographic detail, the presence of sensitive variables (such as age, race, occupation, income, and household type), and whether other da- tabases contain similar data. It is recognized that risk of disclosure remains even if all possible protections are applied, so the rule is that protection is paramount. Data are released for research “only when the risk is judged to be extremely low” (National Center for Health Statistics, 2004, p. 14). In response to a question, zarate stated that NCHS reviews its deci- sions on risk of disclosure every 4 years or so because threats change as computational power and techniques evolve.