6
Technical and Administrative Procedures

The federal government has led the way in the development of statistical methodology … to protect the confidentiality of respondents, usually called disclosure-avoidance techniques.

Barbara Bailar, 1990

The disclosure control to be applied to the release should ideally be based on information about … i. the degree of protection provided by a specific scheme; and ii. the amount of loss of information introduced by that scheme.

Tore Dalenius, 1991

VA officials acknowledged their Income Verification Match project had gone awry. A computer malfunction in Chicago caused more than 650 veterans to receive another veteran's income tax information.

Washington Post, May 27, 1992

INTRODUCTION

The general and agency-specific information statutes discussed in Chapter 5 provide the framework for the confidentiality and data access policies and practices of federal statistical agencies. Within this broad and complex framework, the agencies have substantial latitude to develop and apply specific technical and administrative techniques in order to achieve wide dissemination and use of publicly collected data while protecting the confidentiality of individual information.

Statistical agencies have two main options for protecting the confidentiality of released data—providing restricted data and providing restricted access. The first option entails restricting the content of data sets or files to be released. Before releasing a microdata



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics 6 Technical and Administrative Procedures The federal government has led the way in the development of statistical methodology … to protect the confidentiality of respondents, usually called disclosure-avoidance techniques. Barbara Bailar, 1990 The disclosure control to be applied to the release should ideally be based on information about … i. the degree of protection provided by a specific scheme; and ii. the amount of loss of information introduced by that scheme. Tore Dalenius, 1991 VA officials acknowledged their Income Verification Match project had gone awry. A computer malfunction in Chicago caused more than 650 veterans to receive another veteran's income tax information. Washington Post, May 27, 1992 INTRODUCTION The general and agency-specific information statutes discussed in Chapter 5 provide the framework for the confidentiality and data access policies and practices of federal statistical agencies. Within this broad and complex framework, the agencies have substantial latitude to develop and apply specific technical and administrative techniques in order to achieve wide dissemination and use of publicly collected data while protecting the confidentiality of individual information. Statistical agencies have two main options for protecting the confidentiality of released data—providing restricted data and providing restricted access. The first option entails restricting the content of data sets or files to be released. Before releasing a microdata

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics file, for example, an agency would usually remove explicit identifiers (e.g., name, address, and Social Security number) and might further curtail the information in the file (e.g., by giving people's ages in five-year intervals rather than by exact date of birth). The second option entails imposing conditions on who may have access to agency data, for what purpose, at what locations, and so forth. (A similar distinction was made by Marsh et al. (1991b) in a paper about data dissemination practices in the United Kingdom. Instead of restricted data and restricted access, they used the terms safe data and safe setting.) Microdata sets that are released with no restrictions on access (but typically with many restrictions on content) are commonly referred to as public-use data sets. There is an inverse relationship between restrictions on data and restrictions on access: as data restrictions increase, fewer restrictions on access are needed and vice versa. Some user needs cannot be met with restricted data, however, because the data transformation required to ensure data confidentiality is so extreme that the restricted data are useless for inference purposes. In response, an agency may allow more restricted access to less restricted data. Conversely, to ensure confidentiality, access may have to be so restricted that a legitimate user cannot, as a practical matter, obtain the data. Again in response, an agency may allow less restricted access to more restricted data. Neither restricted data nor restricted access alone is a panacea. To make effective use of data while protecting confidentiality, both options are needed, often in combination. On the other hand, without assurances of adequate confidentiality shields, data collection may be stymied. A study of draft evaders who fled to a neutral country, for example, was never conducted because researchers were unable to convince the potential respondents that their anonymity could be ensured (Sagarin, 1973). Thus, the comment by Dalenius quoted at the beginning of this chapter is to the point. Confidentiality protection shields must be both effective and faithful to the original data. The goal of technical and administrative shields is to protect confidentiality adequately while leaving the statistical agency sufficiently unencumbered that it can furnish faithful data. The criterion for faithful data is simple and compelling: "The user expects to be the peer of the data collector in answering a research question" (David, 1991:94). This chapter has two main sections. In the first section we discuss statistical techniques for protecting the confidentiality of data on persons and other data subjects included in released data sets. It is widely accepted that it is virtually impossible to release

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics data for statistical use without incurring some risk that one or more persons can be identified and information about them disclosed, especially if substantial resources are used in a deliberate attempt to do so. The goal of using statistical disclosure limitation techniques (mathematical methods that depend on statistical characteristics of the data) prior to releasing data is to reduce the magnitude of that risk.1 In the second section of the chapter, we discuss agency policies and procedures for providing restricted access to data on persons. We also present examples of several kinds of restricted access, including data sharing between agencies for statistical purposes and access by end users outside the government. One of the examples includes the release of microdata in encrypted form. Encryption is a technical procedure, but it is not generally thought of in the context of statistical disclosure limitation techniques, and so we include it in our discussion of restricted access procedures. RESTRICTED DATA: STATISTICAL TECHNIQUES FOR PROTECTING CONFIDENTIALITY In this section we focus on the search for statistical techniques that restrict statistical data so as to protect the confidentiality of the data while maintaining utility to the legitimate data user. Although the topic is technical, our treatment is not. Nor do we provide a primer on available techniques. That would only duplicate concurrent work of the Federal Committee on Statistical Methodology (1993), Subcommittee on Disclosure Limitation Methodology, which is chaired by Nancy Kirkendall. We do describe the key concepts of the approach, evaluate its value for confidentiality protection and data access, and present our recommendations. In so doing, we address four topics: the nature of disclosure risk and statistical procedures for disclosure limitation; current statistical disclosure limitation practices of federal statistical agencies; the impact of increased computer and communications capability on disclosure risk; and current statistical disclosure limitation research. DISCLOSURE RISK AND STATISTICAL DISCLOSURE LIMITATION TECHNIQUES As defined in Chapter 1, a disclosure occurs when a data subject is identified from a released file (identity disclosure), sensitive

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics information about a data subject is revealed through the released file (attribute disclosure), or released data make it possible to infer the value of an attribute of a data subject more accurately than otherwise would have been possible (inferential disclosure). Inferential disclosure may involve either identity disclosure or attribute disclosure.2 The most general of these concepts is inferential disclosure, which was defined by Dalenius (1977), supported by the Federal Committee on Statistical Methodology (1978), and summed up by Jabine et al. (1977:6) in the following statement: "If the release of the statistic S makes it possible to determine the (microdata) value more closely than is possible without access to S, a[n inferential] disclosure has taken place." As discussed in Chapter 5, the only way to have zero risk of inferential disclosure is not to release any data. In practice, the extent of disclosure can only be limited to below some acceptable level. Indeed, in Recommendation 5–2 the panel suggested that new legislation should recognize this fact and allow for release of information for legitimate statistical purposes when it entails a reasonably low risk of disclosure. Fellegi's (1972) view that disclosure requires identity disclosure and attribute disclosure is closest to what is commonly understood by disclosure and provides a reasonable basis for legislative language. On the other hand, the concept of inferential disclosure is useful to statistical agencies developing and analyzing statistical disclosure limitation techniques. Also, inferential disclosure encompasses a broader range of confidentiality risks that an agency should examine. Statistical disclosure limitation techniques involve transformations of data to limit the risk of disclosure. Use of such a technique is often called masking the data, because it is intended to hide characteristics of data subjects. Some statistical disclosure limitation techniques are designed for data accessed as tables (tabular data), some are designed for data accessed as records of individual data subjects (microdata), and some are designed for data accessed as computer data bases. Common methods of masking tabular data are deleting table entries (cell suppression) and altering table entries (random error, or noise introduction). Common methods of masking microdata are deleting identifiers, dropping sensitive variables, releasing only a small fraction of the data records,3 and grouping data values into categories (as in topcoding, whereby data values exceeding a certain level are assigned to the top category). As discussed below, direct access of computer data bases, a recent phenomenon, may involve either tabular data or

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics microdata, but it raises new statistical disclosure limitation issues. Prior to releasing statistical data, a statistical agency removes from the data records any explicit identifiers (such as name, address, Social Security number, telephone number) of data subjects that are not needed for statistical purposes. In many situations, however, this obvious step of deidentification or anonymization is not adequate to make the risk of disclosure reasonably low.4 To go beyond ad hoc measures to reduce the risk of disclosure, it is necessary to have ways of measuring the nature and extent of disclosure possible in specified circumstances. In the context of inferential disclosure, Duncan and Lambert (1986) measure the extent of disclosure in terms of the change in uncertainty about a shielded value prior to data release and after data release. Common disclosure limitation policies, such as requiring the relative frequencies of released cells to be bounded away from both one and zero, are equivalent to disclosure rules that allow data release only if specific uncertainty functions at particular predictive distributions exceed a limit. This effort generalizes work of Cassel (1976) and Frank (1976, 1979) and demonstrates the analytic power of the inferential disclosure formulation. A seemingly different approach was taken by Bethlehem et al. (1990), who focus on the number of unique individuals in a population, and so use identity disclosure limitation. To illustrate how surprisingly often individuals prove to be unique based on just a few common variables, they noted that a certain region of the Netherlands contained 83,799 households. Of those households, 23,485 were composed of a father, mother, and two children. Looking only at ages of the father and mother and ages and sexes of the two children (all ages in years), 16,008 of the 23,485 households (68 percent) were unique. If a microdata file was released containing this key information on ages and gender plus other sensitive information, confidentiality could easily be compromised in most cases by any individual having only basic background information about a household. At least that would be the case if confidentiality was attacked broadly, for example, by an individual who simply sought to identify any one household among all households rather than some specific household. Such "fishing expeditions," although presumably rare in practice are a concern to agencies worried that someone might seek to discredit their confidentiality policies. A disclosure risk analysis based on the number of unique entities can be one element of the Duncan and Lambert (1986) framework.

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics Release of a microdata file in which key variables are generally widely available in the external world could permit linking data for all of the unique households. This would raise the disclosure risk drastically for the shielded sensitive information.5 When disclosure risk is too high, data can be masked prior to release. As noted above, different statistical masking techniques are used for public-use microdata files and tables. In the case of a public-use microdata file, statistical disclosure limitation techniques can be classified into five broad categories (Duncan and Pearson, 1991): Collecting or releasing only a sample of the data: For example, the Bureau of the Census first released a public-use microdata file with a 1-in-1,000 sample from the 1960 Census of Population and Housing; microdata products from the 1980 census included one based on a 1 percent sample and another based on a 5 percent sample. Including simulated data: This technique has not been implemented, but it is conceptually akin to including several identical limousines in a motorcade under threat of terrorist attack. ''Blurring" of the data by grouping or adding random error to individual values: Presenting subjects' ages in 10-year intervals is an example of grouping. For an example of addition of random error, the Census Bureau prepared a microdata file for researchers at the National Opinion Research Center from the 1980 census that contained census tract characteristics (e.g., percentage of blacks and Hispanics, unemployment rate, median house value). Because the tract characteristics had unique combinations and those characteristics could readily be learned from Census Bureau publications, the records could be linked to the actual tract of residence. That would have violated the Census Bureau policy of not identifying a geographic area with fewer than 100,000 residents. To reduce this risk, tract characteristics were masked by adding random error, or noise (see Kim, 1990). Excluding certain attributes: An agency might provide a subject's year of birth but not the month and day, or quarterly employment data could be replaced with yearly summaries. Swapping of data by exchanging the values of just certain variables between data subjects: For example, the value for some sensitive variable in a record could be exchanged for that in an adjacent record. For data released as tables, the blurring and swapping techniques described above have been used. Three other statistical

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics disclosure limitation techniques are unique to tables (see Cox, 1980, 1987; Sande, 1984): Requiring each marginal total of the table to have a minimum count of data subjects. Using a "concentration" rule, also known as the (N,K)-rule, where N entities do not dominate K percent of a cell; for example, requiring that the reported aspects of the two dominant businesses in a cell comprise no more than a certain percentage of a cell. Using controlled rounding of table entries to perturb entries while maintaining various marginal totals. The more sophisticated the masking technique, the less accessible the data and their analysis will be to many social scientists and policymakers. The statistical disclosure limitation approach that creates the least problems for analysts is masking through sampling because (at least with simple random sampling) standard statistical inference tools readily apply. However, for public-use microdata files, other statistical disclosure limitation techniques, such as the addition of noise, present measurement error or errors-in-variables problems for the user of the masked data (see, e.g., Sullivan and Fuller, 1989).6 This can introduce bias in the inferences that are drawn.7 In general, the choice of an appropriate statistical disclosure limitation method depends on the statistical procedure to be used to analyze the data. SELECTED STATISTICAL DISCLOSURE LIMITATION PRACTICES OF FEDERAL STATISTICAL AGENCIES Some federal statistical agencies—notably the Census Bureau—have devoted considerable attention to the development and implementation of statistical disclosure limitation techniques.8 In an enduring contribution, the Federal Committee on Statistical Methodology (1978) issued Statistical Policy Working Paper 2, Report on Statistical Disclosure and Disclosure-Avoidance Techniques. The report summarized the practices of seven federal agencies,9 presented recommendations regarding statistical disclosure limitation techniques, and discussed the effects of disclosure on data subjects and users and the need for research and development in this area. Because of their continued relevance, the panel endorses certain key recommendations from Statistical Policy Working Paper 2 in presenting its own recommendations below. Attention to this area is perhaps even more justifiable today because most

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics statistical disclosure limitation methods that are employed in practice lack theoretical roots (see Duncan and Lambert, 1986; Greenberg and Zayatz, 1992). Their ad hoc nature leads to criticism by potential users that they are excessively conservative. For decades, for example, the Census Bureau used an area size cutoff of 250,000 residents to limit the disclosure risk based on geographic specification, a cutoff that was lowered to 100,000 residents for most surveys and censuses after systematic studies were conducted (see Greenberg and Voshell, 1990a,b). Procedures Used for Tables Several statistical agencies have applied disclosure limitation to tabular data. In practice, most agency guidelines provide for suppressing the values in the cells of a table based on minimum cell sizes and an (N, K) concentration rule. Minimum cell sizes of three are almost invariably used, because each member of a cell of size two could derive a specific value from the other member. A typical concentration rule specifies that no more than K = 70 percent of a cell value is attributable to N = 1 entity. Often, the particular choice of N and K is not revealed to data users. Cell suppression in tables always results in a loss to the data user of some detailed information. Because cell sizes are typically small in geographic areas with small populations, geographic detail is often lost (see below). Procedures for Microdata Files Only about half of the federal statistical agencies that replied to the panel's request for information included materials that documented their statistical disclosure limitation techniques for microdata. Some that did merely indicated that the statistical disclosure limitation techniques for surveys they sponsored were set by the Census Bureau's Microdata Review Panel because the surveys had been conducted for them by the Census Bureau. Major releasers of public-use microdata (that is, the Census Bureau, the National Center for Health Statistics (NCHS), and more recently the National Center for Education Statistics, or NCES) have all established formal procedures for review and approval of new microdata sets. In general those procedures do not rely on parameter-driven rules like those used for tabulations. Instead, they require judgments by reviewers that take into account such factors as the availability of external files with comparable

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics data, the resources that might be needed by a "snooper" to identify individual units, the sensitivity of individual data items, the expected number of unique records in the file, the proportion of the study population included in the sample, and the expected amount of error in the data. Since locating a data subject geographically increases disclosure risk, a common disclosure limitation method is to coarsen geographic detail. The Census Bureau and NCHS specify that no geographic codes for areas with a population of less than 100,000 can be included in public-use microdata files. If a file contains a large number of variables, a higher cutoff may be used. The inclusion of local-area characteristics, such as the mean income, population density, and percentage of minority population of a census tract, is also limited by this requirement because if enough variables of this type are included, the local area can be uniquely identified. For example, in the Energy Information Administration's Residential Energy Consumption Surveys, local weather information has had to be masked to prevent disclosure of the geographic location of surveyed households. Lack of geographical coding may have limited effect on certain analyses of national programs, like Food Stamps. Yet expunging state of residence of a sampled household, as is done in the public-use microdata files of the Survey of Income and Program Participation (SIPP), precludes many useful analyses of programs (e.g., Aid for Families with Dependent Children) for which eligibility criteria vary by state.10 THE IMPACT OF IMPROVED COMPUTER AND COMMUNICATIONS TECHNOLOGY The President's Commission on Federal Statistics reported in 1971 that the development of new methods of storing and recalling information by computer has generated considerable confusion in the public mind about the government's need for personal information about individuals, and apprehension about the use of such information (pp. 198–199). This comment remains emphatically legitimate today, over 20 years after the publication of the report of the presidential commission. The public fears that the government's amassing of large data bases will reach into the private lives of citizens (see, e.g., Burnham, 1983; Flaherty, 1989). The computer revolution that caused concern in 1971 is still

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics ongoing. In the 1960s and 1970s mainframe computers were available to comparatively few individuals. Beginning with the mass introduction of personal computers by International Business Machines (IBM) in 1981, significant computing power became available on the desks of most researchers. Because of improvements in computer and communications technology, the prevailing mode for data storage today is a computer data base, and a frequent mode of access is through remote telecommunications. The latter permits large data bases to be developed and maintained according to strict quality control standards while allowing easy access by a widely dispersed group of data users. In the commercial field the growth of airline reservation systems is a prototypical example of this phenomenon. In the area of research and policy analysis, this technology allows rich data bases to be assembled, provides access without a special trip to the site where the data are stored, and in some cases enables researchers to use their own software in analyzing the data. An often-cited example of such an arrangement is the Luxembourg Income Study (Rain-water and Smeeding, 1988), which maintains microdata sets containing measures of economic well-being for many developed countries. The key consideration for custodians of remotely accessed data bases is to provide the benefits of this easy access while ensuring confidentiality. Only statistical aggregates, such as tabulations, should be obtainable. The ability to download individual records should be precluded. Further, and this is more difficult, the data user should not be able to inter the information contained in individual records from permitted queries about aggregates. Data base security has both administrative and technical aspects. Administratively and at a systemwide level, the institution maintaining the data base must create an environment in which passwords are not shared and the use that individuals can make of secure data is subject to periodic audit and review. Technically and at the data base level, reliance on access control procedures, such as the use of passwords, is not fully adequate. All modern data bases are designed to provide security against direct query of certain attributes. In this way, any user can be allowed to access only a restricted part of a data base. Such a multilevel data base basically stores data according to different security classifications and allows users access to data only if their security level is greater than or equal to the security classification of the data. Unfortunately, this simple device does not preclude the existence of what is called an inference channel (see Denning and

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics Lunt, 1988). An inference channel is said to exist in a multilevel data base when a user can infer information classified at a high level (to which the user does not have access) based on information classified at a lower level to which the user does have access. Such a channel may be hard to detect because it may involve a lengthy chain of inference and combine information that is explicitly stored in the data base and other, external information. Further, a data user may have legitimate access to statistical aggregates, such as averages and regression coefficients for such sensitive attributes as medical or salary information, while not being permitted access to information that is identifiable to a specified individual. In such situations, inferential disclosure control is usually implemented through query restriction (Fellegi, 1972) or response modification (Denning, 1980). In query restriction, certain queries, such as those pertaining to a few entities in the data base, are not answered. In response modification, the answers to certain queries are modified, for example, by adding random error or rounding. In either instance, a problem stems from the fact that the data base can be repeatedly queried. Each individual query may be innocuous, but the sequential queries may be sufficient to compromise the data base. This particular problem is an area of ongoing research (see Adam and Wortman, 1989; Duncan and Mukherjee, 1991, 1992; Keller-McNulty and Unger, 1993; Matloff, 1986; Tendick, 1991; and Turn, 1990). To date, research suggests that (1) user demand for data access through remote querying of relational data bases is inevitable and (2) modern data bases give rise to special problems in protecting confidentiality that require new disclosure limitation techniques. RECENT RESEARCH ON DISCLOSURE LIMITATION Until recently agencies had little theoretical justification for the statistical disclosure limitation tools they employed (see Cox, 1986; Duncan and Pearson, 1991). The tools of this field are not only statistical but also come from the fields of mathematics, computer science, numerical analysis, linear and integer programming; and operations research (see Brackstone, 1990; Greenberg 1990, 1991). Much of the research on disclosure limitation has taken place in federal agencies, especially the Census Bureau (Bailar, 1990). Recently, statistical disclosure limitation has begun to attract the attention of university researchers as a field of inquiry. Below we describe a few recent studies of statistical disclosure limitation techniques.

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics information for labor market and related research. The proceedings of a 1978 workshop on the uses of Social Security research files, for example, included more than 10 research papers based on CWHS data sets, mostly from non-federal researchers (Social Security Administration, 1978). The Bureau of Economic Analysis also used CWHS data for its regional national income accounts and analyses of regional labor force characteristics. To enable users to update their longitudinal CWHS files on a quarterly or annual basis, a numerical identifier was included with each record. Initially, the identifier was the actual Social Security number, but after a short time, the actual number was replaced by an encrypted number, based on a simple substitution cipher. Later, a more sophisticated transformation was introduced. It is also worth noting that the specific combinations of Social Security number ending digits used to select the 1 percent sample were published in a journal article many years ago, at a time when the effects of such an action on disclosure risks were not generally understood. Prior to 1974, CWHS files were released without restrictions. However, concerns arose that in files containing county and detailed industry codes it would be possible to identify some employers, especially large ones. Further, employers with access to the file, making use of knowledge of the ending digits that had been used to select the sample, would be able to identify records for some of their own employees and to obtain information, for example, about their previous and other current employment. Starting in 1974, because of these concerns, all recipients were required to agree in writing not to redisclose their files without permission or to attempt to identify individual employers or employees from the file. As explained in Chapter 5, the Tax Reform Act of 1976 severely curtailed the use of identifiable tax return information for statistical purposes by users outside the IRS. Both the employer information from Employer Identification number applications and the earnings data were considered to be tax return information, and thus, the release of CWHS files containing such data came under the new provisions of the Internal Revenue Code. In 1977, the IRS concluded that CWHS files could not be released in the same detail as previously to non-federal users or to most of the federal agencies that had been using them, including the Bureau of Economic Analysis, which had, until that point, played a major role as a user and disseminator of CWHS files in convenient formats to other users (Carroll, 1985; Smith, 1989).

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics Since the 1977 termination of releases under the arrangements that had been used prior to the 1976 Tax Reform Act, there have been a few releases of CWHS files, under restricted conditions, to other federal agencies. Files with identifiers removed have recently been released to the Treasury Department's Office of Economic Policy and the Congressional Budget Office. Provisions of the 1976 Tax Reform Act permit both of these agencies to use tax return information under certain conditions, so one cannot anticipate similar releases to the many potential users who do not have this authority. One possibility for wider release would be to develop one or more versions of the CWHS containing less detailed information for characteristics like industry classification and geographic location of employment. However, SSA and the IRS have not been able to agree on a formula for doing this. Thus, agencies like the Bureau of Economic Analysis no longer have access to the microdata files. It seems fair to say that while the CWHS continues to be used for policy analysis by components of the SSA and a handful of other agencies, its availability to the broader community of users is only a small fraction of what it was prior to passage of the 1976 Tax Reform Act. Access to Address Information in Federal Files for Medical Follow-up and Epidemiologic Studies Access to data for use in epidemiologic studies and user access to the results of such studies raise many complex issues. This example focuses on a single question: access to federal address information for use in locating and tracking respondents in long-term follow-up studies. Although a few federal agencies do have access to such information, access is so limited that it seems reasonable to put this example in the failure category. The importance of epidemiologic follow-up studies is hard to exaggerate. Most people are exposed, in their work and other aspects of their lives, to a host of substances and environmental factors that may lead to adverse health effects. Some of these effects, such as cancer, may not show up until long after initial exposure. Thus, to determine relationships between exposures and effects and to identify the most serious environmental risks, it is necessary to follow groups of exposed persons for long periods, say 20 years or more, and to determine periodically the state of their health, and if they have died, the cause of their death. For many such studies the group of persons to be followed is not identified until well after the period of exposure, so that finding them is not a simple matter.

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics Within the federal system, the best potential source of current addresses for most of the population is the IRS, which has recent addresses and Social Security numbers for all tax filers and their dependents. The Social Security Administration and the Health Care Financing Administration have the current address and Social Security number for beneficiaries, who include virtually all persons aged 65 and over, but their coverage of persons under 65 is limited. For death information, the National Death Index provides access to information on all registered deaths in the United States for 1979 and subsequent years. Because of restrictions imposed by the states on uses of their vital statistics data, NCHS, which is the custodian of the National Death Index, can tell researchers only which members of their study populations have died and the states where those deaths occurred. Researchers who want cause of death and other information appearing on the death certificate must contact the state where the death occurred to purchase a copy of the certificate. Another limitation of the National Death Index is that deaths prior to 1979 are not included. There was some discussion of extending the coverage of the National Death Index back to about 1965, but that was found to be infeasible. The Social Security Administration releases information about date and place of death to the public from its own files, and epidemiologists can obtain this information about members of study populations they are tracking. However, section 205(r) of the Social Security Act, added in 1983 (P.L. 98–21), required the agency to establish a program of voluntary contracts with the states to obtain death certificate information. The purpose was to correct SSA records and remove decedents' names from its benefit rolls. With some exceptions (see Aziz and Buckler, 1992), SSA cannot provide epidemiologists and other researchers with the death information it obtains from the states through this program. It regularly releases death information it obtains through its own sources. Prior to 1976, it was possible, at least in some instances, to obtain current name and address information from the IRS for use in follow-up studies. Such access was completely cut off by the 1976 Tax Reform Act; subsequent amendments made the information available again to the National Institute for Occupational Safety and Health on a limited basis and also for follow-up studies of veterans of military service. However, numerous other government and private organizations conducting follow-up studies do

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics not have access to this relatively low-cost and effective means of tracking their study populations. On a slightly different, but related subject, SSA cannot disclose, for use in epidemiologic follow-up studies, rosters of persons who have worked in a particular industry during a specified period. Such lists would be based on information from earnings reports and Employer Identification number applications, both of which are classified as tax return information and are therefore prohibited from disclosure by the 1976 amendments to the Internal Revenue Code. For periods of employment prior to 1978 there would also be difficulties because the manner in which the SSA's files are organized would make the development of such lists a costly process unless a list based on the 1 percent Continuous Work History Sample could serve the purpose of the study. Finally, SSA has little if any information on current addresses for nonbeneficiaries except on tax records that are subject to the Internal Revenue Code's restrictions on disclosure. Discussion These examples illustrate the wide spectrum of modes of external user access to federal statistical data, ranging from no access at all to completely unrestricted access. Legal requirements for confidentiality are not the only factors that influence statistical agencies' decisions on what modes of access are acceptable for particular classes of data and users. When the probability that individual records can be identified and the perceived sensitivity of the data items are high, the agencies are likely to impose greater restrictions on access by external users. An underlying consideration in all decisions is the possibility that a well-publicized violation of confidentiality might lead to widespread public resistance to participation in voluntary or even mandatory statistical data collection programs. Users look for modes of access that are low cost and enable them to work at their own sites with a minimum of restriction and formality. When their needs cannot be met with public-use data sets, they will generally prefer to work under licensing agreements (Smith, 1991) or with encrypted CD-ROM diskettes, provided those modes will give them timely access to the kinds of data they want, in sufficient detail for their research. Modes of access that require working at agency sites or controlled remote access are not likely to be considered unless there are no alternatives and there are strong incentives to undertake the research.

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics ARCHIVING FEDERAL STATISTICAL DATA SETS Our concern in this subsection is with arrangements whereby public-use and other versions of significant statistical data sets, largely in electronic form, are either maintained by statistical agencies or transferred by them to the National Archives and Records Administration (NARA) for preservation and future access for statistical and research uses. The potential interest in and value of secondary analyses of data files long after their original creation is much greater today than it was even 10 years ago given the expanded availability of computer power and new methods of analysis. Some data sets may have the potential for illuminating issues that were not even thought of when they were created. (An analogy would be the practice of freezing genetic material for future breeding and research uses.) Unfortunately, researchers who have explored the possibility of secondary analyses of old data sets have often encountered serious obstacles. In many instances, the desired files have not been preserved in any form (David, 1980; David and Robbin, 1981). If the files do exist and are accessible, serious difficulties may still result from the lack of supporting documentation, outmoded storage media, and failure to retain the entire data content of files that were used for the original analyses. The question, then, is what can be done to make things easier for future secondary analysts and historical researchers who will want to work with data sets that are now in existence or about to be created? Current Federal Archiving Procedures All federal agencies, including statistical agencies, operate under statutorily prescribed information management procedures that oblige them to notify the NARA of proposed schedules for disposition of their records. In turn, NARA reviews the proposed schedules and, when it considers the records to ''have sufficient administrative, legal, research, or other value to warrant their further preservation by the United States Government" (Title 44 U.S.C. § 3303a (a)) may disallow the proposed destruction of the records. For records that have been in existence for more than 30 years, the archivist (director of the agency) may direct the transfer of such records to the National Archives. In general, statutory restrictions on access to and use of records transferred to the National Archives expire after a period of 30 years. However, under certain conditions, by agreement between the archivist and the agency that transferred the records, such

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics restrictions can remain in force for a longer period (Title 44 U.S.C. § 2108(a)). In addition, it is the usual policy of the National Archives to maintain access restrictions for 75 years when the data are about individuals and their earlier release by the agency that transferred the records could have been denied under Freedom of Information Act exemption 6, which covers "personnel and medical and similar files the disclosure of which would constitute a clearly unwarranted invasion of personal privacy" (Title 5 U.S.C. § 552(b)(6)). The National Archives only accepts electronic files that are physically intact and have adequate documentation to enable researchers to read and make valid use of the data. The agency takes necessary precautions, such as reading files periodically and recopying them to new media, to ensure that the data remain accessible on current computers. Archiving Census Bureau Records Arrangements for the archiving of Census Bureau records are fairly well developed. Economic census records, which identify the responding firms, can be used at the National Archives without any restrictions after the statutory period of 30 years has elapsed. Under a special 1952 agreement between the archivist and the Census Bureau, microfilm copies of the original population census records (which are of great interest to genealogists) that are transferred to the National Archives are kept confidential for 72 years, after which time they are made available to users. About two years ago, another agreement between the archivist and the Census Bureau extended provisions of the 1952 agreement to records from the Current Population Survey and other demographic surveys conducted by the Census Bureau. The 72-year period of confidentiality does not apply to any public-use microdata files that are transferred to the National Archives, but it would apply to internal Census Bureau survey data files whose content had not been restricted to make them suitable for unrestricted public access. To date, no such internal survey files have been transferred to the National Archives, but the archivist has requested files of this type and it is likely that they will be transferred once appropriate security arrangements have been agreed on. To illustrate the kinds of issues that arise, archived electronic files must be copied periodically to ensure their preservation in usable form, and the Census Bureau believes that such copying should be done by special sworn employees who are subject

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics to criminal penalties for improper disclosure of Census Bureau information. The rapid replacement of paper records by electronic files as the primary storage medium and the prospect of further changes in data base technology pose many challenges and opportunities for NARA in its efforts to meet future research needs. Statutes and procedures developed for archiving paper records will require significant alterations. A panel established by the National Academy of Public Administration to review these questions issued a report in 1991, The Archives of the Future: Archival Strategies for the Treatment of Electronic Databases. The tenor of that panel's recommendations is clearly expressed by the following excerpt from the report: NARA itself must take an active stance in seeking candidate databases to evaluate for inclusion as part of the National Archives [emphasis in original]. Persuasive and aggressive oversight should be the ringing quality of NARA's role in guiding the preservation policies of all federal agencies. And time is of the essence. An outstanding characteristic of electronic records … is that there is a much briefer span of time in which to bring them under active preservation. NARA's authority prevents it from taking forceful action to guarantee preservation of records until 30 years have passed. Electronic records not brought under the control of a comprehensive and active preservation program are unlikely to survive more than a few years. NARA must seek additional authorities to sustain a viable program for bringing electronic databases into the National Archives (pp. 1–2). The report recommended that the National Archives preserve data from 430 federal data bases in addition to more than 600 that the agency had already designated as archival. The National Archives is actively pursuing transfers from these data bases. RESTRICTED ACCESS: FINDINGS AND RECOMMENDATIONS Recommendation 6.5 Federal statistical agencies should strive for a greater return on public investment in statistical programs through carefully controlled increases in interagency data sharing for statistical purposes and expanded availability of federal data sets to external users. Full realization of this goal will require legislative changes, as discussed in Chapter 5, but much can be accomplished within the framework of existing legislation.

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics The panel believes that some of the newer and more user-friendly restricted access techniques, such as the release of encrypted CD-ROM diskettes with built-in software and licensing agreements that allow users to use data sets at their own work sites, have considerable promise, and it commends the agencies and organizations that have pioneered the use of such procedures. Recommendation 6.6 Statistical agencies, in their efforts to expand access for external data users, should follow a policy of responsible innovation. Whenever feasible, they should experiment with some of the newer restricted access techniques, with appropriate confidentiality safeguards and periodic reviews of the costs and benefits of each procedure. Recommendation 6.7 In those instances in which controlled access at agency sites remains the only feasible alternative, statistical agencies should do all they can to make access conditions more affordable and acceptable to users, for example, by providing access at dispersed agency locations and providing adequate user support and access to computing facilities at reasonable cost. The panel agrees with the views expressed in the excerpt (above) from the National Academy of Public Administration's report on the archiving of electronic data bases. Recommendation 6.8 Significant statistical data files, in their unrestricted form, should be deposited at the National Archives and eventually made available for historical research uses. This recommendation is intended to cover statistical data bases from censuses and surveys and those, like the Statistics of Income and Continuous Work History Sample data bases, that are derived from administrative records. We have purposely not been specific as to the content of such archived data bases and the length of time for which confidentiality restrictions should continue to apply. Some data bases, like the economic and population censuses, might include explicit identification of data providers. Others, especially those based on samples, might not include names and addresses, but would not be subject to statistical disclosure limitation

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics procedures of the kind that are applied to produce public-use microdata sets for contemporary use. NOTES 1.   Statistical disclosure limitation techniques are often referred to as disclosure control of disclosure avoidance. We prefer disclosure limitation because it expresses more clearly the fact that zero disclosure risk is usually unattainable. 2.   These and other conceptualizations of disclosure are explored in Duncan and Lambert (1989), Duncan and Pearson (1991), and Lambert (1993). 3.   Called samples of anonymized records. See Collins (1992) and Marsh et al. (1991a, 1992). 4.   For example, as recognized by the National Center for Health Statistics, Of course, all direct identifiers of study subjects, such as name, address, and social security number, are deleted from the public use files. Still, there are so many different items of information about any subject individual or establishment in our typical surveys that the set of information could serve as a unique identifier for each subject, if there were some other public source for many of the survey items. Fortunately there is not. But to minimize the chance of disclosure we take additional precautions; we make sure there are no rare characteristics shown on any case in the files, such as the exact bed-size of a large nursing home, or the exact date of birth of a subject, or the presence of a rare disease, or the exact number of children in a very large family. We either delete or encrypt the code identifying smaller geographical areas—places smaller than 100,000 in population—because anyone trying to identify a respondent will have his task greatly simplified if he knows the respondent's local area (Mugge, 1984:291). 5.   Collins (1992) and Marsh et al. (1992) provide further discussion of the relationship between disclosure risk and uniqueness. 6.   There are situations in which adding noise has no effect on certain important kinds of analyses. With, for example, the Health Care Finance Agency's Medicare data, releasing values like exact date of death can pose substantial disclosure risk. The common practice of coarsening the data to, say, year of death may limit the usefulness of survival analyses based on the data and make it impossible to draw good inferences about the efficacy of certain medical treatments. However, shifting all dates by a fixed amount, unknown to the data user, will substantially lessen disclosure risk while leaving most survival analyses unharmed, because survival analysis generally depends only on the elapsed time between various events.

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics 7.   Some researchers have begun to address these issues. Kamlet et al. (1985), for example, analyze the 1980 National Health Interview Survey, in which several averages are reported rather than individual-level data because of confidentiality restrictions. Typically, analysts of such data simply use the associated group-lavel information instead of the (unavailable) individual-lavel data. As Kamlet et al. note, however, this practice can produce inconsistent estimates and regression coefficients of the wrong sign. Kamlet and Klepper (1985) demonstrate how consistent estimators can be computed in special cases. Hwang (1986) deals with the errors-in-variables nature of data masked by adding noise. For such a case, Fuller (1993) illustrates how the data can be analyzed. 8.   The material in this section is based largely on Jabine (1993b), a paper commissioned by the panel. Brackstone (1990) notes that the development and application of statistical disclosure limitation techniques to protect the confidentiality of data is a relatively new area of statistics that arose out of the practical problems statistical agencies faced. Early discussions include Bachi and Baron (1969) and Steinberg and Pritzker (1967). Bailar (1990) identifies the development of statistical disclosure limitation techniques to be one of the five major contributions of the Census Bureau. Also see Barabba and Kaplan (1975), Cox et al. (1985), and Greenberg (1990, 1991). 9.   The seven agencies are the Bureau of the Census, Bureau of Labor Statistics, Internal Revenue Service, National Center for Education Statistics, National Center for Health Statistics, Social Security Administration, and the Statistical Reporting Service (now the National Agricultural Statistics Service). 10.   Based on conversation between Thomas Jabine, a consultant to the panel, and Patricia Doyle of Mathematica at a meeting of the American Statistical Association/Survey Research Methods Working Group on the Technical Aspects of SIPP on May 21, 1992. 11.   Published along with other discussion and commissioned papers in a special issue of the Journal of Official Statistics 1993(2). See Appendix A for a list of the papers. 12.   For univariate dimensions, he checks percentiles, the mean, standard deviation, and skewness. For multivariate dimensions, he checks correlation coefficients and covariance matrices. 13.   The memorandum was signed on February 2, 1990, by Barbara E. Bryant (Census Bureau) and Janet L. Norwood (BLS). 14.   The letters were signed on October 20, 1988, by John A. Ferris (DMDC) and by Kenneth C. Scheflen (OIG) on November 22, 1988.

OCR for page 141
Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics This page in the original is blank.