Technical and Administrative Procedures
The federal government has led the way in the development of statistical methodology … to protect the confidentiality of respondents, usually called disclosure-avoidance techniques.
Barbara Bailar, 1990
The disclosure control to be applied to the release should ideally be based on information about … i. the degree of protection provided by a specific scheme; and ii. the amount of loss of information introduced by that scheme.
Tore Dalenius, 1991
VA officials acknowledged their Income Verification Match project had gone awry. A computer malfunction in Chicago caused more than 650 veterans to receive another veteran's income tax information.
Washington Post, May 27, 1992
The general and agency-specific information statutes discussed in Chapter 5 provide the framework for the confidentiality and data access policies and practices of federal statistical agencies. Within this broad and complex framework, the agencies have substantial latitude to develop and apply specific technical and administrative techniques in order to achieve wide dissemination and use of publicly collected data while protecting the confidentiality of individual information.
Statistical agencies have two main options for protecting the confidentiality of released data—providing restricted data and providing restricted access. The first option entails restricting the content of data sets or files to be released. Before releasing a microdata
file, for example, an agency would usually remove explicit identifiers (e.g., name, address, and Social Security number) and might further curtail the information in the file (e.g., by giving people's ages in five-year intervals rather than by exact date of birth). The second option entails imposing conditions on who may have access to agency data, for what purpose, at what locations, and so forth. (A similar distinction was made by Marsh et al. (1991b) in a paper about data dissemination practices in the United Kingdom. Instead of restricted data and restricted access, they used the terms safe data and safe setting.) Microdata sets that are released with no restrictions on access (but typically with many restrictions on content) are commonly referred to as public-use data sets.
There is an inverse relationship between restrictions on data and restrictions on access: as data restrictions increase, fewer restrictions on access are needed and vice versa. Some user needs cannot be met with restricted data, however, because the data transformation required to ensure data confidentiality is so extreme that the restricted data are useless for inference purposes. In response, an agency may allow more restricted access to less restricted data. Conversely, to ensure confidentiality, access may have to be so restricted that a legitimate user cannot, as a practical matter, obtain the data. Again in response, an agency may allow less restricted access to more restricted data. Neither restricted data nor restricted access alone is a panacea. To make effective use of data while protecting confidentiality, both options are needed, often in combination. On the other hand, without assurances of adequate confidentiality shields, data collection may be stymied. A study of draft evaders who fled to a neutral country, for example, was never conducted because researchers were unable to convince the potential respondents that their anonymity could be ensured (Sagarin, 1973). Thus, the comment by Dalenius quoted at the beginning of this chapter is to the point. Confidentiality protection shields must be both effective and faithful to the original data. The goal of technical and administrative shields is to protect confidentiality adequately while leaving the statistical agency sufficiently unencumbered that it can furnish faithful data. The criterion for faithful data is simple and compelling: "The user expects to be the peer of the data collector in answering a research question" (David, 1991:94).
This chapter has two main sections. In the first section we discuss statistical techniques for protecting the confidentiality of data on persons and other data subjects included in released data sets. It is widely accepted that it is virtually impossible to release
data for statistical use without incurring some risk that one or more persons can be identified and information about them disclosed, especially if substantial resources are used in a deliberate attempt to do so. The goal of using statistical disclosure limitation techniques (mathematical methods that depend on statistical characteristics of the data) prior to releasing data is to reduce the magnitude of that risk.1
In the second section of the chapter, we discuss agency policies and procedures for providing restricted access to data on persons. We also present examples of several kinds of restricted access, including data sharing between agencies for statistical purposes and access by end users outside the government. One of the examples includes the release of microdata in encrypted form. Encryption is a technical procedure, but it is not generally thought of in the context of statistical disclosure limitation techniques, and so we include it in our discussion of restricted access procedures.
RESTRICTED DATA: STATISTICAL TECHNIQUES FOR PROTECTING CONFIDENTIALITY
In this section we focus on the search for statistical techniques that restrict statistical data so as to protect the confidentiality of the data while maintaining utility to the legitimate data user. Although the topic is technical, our treatment is not. Nor do we provide a primer on available techniques. That would only duplicate concurrent work of the Federal Committee on Statistical Methodology (1993), Subcommittee on Disclosure Limitation Methodology, which is chaired by Nancy Kirkendall. We do describe the key concepts of the approach, evaluate its value for confidentiality protection and data access, and present our recommendations. In so doing, we address four topics: the nature of disclosure risk and statistical procedures for disclosure limitation; current statistical disclosure limitation practices of federal statistical agencies; the impact of increased computer and communications capability on disclosure risk; and current statistical disclosure limitation research.
DISCLOSURE RISK AND STATISTICAL DISCLOSURE LIMITATION TECHNIQUES
As defined in Chapter 1, a disclosure occurs when a data subject is identified from a released file (identity disclosure), sensitive
information about a data subject is revealed through the released file (attribute disclosure), or released data make it possible to infer the value of an attribute of a data subject more accurately than otherwise would have been possible (inferential disclosure). Inferential disclosure may involve either identity disclosure or attribute disclosure.2 The most general of these concepts is inferential disclosure, which was defined by Dalenius (1977), supported by the Federal Committee on Statistical Methodology (1978), and summed up by Jabine et al. (1977:6) in the following statement: "If the release of the statistic S makes it possible to determine the (microdata) value more closely than is possible without access to S, a[n inferential] disclosure has taken place." As discussed in Chapter 5, the only way to have zero risk of inferential disclosure is not to release any data. In practice, the extent of disclosure can only be limited to below some acceptable level. Indeed, in Recommendation 5–2 the panel suggested that new legislation should recognize this fact and allow for release of information for legitimate statistical purposes when it entails a reasonably low risk of disclosure.
Fellegi's (1972) view that disclosure requires identity disclosure and attribute disclosure is closest to what is commonly understood by disclosure and provides a reasonable basis for legislative language. On the other hand, the concept of inferential disclosure is useful to statistical agencies developing and analyzing statistical disclosure limitation techniques. Also, inferential disclosure encompasses a broader range of confidentiality risks that an agency should examine.
Statistical disclosure limitation techniques involve transformations of data to limit the risk of disclosure. Use of such a technique is often called masking the data, because it is intended to hide characteristics of data subjects. Some statistical disclosure limitation techniques are designed for data accessed as tables (tabular data), some are designed for data accessed as records of individual data subjects (microdata), and some are designed for data accessed as computer data bases. Common methods of masking tabular data are deleting table entries (cell suppression) and altering table entries (random error, or noise introduction). Common methods of masking microdata are deleting identifiers, dropping sensitive variables, releasing only a small fraction of the data records,3 and grouping data values into categories (as in topcoding, whereby data values exceeding a certain level are assigned to the top category). As discussed below, direct access of computer data bases, a recent phenomenon, may involve either tabular data or
microdata, but it raises new statistical disclosure limitation issues.
Prior to releasing statistical data, a statistical agency removes from the data records any explicit identifiers (such as name, address, Social Security number, telephone number) of data subjects that are not needed for statistical purposes. In many situations, however, this obvious step of deidentification or anonymization is not adequate to make the risk of disclosure reasonably low.4 To go beyond ad hoc measures to reduce the risk of disclosure, it is necessary to have ways of measuring the nature and extent of disclosure possible in specified circumstances. In the context of inferential disclosure, Duncan and Lambert (1986) measure the extent of disclosure in terms of the change in uncertainty about a shielded value prior to data release and after data release. Common disclosure limitation policies, such as requiring the relative frequencies of released cells to be bounded away from both one and zero, are equivalent to disclosure rules that allow data release only if specific uncertainty functions at particular predictive distributions exceed a limit. This effort generalizes work of Cassel (1976) and Frank (1976, 1979) and demonstrates the analytic power of the inferential disclosure formulation.
A seemingly different approach was taken by Bethlehem et al. (1990), who focus on the number of unique individuals in a population, and so use identity disclosure limitation. To illustrate how surprisingly often individuals prove to be unique based on just a few common variables, they noted that a certain region of the Netherlands contained 83,799 households. Of those households, 23,485 were composed of a father, mother, and two children. Looking only at ages of the father and mother and ages and sexes of the two children (all ages in years), 16,008 of the 23,485 households (68 percent) were unique. If a microdata file was released containing this key information on ages and gender plus other sensitive information, confidentiality could easily be compromised in most cases by any individual having only basic background information about a household. At least that would be the case if confidentiality was attacked broadly, for example, by an individual who simply sought to identify any one household among all households rather than some specific household. Such "fishing expeditions," although presumably rare in practice are a concern to agencies worried that someone might seek to discredit their confidentiality policies.
A disclosure risk analysis based on the number of unique entities can be one element of the Duncan and Lambert (1986) framework.
Release of a microdata file in which key variables are generally widely available in the external world could permit linking data for all of the unique households. This would raise the disclosure risk drastically for the shielded sensitive information.5
When disclosure risk is too high, data can be masked prior to release. As noted above, different statistical masking techniques are used for public-use microdata files and tables. In the case of a public-use microdata file, statistical disclosure limitation techniques can be classified into five broad categories (Duncan and Pearson, 1991):
Collecting or releasing only a sample of the data: For example, the Bureau of the Census first released a public-use microdata file with a 1-in-1,000 sample from the 1960 Census of Population and Housing; microdata products from the 1980 census included one based on a 1 percent sample and another based on a 5 percent sample.
Including simulated data: This technique has not been implemented, but it is conceptually akin to including several identical limousines in a motorcade under threat of terrorist attack.
''Blurring" of the data by grouping or adding random error to individual values: Presenting subjects' ages in 10-year intervals is an example of grouping. For an example of addition of random error, the Census Bureau prepared a microdata file for researchers at the National Opinion Research Center from the 1980 census that contained census tract characteristics (e.g., percentage of blacks and Hispanics, unemployment rate, median house value). Because the tract characteristics had unique combinations and those characteristics could readily be learned from Census Bureau publications, the records could be linked to the actual tract of residence. That would have violated the Census Bureau policy of not identifying a geographic area with fewer than 100,000 residents. To reduce this risk, tract characteristics were masked by adding random error, or noise (see Kim, 1990).
Excluding certain attributes: An agency might provide a subject's year of birth but not the month and day, or quarterly employment data could be replaced with yearly summaries.
Swapping of data by exchanging the values of just certain variables between data subjects: For example, the value for some sensitive variable in a record could be exchanged for that in an adjacent record.
For data released as tables, the blurring and swapping techniques described above have been used. Three other statistical
disclosure limitation techniques are unique to tables (see Cox, 1980, 1987; Sande, 1984):
Requiring each marginal total of the table to have a minimum count of data subjects.
Using a "concentration" rule, also known as the (N,K)-rule, where N entities do not dominate K percent of a cell; for example, requiring that the reported aspects of the two dominant businesses in a cell comprise no more than a certain percentage of a cell.
Using controlled rounding of table entries to perturb entries while maintaining various marginal totals.
The more sophisticated the masking technique, the less accessible the data and their analysis will be to many social scientists and policymakers. The statistical disclosure limitation approach that creates the least problems for analysts is masking through sampling because (at least with simple random sampling) standard statistical inference tools readily apply. However, for public-use microdata files, other statistical disclosure limitation techniques, such as the addition of noise, present measurement error or errors-in-variables problems for the user of the masked data (see, e.g., Sullivan and Fuller, 1989).6 This can introduce bias in the inferences that are drawn.7 In general, the choice of an appropriate statistical disclosure limitation method depends on the statistical procedure to be used to analyze the data.
SELECTED STATISTICAL DISCLOSURE LIMITATION PRACTICES OF FEDERAL STATISTICAL AGENCIES
Some federal statistical agencies—notably the Census Bureau—have devoted considerable attention to the development and implementation of statistical disclosure limitation techniques.8 In an enduring contribution, the Federal Committee on Statistical Methodology (1978) issued Statistical Policy Working Paper 2, Report on Statistical Disclosure and Disclosure-Avoidance Techniques. The report summarized the practices of seven federal agencies,9 presented recommendations regarding statistical disclosure limitation techniques, and discussed the effects of disclosure on data subjects and users and the need for research and development in this area. Because of their continued relevance, the panel endorses certain key recommendations from Statistical Policy Working Paper 2 in presenting its own recommendations below. Attention to this area is perhaps even more justifiable today because most
statistical disclosure limitation methods that are employed in practice lack theoretical roots (see Duncan and Lambert, 1986; Greenberg and Zayatz, 1992). Their ad hoc nature leads to criticism by potential users that they are excessively conservative. For decades, for example, the Census Bureau used an area size cutoff of 250,000 residents to limit the disclosure risk based on geographic specification, a cutoff that was lowered to 100,000 residents for most surveys and censuses after systematic studies were conducted (see Greenberg and Voshell, 1990a,b).
Procedures Used for Tables
Several statistical agencies have applied disclosure limitation to tabular data. In practice, most agency guidelines provide for suppressing the values in the cells of a table based on minimum cell sizes and an (N, K) concentration rule. Minimum cell sizes of three are almost invariably used, because each member of a cell of size two could derive a specific value from the other member. A typical concentration rule specifies that no more than K = 70 percent of a cell value is attributable to N = 1 entity. Often, the particular choice of N and K is not revealed to data users. Cell suppression in tables always results in a loss to the data user of some detailed information. Because cell sizes are typically small in geographic areas with small populations, geographic detail is often lost (see below).
Procedures for Microdata Files
Only about half of the federal statistical agencies that replied to the panel's request for information included materials that documented their statistical disclosure limitation techniques for microdata. Some that did merely indicated that the statistical disclosure limitation techniques for surveys they sponsored were set by the Census Bureau's Microdata Review Panel because the surveys had been conducted for them by the Census Bureau.
Major releasers of public-use microdata (that is, the Census Bureau, the National Center for Health Statistics (NCHS), and more recently the National Center for Education Statistics, or NCES) have all established formal procedures for review and approval of new microdata sets. In general those procedures do not rely on parameter-driven rules like those used for tabulations. Instead, they require judgments by reviewers that take into account such factors as the availability of external files with comparable
data, the resources that might be needed by a "snooper" to identify individual units, the sensitivity of individual data items, the expected number of unique records in the file, the proportion of the study population included in the sample, and the expected amount of error in the data.
Since locating a data subject geographically increases disclosure risk, a common disclosure limitation method is to coarsen geographic detail. The Census Bureau and NCHS specify that no geographic codes for areas with a population of less than 100,000 can be included in public-use microdata files. If a file contains a large number of variables, a higher cutoff may be used. The inclusion of local-area characteristics, such as the mean income, population density, and percentage of minority population of a census tract, is also limited by this requirement because if enough variables of this type are included, the local area can be uniquely identified. For example, in the Energy Information Administration's Residential Energy Consumption Surveys, local weather information has had to be masked to prevent disclosure of the geographic location of surveyed households. Lack of geographical coding may have limited effect on certain analyses of national programs, like Food Stamps. Yet expunging state of residence of a sampled household, as is done in the public-use microdata files of the Survey of Income and Program Participation (SIPP), precludes many useful analyses of programs (e.g., Aid for Families with Dependent Children) for which eligibility criteria vary by state.10
THE IMPACT OF IMPROVED COMPUTER AND COMMUNICATIONS TECHNOLOGY
The President's Commission on Federal Statistics reported in 1971 that
the development of new methods of storing and recalling information by computer has generated considerable confusion in the public mind about the government's need for personal information about individuals, and apprehension about the use of such information (pp. 198–199).
This comment remains emphatically legitimate today, over 20 years after the publication of the report of the presidential commission. The public fears that the government's amassing of large data bases will reach into the private lives of citizens (see, e.g., Burnham, 1983; Flaherty, 1989).
The computer revolution that caused concern in 1971 is still
ongoing. In the 1960s and 1970s mainframe computers were available to comparatively few individuals. Beginning with the mass introduction of personal computers by International Business Machines (IBM) in 1981, significant computing power became available on the desks of most researchers. Because of improvements in computer and communications technology, the prevailing mode for data storage today is a computer data base, and a frequent mode of access is through remote telecommunications. The latter permits large data bases to be developed and maintained according to strict quality control standards while allowing easy access by a widely dispersed group of data users. In the commercial field the growth of airline reservation systems is a prototypical example of this phenomenon. In the area of research and policy analysis, this technology allows rich data bases to be assembled, provides access without a special trip to the site where the data are stored, and in some cases enables researchers to use their own software in analyzing the data. An often-cited example of such an arrangement is the Luxembourg Income Study (Rain-water and Smeeding, 1988), which maintains microdata sets containing measures of economic well-being for many developed countries. The key consideration for custodians of remotely accessed data bases is to provide the benefits of this easy access while ensuring confidentiality. Only statistical aggregates, such as tabulations, should be obtainable. The ability to download individual records should be precluded. Further, and this is more difficult, the data user should not be able to inter the information contained in individual records from permitted queries about aggregates.
Data base security has both administrative and technical aspects. Administratively and at a systemwide level, the institution maintaining the data base must create an environment in which passwords are not shared and the use that individuals can make of secure data is subject to periodic audit and review. Technically and at the data base level, reliance on access control procedures, such as the use of passwords, is not fully adequate. All modern data bases are designed to provide security against direct query of certain attributes. In this way, any user can be allowed to access only a restricted part of a data base. Such a multilevel data base basically stores data according to different security classifications and allows users access to data only if their security level is greater than or equal to the security classification of the data. Unfortunately, this simple device does not preclude the existence of what is called an inference channel (see Denning and
Lunt, 1988). An inference channel is said to exist in a multilevel data base when a user can infer information classified at a high level (to which the user does not have access) based on information classified at a lower level to which the user does have access. Such a channel may be hard to detect because it may involve a lengthy chain of inference and combine information that is explicitly stored in the data base and other, external information.
Further, a data user may have legitimate access to statistical aggregates, such as averages and regression coefficients for such sensitive attributes as medical or salary information, while not being permitted access to information that is identifiable to a specified individual. In such situations, inferential disclosure control is usually implemented through query restriction (Fellegi, 1972) or response modification (Denning, 1980). In query restriction, certain queries, such as those pertaining to a few entities in the data base, are not answered. In response modification, the answers to certain queries are modified, for example, by adding random error or rounding. In either instance, a problem stems from the fact that the data base can be repeatedly queried. Each individual query may be innocuous, but the sequential queries may be sufficient to compromise the data base. This particular problem is an area of ongoing research (see Adam and Wortman, 1989; Duncan and Mukherjee, 1991, 1992; Keller-McNulty and Unger, 1993; Matloff, 1986; Tendick, 1991; and Turn, 1990). To date, research suggests that (1) user demand for data access through remote querying of relational data bases is inevitable and (2) modern data bases give rise to special problems in protecting confidentiality that require new disclosure limitation techniques.
RECENT RESEARCH ON DISCLOSURE LIMITATION
Until recently agencies had little theoretical justification for the statistical disclosure limitation tools they employed (see Cox, 1986; Duncan and Pearson, 1991). The tools of this field are not only statistical but also come from the fields of mathematics, computer science, numerical analysis, linear and integer programming; and operations research (see Brackstone, 1990; Greenberg 1990, 1991). Much of the research on disclosure limitation has taken place in federal agencies, especially the Census Bureau (Bailar, 1990). Recently, statistical disclosure limitation has begun to attract the attention of university researchers as a field of inquiry. Below we describe a few recent studies of statistical disclosure limitation techniques.
Duncan and Pearson (1991) describe some general techniques for limiting disclosures from microdata through a process of matrix masking. A statistical agency would disseminate a transformed file rather than the original file. By a process of matrix multiplication, the records can be transformed, the attributes can be transformed, or both. By a process of matrix addition, the individual data values can be displaced. Such matrix masking makes uses of well-known disclosure limitation methods. Among record-transforming masking techniques are aggregation across records, suppression of certain records, release of statistics for ordinary least squares regression, release of a sample of records, and multiplication of records by random noise. Among displacing masking techniques are addition of random noise and addition of deterministic noise. In his discussion of the Duncan and Pearson paper, Cox (1991) suggests how certain generalizations of matrix masks can also make use of other disclosure limitation methods, such as random rounding, grouping, and truncation.
Papers Commissioned by the Panel
To aid its deliberations, the panel commissioned a series of papers dealing with a range of issues bearing on confidentiality and data access (see Appendix A for a list of the papers). Below we briefly identify the contributions of the commissioned papers that dealt most directly with statistical disclosure limitation.
Fuller (1993), in his paper Use of Masking Procedures for Disclosure Limitation, examines the confidentiality protection provided by adding random error to data elements. Noting the importance of researchers' being able to analyze the resulting masked data, he examines the costs of data masking as a function of type of masking, degree of masking, and type of analysis. For a range of situations, Fuller provides appropriate statistical estimation procedures based on measurement error methodology.
Jabine (1993b), in his paper Statistical Disclosure Limitation: Current Federal Practices and Research, summarizes the statistical disclosure limitation practices of 18 agencies based primarily on their responses to a 1990 request from the panel for information about agency confidentiality and data access policies and practices. His paper also provides an update on the statistical procedures used by the agencies that were included in the 1978 Statistical Policy Working Paper 2.
Lambert (1993), in her paper Measures of Disclosure Risk and Harm, develops a decision-theoretic model for the actions of a data snooper in identifying a target record. She also examines a variety of measures of disclosure harm and relates her discussion to a larger literature.
Little (1993), in his paper Statistical Analysis of Masked Data, discusses a broad range of masking methods, including randomized response, release of subsamples of records, suppression of cells in a cross-tabulation, deletion of sensitive values, deletion followed by imputation of values, addition of random noise, rounding, grouping or truncation, transformation, slicing files into subsets of variables, slicing and recombination to form synthetic records, reduction to aggregate sufficient statistics, and microaggregation. He develops a likelihood theory for masked data files that combines elements of Rubin's (1976) theories for treatment assignment and missing data and Heitjan and Rubin's (1991) theory for coarsened data.
Keller-McNulty and Unger (1993), in their paper Database Systems: Inferential Security, synthesize research that has been conducted by computer scientists and statisticians on problems of data security and confidentiality. Computer scientists have focused on data release through sequential queries to a data base; statisticians have focused on the aggregate release of data.
In his discussion of the commissioned papers at the panel's March 1991 Conference on Disclosure Limitation Approaches and Data Access, Rubin broached the idea of applying multiple imputation methodology (see Rubin, 1987; Little and Rubin, 1987) to create artificial (synthetic) microdata files.11 In this mode, federal statistical agencies would not release any actual data from subjects. The clear appeal of this notion is that there is no possibility of identity disclosure in disseminating synthetic data. However, the notion gives rise to important questions: To what extent can synthetic data sets created by using techniques such as multiple imputation and data swapping satisfy user needs? What would be the legal and research standing of inferences drawn from them?
Shielding Organizational Data
For the most part, statistical disclosure limitation techniques have been developed for data on persons and households. The task of protecting the confidentiality of organizational data—for example, economic data on establishments—is more difficult because
organizational data are decidedly skew on key dimensions. Federal statistical agencies are thus reluctant to release public-use microdata files on organizations. Some research in this area has begun. Wolf (1988), for example, examines a technique of microaggregation that creates pseudo-records by combining information from similar records. He applies it to the Census Bureau's Longitudinal Research Development file (an earlier version of the Longitudinal Research Database), which consists of a longitudinal file of manufacturing establishment microdata records collected through the Annual Survey of Manufactures and the Census of Manufactures for the years 1972–1981. He examines the utility of the masked data by comparing characteristics of the distribution of the masked data with the corresponding characteristics of the original data.12 Other work in this area includes Govoni and Waite (1985), McGuckin and Nguyen (1990), and Spruill (1983).
FINDINGS AND RECOMMENDATIONS
Many of the statistical agencies included in the panel's review have standards, guidelines, or formal review mechanisms that are designed to ensure that (1) adequate analyses of disclosure risk are performed and (2) appropriate statistical disclosure limitation techniques are applied prior to release of tables and public-use microdata files. The standards and guidelines, however, exhibit a wide range of specificity: Some contain only one or two simple rules; others are much more detailed. Examples of more detailed formal documentation or procedures include those of the Census Bureau (for microdata), the Energy Information Administration (for tabulations only), the National Center for Health Statistics, and the 1977 Social Security Administration guidelines, which are still in effect. Other statistical agencies have far less formal procedures. This variation over agencies in the comprehensiveness of disclosure review has little justification in terms of agency mission. Further, unfulfilled opportunities exist for agencies to work together and learn from one another, perhaps pooling resources to investigate the strengths and weaknesses of various statistical disclosure limitation techniques.
Based on its findings, the panel endorses the following recommendations from Statistical Policy Working Paper 2 (Federal Committee on Statistical Methodology, 1978) and then makes Recommendation 6.1:
Recommendations from Statistical Policy Working Paper 2
All federal agencies releasing statistical information, whether in tabular or microdata form, should formulate and apply policies and procedures designed to avoid unacceptable disclosures (pp. 41–42:part of Recommendation B1).
To insure compliance with its disclosure-avoidance policies and procedures, each agency that releases statistical information should establish appropriate internal clearance procedures. There should be a clear assignment of individual responsibilities for compliance (p. 43:Recommendation B6).
The … [Statistical Policy Office, OMB] should encourage agencies that release tabulations and microdata to develop appropriate policies and guidelines for avoiding disclosure, and to review these policies periodically. To the extent feasible, … [this office] should help agencies to obtain technical assistance in the development of disclosure-avoidance techniques (p. 43:Recommendation B8).
The … [Statistical Policy Office, OMB] should conduct periodic training seminars for Federal agency personnel who are responsible for developing and applying statistical disclosure-avoidance procedures (p. 43:Recommendation C2).
Since the panel convened, the federal statistical agencies have initiated some activities consonant with the above recommendations. In particular, in 1991 OMB's Statistical Policy Office took the lead in organizing an interagency committee to coordinate research on statistical disclosure analysis. This Subcommittee on Disclosure Limitation Methodology was begun as an initiative of the directors of the major statistical agencies.
Recommendation 6.1 The Office of Management and Budget's Statistical Policy Office should continue to coordinate research work on statistical disclosure analysis and should disseminate the results of this work broadly among statistical agencies. Major statistical agencies should actively encourage and participate in scholarly statistical research in this area. Other agencies should keep abreast of current developments in the application of statistical disclosure limitation techniques.
As discussed above, statistical disclosure limitation methods can hide or distort relations among study variables and result in
analyses that are incomplete or misleading. Because of this possibility, policy researchers have expressed serious reservations about the use of statistical disclosure limitation techniques (e.g., Smith, 1991). Further, data masked by some disclosure limitation methods can only be analyzed accurately by researchers who are highly sophisticated methodologically. Based on these findings, the panel makes the following recommendation:
Recommendation 6.2 Statistical agencies should determine the impact on statistical analyses of the techniques they use to mask data. They should be sure that the masked data can be accurately analyzed by a range of typical researchers. If the data cannot be accurately analyzed using standard statistical software, the agency should make appropriate consulting and software available.
The panel believes that no one procedure can be developed for all statistical agencies. Further, confidentiality laws governing particular agencies differ, as do the types of data collected and the needs of data users. In light of these findings, the panel endorses the following recommendations contained in Statistical Policy Working Paper 2:
Because there are wide variations in the content and format of information released, the Subcommittee does not feel that it is feasible to develop a uniform set of rules, applicable to all agencies, for distinguishing acceptable from unacceptable disclosures. In formulating disclosure-avoidance policies, agencies should give particular attention to the sensitivity of different data items. Financial data, such as salaries and wages, benefits, and assets, and data on illegal activities and on activities generally considered to be socially sensitive or undesirable require disclosure-avoidance policies that make the risk of statistical disclosure negligible….
Agencies should avoid framing regulations and policies which define unacceptable statistical disclosure in unnecessarily broad or absolute terms. Agencies should apply a test of reasonableness, i.e., releases should be made in such a way that it is reasonably certain that no information about a specific individual will be disclosed in a manner that can harm that individual (p. 42:part of Recommendation B1).
Special care should be taken to protect individual data when releases are based on complete (as opposed to sample) files and when data are presented for small areas (p. 42:Recommendation B2).
Given the potential difficulties that certain statistical disclosure limitation techniques can cause for analysts, it is important that federal statistical agencies involve data users in selecting such procedures. As Greenberg (1991:375) notes, ''survey sponsors and data users must contribute to the decision making process in identifying areas in which some completeness and/or accuracy can be sacrificed while attempting to maintain as much data quality as possible." In the past, agency staffs have been essentially the sole determiners of which statistical disclosure limitation techniques are to be employed prior to releasing tables and microdata files.
Recommendation 6.3 Each statistical agency should actively involve data users from outside the agency as statistical disclosure limitation techniques are developed and applied to data.
Finally, over the past 30 years various agencies have released public-use microdata files successfully. Marsh et al. (1991a) make a compelling case for the release of public-use microdata files. (Note also Collins, 1992, for a cautionary viewpoint and Marsh et al., 1992, for a rejoinder.) Based on experience, such data dissemination has met a two-pronged test: (1) the microdata files have been useful to researchers and policy analysts and (2) confidentiality has been protected. Based on this finding, the panel makes the following recommendation:
Recommendation 6.4 Statistical agencies should continue widespread release, with minimal restrictions on use, of microdata sets with no less detail than currently provided.
We note that expansion of the number and richness of public-use microdata files to be disseminated would be better justified if all users were subject to and made aware of sanctions for disclosure of information about individually identifiable data providers (see Recommendation 5.3).
RESTRICTED ACCESS: ADMINISTRATIVE PROCEDURES TO PROTECT CONFIDENTIALITY
Procedures for providing restricted access to data typically establish eligibility requirements for access and impose a variety of
conditions covering the purposes for which the data can be used, which organizations and individuals can have access, the location of access, physical security measures, and the retention and disposition of initial and secondary data files. Written agreements are usually required, and criminal or contractual penalties are often attached to noncompliance with the conditions of access.
Arrangements for providing restricted access to federal data for statistical purposes are not uncommon. In a paper prepared for the panel, Jabine (1993a) provides 19 examples, most of them current, and his list was intended only to illustrate various types of access that are allowed, not to provide an exhaustive summary. He also gives six examples of instances in which access could not be obtained to data that would have had important statistical uses.
The characteristics of interagency data sharing agreements tend to be different from those associated with arrangements for restricted access by external users. The former are more likely to involve sharing of individual records with explicit identifiers, for purposes such as developing sampling frames and enhancing data bases. The latter typically are designed to permit access to potentially identifiable records for statistical analysis. These two types of access are discussed in the next two subsections.
INTERAGENCY DATA SHARING
We begin this subsection by presenting examples of agreements that have been developed to permit interagency sharing of identifiable, or potentially identifiable, personal records for statistical purposes. Some of the examples involve transfers of administrative records; others involve transfers of data collected in statistical surveys. Most of the examples are taken from Jabine (1993a). Following the examples, we discuss some of the general issues they illustrate.
This subsection does not include any examples of unsuccessful attempts to develop interagency data sharing arrangements for statistical purposes. The absence of such examples does not mean that all or even most proposals for interagency sharing are successful. For some kinds of data, statutory restrictions make sharing impossible; in other instances, agencies' policies and their interpretations of confidentiality statutes have thwarted requests by other agencies for access to their data. The very limited ability of federal agencies to share business lists for statistical purposes (see Chapter 8) illustrates the kinds of barriers that exist.
Examples of Interagency Data Sharing
Bureau of Labor Statistics Access to Nonpublic Microdata from the Current Population Survey The main sponsor and funding agency for the monthly Current Population Survey (CPS) is the Bureau of Labor Statistics (BLS). The survey data are collected by the Census Bureau. Because the sample of households is based in part on address listings from the decennial censuses, all CPS data are subject to the confidentiality provisions of Title 13 of the U.S. Code, which means that only Census Bureau employees (including special sworn employees) can have access to individually identifiable information.
Such limitations on access have long been a source of discontent on the part of agencies sponsoring household surveys conducted under the provisions of Title 13. Sponsors believe they are unduly restricted in their ability to perform detailed analyses of the survey data and to use the survey materials for intermediate purposes that might involve data linkages or follow-up contacts with survey respondents to collect additional data. Nonetheless, BLS has opted to continue to take advantage of the sampling efficiency that results from the use of decennial census address lists for its CPS sample. In February 1990, the Census Bureau and BLS executed a formal five-year "Memorandum of Understanding on the Use of Nonpublic Current Population Survey Microdata by the Bureau of Labor Statistics."13 Under the agreement, BLS users of nonpublic CPS microdata (which consist of microdata with geographic area identifiers and no topcoding of income items, but without explicit identifiers) must be special sworn employees of the Census Bureau. The data may be used for "longitudinal matching of Current Population Survey records, general statistical research [methodological research related to the survey design and operations], and improvement and expansion of general tabulations." No provisions are made, however, for linking CPS records with records from other sources or for follow-up contacts of any kind with CPS respondents. The agreement and attached statement of policies provide for strict security measures by BLS staff, periodic on-site inspections by the Census Bureau, and regular reviews of the benefits of the sharing arrangement.
Linkage of Records from the Departments of Defense and Health and Human Services This example also covers restricted access to data for statistical purposes. However, it is somewhat unusual
in that the purpose of the record linkage was to provide statistical tabulations needed to determine the feasibility of a subsequent linkage for nonstatistical, compliance purposes. The agencies that participated were the Office of the Inspector General (OIG) in the Department of Health and Human Services and the Defense Manpower Data Center (DMDC) in the Department of Defense.
The purpose of the match was to determine how many military and civilian employees of the Defense Department might be in arrears on court-ordered child support payments. Based on the findings, OIG would determine whether to proceed with a full-scale records match for compliance purposes. The files to be linked were 1987 and 1988 Tax Intercept Files in the custody of the Family Support Administration of the Department of Health and Human Services and DMDC's personnel files for military and civilian employees of the Defense Department. To accomplish the linkage, OIG sent a tape to DMDC containing the Social Security numbers and names of persons potentially delinquent on child support payments. The latter agency matched this tape against its personnel files and provided OIG with a tabulation of the number of matches by category of Defense Department employment.
The arrangements were formalized by an exchange of letters between officials of the two agencies.14 The main conditions agreed to were that DMDC would not use the OIG records for any purpose other than the specified match and that it would return the OIG data file when the match was completed. One of the letters noted that the match would not be subject to federal matching standards established by the Computer Matching and Privacy Protection Act of 1988 (P.L. 100–503) because that act does not cover matches performed solely for statistical purposes.
The number of matches found in the OIG-DMDC matching study was sufficient for the agencies to reach a decision to proceed with an ongoing matching operation for compliance purposes. No Defense Department employees whose child support payments were in arrears were immediately affected because their records had been matched in the statistical study. However, many of them may have been subject to disciplinary action as a result of the subsequent matching operations for compliance purposes.
The 1973 Exact Match Study What may have been the most ambitious and complex statistical record linkage study, involving both survey and administrative records, that has yet been undertaken occurred in the 1970s. The 1973 Exact Match Study was a
joint undertaking of the Social Security Administration (SSA) and the Census Bureau. As part of the study, the Internal Revenue Service (IRS) furnished selected tax information from 1972 individual income tax returns to the Census Bureau for matching to CPS records (see Kilss and Scheuren, 1978). The primary goal of the study was to provide a broad data base for addressing such policy issues as the redistributive effects of changes in income and payroll taxes and alternative Social Security benefit structures. Specific uses included evaluating the quality of CPS income reporting, conducting labor force participation and earnings analyses using CPS income and demographic data and SSA earnings histories, and developing lifetime earnings models.
The study linked survey records for persons in the March 1973 Current Population Survey, including the income supplement, with their earnings and benefit information for several years from SSA data files and IRS information from their 1972 tax returns. There had been earlier administrative data linkages to the CPS data, starting with the March 1964 round, but the scope of the 1973 study exceeded that of any previous one. The study was designed to take advantage of the possibilities that had been opened up by advances in computer matching techniques and by the achievement of close-to-universal coverage by the IRS and SSA record systems (Kilss and Scheuren, 1984).
The data products from the 1973 Exact Match Study included several public-use (unrestricted access) microdata files containing linked survey and administrative record data. The files were widely used by researchers and provided the basis for numerous published analytic and methodologic reports and papers (Kilss and Scheuren, 1978, list 41 reports and papers).
Although the end products of the study were available to all users, the intermediate stages in their development required restricted access arrangements among the three agencies involved. Kilss and Scheuren (1984) summarize the manner in which the confidentiality requirements of the three agencies were met. None of the linkages was performed at the IRS. For linkages performed at SSA (to extract earnings and benefit data for sample persons), limited extracts of CPS records were used and those records were processed only by SSA employees who had been given appointments as special sworn employees of the Census Bureau.
The primary personal identifier used to link records from the different sources was the Social Security number, which respondents to the CPS had been asked to provide—with assurance that their numbers would be used only for statistical purposes. The
SSA did not provide administrative data to the Census Bureau for the small number of CPS respondents who declined to provide their Social Security number.
One concern of those responsible for the study was the possibility that an SSA or IRS employee could, in theory, use the linked data to identify one or more individuals in the public-use file and then use the survey information in those individuals' records for administrative purposes. To minimize this possibility (often called the reidentification problem), income items were topcoded (that is, values above specified levels were replaced by codes indicating that they exceeded those levels) and the release of the most detailed public-use file, which included the tax return data, was delayed until the end of the period for which the IRS retained the same tax return information for all persons in its electronically accessible administrative files.
Could the same agencies collaborate on a study of this scope and utility for research on income distribution today? Probably not. Although the period for which the IRS maintains computerized files of individual income tax data is still about the same, the returns are kept indefinitely on microfilm or microfiche, so that the reidentification problem would not disappear completely. Moreover, the Census Bureau, as a matter of policy, no longer releases public-use microdata files that link census or survey and administrative data for individuals. Even if these difficulties could be overcome, it would probably be difficult to marshal the resources necessary to carry out an interagency linkage study of this magnitude.
Census Bureau Use of IRS and SSA Personal Records To carry out their respective responsibilities for the collection of federal taxes and the administration of Social Security benefit programs, the IRS and SSA have developed extensive personal record systems that now cover large segments of the U.S. population. Over the years, the Census Bureau has used those records in several ways to enhance its demographic censuses and surveys and to evaluate the quality of census and survey data.
Census Bureau uses of IRS and SSA administrative data can perhaps be best illustrated by explaining how the data serve as basic inputs to the Census Bureau's program of intercensal population estimates for small areas. The Census Bureau tracks trends in internal migration between censuses by matching individual taxpayer record extracts for successive years from the IRS Individual Master File. The migration data from the matched extracts
serve as inputs to the estimates of the total population of states, counties, and other government units. The tax files do not include information on the age, sex, or race-ethnic classification of individual taxpayers, but this information can be obtained from the SSA's NUMIDENT file, which contains the basic demographic data obtained in connection with the issuance of Social Security numbers. The Census Bureau has used these and other subsidiary inputs to produce estimates of population for states and metropolitan areas by race and age.
As explained in Chapter 5, the Internal Revenue Code allows certain statistical uses of tax return information by the Census Bureau. Specific uses are spelled out in some detail in a series of Treasury Department regulations that identify the content of the records to be transferred, the uses to which they can be put, and the security measures that must be adopted to protect the confidentiality of information turned over to the Census Bureau. The information from the SSA's NUMIDENT file is transferred to the Census Bureau under a general agreement between the two agencies that provides for limited sharing of information in both directions for joint statistical projects (see Jabine, 1993a:example 6).
The future of the Census Bureau's program to develop population estimates for states and counties by age, race-ethnicity, and sex has been clouded by a recent development. Under arrangements negotiated between SSA and state registrars of vital statistics, birth certificates and Social Security numbers are now being issued jointly at the time of birth in most states. For the births covered by these arrangements, SSA no longer obtains data on race/ethnic classification. Unless SSA finds another way to get such information, which it has used only for statistical analyses of its programs, the gradual increase in the proportion of persons for whom it lacks race-ethnic information will degrade the Census Bureau's ability to include race-ethnic status in its small-area population estimates program.
An obvious requirement for interagency data sharing is that the statutory confidentiality requirements of all of the agencies involved must be observed. A second requirement is that the transfer of data among agencies must be consistent with statements made to data providers when the data were obtained from them. When these requirements are not the same for all of the agencies involved, those that are most stringent govern the arrangements.
In the case of the Census Bureau, the law says that only sworn Census Bureau employees, who are subject to criminal penalties for improper disclosure, may have access to the data. Thus, in any sharing arrangement involving Census Bureau data, employees of other agencies can have access to the data only after taking an oath as special sworn employees of the Census Bureau. The use of the special sworn employee provision is further circumscribed by the requirement that it only be used in connection with activities that are part of the bureau's statutorily defined mission.
Statistical uses of IRS administrative records for purposes not related to tax administration are limited to the agencies and purposes specified in the Internal Revenue Code. The Tax Reform Act of 1976 (P.L. 94–455) restricted such uses to other units of the Treasury Department, the Census Bureau, and the Bureau of Economic Analysis, in each case for particular purposes. A subsequent amendment (P.L. 95–210) added the National Institute of Occupational Safety and Health to the list of eligible recipients (P.L. 95–210, § 6103(j) and 6103(m)(3)). The IRS has engaged in some sharing of records with agencies other than those mentioned specifically in the code, but only for studies it considers to be related to tax administration.
In general, data sharing among federal agencies other than the Census Bureau and IRS for statistical purposes is less constrained by legislation. However, statistical agencies that receive some of their data from the states under federal-state cooperative programs (e.g., BLS and NCHS) must comply with confidentiality provisions of state law, which vary among states. In some instances, potential users of identifiable data obtained by a federal agency under such cooperative arrangements have had to apply separately to each state for permission to have access to its data. A federal-state cooperative program is also central to the data collection activities of the National Agricultural Statistics Service (NASS). However, NASS uses of the data are controlled by federal rather than state law because the data collection, while jointly financed, is federally controlled (specified statistical products are provided to the states).
Developing arrangements for interagency data sharing can be a complex and time-consuming process, especially if more than two agencies are involved or if novel applications of the data are planned. New initiatives are likely to pose new legal, ethical, administrative, and policy questions. The expected benefits in terms of cost savings or better quality data must be substantial to
justify the level of effort and perseverance needed to find acceptable answers. It helps if the proposed data sharing arrangements offer benefits to all of the parties concerned.
EXTERNAL DATA USERS
The availability of high-speed computers and sophisticated analytic techniques and software have generated vastly increased appetites for federal statistical data. The statistical agencies have tried to satisfy the demand by issuing more detailed tabulations and public-use microdata sets, but not surprisingly, they have succeeded only in whetting user appetites further. Continued efforts to meet increasing demands from users are important; many users have the capacity to conduct sophisticated research on important matters of public interest.
For many of the data sets that users want, the risk of disclosure is great enough that some form of restricted access is the only option for release. Several modes of restricted access for external users have been developed by statistical agencies. Some of the important features of these access modes are eligibility criteria, location of access, cost and convenience for agencies and users, and methods of protecting confidentiality. As in the previous subsection, we present examples of different modes of restricted access by external users. We also present two examples of failures to gain access to data sets. We follow the examples with a discussion of their key features and the comparative advantages of different modes of access.
Examples of Restricted Access by External Data Users
ASA/NSF Fellows Since 1978, the National Science Foundation (NSF) has funded and the American Statistical Association (ASA) has administered a program designed to promote the exchange of ideas and techniques between federal government statisticians and academic users of federal statistical data. Five agencies, the Bureau of Labor Statistics, the Census Bureau, the National Center for Education Statistics, the National Institute of Standards and Technology, and most recently, the National Science Foundation (as a host agency), have participated in the program, which enables senior research fellows and associates from universities to undertake research studies at one of the host agencies for a period of up to one year. The National Agricultural Statistics Service
has a similar fellows program, administered by ASA, but paid for entirely with agency funds rather than jointly by the National Science Foundation and the agency.
The fellows work on research topics of joint interest to themselves and their host agencies. They have essentially the same access to agency data bases and computer facilities as regular employees doing similar work, and they are subject to the same confidentiality requirements and penalties for improper disclosure of individually identifiable information. At the Census Bureau, for example, the ASA/NSF/Census fellows receive appointments, take oaths as special sworn employees, and are subject to the same Title 13 confidentiality provisions, including penalties for violations, as regular employees.
The ASA/NSF fellows program has provided user access to data not available in unrestricted public-use files for a substantial number of academic researchers. From the start of the program through the end of 1991, 72 fellows and 54 associates had worked at the first four agencies listed above, about three-fourths of them at the Census Bureau. The primary restrictions are that access is available only at the agency's central facility, for a limited term (although fellows sometimes revisit agencies on a less formal basis), and only for projects that the host agency deems to be of interest. A more serious problem is that NSF funding for large numbers of fellows is expected to end soon, and it is unlikely that the other agencies will be able to support similar numbers out of their own operating funds.
Remote On-line Access: The Luxembourg Income Study The Luxembourg Income Study is an international cooperative research project providing remote access to microdata sets from household surveys conducted by member countries. The project is designed "to promote research on the distribution of income … and the general economic situation of households and families in an international context" (Cigrang and Rainwater, 1990:1). The project is supported by the Government of Luxembourg and the Center for Population, Poverty, and Policy Studies, also in Luxembourg.
At present, microdata sets from 14 countries, with explicit identifiers removed, reside in the Luxembourg Income Study data base system maintained at the Government of Luxembourg's computer center. Microdata sets from the United States, Canada, and Australia had previously been issued by those countries as public-use files, but the data sets for the remaining countries had either been available with restrictions or not at all.
The data base system is designed mainly to provide remote access to registered users through an EARN/BITNET computer mail network. Use of the data is restricted to academic and policy analysis research. Remote users submit job requests in the SPSSX statistical language, specifying variable names used in the system. Dedicated batch-processing computers execute the requests and the outputs are forwarded to users over the computer mail network. To prevent disclosure of individual records, incoming job requests and statistical outputs are subjected to automated reviews and those that fail the review are given to the Luxembourg Income Study staff for further review. For all of the microdata sets, exact income amounts, some of which may have been taken from administrative sources, are rounded.
Once users receive the statistical outputs, they are not subject to further restrictions on use or publication, other than the general requirement that use for commercial purposes is prohibited. As of mid-1990, the data base contained 22 data sets from the 14 member countries, had about 150 registered users, and was averaging about 50 job requests a day.
Release in Encrypted CD-ROM Format The passage in 1988 of Public Law 100–297 (the Hawkins-Stafford amendments to the General Education Provisions Act) led the NCES to undertake a comprehensive review of its policies for protecting the confidentiality of individual information included in releases of tabulations and microdata sets. It became clear that for some of the microdata sets that had been released previously without restrictions the risk of disclosing individually identifiable data was too high. Criteria and procedures were established for deciding which files could continue to be released with no restrictions on access and which could not. For the latter group, other modes of release were sought. One of these modes, the encrypted CD-ROM format, is the subject of this example (Wright and Ahmed, 1990). Another, licensing agreements for use at user sites, is the subject of the next example.
Under the encrypted CD-ROM format, users purchase a diskette containing an encrypted microdata set and software that can produce descriptive statistics from the encrypted data. Prior to release, agency staff evaluate the disclosure risks associated with the outputs that can be produced with the software provided on the diskette. Restrictions are built into the software to prevent the user from printing out unencrypted individual records or statistics that would tend to disclose individual information. It is
unlikely that any users would attempt to de-encrypt the records because of the high cost that would be associated with such an effort.
In general terms, this mode of release can be considered either public-use or restricted access, depending on what conditions are imposed on recipients. Significant advantages for users are the relatively low cost of access and the ability to use the data at their own work sites. However, the only analyses they can perform are those for which the necessary programs are included on the diskette. At present, only elementary analytic techniques are supported, but work is proceeding to incorporate more complex types of analyses into the system.
This system has the potential for solving, at least in part, the reidentification problem described earlier in this chapter. Microdata sets containing linked survey and administrative data could then be released with very little risk that the custodians of the administrative records could reidentify individual records.
Release of Microdata under Licensing Agreements To meet the requirements of users whose needs cannot be satisfied with public-use or encrypted CD-ROM microdata sets, the NCES has developed a licensing procedure that allows eligible organizations or agencies to use its confidential data for research and statistical purposes at their own sites (see Wright and Ahmed, 1990). Some of the key features of the standard agreement, which was approved for use on a trial basis in November 1990, follow:
All employees of the licensee who will have access to the data become special sworn employees of NCES. They are required to sign affidavits of nondisclosure and are subject to severe penalties for violation of their oath.
The physical security of the identifiable data must be ensured by use of specified procedures. Representatives of NCES have the right to make unannounced and unscheduled inspections of the licensee's facility to evaluate compliance.
All publications and other data products must be submitted to NCES for review. If there are evident reasons to be concerned about the possibility of disclosure, the products must also be submitted for review prior to any release.
The research conducted under the license must be consistent with the statistical purposes for which the data were initially supplied to NCES and the data must be returned to the center or destroyed when the research is completed.
If the licensee is a state agency or contractor thereof, the attorney general of the state must certify that the licensee cannot be required to make individually identifiable data available to a state agency or employee not covered by the agreement.
These licensing procedures permit somewhat wider restricted access than that currently provided by the Census Bureau under its special sworn employee provision. The NCES statutory provisions for the appointment of special sworn employees are identical, but the statutory definition of the latter's mission (in the Hawkins-Stafford amendments) places greater emphasis on the dissemination of data. Much of the research on educational policy issues takes place outside the federal government, and NCES has taken the position that a licensing system that allows researchers to work with restricted access files at their own site is an important element of its compliance with its mission statement.
By early 1992, NCES had issued about 30 licenses to users. The Statistical Policy Office, OMB, has announced that the licensing procedures and agreement will be submitted to a formal regulatory review, including publication in the Federal Register and comment by all interested groups.
Examples of Failure to Gain Access
SSA's Continuous Work History Sample The Social Security Administration established its Continuous Work History Sample (CWHS) system about 50 years ago to serve as a multipurpose longitudinal data base for program analysis and research on earnings, Social Security benefits, labor force behavior, internal migration, and other characteristics of the U.S. population. For a 1 percent sample of all persons who have been issued a Social Security number, the system contains their date of birth, sex, and race-ethnic classification (as stated earlier in this chapter, race-ethnic information is no longer being captured for most new births) from the Social Security number application form and longitudinal information on earnings and type and location of employment. The earnings information comes from employer annual wage reports (quarterly prior to 1978) submitted to the IRS, and the information on type and location of employment comes primarily from applications to the IRS for an Employer Identification number.
Prior to 1976, CWHS files were widely available and were used for research by other agencies and organizations on the subjects mentioned above. The CWHS system was a prime source of
information for labor market and related research. The proceedings of a 1978 workshop on the uses of Social Security research files, for example, included more than 10 research papers based on CWHS data sets, mostly from non-federal researchers (Social Security Administration, 1978). The Bureau of Economic Analysis also used CWHS data for its regional national income accounts and analyses of regional labor force characteristics.
To enable users to update their longitudinal CWHS files on a quarterly or annual basis, a numerical identifier was included with each record. Initially, the identifier was the actual Social Security number, but after a short time, the actual number was replaced by an encrypted number, based on a simple substitution cipher. Later, a more sophisticated transformation was introduced. It is also worth noting that the specific combinations of Social Security number ending digits used to select the 1 percent sample were published in a journal article many years ago, at a time when the effects of such an action on disclosure risks were not generally understood.
Prior to 1974, CWHS files were released without restrictions. However, concerns arose that in files containing county and detailed industry codes it would be possible to identify some employers, especially large ones. Further, employers with access to the file, making use of knowledge of the ending digits that had been used to select the sample, would be able to identify records for some of their own employees and to obtain information, for example, about their previous and other current employment. Starting in 1974, because of these concerns, all recipients were required to agree in writing not to redisclose their files without permission or to attempt to identify individual employers or employees from the file.
As explained in Chapter 5, the Tax Reform Act of 1976 severely curtailed the use of identifiable tax return information for statistical purposes by users outside the IRS. Both the employer information from Employer Identification number applications and the earnings data were considered to be tax return information, and thus, the release of CWHS files containing such data came under the new provisions of the Internal Revenue Code. In 1977, the IRS concluded that CWHS files could not be released in the same detail as previously to non-federal users or to most of the federal agencies that had been using them, including the Bureau of Economic Analysis, which had, until that point, played a major role as a user and disseminator of CWHS files in convenient formats to other users (Carroll, 1985; Smith, 1989).
Since the 1977 termination of releases under the arrangements that had been used prior to the 1976 Tax Reform Act, there have been a few releases of CWHS files, under restricted conditions, to other federal agencies. Files with identifiers removed have recently been released to the Treasury Department's Office of Economic Policy and the Congressional Budget Office. Provisions of the 1976 Tax Reform Act permit both of these agencies to use tax return information under certain conditions, so one cannot anticipate similar releases to the many potential users who do not have this authority. One possibility for wider release would be to develop one or more versions of the CWHS containing less detailed information for characteristics like industry classification and geographic location of employment. However, SSA and the IRS have not been able to agree on a formula for doing this. Thus, agencies like the Bureau of Economic Analysis no longer have access to the microdata files. It seems fair to say that while the CWHS continues to be used for policy analysis by components of the SSA and a handful of other agencies, its availability to the broader community of users is only a small fraction of what it was prior to passage of the 1976 Tax Reform Act.
Access to Address Information in Federal Files for Medical Follow-up and Epidemiologic Studies Access to data for use in epidemiologic studies and user access to the results of such studies raise many complex issues. This example focuses on a single question: access to federal address information for use in locating and tracking respondents in long-term follow-up studies. Although a few federal agencies do have access to such information, access is so limited that it seems reasonable to put this example in the failure category.
The importance of epidemiologic follow-up studies is hard to exaggerate. Most people are exposed, in their work and other aspects of their lives, to a host of substances and environmental factors that may lead to adverse health effects. Some of these effects, such as cancer, may not show up until long after initial exposure. Thus, to determine relationships between exposures and effects and to identify the most serious environmental risks, it is necessary to follow groups of exposed persons for long periods, say 20 years or more, and to determine periodically the state of their health, and if they have died, the cause of their death. For many such studies the group of persons to be followed is not identified until well after the period of exposure, so that finding them is not a simple matter.
Within the federal system, the best potential source of current addresses for most of the population is the IRS, which has recent addresses and Social Security numbers for all tax filers and their dependents. The Social Security Administration and the Health Care Financing Administration have the current address and Social Security number for beneficiaries, who include virtually all persons aged 65 and over, but their coverage of persons under 65 is limited.
For death information, the National Death Index provides access to information on all registered deaths in the United States for 1979 and subsequent years. Because of restrictions imposed by the states on uses of their vital statistics data, NCHS, which is the custodian of the National Death Index, can tell researchers only which members of their study populations have died and the states where those deaths occurred. Researchers who want cause of death and other information appearing on the death certificate must contact the state where the death occurred to purchase a copy of the certificate. Another limitation of the National Death Index is that deaths prior to 1979 are not included. There was some discussion of extending the coverage of the National Death Index back to about 1965, but that was found to be infeasible.
The Social Security Administration releases information about date and place of death to the public from its own files, and epidemiologists can obtain this information about members of study populations they are tracking. However, section 205(r) of the Social Security Act, added in 1983 (P.L. 98–21), required the agency to establish a program of voluntary contracts with the states to obtain death certificate information. The purpose was to correct SSA records and remove decedents' names from its benefit rolls. With some exceptions (see Aziz and Buckler, 1992), SSA cannot provide epidemiologists and other researchers with the death information it obtains from the states through this program. It regularly releases death information it obtains through its own sources.
Prior to 1976, it was possible, at least in some instances, to obtain current name and address information from the IRS for use in follow-up studies. Such access was completely cut off by the 1976 Tax Reform Act; subsequent amendments made the information available again to the National Institute for Occupational Safety and Health on a limited basis and also for follow-up studies of veterans of military service. However, numerous other government and private organizations conducting follow-up studies do
not have access to this relatively low-cost and effective means of tracking their study populations.
On a slightly different, but related subject, SSA cannot disclose, for use in epidemiologic follow-up studies, rosters of persons who have worked in a particular industry during a specified period. Such lists would be based on information from earnings reports and Employer Identification number applications, both of which are classified as tax return information and are therefore prohibited from disclosure by the 1976 amendments to the Internal Revenue Code. For periods of employment prior to 1978 there would also be difficulties because the manner in which the SSA's files are organized would make the development of such lists a costly process unless a list based on the 1 percent Continuous Work History Sample could serve the purpose of the study. Finally, SSA has little if any information on current addresses for nonbeneficiaries except on tax records that are subject to the Internal Revenue Code's restrictions on disclosure.
These examples illustrate the wide spectrum of modes of external user access to federal statistical data, ranging from no access at all to completely unrestricted access. Legal requirements for confidentiality are not the only factors that influence statistical agencies' decisions on what modes of access are acceptable for particular classes of data and users. When the probability that individual records can be identified and the perceived sensitivity of the data items are high, the agencies are likely to impose greater restrictions on access by external users. An underlying consideration in all decisions is the possibility that a well-publicized violation of confidentiality might lead to widespread public resistance to participation in voluntary or even mandatory statistical data collection programs.
Users look for modes of access that are low cost and enable them to work at their own sites with a minimum of restriction and formality. When their needs cannot be met with public-use data sets, they will generally prefer to work under licensing agreements (Smith, 1991) or with encrypted CD-ROM diskettes, provided those modes will give them timely access to the kinds of data they want, in sufficient detail for their research. Modes of access that require working at agency sites or controlled remote access are not likely to be considered unless there are no alternatives and there are strong incentives to undertake the research.
ARCHIVING FEDERAL STATISTICAL DATA SETS
Our concern in this subsection is with arrangements whereby public-use and other versions of significant statistical data sets, largely in electronic form, are either maintained by statistical agencies or transferred by them to the National Archives and Records Administration (NARA) for preservation and future access for statistical and research uses. The potential interest in and value of secondary analyses of data files long after their original creation is much greater today than it was even 10 years ago given the expanded availability of computer power and new methods of analysis. Some data sets may have the potential for illuminating issues that were not even thought of when they were created. (An analogy would be the practice of freezing genetic material for future breeding and research uses.)
Unfortunately, researchers who have explored the possibility of secondary analyses of old data sets have often encountered serious obstacles. In many instances, the desired files have not been preserved in any form (David, 1980; David and Robbin, 1981). If the files do exist and are accessible, serious difficulties may still result from the lack of supporting documentation, outmoded storage media, and failure to retain the entire data content of files that were used for the original analyses. The question, then, is what can be done to make things easier for future secondary analysts and historical researchers who will want to work with data sets that are now in existence or about to be created?
Current Federal Archiving Procedures All federal agencies, including statistical agencies, operate under statutorily prescribed information management procedures that oblige them to notify the NARA of proposed schedules for disposition of their records. In turn, NARA reviews the proposed schedules and, when it considers the records to ''have sufficient administrative, legal, research, or other value to warrant their further preservation by the United States Government" (Title 44 U.S.C. § 3303a (a)) may disallow the proposed destruction of the records. For records that have been in existence for more than 30 years, the archivist (director of the agency) may direct the transfer of such records to the National Archives.
In general, statutory restrictions on access to and use of records transferred to the National Archives expire after a period of 30 years. However, under certain conditions, by agreement between the archivist and the agency that transferred the records, such
restrictions can remain in force for a longer period (Title 44 U.S.C. § 2108(a)). In addition, it is the usual policy of the National Archives to maintain access restrictions for 75 years when the data are about individuals and their earlier release by the agency that transferred the records could have been denied under Freedom of Information Act exemption 6, which covers "personnel and medical and similar files the disclosure of which would constitute a clearly unwarranted invasion of personal privacy" (Title 5 U.S.C. § 552(b)(6)).
The National Archives only accepts electronic files that are physically intact and have adequate documentation to enable researchers to read and make valid use of the data. The agency takes necessary precautions, such as reading files periodically and recopying them to new media, to ensure that the data remain accessible on current computers.
Archiving Census Bureau Records Arrangements for the archiving of Census Bureau records are fairly well developed. Economic census records, which identify the responding firms, can be used at the National Archives without any restrictions after the statutory period of 30 years has elapsed. Under a special 1952 agreement between the archivist and the Census Bureau, microfilm copies of the original population census records (which are of great interest to genealogists) that are transferred to the National Archives are kept confidential for 72 years, after which time they are made available to users.
About two years ago, another agreement between the archivist and the Census Bureau extended provisions of the 1952 agreement to records from the Current Population Survey and other demographic surveys conducted by the Census Bureau. The 72-year period of confidentiality does not apply to any public-use microdata files that are transferred to the National Archives, but it would apply to internal Census Bureau survey data files whose content had not been restricted to make them suitable for unrestricted public access. To date, no such internal survey files have been transferred to the National Archives, but the archivist has requested files of this type and it is likely that they will be transferred once appropriate security arrangements have been agreed on. To illustrate the kinds of issues that arise, archived electronic files must be copied periodically to ensure their preservation in usable form, and the Census Bureau believes that such copying should be done by special sworn employees who are subject
to criminal penalties for improper disclosure of Census Bureau information.
The rapid replacement of paper records by electronic files as the primary storage medium and the prospect of further changes in data base technology pose many challenges and opportunities for NARA in its efforts to meet future research needs. Statutes and procedures developed for archiving paper records will require significant alterations. A panel established by the National Academy of Public Administration to review these questions issued a report in 1991, The Archives of the Future: Archival Strategies for the Treatment of Electronic Databases. The tenor of that panel's recommendations is clearly expressed by the following excerpt from the report:
NARA itself must take an active stance in seeking candidate databases to evaluate for inclusion as part of the National Archives [emphasis in original]. Persuasive and aggressive oversight should be the ringing quality of NARA's role in guiding the preservation policies of all federal agencies. And time is of the essence. An outstanding characteristic of electronic records … is that there is a much briefer span of time in which to bring them under active preservation. NARA's authority prevents it from taking forceful action to guarantee preservation of records until 30 years have passed. Electronic records not brought under the control of a comprehensive and active preservation program are unlikely to survive more than a few years. NARA must seek additional authorities to sustain a viable program for bringing electronic databases into the National Archives (pp. 1–2).
The report recommended that the National Archives preserve data from 430 federal data bases in addition to more than 600 that the agency had already designated as archival. The National Archives is actively pursuing transfers from these data bases.
RESTRICTED ACCESS: FINDINGS AND RECOMMENDATIONS
Recommendation 6.5 Federal statistical agencies should strive for a greater return on public investment in statistical programs through carefully controlled increases in interagency data sharing for statistical purposes and expanded availability of federal data sets to external users.
Full realization of this goal will require legislative changes, as discussed in Chapter 5, but much can be accomplished within the framework of existing legislation.
The panel believes that some of the newer and more user-friendly restricted access techniques, such as the release of encrypted CD-ROM diskettes with built-in software and licensing agreements that allow users to use data sets at their own work sites, have considerable promise, and it commends the agencies and organizations that have pioneered the use of such procedures.
Recommendation 6.6 Statistical agencies, in their efforts to expand access for external data users, should follow a policy of responsible innovation. Whenever feasible, they should experiment with some of the newer restricted access techniques, with appropriate confidentiality safeguards and periodic reviews of the costs and benefits of each procedure.
Recommendation 6.7 In those instances in which controlled access at agency sites remains the only feasible alternative, statistical agencies should do all they can to make access conditions more affordable and acceptable to users, for example, by providing access at dispersed agency locations and providing adequate user support and access to computing facilities at reasonable cost.
The panel agrees with the views expressed in the excerpt (above) from the National Academy of Public Administration's report on the archiving of electronic data bases.
Recommendation 6.8 Significant statistical data files, in their unrestricted form, should be deposited at the National Archives and eventually made available for historical research uses.
This recommendation is intended to cover statistical data bases from censuses and surveys and those, like the Statistics of Income and Continuous Work History Sample data bases, that are derived from administrative records. We have purposely not been specific as to the content of such archived data bases and the length of time for which confidentiality restrictions should continue to apply. Some data bases, like the economic and population censuses, might include explicit identification of data providers. Others, especially those based on samples, might not include names and addresses, but would not be subject to statistical disclosure limitation
procedures of the kind that are applied to produce public-use microdata sets for contemporary use.
Some researchers have begun to address these issues. Kamlet et al. (1985), for example, analyze the 1980 National Health Interview Survey, in which several averages are reported rather than individual-level data because of confidentiality restrictions. Typically, analysts of such data simply use the associated group-lavel information instead of the (unavailable) individual-lavel data. As Kamlet et al. note, however, this practice can produce inconsistent estimates and regression coefficients of the wrong sign. Kamlet and Klepper (1985) demonstrate how consistent estimators can be computed in special cases. Hwang (1986) deals with the errors-in-variables nature of data masked by adding noise. For such a case, Fuller (1993) illustrates how the data can be analyzed.
The material in this section is based largely on Jabine (1993b), a paper commissioned by the panel. Brackstone (1990) notes that the development and application of statistical disclosure limitation techniques to protect the confidentiality of data is a relatively new area of statistics that arose out of the practical problems statistical agencies faced. Early discussions include Bachi and Baron (1969) and Steinberg and Pritzker (1967). Bailar (1990) identifies the development of statistical disclosure limitation techniques to be one of the five major contributions of the Census Bureau. Also see Barabba and Kaplan (1975), Cox et al. (1985), and Greenberg (1990, 1991).
The seven agencies are the Bureau of the Census, Bureau of Labor Statistics, Internal Revenue Service, National Center for Education Statistics, National Center for Health Statistics, Social Security Administration, and the Statistical Reporting Service (now the National Agricultural Statistics Service).
Based on conversation between Thomas Jabine, a consultant to the panel, and Patricia Doyle of Mathematica at a meeting of the American Statistical Association/Survey Research Methods Working Group on the Technical Aspects of SIPP on May 21, 1992.
Published along with other discussion and commissioned papers in a special issue of the Journal of Official Statistics 1993(2). See Appendix A for a list of the papers.
For univariate dimensions, he checks percentiles, the mean, standard deviation, and skewness. For multivariate dimensions, he checks correlation coefficients and covariance matrices.
The memorandum was signed on February 2, 1990, by Barbara E. Bryant (Census Bureau) and Janet L. Norwood (BLS).
The letters were signed on October 20, 1988, by John A. Ferris (DMDC) and by Kenneth C. Scheflen (OIG) on November 22, 1988.