The Framework of Study
You can't have a democratic society without having a good data base.
Janet Norwood, former commissioner, Bureau of Labor Statistics, 1991
In Chapter 1, we laid out ethical principles for statistical agencies as they struggle to broker society's insistence that citizens be allowed to lead private lives and that public policies be based on the dissemination of relevant personal information. In this chapter, we first put this struggle in historical context by tracing the evolution of the federal statistical system's response to issues of confidentiality and data access. Because of their importance, these issues have been the subject of examination by various commissions, panels, and committees. Thus, we next briefly review this earlier work and the recommendations made by some of the key groups. We then argue that recent changes in the composition of society and in computer and communications technology make reexamination of these issues a pressing concern. Finally, we identify and describe the responsibilities that federal statistical agencies have to their various constituencies.
EVOLUTION OF THE FEDERAL STATISTICAL SYSTEM
The federal statistical system has evolved apace with the country, and at each step it has had to address confidentiality and data access issues. The Constitutional Convention of 1787 called for a count of Americans every 10 years beginning in 1790. By the second decennial census, Vice-President Thomas Jefferson had successfully urged more detailed collection of data on people's ages so policies could be designed to raise longevity. The seventh federal census (1850) was greatly enlarged to report by individuals
rather than families. In addition, the practice of having local marshals collect and tabulate the results was stopped. Instead, local census takers filled out the forms, which were then sent to the census office in Washington, D.C., for uniform classification and tabulation. Some 640,000 pages of census schedules were bound in 800 volumes to provide, for the first time, a comprehensive statistical picture of the social and economic life of the nation. To organize the 1850 census, the new superintendent of the census, Professor of Political Economy James D.B. De Bow, teamed with Lemuel Shattuck, a founder in 1839 of the American Statistical Association.
The first permanent census office was founded following adoption of the Permanent Census Act of March 1902. According to Boorstin (1973:172), Dr. S.N.D. North, the first head of the permanent census office, was ready to divide all modern history ''into two periods, the nonstatistical and the statistical; one the period of superstition, the other the period of ascertained facts expressed in numerical terms." Further, North continued, "the science of statistics is the chief instrumentality through which the progress of civilization is now measured, and by which its development hereafter will be controlled." The census office of S.N.D. North evolved into the modern U.S. Bureau of the Census, whose responsibilities go well beyond the decennial census.
As the need for information has grown in such areas as labor, education, and health, the nation has created specialized statistical agencies to collect and disseminate data. In other cases, states have assumed data collection responsibilities, especially in the area of vital statistics, such as birth, marriage, and death records. Unlike its northern neighbor, with its Statistics Canada, the United States has a decentralized statistical system, in which there are numerous federal statistical agencies, each with separate enabling legislation and distinct data provider and data user constituencies. Because of this, substantial variations exist in the way agencies and programs seek to protect confidentiality and provide data access. Some variations are justifiable given the agencies' differing mandates; however, some appear more an accident of history and reflect a lack of coordination and systemwide thinking. In either case, the existing decentralized system provides a natural experiment for examining what works and what does not work for data protection and data access.
Over the past two centuries, federal statistical agencies have responded commendably to growing public concerns about protecting individual autonomy. Courtland (1985) reviews this historical
progression for the Census Bureau. Through the first six decennial censuses (1790–1840), for the limited data to be collected on each household, each census taker was "to cause a correct copy, signed by himself, of the schedule containing the number of inhabitants within his division to be set up at two of the most public places within the same, there to remain for the inspection of all concerned." By the 1850 census, this practice of public posting had been abandoned and census takers were making assurances of confidentiality. Still, copies of the census returns were filed with state officials and county courts, a practice about which Francis Amasa Walker, the superintendent of the 1870 census, expressed concerns (quoted in Courtlands, 1985:409):
The whole expenditure has been worse than useless. It has been positively mischievous. The knowledge on the part of the people that the original sheets of the census were to be deposited among the records of the counties to which they relate, has added almost incalculably to the resistance which the inquiries of the census have encountered. It is useless to attempt to maintain the confidential character of a census under such circumstances. The deposit of the returns at the county seat of every county constitutes a direct invitation to impertinent or malicious examination…. At every step the work of the assistant marshal has been made more difficult by the fear that the information would be … divulged for impertinent and malicious criticism. No one feature of the present method of enumeration has done so much to excite and justify this fear as the provision of the law which requires that the original returns for each county shall be deposited in the office of the county clerk.
By 1890, census legislation required census workers to swear under oath not to disclose census data except to their superiors and eliminated the requirement to file copies with the county court.
During the nineteenth century there was steady growth in the collection of labor, health, agriculture, and education statistics (see, e.g., Duncan and Shelton, 1978). In the early twentieth century, a wealth of federal administrative data began to be assembled with the passage of the Sixteenth Amendment to the Constitution in 1912, which enabled a federal income tax, and the Income Tax Act of 1913, which implemented the income tax.1 Statistics based on the data were first published pursuant to the Revenue Act of 1916. By 1924 legislation had loosened access to tax data to include public listing of taxpayers and their incomes and access to tax returns by two congressional revenue committees. A reaction to this openness led to the Revenue Act of 1926, which rescinded public access to income data.
Major growth in the federal statistical system began in the 1930s with the implementation of an array of government programs to bring the country out of the Great Depression.2 Growth in personnel, budget, and responsibilities accelerated in the postwar period with the passage of the Employment Act of 1946 and the accompanying establishment of the Council of Economic Advisers and the Joint Economic Committee of the Congress. The need for social, as well as economic, data was further highlighted by the civil rights movement and Great Society programs of the 1960s.
Paralleling the growth in the scope of its responsibilities, the Census Bureau has paid increasing attention to protecting confidentiality. Starting in 1940, it ceased the release of certain aggregate data—such as tables displaying counts of the number of data subjects in various categories—from its demographic census publications. Data tables were not released that had small cell counts. In the 1960s, the Census Bureau began to release some computer files of records about individuals (i.e., public-use microdata files), but under the oversight of the Census Bureau Microdata Review Panel, it deletes or modifies potentially identifying information in the files.
Confidentiality concerns have arisen not only in regard to the Census Bureau, but, peaking some 20 years ago, more generally in regard to the federal government. During the Watergate period of the early 1970s, for example, the public and Congress were alarmed by the disclosure to the White House of tax information on a number of political opponents. Legislative remedies were then developed. The Privacy Act of 1974 was enacted to provide greater control over the government's use of personal records, and the Tax Reform Act of 1976 curtailed presidential authority to access tax records and to make them available to other agencies and organizations for nontax uses.
Concern about the confidentiality of personal records hampered ordinary statistical uses of federal data, for example, when the Department of Agriculture was blocked in 1973 from using tax return information to construct a directory of names for use in its surveys of farmers. During the 1970s many ambitious proposals, like the President's Reorganization Project for the Federal Statistical System, were developed to coordinate the federal statistical effort, and they generally gave careful attention to confidentiality. As Statistical Policy Working Paper 2, Report on Statistical Disclosure and Disclosure-Avoidance Techniques, indicated,
Most agencies that release statistical information are becoming increasingly sensitive to the disclosure issue, and … have adopted or are in the process of adopting policies and procedures designed to avoid unacceptable disclosure (Federal Committee on Statistical Methodology, 1978:41).
The decade of the 1980s is widely viewed as a period of retrenchment for the federal statistical system; most of the agencies were on the defensive in an effort to preserve programs and maintain budgets. The 1990s have begun with renewed recognition of the importance of federal statistics and a commitment to improve the quality of the system. For example, in 1991 the Economic Policy Council Working Group of the Council of Economic Advisers (1991), under chair Michael Boskin, proposed several initiatives to improve the quality of economic statistics. Quality improvement embraces renewed attention to confidentiality and data access concerns. Robert M. Groves, while at the Bureau of the Census, identified "analysis of risk of disclosure of confidential data" (quoted in National Research Council, 1992a:23) as a key interdisciplinary need closely connected to quality improvement. The Clinton administration now has an opportunity for further changes to improve the federal statistical system.
EARLIER STUDIES OF PRIVACY, CONFIDENTIALITY, AND DATA ACCESS
History suggests that privacy, confidentiality, and data access are ongoing concerns. For the federal government, these issues have been addressed by several groups and organizations, especially during a period of intensive activity in the 1970s.3 The reports prepared by the various groups chronicle their ideas, many of which remain valid today. The scope of each study was different from that of this study, however. For example, the Privacy Protection Study Commission (1977a, b, c) was concerned with all uses of personal records, not just statistical records, and the Office of Federal Statistical Policy and Standards (1978) examined all aspects of federal statistical programs, going far beyond confidentiality and data access issues. Below, we briefly review three of the studies, which were chosen because they relate most closely to the our mission, and we recommend their reports to all who want to probe these subjects. Additionally, we refer the reader to the discussions in Boruch and Cecil (1979) and Flaherty (1979, 1989).
AMERICAN STATISTICAL ASSOCIATION AD HOC COMMITTEE ON PRIVACY AND CONFIDENTIALITY
In 1975, Lester R. Frankel, then president of the American Statistical Association, appointed the Ad Hoc Committee on Privacy and Confidentiality to deal with information reporting, privacy, and confidentiality issues. After two years of work, the committee, chaired by Joseph L. Gastwirth, produced a final report, which made the following key recommendations regarding confidentiality:
Confidentiality statutes providing full and overriding protection against compulsory disclosure of identifiable records from statistical data systems derived either from surveys or from administrative records should be enacted to cover each federal statistical agency and designated units of other agencies.
Disclosure for nonstatistical purposes of data about identifiable individuals collected or derived from administrative records by federal agencies solely for statistical purposes should be prohibited by statute….
The Committee urges the Congress to avoid the passage of legislation which has the effect of revoking proper guarantees of confidentiality already given by agencies collecting data to be used solely for statistical and research purposes. Statutes which have already had this effect should be amended to exempt data collected or compiled solely for statistical and research purposes (American Statistical Association, 1977:75–76).
PRIVACY PROTECTION STUDY COMMISSION
Commissioned by the Privacy Act of 1974, the Privacy Protection Study Commission (1977a), chaired by David F. Linowes, prepared a report on Personal Privacy in an Information Society. Although the scope of its report is much broader than just research and statistical uses of data, the commission emphasized that such activities (1) benefit society as a whole and (2) depend on voluntary cooperation for accurate information. Voluntary cooperation requires assurances of confidentiality. It also emphasized that the rich lode of administrative data built up by the federal government had barely been tapped for research and statistical purposes. From our standpoint, a key recommendation of the commission in the area of confidentiality and data access was to establish a clear functional separation between the use of information for research and statistical purposes and its use for administrative purposes. It further recommended the establishment of
an independent entity within the federal government to monitor and research privacy-related issues and to issue interpretative rulings regarding implementation of the Privacy Act of 1974.
OFFICE OF FEDERAL STATISTICAL POLICY AND STANDARDS
In 1978 the Office of Federal Statistical Policy and Standards (the equivalent at that time of OMB's Statistical Policy Office) issued a report called A Framework for Planning U.S. Federal Statistics for the 1980's. Some basic recommendations made in the report regarding confidentiality and data access were as follows:
All agencies involved in the collection of statistical or research data should have mandated legislative protection for the confidentiality of information collected or otherwise obtained to be used solely for statistical or research purposes. This should apply to both commercial and personal data.
The uses of statistical data must be restricted to prevent their use in identifiable form for making determinations which would affect the rights, benefits or privileges of the individuals.
Exchange of data among the "protected enclaves" [see below] should be feasible under controlled conditions.
Administrative data sets should be accessible to statisticians and researchers in "protected enclaves" for some statistical uses unrelated to the purposes of the original data collection (pp. 280–281).
The report also argued for
enactment of a clear legal status as "protected enclaves" for selected statistical and research agencies in the major departments, and for other clearly identified statistical and research units within other agencies. The enclave must be insulated from intervention and from unauthorized access to data. Employees must be subject to strict ethical standards established with respect to data handling and to penalties for voluntarily releasing identifiable data contrary to law (p. 281).
Such "protected enclaves" have not been established.
WHAT HAS CHANGED TO WARRANT A NEW STUDY?
Since 1980 many changes in the social and technical environment have affected the federal statistical system. Those changes have caused increased concern about the confidentiality of and access to statistical data and led to our reexamination of these
issues. Pertinent developments in the past decade include the following:
advances in computer and communications technology,
an expanding role for outside researchers in the use of federal data bases for policy analysis,
expanded use of matching (record linkage) for statistical and nonstatistical purposes,
increases in the variety, number, and influence of organizations that have a stake in confidentiality and data access issues,
increasing difficulties in persuading data providers to participate in censuses and surveys,
initiation of cognitive research aimed at the improvement of informed consent and notification procedures for surveys, and
new developments in research on statistical disclosure limitation.
Below, we briefly describe each of these developments and indicate areas of concern for the federal statistical system.
ADVANCES IN COMPUTER AND COMMUNICATIONS TECHNOLOGY
Rapid advances in computer and communications technology have resulted in increased demands by data users for microdata (i.e., data on individual subjects). In 1978, Statistical Policy Working Paper 2 (Federal Committee on Statistical Methodology, 1978:41) noted the increased use of microdata files since 1960 and affirmed, "This development has significantly increased the utility of statistical data bases created by Federal agencies from censuses, surveys and administrative records and promises to do so even more." Beginning with the introduction of personal computers in about 1981, the rapid proliferation of computing power has radically altered the mainframe environment of the 1970s.
The increase in computational capability, the rapid decline in the cost of computer data storage, the development of more sophisticated data base software, the improvement of data transmission capabilities, and the development of computerized data entry—all have made it possible to develop and access with ease large data bases of personal records and thus made confidentiality issues more salient. And although sophisticated techniques for disclosure limitation can now be applied, analytic tools also exist that make it easier for a potential data snooper to identify individual records.
The role of outside researchers in the use of federal data bases for policy analysis has been expanding. These outside researchers include those who work for other agencies, Congress, academic institutions, businesses, labor organizations, and various other organizations. This expanded role is appropriate. The task of analyzing the collected data in order to obtain maximum information from them is enormous, and yet the statistical agencies necessarily face fiscal and institutional constraints that limit the amount of analysis they can perform.
Outside researchers possess substantial capability for data analysis because of their subject-matter knowledge, computing capability, and professional motivation to obtain new insights from the data. Further, their independent analysis of federal data can provide not only new research and policy interpretations, but also uses that the various government agencies had not envisioned. At the same time, the demands of outside researchers for access to federal data raise additional confidentiality concerns for federal statistical agencies.
In order to enhance their ability to answer complex policy-relevant questions, researchers seek to match records in one statistical data base with records from the same data provider in another data base. By matching records, they can avoid having to ask data providers for information they have already given to another data gatherer. The existence of abundant and available computing capability makes matching a more viable option than in prior decades.
In general, record linkage can be an effective tool for statistical and administrative purposes. In administrative procedures, records of federal student loans, for example, can be linked to federal employment records in order to identify federal employees who are delinquent in repaying their loans. In statistical studies, data on federal student loans can be linked to federal employment records to research the value added in human capital of federal loan programs.
For ethical and pragmatic reasons, as we argue in Chapter 1 and elsewhere in this report, data collection for statistical purposes should be protected from administrative uses. Thus, a survey of college graduates that purports to be examining the value
of various sources of college financial aid should not be used to locate those who are delinquent on federal student loans. This concept of functional separation is fundamental to the integrity and effectiveness of a federal statistical agency.
ORGANIZATIONS CONCERNED WITH CONFIDENTIALITY AND DATA ACCESSIBILITY
Responding to the demands of an information-rich age, the variety, number, and influence of organizations claiming a stake in confidentiality and data access issues have mushroomed. As an illustration of the scope of this growth in just one area, the Second Conference on Computers, Freedom, and Privacy was held in March 1992, at George Washington University, one year after the first conference. Sponsored by the Association for Computing Machinery, the 1992 conference had 12 co-sponsors, ranging from the American Civil Liberties Union to the Association of Research Libraries to the Committee on Communications and Information Policy of the Institute for Electrical and Electronics Engineers—USA. In addition it had 10 patrons, including Bell Atlantic and the Computer Security Institute. Given the evident range and depth of concern, federal statistical agencies should be sensitive to the views of different stakeholders, accommodate their conflicting needs to the extent possible, and help to inform the public debate by making the various parties aware of each other's views.
PERSUADING DATA PROVIDERS
Many data collectors face mounting difficulties in persuading data providers to participate in censuses and surveys. There is a consensus that, generally, respondent cooperation with survey activities is declining. Dalenius (1988) describes a willingness to provide "self-disclosure" and a "survey-mindedness" during the 1950s and the early 1960s that have since become weaker. Although no U.S. examples of serious consequence have yet emerged, Dalenius cites a European example that suggests caution. According to Dalenius, a 1986 debate over Stockholm University's Project Metropolitan, a longitudinal study of some 15,000 people born in 1953, may have doubled the nonresponse rate to Statistics Sweden's labor force survey. The key issue was the accumulation and linkage, from several administrative and statistical sources, of highly sensitive data for individuals, without their apparent
knowledge. Although there was some notification of parents at the start of the study, the subjects themselves were not contacted when they reached the age at which they could make their own decisions (p. 5).
In recognition of the personal, and often sensitive, nature of the information that surveys seek from individuals, federal statistical agencies have for many years conducted a variety of survey experiments. More recently, they have conducted cognitive and public opinion research. In the late 1970s, three major studies of informed consent assurances were conducted with survey respondents. In a paper prepared for the panel, Singer (1993) reviews these and other related studies.
The Bureau of Labor Statistics, Census Bureau, and the National Center for Health Statistics have recently set up small units to conduct cognitive research. Within a larger mandate of improving the design of questionnaires, the units have addressed some issues related to confidentiality and data access. Specifically, research studies have been conducted to gain a better understanding of how survey respondents react to personal questions under differing confidentiality pledges and informed consent and notification procedures. In addition, the Internal Revenue Service has sought to elicit respondents' views on the sharing of personal data among selected agencies.
Several public opinion surveys have also addressed privacy issues. For example, a 1990 Louis Harris survey addressed the information practices of business and government agencies (see Equifax Inc., 1990). The Internal Revenue Service has sponsored a series of surveys on such topics as the use of information from mandatory data sets (like income tax returns) for statistical purposes not related to the purposes for which the data were collected (e.g., Internal Revenue Service, 1984, 1987). We explore the impact of the various research activities in Chapter 3.
NEW DEVELOPMENTS IN RESEARCH ON STATISTICAL DISCLOSURE LIMITATION
Various researchers in statistical agencies and in universities have developed better theoretical frameworks for the construction of statistical disclosure limitation techniques. New disclosure limitation techniques have been developed for tabular data, public-use
files of individual data, and data accessed from computer data bases. These developments, which are explored in Chapter 6, are designed to make it possible for statistical agencies to permit data access under conditions that protect confidentiality and maintain the utility of the data for legitimate users.
RESPONSIBILITIES OF FEDERAL STATISTICAL AGENCIES
In Chapter 1 we explored three guiding principles for federal statistical agencies: democratic accountability, individual autonomy, and constitutional empowerment. Each of these principles strongly relates to the responsibilities agencies have to the public, data subjects and providers, data users, other statistical agencies, and the custodians of administrative records regarding confidentiality and data access.
RESPONSIBILITIES TO THE PUBLIC
The principle of democratic accountability highlights the responsibilities of the federal statistical agencies to the public. On confidentiality and data access issues, agencies should establish and maintain a reputation for trustworthy stewardship of data. As an example, in a free enterprise economic system, trustworthy stewardship ensures common availability of accurate economic information. A statistical agency should have sufficient independence to be insulated from political interference so that it can provide facts to public policy debates, fairly and impartially. It should also respond promptly, efficiently, and effectively to data needs in areas that are priorities on the public agenda. In particular, it should provide data needed to evaluate the results of government activity or lack of activity. Further, it should maintain a sufficiently high public profile that its work is known to the public.
RESPONSIBILITIES TO DATA PROVIDERS AND DATA SUBJECTS
Statistical agencies have responsibilities to data providers and data subjects that are congruous with the principle of individual autonomy. They should observe fair statistical information practices. Those practices include (1) protection of confidentiality, (2) nondisclosure of identifiable information for administrative, regulatory, or enforcement purposes, (3) use of informed consent in voluntary surveys, and (4) notification of the conditions of participation
in mandatory surveys. History suggests that statistical agencies should anticipate and be prepared to contend with requests by government, the courts, and citizens for individually identifiable data for nonstatistical purposes. Yet, some federal statistical agencies lack the legislative protection they need to ensure the confidentiality they promise. Additionally, congruous with the principle of constitutional empowerment, statistical agencies should respect the willingness of data providers to contribute to society by fairly representing the information they provide and by making it available for appropriate uses (National Research Council, 1992b:5).
In the United States, beyond the decennial census, an individual's cooperation with federal surveys is largely voluntary. Cooperation involves a willingness to provide data and a good-faith effort to provide accurate data. Cooperation is dependent on data providers sensing the value of providing accurate data, having the time to respond, and believing that cooperation will not harm them. These perceptions may be influenced by the particular assurances they receive from the agency collecting the data (see Boruch and Cecil, 1979; National Research Council, 1979; Singer, 1978, 1979; and Singer et al., 1990).
The panel believes that government has different ethical responsibilities to data providers who are individuals or households versus data providers that are organizations. This view derives from the fundamental role of the individual in society, so that what rights organizations derive come from the individuals they represent and to whom they are accountable. A number of questions about appropriate confidentiality and data access policies for organizations cannot be answered by immediate extrapolation from policies for individuals. What confidentiality protection should organizations enjoy? Should tax-exempt institutions, for example, merit less protection than privately held firms? For establishments, is the basic concern access to proprietary and financial data by competitors and regulatory agencies?
In exercising their responsibilities, federal statistical agencies need to be confident that users of their data will not act irresponsibly toward the data providers. Thus, they should subject users seeking access to mechanisms that ensure their accountability about confidentiality. They might well make users bear the liability for violation of confidentiality requirements. If data providers can demonstrate harm resulting from such violations, they should have accessible legal remedies.
RESPONSIBILITIES TO DATA USERS
In accord with the principle of constitutional empowerment, statistical agencies have responsibilities to a wide range of data users. They should strive to maximize the delivery of timely, accurate, and complete information, subject to constraints on confidentiality and budget.
RESPONSIBILITIES TO OTHER STATISTICAL AGENCIES
Sharing of data with other statistical agencies can enhance efficiency and the quality of information that is available for research and policy purposes. Data sharing is consistent with the principles of democratic accountability and constitutional empowerment. In developing policies for data sharing there is an insistent tension with the principle of individual autonomy, and agencies must be mindful of the expectations that data providers have of the uses to which their data might be put. Additionally, as users of data collected by others, statistical agencies are obligated to use those data responsibly. This issue is difficult because different agencies operate under different legislative authority to protect the confidentiality of their data. Thus, an agency with strong confidentiality protection cannot be expected to be forthcoming with data to an agency that lacks such confidentiality protection.
RESPONSIBILITIES TO CUSTODIANS OF ADMINISTRATIVE RECORDS
In using administrative records, statistical agencies have an obligation to maintain confidentiality. Further, based on the notion that personal records should be accurate and complete, agencies using such records ought to provide feedback that can improve the quality of administrative data bases.
Federal Paperwork (1977a, b), Federal Committee on Statistical Methodology (various Statistical Policy Working Papers, especially number 2, which was prepared by the Subcommittee on Disclosure-Avoidance Techniques, 1978), Office of Federal Statistical Policy and Standards (1978), President's Commission on Federal Statistics (1971), President's Reorganization Project for the Federal Statistical System (1981), and Privacy Protection Study Commission (1977a, b, c).