Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 26
2 Legal, Ethical, and Statistical Issues in Protecting Confidentiality PAST AND CURRENT PRACTICE There is a long tradition in government agencies and research institu- tions of maintaining the confidentiality of human research participants (e.g., de Wolf, 2003; National Research Council, 1993, 2000, 2005a). Most U.S. research organizations, whether in universities, commercial firms, or government agencies, have internal safeguards to help guide data collec- tors and data users in ethical and legal research practices. Some also have guidelines for the organizations responsible for preserving and disseminat- ing data, data tables, or other compilations. Government data stewardship agencies use a suite of tools to construct public-use datasets (micro and aggregates) and are guided by federal stan- dards (Doyle et al., 2002; Confidentiality and Data Access Committee, 2000, 2002). For example, current practices that guide the U.S. Census Bureau require that geographic regions must contain at least 100,000 persons for micro data about them to be released (National Center for Health Statistics and Centers for Disease Control and Prevention, 2003). Most federal agen- cies that produce data for public use maintain disclosure review boards that are charged with the task of ensuring that the data made available to the public have minimal risk of identification and disclosure. Federal guidelines for data collected under the Health Insurance Portability and Accountability Act of 1996 (HIPAA) are less stringent: they prohibit release of data for regions with fewer than 20,000 persons. Table 2-1 shows the approaches of various federal agencies that regularly collect social data to maintaining con- fidentiality, including cell size restrictions and various procedural methods. 26
OCR for page 27
27 LEGAL, ETHICAL, AND STATISTICAL ISSUES Fewer guidelines exist for nongovernmental data stewardship organiza- tions. Many large organizations have their own internal standards and procedures for ensuring that confidentiality is not breached. Those proce- dures are designed to ensure that staff members are well trained to avoid disclosure risk and that data in their possession are subject to appropriate handling at every stage in the research, preservation, and dissemination cycle. The Inter-university Consortium for Political and Social Research (ICPSR) requires staff to certify annually that they will preserve confidenti- ality. It also has a continual process of reviewing and enhancing the training that its staff receives. Moreover, ICPSR requires that all data it acquires be subject to a careful examination that measures and, if necessary, reduces disclosure risk. ICPSR also stipulates that data that cannot be distributed publicly over the Internet be made available using a restricted approach (see Chapter 3). Other nongovernmental data stewardship organizations, such as the Roper Center (University of Connecticut), the Odum Institute (Uni- versity of North Carolina), the Center for International Earth Science Infor- mation Network (CIESIN, at Columbia University), and the Murray Re- search Archive (Harvard University), have their own training and disclosure analysis procedures, which over time have been very effective; there have been no publicly acknowledged breaches of confidentiality involving the data handled by these organizations, and in private discussions with archive managers, we have learned of none that led to any known harm to research participants or legal action against data stewards. Universities and other organizations that handle social data have guide- lines and procedures for collecting and using data that are intended to protect confidentiality. Institutional review boards (IRBs) are central in specifying these rules. They can be effective partners with data stewardship organizations in creating approaches that reduce the likelihood of confiden- tiality breaches. The main activities of IRBs in the consideration of research occur before the research is conducted, to ensure that it follows ethical and legal standards. Although IRBs are mandated to do periodic continuing review of ongoing research, they generally get involved in any major way only reactively, when transgressions occur and are reported. Few IRBs are actively involved in questions about data sharing over the life of a research project, and fewer still have expertise in the new areas of linked social- spatial data discussed in this report. Although not all research is explicitly subject to the regulations that require IRB review, most academic institutions now require IRB review for all human subjects research undertaken by their students, faculty, and staff. In the few cases for which IRB review is not required for research that links location to other human characteristics and survey responses, researchers undertaking such studies are still subject to standard codes of research ethics. In addition, many institutions require that their researchers, regard-
OCR for page 28
28 PUTTING PEOPLE ON THE MAP TABLE 2-1 Agency-Specific Features of Data Use Agreements and Licenses Mechanisms for Data Approval* IRB Security Approval Institutional Pledges Report Agency Required Concurrence All Users Disclosures National Center for X X X Education Statistics National Science Foundation X X X Department of Justice X X X Health Care Financing Administration Social Security X X X X Administration Health Care Financing Administration-National Cancer Institute Bureau of Labor Statistics- X X National Longitudinal Survey of Youth Bureau of Labor Statistics- X X Census of Fatal Occupational Injuries National Institute of Child X X X Health and Human Development National Heart, Lung, and X Blood Institute National Institute of Mental X X Health National Institute on Drug X X Abuse National Institute on Alcohol X Abuse and Alcoholism *The agreement mechanisms for data use range from those believed to be most stringent (IRB approval) on the left to the least stringent (notification of reports) on the right. In practice, policies for human subjects protection often comprise several mechanisms or facets of them. IRB approval and “institutional concurrence” are similar, though the latter often encompasses financial and legal requirements of grants not generally covered by IRBs. less of their funding sources, undergo general human subjects protection training when such issues are pertinent to their work or their supervisory roles. IRBs are also taking a more public role; for example, making re- sources available for investigators and study subjects.1 Educating IRBs and 1For example, see the website for Columbia University’s IRB: http://www.columbia.edu/ cu/irb/ [accessed April 2006].
OCR for page 29
29 LEGAL, ETHICAL, AND STATISTICAL ISSUES Security Security Cell Size Prior-approval Notification of Plan Inspections Restrictions Reports Reports X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X NOTE: Security plans may be quite broad, including safeguards on the computing envi- ronment as well as the physical security of computers on which confidential data are stored. Security inspections are randomly timed inspections to assess compliance of the security plan. SOURCE: Seastrom (2002:290). having IRBs do more to educate investigators may be important to in- creased awareness of confidentiality issues, but education alone does not address two challenges presented by the analysis of linked spatial and social data. One of these challenges is that major sources of fine-grained spatial data, such as commercial firms and such government agencies as the Na- tional Aeronautics and Space Administration (NASA) and National Oce-
OCR for page 30
30 PUTTING PEOPLE ON THE MAP anic and Atmospheric Administration (NOAA), do not have the same his- tory and tradition of the protection of human research subjects that are common in social science, social policy, and health agencies, particularly in relation to spatial data. As a result, they may be less sensitive than the National Institutes of Health (NIH) or the National Science Foundation (NSF) to the risks to research participants associated with spatial data and identification. Neither NASA nor NOAA has large-scale grant or research programs in the social sciences, where confidentiality issues usually arise. However, NASA and NOAA policies do comply with the U.S. Privacy Act of 1974, and in some research activities that involve human specimens or individuals (e.g., biomedical research in space flight) or establishments (such as research on the productivity of fisheries).2 NASA and NOAA also pro- vide clear guidance to their investigators on the protection of human sub- jects, including seeking IRB approval, obtaining consent from study sub- jects, and principal investigator education. For example, NASA’s policy directive on the protection of human research subjects offers useful guid- ance for producers and users of linked spatial-social data, although it is clearly targeted at biomedical research associated with space flight.3 The difference in traditions between NASA and NOAA and other re- search agencies may be due in part to the fact that spatial data in and of themselves are not usually considered private. Although aerial photography can reveal potentially identifiable features of individuals and lead to harm, legal privacy protections do not apply to observations from navigable air- space (see Appendix A). Thus, agencies have not generally established hu- man subjects protection policies for remotely sensed data. Privacy and confidentiality issues arise with these data mainly when they are linked to social data, a kind of linkage that has been regularly done only recently. These linkages, combined with dramatic increases in the resolution of im- ages from earth-observing satellites and airborne cameras in the past de- cade, now make privacy and confidentiality serious issues for remote data providers. Thus, it is not surprising that NASA and NOAA are absent from the list of agencies in Table 2-1 that have been engaged in specifying data use agreements and licenses—another context in which confidentiality is- sues may arise. Agencies that already have such procedures established for social databases may be better prepared to adopt such procedures for spa- tial data than agencies that do not have established procedures for human subjects protection. The other challenge is that, absent the availability of other information 2For details, see http://www.pifsc.noaa.gov/wpacfin/confident.php [accessed January 2007]. 3See http://nodis3.gsfc.nasa.gov/npg_img/N_PD_7100_008D_/N_PD_7100_008D__main. pdf [accessed January 2007].
OCR for page 31
31 LEGAL, ETHICAL, AND STATISTICAL ISSUES or expertise, IRBs have, for the most part, treated spatially linked or spa- tially explicit data no differently from other self-identifying data. There are no current standards or guidelines for methods to perturb or aggregate spatially explicit data other than those that exist for other types of self- identifying data. Current practice primarily includes techniques such as data aggregation, adding random noise to alter precise locations, and re- stricting data access. Without specialized approaches and specialized knowl- edge provided to IRBs, they can either be overly cautious and prevent valuable data from being made available for secondary use or insufficiently cautious and allow identifiable data to be released. Neither option ad- dresses the fundamental issues. The need for effective training in confidentiality-related research and ethics issues goes beyond the IRBs and investigators, and extends to data collectors, stewards, and users. Many professional organizations in the social sciences have ethics statements and programs (see Chapter 1 and Appendix B), and these statements generally require that students be trained explicitly in ethical research methods. Training programs funded by the NIH also require ethics components, but it is not at all certain that the coverage provided or required by these programs goes beyond general ethi- cal issues to deeper consideration of ethics in social science research, let alone in the context of social-spatial linkages.4 Professional data collection and stewardship organizations, as noted above, typically have mandatory standards and training. Nonetheless, there is no evidence that any of these organizations are systematically considering the issue of spatial data linked to survey or other social survey data in their training and certification processes. We offer some recommendations for improving this situation in Chapter 4. LEGAL ISSUES Researchers in college or university settings or supported by federal agencies are subject to the rules of those institutions, in particular, their Federalwide Assurances (FWAs) for the Protection of Human Subjects and the institutional review boards (IRBs) designated under their Assurances. 4For example, the Program Announcement for National Research Service Award Institu- tional Research Grants (T32) specifies: Although the NIH does not establish specific cur- ricula or formal requirements, all programs are encouraged to consider instruction in the following areas: conflict of interest, responsible authorship, policies for handling miscon- duct, data management, data sharing, and policies regarding the use of human and animal subjects. Within the context of training in scientific integrity, it is also beneficial to discuss the relationship and the specific responsibilities of the institution and the graduate students or postdoctorates appointed to the program (see http://grants1.nih.gov/grants/guide/pa-files/ PA-02-109.html [accessed April 2006]).
OCR for page 32
32 PUTTING PEOPLE ON THE MAP Also, researchers may find guidance in the federal statutes and codes that govern research confidentiality for various agencies.5 Rules may also be defined legally through employer-employee or sponsor-recipient contracts. Obligations to follow IRB rules, policies, and procedures may be incorpo- rated in the terms of such contracts in addition to any explicit language that may refer to the protection of human subjects. Researchers who are not working in a college or university or who are not supported with federal funds may be bound, from a practical legal perspective, only by the privacy and confidentiality laws that are generally applicable in society. Such researchers in the United States usually include those working for private companies or consortia. In an international con- text, research may be done using human subjects data gathered in nations where different legal obligations apply to protecting privacy and confiden- tiality and where the social, legal, and institutional contexts are quite differ- ent. As a general rule, U.S. researchers are obligated to adhere to the laws of countries in which the data are collected, as well as those of the United States. The notion of confidentiality is not highly developed in U.S. law.6 Privacy, in contrast with confidentiality, is partly protected both by tort law concepts and by specific legislative protections. Appendix A provides a detailed review of U.S. privacy law as it applies to issues of privacy, confi- dentiality, and harm in relation to human research subjects. The appendix summarizes when information is sufficiently identifiable so that privacy rules apply, when the collection of personal information does and does not fall under privacy regulations, and what legal rules govern the disclosure of personal information. As Appendix A shows, the legal status of confidenti- ality is less well defined than that of privacy. U.S. law provides little guidance for researchers and the holders of datasets except for the rules imposed by universities and research sponsors regarding methods by which researchers may gain access to enhanced and detailed social data linked to location data in ways that both meet their research needs and protect the rights of human subjects. Neither does cur- rent U.S. privacy law significantly proscribe or limit methods that might be used for data access or data mining. The most detailed provisions are in the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA).7 This situation makes it possible for researchers and organiza- 5An illustrative compendium of federal confidentiality statutes and codes can be found at http://www.hhs.gov/ohrp/nhrpac/documentsnhrpac15.pdf [accessed April 2006]. 6For some references to federal laws on confidentiality, see http://www.hhs.gov/ohrp/ nhrpac/documents/nhrpac15.pdf [accessed January 2007]. 7E-Government Act of 2002, Pub. L. 107-347, Dec. 17, 2002, 116 Stat. 2899, 44 U.S.C. § 3501 note § 502(4).
OCR for page 33
33 LEGAL, ETHICAL, AND STATISTICAL ISSUES tions that are unconstrained by the rules and requirements of universities and federal agencies to legally access vast depositories of commercial data on the everyday happenings, transactions, and movements of individuals and to use increasingly sophisticated data mining technology to conduct detailed analyses on millions of individuals and households without their knowledge or explicit consent. These privacy issues are not directly relevant to the conduct of social science research under conventional guarantees of confidentiality. How- ever, they may become linked in the future, either because researchers may begin to use these data sources or because privacy concerns raised by uses of large commercial databases may lead to pressures to constrain research uses of linked social and spatial data. Solutions to the tradeoffs among data quality, access, and confidentiality must be considered in the context of the legal vagueness surrounding the confidentiality concept and the effects it may have on individuals’ willingness to provide information to researchers under promises of confidentiality. ETHICAL ISSUES The topics of study, the populations being examined, and the method or methods involved in an inquiry interact to shape ethical considerations in the conduct of all research involving human participants (Levine and Skedsvold, 2006). Linked social-spatial research raises many of the typical issues of sound science and sound ethics, for which the basic ethical prin- ciples have been well articulated in the codes of ethics of scientific societ- ies,8 in research and writing on research ethics, in the evolution of the Code of Federal Regulations for the Protection of Human Subjects (45 CFR 46) and the literature that surrounds it, and in a succession of important reports and recommendations typically led by the research community (see Appen- dix B). Much useful ethical guidance can also be extrapolated from past National Research Council reports (e.g., 1985, 1993, 2004b, 2005a). In addition, as noted above, linked social and spatial data raise particu- larly challenging ethical issues because the very spatial precision of these data is their virtue, and, thus, aggregating or masking spatial identifiers to protect confidentiality can greatly reduce their scientific value and utility. Therefore, if precise spatial coordinates are to be used as research data, primary researchers and data stewards need to address how ethically to store, use, analyze, and share those data. Appendix B provides a detailed discussion of ethical issues at each stage of the research process, from primary data collection to secondary use. 8For example, see those of the American Statistical Association, at http://www.amstat.org/ profession/index.cfm?fuseaction=ethicalstatistics [accessed January 2007].
OCR for page 34
34 PUTTING PEOPLE ON THE MAP The process of linking micro-level social and spatial data is usually considered to fall in the domain of human subjects research because it involves interaction or intervention with individuals or the use of identifi- able private information.9 Typically, such research is held to ethical guide- lines and review processes associated with IRBs at colleges, universities, and other research institutions. This is the case whether or not the research is funded by one of the federal agencies that are signatories to the federal regulations on human subjects research.10 Thus, generic legal and ethical principles for data collection and access apply. Also, secondary analysts of data, including those engaged in data linking, have the ethical obligation to honor agreements made to research participants as part of the initial data collection. However, the practices of IRBs for reviewing proposed second- ary data analyses vary across institutions, which may require review of proposals for secondary data analysis or defer authority to third-party data providers that have protocols for approving use.11 Data stewardship—the practices of providing or restricting the access of secondary analysts to original or transformed data—entails similar ethical obligations. Planning for ethically responsible research is a matter of professional obligation for researchers and other professionals, covered in part by IRBs under the framework of a national regulatory regime. This regime provides for a distributed human subjects protection system that allows each IRB to tailor its work with considerable discretion to meet the needs of researchers and the research communities in which the work is taking place. The link- ing of social and spatial data raises new and difficult issues for researchers and IRBs to consider: because the uses of linked data are to some extent unpredictable, decisions about data access are rarely guided by an explicit set of instructions. The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research (1979) concisely conveyed the essen- tial ethical principles for research: 9These are the elements of human subject research as defined in the Code of Federal Regulations at 45 CFR 46.102(f). 10Academic and research institutions typically have in place federally approved Federal- wide Assurances that extend the Federal Regulations for the Protection of Human Subjects to all human subjects research undertaken at the institution, not just to research funded by the 17 agencies that have adopted the Federal Regulations. 11IRBs even vary in whether research using public-use data files is reviewed, although increasingly the use of such data, if not linked to or supplemented by other data, is viewed as exempt once vetted for public use). See http://www.hhs.gov/ohrp/nhrpac/documents/ dataltr.pdf for general guidelines and http://info.gradsch.wisc.edu/research/compliance/ humansubjects/7.existingdata.htm for a specific example. [Web pages accessed January 2007].
OCR for page 35
35 LEGAL, ETHICAL, AND STATISTICAL ISSUES Beneficence—maximizing good outcomes for society, science, and in- dividual research participants while avoiding or minimizing unnecessary risk or harm; respect for persons—protecting the autonomy of research participants through voluntary, informed consent and by assuring privacy and confi- dentiality); and justice—ensuring reasonable, carefully considered procedures and a fair distribution of costs and benefits. These three principles together provide a framework for both facilitat- ing social and spatial research and doing so in an ethically responsible and sensitive way. For primary researchers, secondary analysts, and data stewards, the major ethical issues concern the sensitivity of the topics of research; main- taining confidentiality and obtaining informed consent; considerations of benefits to society and to research participants; and risk and risk reduction, particularly the obligation to reduce disclosure risk. Linking spatial data to social data does not alter ethical obligations, but it may pose additional challenges. Data collectors, stewards, and analysts have a high burden with regard to linked social and spatially precise data to ensure that the probability of disclosure approaches zero and that the data are very securely protected. They also need to avoid inadvertent disclosure through the ways findings are presented, discussed, or displayed. To meet this burden, they need to consider all available technical methods and data management strategies. We examine these methods and strategies in Chapter 3 in relation to their ability to meet the serious challenges of data protection for linked social- spatial data. STATISTICAL ISSUES All policies about access to linked social-spatial data implicitly involve tradeoffs between the costs and benefits of providing some form of access to the data, or modified versions of the data, by secondary data users. The risk of disclosures of sensitive information constitutes the primary cost, and the knowledge generated from the data represents the primary benefit. At one extreme, data can be released as is, with identifiers such as precise geocodes intact. This policy offers maximum benefit at a maximum cost (i.e., minimum confidentiality protection). At the other extreme, data can be completely restricted for secondary use, a policy that provides minimal benefit and minimal cost (i.e., maximum confidentiality protection). Most current methods of data release, such as altering or restricting access to the original data, have costs and benefits between these two extremes. Well-informed data access policies reflect wise decisions about the
OCR for page 36
36 PUTTING PEOPLE ON THE MAP tradeoffs, such as whether the data usefulness is high enough for the disclo- sure risks associated with a particular policy. However, most data stewards do not directly measure the inputs to these cost-benefit analyses. This is not negligence on the part of data stewards; indeed, the broader research commu- nity has not yet developed the tools needed to make such assessments. Yet, data stewards could quantify some aspects of the cost-benefit tradeoff, namely, disclosure risks and data quality. Evaluating these measures can enable data stewards to choose policies with better risk-quality profiles (e.g., between two policies with the same disclosure risk, to select the one with higher data quality). There have been a few efforts to formalize the task of assessing data quality and disclosure risk together for the purpose of evaluat- ing the tradeoffs (Duncan et al., 2001; Gomatam et al., 2005). This section briefly reviews statistical approaches to gauging the risk-quality tradeoffs both generally and for spatial data. (For further discussion about the cost- benefit approach to data dissemination, see Abowd and Lane, 2003). Most data stewards seeking to protect data confidentiality are concerned with two types of disclosures. One is identity disclosure, which occurs when a user of the data correctly identifies individual records using the released data. The other is attribute disclosure, which occurs when a data user learns the values of sensitive variables for individual records in the dataset. At- tribute disclosures typically require identification disclosures (Duncan and Lambert, 1986a). Other types of disclosures include perceived identity disclo- sure, which occurs when a data user incorrectly identifies individual records in the database, and inferential disclosure, which occurs when a data user can accurately predict sensitive attributes in the dataset using the released data that may have been altered—for example, by adding statistical noise—to prevent disclosure. (For introductions to disclosure risks, see Federal Com- mittee on Statistical Methodology, 1994; Duncan and Lambert, 1986a, 1986b; Lambert, 1993: Willenborg and de Waal, 1996, 2001.) Efforts to quantify identity disclosure risk generally fall in two broad categories: (1) estimating the number of records in the released data that are unique records in the population and can therefore be at high risk of identification, and (2) estimating the probabilities that users of the released data can determine the identities of the records in the released data by using the information in those data. Although these approaches are appropriate for many varieties of data, in cases where there are exact spatial identifiers, virtually every individual is unique, so the disclosure risk is very great. Quantifying Disclosure Risks Methods of estimating the risk of identification disclosure involve esti- mating population uniqueness and probabilities of identification. Estimates of attribute disclosures involve measuring the difference between estimates
OCR for page 37
37 LEGAL, ETHICAL, AND STATISTICAL ISSUES of sensitive attributes made by secondary data users and the actual values. This section describes methods that are generally applicable for geographic identification at scales larger than that characterized by exact latitude and longitude (such as census blocks or tracts, minor civil divisions, or coun- ties). In many cases, exact latitude and longitude uniquely identifies respon- dents, although there are exceptions (e.g., when spatial identifiers locate a residence in a large, high-rise apartment building). Population uniqueness is relevant for identity disclosures because unique records are at higher risk of identification than non-unique records. For any unperturbed, released record that is unique in the population, a secondary user who knows that target record’s correct identifying variables can identify it with probability 1.0. For any unperturbed released popula- tion non-unique target record, secondary users who know its correct iden- tifying variables can identify that record only with probability 1/K, where K is the number of records in the population whose characteristics match the target record. For purposes of disclosure risk assessment, population unique- ness is not a fixed quality; it depends on what released information is known by the secondary data user. For example, most individuals are uniquely identified in populations by the combination of their age, sex, and street address. When a data user knows these identifying variables and they are released on a file, most records are population unique records. How- ever, when the secondary user knows only age, sex, and state of residence, most records will not be unique records. Hence, all methods based on population uniqueness depend on assumptions about what information is available to secondary data users. The number of population unique records in a sample typically is not known and must be estimated by the data disseminator. Methods for making such estimates have been reported by several researchers (see, e.g., Bethlehem et al., 1990; Greenberg and Zayatz, 1992; Skinner, 1992; Skinner et al., 1994; Chen and Keller-McNulty, 1998; Fienberg and Makov, 1998; Samuels, 1998; Pannekoek, 1999; Dale and Elliot, 2001.) These methods involve sophisticated statistical modeling. Probabilities of identification are readily interpreted as measures of identity disclosure risk: the larger the probability, the greater the risk. Data disseminators determine their own thresholds for probabilities considered unsafe. There are two main approaches to estimating these probabilities. The first is to match records in the file being considered for release with records from external databases that a secondary user plausibly would use to attempt an identification (Paass, 1988; Blien et al., 1992; Federal Com- mittee on Statistical Methodology, 1994; Yancey et al., 2002). The match- ing is done using record linkage software, which (1) searches for the records in the external data file that look as similar as possible to the records in the file being considered for release; (2) computes the probabilities that these matching records correspond to records in the file being considered for
OCR for page 38
38 PUTTING PEOPLE ON THE MAP release, based on the degrees of similarity between the matches and their targets; and (3) declares the matches with probabilities exceeding a speci- fied threshold as identifications. The second approach is to match records in a file being considered for release with the records from the original, unperturbed data file (Spruill, 1982; Duncan and Lambert 1986a, 1986b; Lambert, 1993; Fienberg et al. 1997; Skinner and Elliot, 2002; Reiter, 2005a). This approach can be easier and less expensive to implement than obtaining external data files and record linkage software. It allows a data disseminator to evaluate the iden- tification risks when a secondary user knows the identities of some or all of the sampled records but does not know the location of those records in the file being considered for release. This approach can be modified to work under the assumption that the secondary user does not know the identities of the sampled records. Many data disseminators focus on identity disclosures and pay less attention to attribute disclosures. In part, this is because attribute disclo- sures are usually preceded by identity disclosures. For example, when origi- nal values of attributes are released, a secondary data user who correctly identifies a record learns the attribute values. Many data disseminators therefore fold the quantification of attribute disclosure risks into the mea- surement of identification disclosure risks. When attribute values are al- tered before release, attribute risks change to inferential disclosure risks. There are no standard approaches to quantifying inferential disclosure risks. Lambert (1993) provides a useful framework that involves specifying a secondary user’s estimator(s) of the unknown attribute values—such as an average of plausible matches’ released attribute values—and a loss function for incorrect guesses, such as the Euclidean or statistical distance between the estimate and the true value of the attribute. A data disseminator can then evaluate whether the overall value of the loss function—the distance between the secondary user’s proposed estimates and the actual values—is large enough to be deemed safe. (For examples of the assessment of at- tribute and inferential disclosure risks, see Gomatam et al., 2005; Reiter, 2005d.) The loss-function approach extends to quantifying overall potential harm in a data release (Lambert, 1993). Specifically, data disseminators can specify cost functions for all types of disclosures, including perceived iden- tification and inferential disclosures, and combine them with the appropri- ate probabilities of each to determine the expected cost of releasing the data. When coupled with measurements of data quality, this approach provides a decision-theoretic framework for selecting disclosure limitation policies. Lambert’s total harm model is primarily theoretical and has not been implemented in practice.
OCR for page 39
39 LEGAL, ETHICAL, AND STATISTICAL ISSUES Quantifying Data Quality Compared with the effort that has gone into developing measures of disclosure risks, there has been less work on developing measures of data quality. Existing quality measures are of two types: (1) comparisons of broad differences between the original and released data, and (2) compari- sons of differences in specific models between the original and released data. The former measures suffer from not being tied to how users analyze the data; the latter measures suffer from capturing only certain dimensions of data quality. Broad difference measures essentially quantify differences between the distributions of the data values on the original and released files. As the differences between the distributions grow, the overall quality of the re- leased data drops. Computing differences in distributions is a nontrivial statistical problem, particularly when there are many variables and records with unknown distributional shapes. Most approaches are therefore ad hoc. For example, some researchers suggest computing a weighted average of the differences in the means, variances, and correlations in the original and released data, where the weights indicate the relative importance that those quantities are similar in the released and observed files (Domingo- Ferrer and Torra, 2001; Yancey et al., 2002). Such ad hoc methods are only tangentially tied to the statistical analyses being done by data users. For example, a user interested in analyzing incomes may not care that means are preserved when the tails of the distribution are distorted, because the researcher’s question concerns only the extremely rich. In environmental research, the main concern may be with the few people with the greatest exposure to an environmental hazard. These measures also have limited interpretability and little theoretical basis. Comparison of specific models is often done informally. For example, data disseminators look at the similarity of point estimates and standard errors of regression coefficients after fitting the same regression on the original data and on the data proposed for release. If the results are consid- ered close—for example, the confidence intervals for the coefficients ob- tained from the models largely overlap—the released data have high quality for that particular analysis. Such measures are closely tied to how the data are used, but they only reflect certain dimensions of the overall quality of the released data. It is prudent to examine models that represent the wide range of expected uses of the released data, even though unexpected uses may arise for the conclusions of such models that do not apply. A significant issue for assessing data quality with linked spatial-social data is the need at times to simultaneously preserve several characteristics or spatial relationships. Consider, for example, a collection of observations represented as points that define nodes in a transportation network, when a
OCR for page 40
40 PUTTING PEOPLE ON THE MAP node is defined as a street intersection. Although it is possible to create a synthetic or transformed network that has the same mean (global) link length as the original one, it is difficult to maintain, in addition, actual variation in the local topology of links (the number of links that connect at a node), as well as the geographical variability in link lengths that might be present in the original data. Consequently, some types of analyses done with transformed or synthetic data may yield results similar to those that would result with the original data, while others may create substantial risks of inferential error. The results may include both Type I errors, in which a null hypothesis is incorrectly rejected, and Type II errors, when a null hypothesis is incorrectly accepted. Data users may be tempted to treat transformed data as equal quality to the original data unless they are in- formed otherwise. Effects of Spatial Identifiers The presence of precise spatial identifiers can have large effects on the risk-quality tradeoffs. Releasing these identifiers can raise the risks of iden- tification to extremely high levels. To reduce these risks, data stewards may perturb the spatial identifiers if they plan to release some version of the original data for open access—but doing this can very seriously degrade the quality of the data for analyses that use the spatial information, and par- ticularly for analyses that depend on patterns of spatial covariance, such as distances or topological relationships between research participants and loca- tions important to the analysis (Armstrong et al., 1999). For example, some analyses may be impossible to do with coarsened identifiers, and others may produce misleading results due to altered relationships between the attributes and spatial variables. Furthermore, if spatial identifiers are used as matching variables for linking datasets, altering them can lead to matching errors, which, when numerous, may seriously degrade analyses. Perturbing the spatial information may not reduce disclosure risks suffi- ciently to maintain confidentiality, especially when the released data include other information that is known by a secondary data user. For example, there may be only one person of a certain sex, age, race, and marital status in a particular county, and this information may be readily available for the county, so that coarsening geographies to the county level would provide no greater protection for that person than releasing the exact address. Identity disclosure risks are complicated to measure when the data are set up to be meaningfully linked to other datasets for research purposes. Altering spatial identifiers will reduce disclosure risks in the set of data originally collected, but the risks may increase when this dataset is linked to datasets with other attributes. For example, unaltered attributes in File A may be insufficient to identify individuals if the spatial identifiers are al-
OCR for page 41
41 LEGAL, ETHICAL, AND STATISTICAL ISSUES tered, but when File A is linked to File B, the combined attributes and altered spatial identifiers may uniquely identify many individuals. The com- plication arises because the steward of the collected data may not know which attributes are in the files to be linked to those data, so that it is difficult to evaluate the degree of disclosure risk. Even when safeguards have been established for data sharing, publica- tion of research papers using linked social-spatial data may pose other problems such as those associated with the visualization of data. VanWey et al. (2005) present a means for evaluating the risks associated with dis- playing data through maps that may be presented orally or in writing to communicate research results. The method involves identifying data with a spatial area of a radius sufficient to include, on average, enough research participants to reduce the identity disclosure risk to a target value. Methods for limiting disclosure risk from maps are only beginning to be developed. No guidelines currently exist for visualizing linked social-spatial data, in published papers or even presentations; but future standards for training and publication contexts should be based on systematic assessment of such risks. In principle, policies for access to data that include spatial identifiers can be improved by evaluating the tradeoff between disclosure risks and data quality. In practice, though, such an evaluation will be challenging for many data stewards and for IRBs that are considering proposals to use linked data. Existing approaches to quantifying risk and quality are techni- cally demanding and may be beyond the capabilities of some data stewards. Low-cost, readily available methods for estimating risks and quality do not yet exist, whether or not the data include spatial identifiers. And existing techniques do not account for the additional risks associated with linked datasets. This challenge would be significantly lessened, and data dissemi- nation practice improved, if data stewards had access to reliable, valid, off- the-shelf software and protocols for assessing the tradeoffs between disclo- sure risk and data quality and for undertaking broad cost-benefit analyses. The next chapter addresses the issue of evaluating and addressing the tradeoffs involving disclosure risk and data quality.
Representative terms from entire chapter: