Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
3 Meeting the Challenges Although the challenges described in Chapter 2 are substantial, a number of possible approaches exist for preserving respondent confidentiality when links to geospatial information could engender breaches. They fall in two main categories: institutional approaches, which involve restricting access to sensitive data; and technical and statistical approaches, which involve trans- forming the data in various ways to enhance the protection of confidentiality. This chapter describes these two broad categories of approaches. INSTITUTIONAL APPROACHES Institutions that have responsibility for preserving the confidentiality of respondents employ a number of strategies. These strategies are very impor- tant for protecting respondent confidentiality in survey data under all cir- cumstances, and especially when there is a high risk of identification due to the existence of precise geospatial attributes. At their heart, many of these strategies protect confidentiality by restricting access to the data, either by limiting access to those data users who explicitly guarantee not to reveal respondent identities or attributes or by requiring that data users work in a restricted environment so they cannot remove information that might re- veal identities or attributes. Restricting data access is a strategy that can be used with original data or with data that have been deidentified, buffered, or synthesized. In addition to restricting access, institutional approaches require that researchersâstudents and faculty or staff at universities or other institu- tionsâbe educated in appropriate and ethical use of data. Many data stew- 42
43 MEETING THE CHALLENGES ards provide guidelines about how data should be used, what the risks of disclosure are, and why researchers should be attentive to disclosure risk and its limitation.1 User education at a more fundamental levelâin the general training of active and future researchersâshould be based on sound theoretical prin- ciples and empirical research. Such studies, however, are few: there are only a few examples of good materials or curricula for ensuring education in proper data use that minimizes the risk of confidentiality breaches. For instance, the disclosure limitation program project, âHuman Subject Pro- tection and Disclosure Risk Analysis,â at the Inter-university Consortium for Political and Social Research (ICPSR) has resources available for teach- ing about its findings and the best practices it has developed (see http:// www.icpsr.umich.edu/HSP [April 2006]). ICPSR is also working on a set of education and certification materials on handling restricted data for its own staff, which will probably evolve into formal training materials. The Caro- lina Population Center also has a set of practices for training students who work on its National Longitudinal Study of Adolescent Health (Add Health: see http://www.cpc.unc.edu/projects/addhealth [April 2006]) and other projects, and for teaching its demography trainees about ethics (see http:// www.cpc.unc.edu/training/meth.html [April 2006]). However, few other training programs have equivalent practices. Fundamental to most institutional approaches is the idea that the greater the risk of disclosure or harm, the more restricted access should be. For every tier of disclosure risk, there is an equivalent tier of access restric- tion. The tiers of risk are partly a function of the ability of the data distribu- tor to make use of identity masking techniques to limit the risk of disclo- sure. On a low tier of risk are data with few identifiable variables, such as the public-use microdata sets from the U.S. Census Bureau and many small sample surveys. Because there is little or no geographic detail in these data, when they are anonymized there is very little risk of disclosure, although if a secondary user knows that an individual is a respondent in a survey (e.g., because it is a family member), identification is much easier. Dissemination of these data over the Web has not been problematic from the standpoint of confidentiality breaches. On the highest tier of risk are absolutely identifi- able data, such as surveys of business establishments and data that include the exact locations of respondentsâ homes or workplaces.2 The use of these data must be tightly restricted to preserve confidentiality. Methods and 1For example, see the Inter-university Consortium for Political and Social Research (ICPSR), 2005; also http://www.icpsr.umich.edu/access/deposit/index.html [accessed April 2006]. 2Business establishments are generally considered to be among the most easily identifiable because data about them are frequently unique: in any given market, there are usually only a small number of business establishments engaged in any given area of activity, and each has unique characteristics such as relative size or specialization.
44 PUTTING PEOPLE ON THE MAP procedures for restricted data access are well described in a National Re- search Council report (2005a:28-34). The number of tiers of access can vary from one study to another and from one data archive to another. A simple model might have four levels of access: full public access, limited licensing, strong licensing, and data en- claves.3 Full Public Access Full access is provided through Web-based public-use files that are available to the general public or to a limited public (for example those who subscribe to a data service, such as ICPSR). Access is available to all users who accept a data use agreement through a Web-based form that requires them to avoid disclosure. This tier of access is typically reserved for data files with little risk of disclosure and harm, such as those that include very small subsamples of a larger sample, that represent a tiny fraction of the population in a geographic area, that contain little or no geographic infor- mation, or that do not include any sensitive information. We are unaware of any cases for which this form of public access is allowed to files that combine social data with highly specific locational data, such as addresses or exact latitude and longitude. Public use, full-access datasets may include some locational data, such as neighborhood or census tract, if it is believed that such units are too broad to allow identification of particular individuals. However, when datasets are linked, it is often possible to identify individuals with high probability even when the linked data provide only neighborhood-level information. Because of this probability, the U.S. Census Bureau uses data swapping and other techniques in their full-access public-use data files (see http://factfinder.census. gov/jsp/saff/SAFFInfo.jsp?_pageId=su5_confidentiality). Full public access is extremely popular with data users, for whom it provides a very high level of flexibility and opportunity. Their main com- plaint is that the datasets made available by this mechanism often include fewer cases and variables than they would like, so that certain types of analysis are impossible. Although data stewards appear generally satisfied 3For other models, see the practices of the Carolina Population Center at the University of North Carolina at Chapel Hill for use of data from the National Survey of Adolescent Health (http://www.cpc.unc.edu/projects/addhealth/data[accessed April 2006]) and the Nang Rong study of social and environmental change, among others. As part of ICPSRâs Data Sharing for Demographic Research project (http://www.icpsr.umich.edu/dsdr [accessed April 2006]), re- searchers there have published a detailed review of contract terms used in restricted use agreements, with recommendations about how to construct such agreements. Those docu- ments are available at http://www.icpsr.umich.edu/dsdr/rduc [accessed April 2006].
45 MEETING THE CHALLENGES with this form of data distribution, they have in recent years begun to express concern about whether data can be shared this way without risk of disclosure, and so have increasingly restricted the number of data collec- tions available in this format. For example, the National Center for Health Statistics (NCHS) linked the National Health Interview Survey to the Na- tional Death Index and made the first two releases of the linked file available publicly. The third release, which follows both respondents from the earlier survey years and adds new survey years, is no longer available publicly; it is available for restricted use in the NCHS Research Data Center. Limited Licensing Limited licensing provides a second tier of access for data that present some risk of disclosure or harm, but for which the risk is limited because there is little geographic precisionâthe geographic information has been systematically masked (Armstrong et al., 1999) or sensitive variables have been deleted or heavily masked. Limited licensing allows data to be distrib- uted to responsible scientific data users (generally those affiliated with known academic institutions) under terms of a license that requires the data user and his or her employer to certify that the data will be used responsi- bly. Data stewards release data in this fashion when they believe that there is little risk of identification and that responsible professionals are able to understand the risk and prevent it in their research activities. For example, the Demographic and Health Surveys (DHS) (see http://www.measuredhs. com/[April 2006]) distributes its large representative survey data collection under a limited licensing model. It makes geocoded data available under a more restricted type of limited licensing arrangement. The DHS collects the geographic coordinates of its survey cluster, or enumerator areas, but the boundaries or areas of those regions are not made available. These geocodes can be attached to individual or household records in the survey data, for which identifying information has been removed. When particularly sensitive information has been collected in the survey (e.g., HIV testing), the current policy is to introduce error into the data, destroy the original data, and release only the data that have been transformed. Data users consider it a burden to obtain limited licensing agreements, but both data stewards and users generally perceive them as successful because they combine an obligation for education and certification with relatively flexible access for datasets that present little risk of disclosure or harm. Nevertheless, the limitations on the utility of data that may be altered (see Armstrong et al., 1999) for release in this manner are still largely unknown, in part because careful tests with the original data cannot be conducted after the data have been transformed.
46 PUTTING PEOPLE ON THE MAP Strong Licensing Strong licensing is a third tier of data access used for data that present a substantial risk of disclosure and for which the data steward decides that this risk cannot be protected within the framework of responsible research practice. Datasets are typically placed at this tier if they present a substan- tial risk of disclosure but are not fully identified or if they include attribute data that are highly sensitive if disclosed, such as responses about sexual practices, drug use, or criminal activity. Most often, these data are shared through a license that requires special handling: for example, they may be kept on a single computer not connected to a network, with specific techni- cal requirements. Virtually all strong licenses require that the data user obtain institutional review board (IRB) approval at his or her home institu- tion. Many of these strong licenses also include physical monitoring, such as unannounced visits from the data stewardâs staff to make sure that conditions are followed. These licenses may also require very strong institu- tional assurances from the researcherâs employer, or may provide for sanc- tions if not followed. For example, the license to use data from the Health and Retirement Survey of the National Institutes of Health (NIH) includes language that says the data user may be prevented from obtaining NIH grants in the future if he or she does not adhere to the restrictions. Some data stewards also require the payment of a fee, usually in the range of $500 to $1,000, to support the expenses associated with document prepa- ration and review and the cost of site visits. Although some researchers and universities are wary of these agree- ments, in recent years they have been seen as successful by most data users. Data distributors, however, continue to be fearful that their rules about data access are not being followed sufficiently closely or that sensitive data are under inadequate control. Data Enclaves For data that present the greatest risk of disclosure or harm, or those that are collected under tight legal restrictionsâsuch as geospatial data that are absolutely identifiableâaccess is usually limited to use within a re- search enclave. For example, this will be the case when the fully geocoded Nang Rong data are made available at the data enclave at ICPSR. The most visible example of this practice in the United States today is the network of nine Research Data Centers (RDCs) created by the Bureau of the Censusâ Washington, DC; Durham, NC; New York City and Ithaca, NY; Boston, MA; Ann Arbor, MI; Chicago; and Los Angeles and Berkeley, CA.4 The 4See http://webserver01.ces.census.gov/index.php/ces/1.00/researchlocations [accessed April 2006].
47 MEETING THE CHALLENGES Bureau makes its most restricted data, including the full count of the Cen- sus of Population and the Census of Business Enterprises, available only in these centers. The principle behind data enclaves is that researchers are not able to leave the premises with any information that might identify an individual. In practice, a trained professional reviews all the materials that each re- searcher prints. For data analyses, researchers are typically allowed to re- move only coefficients from regression-type analyses and tabulations that have a large cell size (because small cell sizes may lead to identification). Although many data stewards limit users to working within a single, super- vised room described as a data center or enclave, alternatives also exist. For example, in addition to its data enclaves NCHS also maintains a system that allows data users to submit data analytic programs from a remote location, have them run against the data in the enclave, and then receive the results by e-mail. This procedure is sometimes performed with an auto- mated disclosure review and sometimes with a manual, expert review. There are considerable barriers of inconvenience and cost to use of the data centers, which means that they are not used as much as they might or should be. Most centers only hold data from a single data provider (for example, the census, NCHS data, or ADD Health), and the division of work leads to inefficiencies that might be overcome if a single center held data from more than one data provider. For the use of its data, the Census Bureau centers require a lengthy approval process that can take a full year from the time a researcher is ready to begin work, as well as a âbenefit statementâ on the part of the researcher that demonstrates the work under- taken in the RDC will not only contribute to science, but will also deliver a benefit to the Census Bureauâsomething required by the Bureauâs statu- tory authority. Although other data centers and enclaves do not require such lengthy approval processes, many require a substantial financial pay- ment from the researcher (often calculated as a per day or per month cost of research center use), in addition to travel and lodging costs. Personal sched- uling to enable a researcher to travel to a remote site competes with teach- ing, institutional service, and personal obligations and can be a serious barrier to use of data in enclaves. The challenge of scheduling becomes even more severe in the context of the large, interdisciplinary teams often in- volved in the analysis of spatial social science data and the need to use specialized technology and software. In addition to the cost passed on to users, the data stewards who maintain data enclaves bear considerable cost and space requirements. In sum, data enclaves are effective but inefficient and inequitable. So- cial science research is faced with the prospect of full and equal access to data when risk is low, but highly differential and unequal access when risks are high. Considerable improvements in data access regimes will be re-
48 PUTTING PEOPLE ON THE MAP quired so that price will not be the mediating factor that determines who has access to linked social science and geospatial data. TECHNICAL APPROACHES Data stewards and statistical researchers have developed a variety of techniques for limiting disclosure risks (for a summary of many of them, see National Research Council, 2005a). This section briefly reviews some of these methods and discusses their strengths and weaknesses in the context of spatial data. Generally, we classify the solutions as data limitation (re- leasing only some of the data), data alteration (releasing perturbed versions of the data), and data simulation (releasing data that were not collected from respondents but that are intended to perform as the original data when analyzed). The approaches described here are designed to preserve as much spatial information as possible because that information is necessary for important research questions. In this way, they represent advances over older approaches to creating public-use social science data, in which the near-universal strategy was to delete all precise spatial information from the data, usually through aggregation to large areas. Data Limitation Data limitation involves manipulations that restrict the number of vari- ables, the number of values for responses, or the number of cases that are made available to researchers. The purpose of data limitation is to reduce the number of unique values in a dataset (reducing the risk of identification) or to reduce the certainty of identification of a specific respondent by a secondary user. A very simple approach sometimes taken with public-use data is to release only a small fraction of the data originally collected, effectively deleting half or more of all cases. This approach makes it diffi- cult, even impossible, for a secondary user who knows that an individual is in the sample to be sure that she or he has identified the right person: the target individual may have been among those deleted from the public dataset. For tabular data, as well as some microdata, one data limitation ap- proach is cell suppression. The data steward essentially blanks out cells with small counts in tabular data or blanks out the values of identifiers or sensitive attributes in microdata. The definition of âsmall countsâ is se- lected by the data steward. Frequently, cells in tables are not released unless they have at least three members. When marginal totals are preserved, as is often planned in tabular data, other values besides those at risk may need to be suppressed; otherwise, the data analyst can subtract the sum of the available values from the total to obtain the value of the suppressed data.
49 MEETING THE CHALLENGES Complementary cells are selected to optimize (at least approximately) vari- ous mathematical criteria. (For discussions of cell suppression, see Cox, 1980, 1995; Willenborg and de Waal, 1996, 2001.) Cell suppression has drawbacks. It creates missing data, which compli- cates analyses because the suppressed cells are chosen for their values and are not randomly distributed throughout the dataset. When there are many records at risk, as is likely to be the case for spatial data with identifiers, data disseminators may need to suppress so many values to achieve satisfac- tory levels of protection that the released data have limited analytical util- ity. Cell suppression is not necessarily helpful for preserving confidentiality in survey data that include precise geospatial locations. It is possible, even if some or many cells are suppressed, for confidentiality to be breached if locational data remain. Cell suppression also does not guarantee protection in tabular data: it may be possible to determine accurate bounds for values of the suppressed cells using statistical techniques (Cox, 2004; Fienberg and Slavkovic, 2004, 2005). An alternative to cell suppression in tabular data is controlled tabular adjustment, which adds noise to cell counts in ways that preserve certain analyses (Cox et al., 2004). Data can also be limited by aggregation. For tabular data, aggregation corresponds to collapsing levels of categorical variables to increase the cell size for each level. For microdata, aggregation corresponds to coarsening variables; for example, releasing ages in 5-year intervals or locations at the state level in the United States. Aggregation reduces disclosure risks by turning unique records into replicated records. It preserves analyses at the level of aggregation but creates ecological inference problems for lower levels of aggregation. For spatial data, stewards can aggregate spatial identifiers or attribute values or both, but the aggregation of spatial identifiers is especially impor- tant. Aggregating spatial attributes puts more than one respondent into a single spatial location, which may be a point (latitude-longitude), a line (e.g., along a highway), or an area of various shapes (e.g., a census tract or other geographic division or a geometrically defined area, such as a circle). This aggregation has the effect of eliminating unique cases within the dataset or eliminating the possibility that a location in the data refers to only a single individual in some other data source, such as a map or list of ad- dresses. In essence, this approach coarsens the geographic data. Some disclosure limitation policies prohibit the release of information at any level of aggregation smaller than a county. Use of a fixed level of geogra- phy, however, introduces variability in the degree of masking provided. Many rural counties in the United States contain very small total populations, on the order of 1 thousand, while urban counties may contain more than 1 million people. The same problem arises with geographic areas defined by spatial coverage: 1 urban square kilometer holds many more people than
50 PUTTING PEOPLE ON THE MAP 1 rural square kilometer. The more social identifiers, such as gender, race, or age, are provided for an area, the greater the risk of disclosure. The use of aggregation to guard against accidental release of confiden- tial information introduces side effects into analyses. When point data are aggregated to areas that are sufficiently large to maintain confidentiality, the ability of researchers to analyze data for spatial patterns is attenuated. Clusters of disease that may be visually evident or statistically significant at the individual level, for example, will often become undetectable at the county level of aggregation. Other effects arise as a consequence of the well- known relationship between variance and aggregation: variance tends to decrease as the size of aggregated units increase (see Robinson, 1950; Clark and Avery, 1976). The suppression of variance with increasing levels of aggregation introduces uncertainty (sometimes called the ecological infer- ence problem) into the process of making inferences based on statistical analyses and is a component of the more general modifiable areal unit problem in spatial data analysis (see Openshaw and Taylor, 1979). For tabular data, another data limitation alternative is to release a selection of subtables or collapsed tables of marginal totals for some prop- erties to ensure that the cells for the full joint table are large (Fienberg and Slavkovic, 2004, 2005). This approach preserves the possibility of analysis when counts from the released subtables are sufficient for the analysis. For spatial data, this approach could be used with aggregated spatial identifiers, perhaps enabling smaller amounts of aggregation. This approach is computationally expensive, especially for high-dimensional tables, and re- quires additional research before a more complete assessment can be made of its effectiveness. Data Alteration Spatial attributes are useful in linked social-spatial data because they precisely record where an aspect of a respondentâs life takes place. Some- times these spatial data are collected at the moment that the original social survey data are collected. In the Nang Rong (see Box 1-1) and other similar studies, researchers use a portable global positioning system (GPS) device to record the latitude and longitude of the location of the interview or of multiple locations (farm fields, daily itineraries) during the interview pro- cess. It is also possible for researchers to follow the daily itineraries of study participants by use of GPS devices or RFID (radio frequency identification) tags. In the United States and other developed countries, however, locations are frequently collected not as latitude and longitude from a GPS device, but by asking an individual to supply a street address. Street addresses require some transformation (e.g., to latitude and longitude) to be made
51 MEETING THE CHALLENGES specific and comparable. This transformation, called geocoding, consists of the processes through which physical locations are added to records. There are several types of geocoding that vary in their level of specificity; each approach uses different materials to support the assignment of coordinates to records (see Box 3-1). Areal geocoding can reduce the likelihood of identification, but most other forms of geocoding have the potential to maintain or increase the risk of disclosure because they provide the data user with one or more precise locations (identifiers) for a survey respondent. The improvements in accu- racy associated with new data sources and new technologies, such as parcel geocoding, only heighten the risk. As a consequence, a new set of tech- niques has been devised to distort locations, and hence to inhibit disclosure. Two of the general methods available are swapping and masking. Swapping It is sometimes possible to limit disclosure risk by swapping data. For example, a data steward can swap the attributes of a person in one area for those of a person in another area, especially if some of those attributes are the same (such as two 50-year-old white males with different responses on other questions), in order to reduce a secondary userâs confi- dence in correctly identifying an individual. Swapping can be done on spatial identifiers or nonspatial attributes, and it can be done within or across defined geographic locations. Swapping small fractions of data gen- erally attenuates associations between the swapped and unswapped vari- ables, and swapping large fractions of data can completely destroy those associations. Swapping data will make spatial analyses meaningless unless the spatial relationships have been carried into the swapped data. It is generally difficult for analysts of swapped data to know how much the swapping affects the quality of analyses. When data stewards swap cases from different locations but leave (genuine) exact spatial identifiers on the file, the identity of participants may be disclosed, even if attributes cannot be meaningfully linked to the participant. For example, if the data reveal that a respondent lived at a particular address, even if that personâs data are swapped with someone elseâs data, a secondary user would still know that a person living at that address was included in the study. Swapping spatial identifiers thus is better suited for limiting disclosures of respondentsâ attributes than their identities. Swapping may not reduceâand probably increasesâthe risk of mistaken attribute disclosures from incorrect identifications. Swapping may be more successful at protecting participantsâ identities when locations are aggregated. However, swapping may not provide much additional protection beyond the aggregation of locations, and it may de- crease data quality relative to analyzing the unswapped aggregated data.
52 PUTTING PEOPLE ON THE MAP BOX 3-1 Geocoding Methods Areal Geocoding Areal geocoding assigns observations to geographic areas. If all a researcher needs is to assign a respondent to a political jurisdiction, census unit, or administrative or other areas in order to match attributes of those larger areas to the individual and perform hierarchical analyses, areal geocoding resolu- tion is a valuable tool. Areal geocoding can be implemented when the database has either addresses or latitude and longitude data, either through the use of a list of addresses that are contained in an area or through the use of an algorithm that determines whether a point is contained within a particular polygon in space. In the latter case, a digital file of polygon geometry is needed to support the areal geoc- oding process. Interpolated Geocoding Interpolated geocoding estimates the precise location of an address along a street segment, typically defined between street intersec- tions, on a proportional basis. This approach relies on the use of a geographic base file (GBF) that contains street centerline descriptions and address ranges for each side of each street segment in the coverage area. An example is the U.S. Census Bureauâs TIGER (topologically integrated geographic encoding and refer- encing) files. For any specific address, an algorithm assigns coordinates to records by finding the street segment (typically, one side of a block along a street) that contains the address and interpolating. Thus, the address 1225 Maple Street would be placed one-quarter of the way along the block that contains the odd-numbered addresses 1201-1299 and assigned the latitude and longitude appropriate to that precise point. Interpolated geocoding can produce digital artifacts, such as addresses placed in the middle of a curving street, or errors, such as can occur if, for example, 1225 is the last house on the 1201 block of Maple Street. Some of these problems can Masking Masking involves perturbations or transformations of some data. Observations, in some cases, may be represented as points, but have their locations altered in such a way to minimize accurate recovery of personal-level information. Among the easiest masking approaches to implement involves the addition of a stochastic component to each obser- vation, which can be visualized as moving the point by a fixed or random amount so that the information about a respondent is associated not with that personâs true location but with another location (see Chakraborty and Armstrong, 2001). That is, one can replace an accurately located point with another point derived from a uniform distribution of radius r centered on that location. The radius parameter may be constant or al- lowed to vary as a function of density or some other factor important to a particular application. If density is used, r will be large in low-density areas (rural) and would be adjusted downward in areas with higher densities.
53 MEETING THE CHALLENGES be minimized with software (e.g., by setting houses back from streets). The extent to which such data transformations change the results of data analyses from what they would have been with untransformed data has not been carefully studied. This approach may reduce disclosure risks. Parcel Geocoding Parcel geocoding makes use of new cadastral information systems that have been implemented in many communities. When this approach is used, coordinates are often transferred from registered digital orthophotographs (images that have been processed to remove distortion that arises as a conse- quence of sensor geometry and variability in local elevation, for example). These coordinates typically represent such features as street curbs and centerlines, side- walks, and most importantly for geocoding, the locations of parcel and building footprint polygons and either parcel centroids or building footprint centroids. Thus, a one-to-one correspondence between each address and an accurate coordinate (representing the building or parcel centroid) can be established during geocoding. This approach typically yields more accurate positional information than interpolat- ed geocoding methods. GNSS-based Geocoding The low cost and widespread availability of devices used to measure location based on signals provided by Global Navigation Satellite Systems (GNSS), such as the Global Positioning System deployed by the U.S. Department of Defense, GLONASS (Russia), and Galileo (European Union), has encouraged some practitioners to record coordinate locations for residence loca- tions through field observations. As in the parcel approach, a one-to-one corre- spondence can be established between each residence and an associated coordi- nate. Though this process is somewhat labor intensive, the results are typically accurate since trained field workers can make policy-driven judgments about how to record particular kinds of information. Though masking can be performed easily, it has a negative side effect: the displaced points can be assigned to locations that contain real observa- tions, thus creating the possibility of false identification and harm to indi- viduals who may not even be respondents in the research. Moreover, re- search on spatial data transformation that involve moving the location of data points (Armstrong et al., 1999; Rushton et al., 2006) shows that these transformations may have a significant deleterious effect on the analysis of data. Not only is there still risk of false identification, but sometimes the points are placed in locations where they cannot beâresidences in lakes that do not permit houseboats, for example. Moreover, no single transfor- mation process provides data that are valuable for every possible form of analysis. These limitations have major consequences both for successful analysis and for reduction of the disclosure risk.
54 PUTTING PEOPLE ON THE MAP Adding noise generally inflates uncertainties in data analyses. For some attributes being estimated, the effect is to increase the width of confidence intervals. Adding noise can also attenuate associations: in a simple linear regression model, for example, the estimated regression coefficients get closer to zero when the predictors have extra noise. There are techniques for accounting for the extra noise, called measurement error models (e.g., Fuller, 1993), but they are not easy to use except in such standard analyses as regressions. Some research by computer scientists and cryptographers under the rubric of âprivacy-preserving data miningâ (e.g., Agrawal and Srikant, 2000; Chawla et al., 2005) also follows the strategy of adding specially constructed random noise to the data, either to individual values or to the results of the computations desired by the analyst. Privacy- preserving data mining approaches have been developed for regression analysis, for clustering algorithms, for discrimination, and for association rules. Like other approaches that add noise, these approaches generally sacrifice data quality for protection against disclosure. The nature of that tradeoff has not been thoroughly evaluated for social and spatial data. Secure Access An emerging set of techniques aims to provide users with the results of computations on data without allowing them to see individual data values. Some of these are based on variants of secure summation (Benaloh, 1987), which allows different data stewards to compute the exact values of sums without sharing their values. One variant, used at the National Center for Educational Statistics, provides public data on a diskette or CD-ROM that is encoded to allow users to construct special tabulations while preventing them from seeing the individual-level data or for calculating totals when there are fewer than 30 respondents in a cell. Secure summation variants entail no sacrifice in data quality for analyses based on sums. They provide excellent confidentiality protection, as long as the database stewards follow specified protocols. This approach is computationally intensive and chal- lenging to set up (for a review of these methods, see Karr et al., 2005). Another approach involves remote access model servers, to which users submit requests for analyses and, in return, receive only the results of statistical analyses, such as estimated model parameters and standard er- rors. Confidentiality can be protected because the remote server never al- lows users to see the actual data (see Boulos et al., 2006). Remote access servers do not protect perfectly, however, as the user may be able to learn identities or sensitive attributes through judicious queries of the system (for examples, see Gomatam et al., 2005). Computer scientists also have devel- oped methods for secure record linkage, which enable two or more data stewards to determine which records in their databases have the same
55 MEETING THE CHALLENGES values of unique identifiers without revealing the values of identifiers for the other records in their databases (Churches and Christen, 2004; OâKeefe et al., 2004). Secure access approaches have not generally been used by stewards of social science data, and the risks and benefits for spatial-social data dis- semination and sharing are largely unevaluated. However, the concept un- derpinning these techniquesâto allow users to perform computations with the data without actually seeing the dataâmay point to solutions for shar- ing social and spatial data. Data Simulation Data providers may also release synthetic (i.e., simulated) data that have similar characteristics as the genuine data as a way to preserve both confidentiality and the possibility of meaningful data analysis, an approach first proposed by Rubin (1993) in the statistical literature. The basic idea is to fit probability models to the original data, then simulate and release new data that fit the same models. Because the data are simulated, the released records do not correspond to individuals from the original file and cannot be directly linked to records in other datasets. These features greatly reduce identity and attribute disclosure risks. However, synthetic data are subject to inferential disclosure risk when the models used to generate data are too accurate. For example, when data are simulated from a regression model with a very small mean square error, analysts can use the model to estimate outcomes precisely and can infer the identities of respondents with high accuracy. When the probability models closely approximate the true joint prob- ability distributions of the actual data, the synthetic data should have simi- lar characteristics, on average. The âon averageâ caveat is important: pa- rameter estimates from any one synthetic dataset are unlikely to equal exactly those from the actual data. The synthetic parameter estimates are subject to variation from sampling the collected data and from simulating new values. It is not possible to estimate all sources of variation from only one synthetic dataset, because an analyst cannot measure the amount of variability from the synthesis. Rubinâs (1993) suggestion is to simulate and release multiple, independent synthetic data sets from the same original data. An analyst can then estimate parameters and their variances in each of the synthetic datasets and combine the results with simple formulas (see description by Raghunathan et al., 2003). Synthetic datasets can have many positive data utility features (see Rubin, 1993; Raghunathan et al., 2003; Reiter, 2002, 2004, 2005b). When the data generation models are accurate, valid inferences can be obtained from multiple synthetic datasets by combining standard likelihood-based or
56 PUTTING PEOPLE ON THE MAP survey-weighted estimates. An analyst need not learn new statistical meth- ods or software programs to unwind the effects of the disclosure limitation method. Synthetic datasets can be generated as simple random samples, so that analysts can ignore the original complex sampling design for infer- ences. The data generation models can adjust for nonsampling errors and can borrow strength from other data sources, thereby making high-quality inferences possible. Finally, because all units are simulated, geographic identifiers can be included in synthetic datasets. Synthetic data reflect only those relationships included in the models used to generate them. When the models fail to reflect certain relationships, analystsâ inferences also do not reflect those relationships. For example, if the data generation model for an attribute does not take into account relationships between location and that attribute, the synthetic data will contain zero association between the spatial data and that attribute. Simi- larly, incorrect distributional assumptions built into the models are passed on to the usersâ analyses. For example, if the data generation model for an attribute is a normal distribution when the actual distribution is skewed, the synthetic data will fail to reflect the shape of the actual distribution. If a model does fail to include such relationships, it is a potentially serious limitation to releasing fully synthetic data. Practically, it means that some analyses cannot be performed accurately and that data disseminators need to release information that helps analysts decide whether or not the syn- thetic data are reliable for their analyses. To reduce dependency on data generation models, Little (1993) sug- gests a variant of the fully synthetic data approach called partially synthetic data. Imagine a data set with three kinds of information: information that, when combined, is a potential indirect identifier of the respondent (age, sex, race, occupation, and spatial location); information that is potentially highly sensitive (responses about antisocial or criminal behavior, for example); and a residual body of information that is less sensitive and less likely to lead to identification (responses about personal values or nonsensitive be- haviors). Partially synthetic data might synthesize the first two categories of data, while retaining the actual data of the third category. For example, the U.S. Federal Reserve Board protects data in the U.S. Survey of Consumer Finances by replacing monetary values at high disclosure risk with multiple imputations, releasing a mixture of these imputed values and the unreplaced, actual values (Kennickell, 1997). The U.S. Bureau of the Census protects data in longitudinal linked data sets by replacing all values of some sensitive variables with multiple imputations and leaving other variables at their actual values (Abowd and Woodcock, 2001). Partially synthetic approaches promise to maintain the primary benefits of fully synthetic dataâprotect- ing confidentiality while allowing users to make inferences without learning
57 MEETING THE CHALLENGES complicated statistical methods or softwareâwith decreased sensitivity to the specification of the data generation models (Reiter, 2003). The protection afforded by partially synthetic data depends on the nature of the synthesis. Replacing key identifiers with imputations obscures the original values of those identifiers, which reduces the chance of identifi- cations. Replacing values of sensitive variables obscures the exact values of those variables, which can prevent attribute disclosures. Partially synthetic datasets present greater disclosure risks than fully synthetic ones: the origi- nally sampled units remain in the released files, albeit with some values changed, leaving values that analysts can use for record linkages. Currently, for either fully or partially synthetic data, there are no semi- automatic data synthesizers. Data generation models are tailored to indi- vidual variables, using sequential regression modeling strategies (Raghunathan et al., 2001) and modifications of bootstrapping, among others. Substantial modeling expertise is required to develop valid synthe- sizers, as well as to evaluate the disclosure risks and data utility of the resulting datasets. Modeling poses an operational challenge to generating synthetic datasets. A few evaluations of the disclosure risk and data utility issues have been done with social surveys, but none with linked spatial- social data. For spatially identifiable data, a fully synthetic approach simulates all spatial identifiers and all attributes. Such an approach can be achieved either by first generating new values of spatial identifiers, (for example, sampling addresses randomly from the population list, and then simulating attribute values tied to those new values of identifiers) or by first generating new attribute values and then simulating new spatial identifiers tied to those new attribute values. In generating new identifiers, however, care should be taken to avoid implausible or impossible results (e.g., private property on public lands, residences in uninhabitable areas). Either way, the synthesis requires models relating the geographic identifiers to the at- tributes. Contextual variables can provide information for modeling. The implications of these methods for data utility, and particularly for the validity of inferences drawn from linked social-spatial data synthesized by different methods, have not yet been studied empirically. Fully synthetic records cannot be directly linked to records in other datasets, which reduces data utility when linkage is desired. One possibility for linkage is to make linkages informed by statistical analyses that attempt to match synthetic records in one dataset with appropriate nonsynthesized records in another dataset. Research has not been conducted to determine how well such matching preserves data utility. Partially synthetic approaches can be used to simulate spatial identifiers or attributes. Simulating only the identifiers reduces disclosure risks with- out distorting relationships among the attribute variables. Its effect on the
58 PUTTING PEOPLE ON THE MAP relationships between spatial and nonspatial variables depends on the qual- ity of the synthesis model. At present, not much is known about the utility of this approach. Linking datasets on synthetic identifiers or on attributes creates match- ing errors, and relationships between spatial identifiers and the linked vari- ables may be attenuated. Analyses involving the synthetic identifiers reflect the assumptions in the model used to generate new identifier values on the basis of attribute values. This approach introduces error into matches ob- tained by linking the partially synthetic records to records in other datasets. Alternatively, simulating selected attributes reduces attribute risks without disturbing the identifiers: this enables linking, but it does not prevent iden- tity disclosures. Relationships between the synthetic attributes and the linked attributes are attenuatedâalthough to an as yet unknown degreeâ when the synthesizing models are not conditional on the linked attributes. This limitation also holds true when linking to fully synthetic data. The release of partially synthetic data can be combined with other disclosure limitation methods. For example, the Census Bureau has an application, On the Map (http://lehdmap.dsd.census.gov/), that combines synthetic data and the addition of noise. Details of the procedure, which coarsens some workplace characteristics and generates synthetic travel ori- gins conditional on travel destinations and workplace characteristics, have not yet been published.