Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
1 Linked Social-Spatial Data: Promises and Challenges Precise, accurate spatial data are contributing to a revolution in some fields of social science. Improved access to such data, combined with im- proved methods of analysis, is making possible deeper understanding of the relationships between people and their physical and social environments. Researchers are no longer limited to analyzing data provided by research participants about their personal characteristics and their views of the world; rather, it has become possible to link personal information to the exact locations of homes, workplaces, daily activities, and characteristics of the environment (e.g., water supplies). Those links allow researchers to understand much more about individual behavior and social interactions than previously, just as linking biomedical data (on genes, proteins, blood chemistry) to social data has helped researchers understand the progress of illness and health in relation to aspects of peopleâs behavior. The potential for improved understanding of human activities at the individual, group, and higher levels by incorporating spatial information is only beginning to be unlocked. Yet even as researchers are learning from new opportunities offered by precise spatial information, these data raise new challenges because they allow research participants to be identified and therefore threaten the prom- ise of confidentiality made when collecting the social data to which spatial data are linked. Although the difficulties of ensuring access to data while preserving confidentiality have been addressed by previous National Re- search Council reports (1993, 2000, 2003, 2005a), those did not consider in detail the risks posed by data that link the information in social science 7
8 PUTTING PEOPLE ON THE MAP research with spatial locations. This report directly addresses the tradeoffs between providing greater access to data and protecting research partici- pants from breaches of confidentiality in the context of the unique capacity of spatial data to lead to the identification of individuals. THE NEW WORLD OF LOCATIONAL DATA The development of new data, approaches, spatial analysis tools, and data collection methods over the past several decades has revolution- ized how researchers approach many questions. The availability of high- resolution satellite images of Earth, collected repeatedly over time, and of software for converting those images into digital information about spe- cific locations, has made new methods of analysis possible. Along with more and improved satellite images, there are aerial images, global posi- tioning systems (GPS) and other types of sensorsâespecially radio frequency identification (RFID) tags that can be used to track people worldwideâthat allow the possibility of ubiquitous tracking of individu- als and groups. The same technologies also permit enhanced research about business enterprises, for example, by providing tracking information for commercial vehicles or shipments of goods. With the advent of GPS, the goal of real-time, continuous global cover- age with an accuracy finer than 1 meter has been achieved, though some caveats, such as difficulty with indoor coverage, apply. Triangulation based on cellular telephone signal strength can be used to establish location on the order of 100 meters in many locations, and researchers are now developing techniques for mapping mobile locations at much higher resolutions (Borriello et al., 2005). Satellite remote sensing instruments have improved by more than an order of magnitude during the past two decades in several dimensions of resolution. Commercial remote sensing firms provide data with a sub-meter ground resolution. With the increasing availability of hyperspectral sensor systems (those that sense in hundreds of discrete spec- tral bands along the electromagnetic spectrum), the amount of geographic information being collected from satellites has increased at a staggering pace. Terrestrial sensing systems are also increasing in quantity and capabil- ity. Low-cost solid-state imagers with GPS control are now widely deployed by private companies and scientific investigators. In addition, fixed sensor arrays (e.g., closed circuit television) are now used routinely in many loca- tions to provide continuous coverage of events in their field of view. As computers continue to decrease in size and power consumption while also increasing in computing and storage capacity, inexpensive in situ sensor networks are able to record information that is transmitted over peer-to- peer networks and other types of radio communication technologies (Culler,
9 LINKED SOCIAL-SPATIAL DATA Estrin and Srivastava, 2004; Martinez, Hart, and Ong, 2004). These de- vices are now rather primitive, often sensing single types of information such as temperature or pressure, but their capabilities are increasing rap- idly. Moreover, their space requirements are decreasing; some researchers now describe nanoscale computing and sensing devices (Geer, 2006). These emerging technologies are being integrated with other develop- ing streams of technologyâsuch as RFID tags (Want, 2006) and wearable computers (Smailagic and Siewiorek, 2002)âthat are location and context aware. Indeed, the ubiquity of these devices has caused some to assert that traditional sensing and processing systems will, in essence, disappear (Streitz and Nixon, 2005; Weiser, 1991). These technologies are creating signifi- cant concerns about threats to privacy, although few, if any, of these con- cerns relate to research uses of the technologies. Nevertheless, emerging technological capabilities are an important part of the context for the re- search use of locational data. As these new tools and methods have become more widely available, researchers have begun to pursue a variety of studies that were previously difficult to accomplish. For example, analysis of health services once fo- cused on access as a function of age, sex, race, income, occupation, educa- tion, and employment. It is now possible to examine how access and its effects on health are influenced by distances from home and work to health care providers, as well as the quality of the available transportation routes and modes (Williams, 1983; Entwisle et al., 1997; Parker, 1998; Kwan, 2003; Balk et al., 2004). Improved understanding of how these spatial phenomena interact with social ones can give a much clearer picture of the nature of access to health care than was previously possible. Critical to research linking social and spatial data are the development and use of geographical information systems (GIS) that make it possible to tie data from different sources to points on the surface of the Earth. This connection has great importance because geographic coordinates are a unique and unchanging identification system. With GIS, data collected from participants in a social survey can be linked to the location of the respon- dentsâ residences, workplaces, or land holdings and thus can be analyzed in connection with data from other sources, such as satellite observations or administrative records that are tied to the same physical location. Such data linkage can reveal more information about research participants than can be known from either source alone. Such revelations can increase the fund of human knowledge, but they can also be seen by the individuals whose data are linked as an invasion of privacy or a violation of a pledge of confidentiality. Increasingly sophisticated tools for spatial analysis involving, but go- ing far beyond, the simple digitized maps of the early geographical infor- mation systems have also contributed to this revolution. Not only has
10 PUTTING PEOPLE ON THE MAP commercial software made spatial data processing, visualization, and inte- gration relatively accessible, but several packages (including freeware; e.g., Anselin, 2005; Anselin et al., 2006; Bivand, 2006; also see http://www. r-project.org/) also make multivariate spatial regression analysis much easier (e.g., Fotheringham et al., 2002). Moreover, standard statistical software packages, such as Stata and Matlab, now have much greater functionality to accommodate spatial analytic models, and SAS (another software package) and Stata have increased flexibility to accommodate complex design effects often associated with spatially linked data. SCOPE OF WORK In response to such challenges of providing wider access to data used for social-spatial analysis while maintaining confidentiality, the sponsors of this study asked the National Academies to address the scientific value of linking remotely sensed and âself-identifyingâ social science data that are often collected in social surveys, that is, data that allow specific individuals and their attributes to be identified. The Academies were further asked to discuss and evaluate tradeoffs involving data accessibility, confidentiality, and data quality; consider the legal issues raised by releasing remotely sensed data in forms linked to self-identifying data; assess the costs and benefits of different methods for addressing confidentiality in the dissemi- nation of such data; and suggest appropriate models for addressing the issues raised by the combined needs for confidentiality and data access. In carrying out our study, it became clear that limiting the study to remotely sensed data unnecessarily restricted the problem domain. When social science research data are linked with spatially precise and accurate information, it does not matter in terms of confidentiality issues whether the geospatial locations are derived from remotely sensed imagery or from other means of determining location, such as GPS devices or address- matching using GIS technology. The issues raised by linking remotely sensed information are a special case within the larger category of spatially precise and accurate information. For that reason, the committee consid- ered as part of its mandate all forms of spatial information. We also considered all forms of data collected from research participants that might allow them to be identified, including personal information about indi- viduals, which may or may not be sensitive if revealed to others, and information about specific businesses enterprises. For purposes of simplic- ity we call all this personal and enterprise information used for the re- search considered here âsocial data,â and their merger with spatial infor- mation âsocial-spatial data.â
11 LINKED SOCIAL-SPATIAL DATA This report focuses mainly on microdata, specifically, information about individuals, households, or businesses that participate in research studies or supply data for administrative records that have the potential to be shared with researchers outside the original group that produced the data. This focus is the result of the fact that such individual-, household-, or enterprise-level data are easily associated with precise locations. Microdata are especially important because spatial data can compromise confidential- ity both by identifying respondents directly and by providing sensitive in- formation that creates risk of harm if linked to identifying data. In addition, spatially precise information may sometimes be associated with small ag- gregates of individuals or businesses; and care is always needed when shar- ing data that have exact locations, for example, a cluster of persons or families living near each other. This report provides guidance to agencies that sponsor data collection and research, to academic and nonacademic institutions and their institu- tional review boards (IRBs), to researchers who are collecting data, to institutions and individuals involved in the research enterprise (such as firms that contract to conduct surveys), and to those organizations charged with the long-term stewardship of data. It discusses the challenges they face in preserving confidentiality for linked social and spatial data, as well as ways that they can simultaneously honor their commitment to share their wealth of data and their commitment to preserve participant confidential- ity. Although all these individuals and organizations involved in the re- search enterprise have somewhat different roles to play and somewhat different interests and concerns, we refer to them throughout this report as data stewards. This focus on the responsibilities of those who share data for analysis does not absolve others who have responsibility for the collected information from thinking about the risks associated with spatially explicit data. The report therefore also speaks to those who use linked social-spatial data, including researchers who analyze the data and editors who publish maps or other spatially explicit information that may reveal information that is problematic from a privacy perspective (e.g., Monmonnier, 2002; Armstrong and Ruggles, 2005; Rushton et al., 2006). This study follows and builds on a series of previous National Research Council reports that address closely related issues, including: issues of data access (1985); the challenges of protecting privacy and reducing disclosure risk while maximizing access to quality, detailed data for informed analyses (1993, 2000, 2003, 2004b); and ethical considerations in using micro-level data, including linked data (2005a). The conclusions and recommendations of several of these earlier studies inform this report. These earlier reports and other studies (e.g., National Research Council, 1998; Jabine 1993; Melichar et al., 2002), have generally developed two themes, one emphasiz- ing the need for dataâespecially microdataâto be shared among research-
12 PUTTING PEOPLE ON THE MAP ers, and the other the need to protect research participants. While the theme of expanding access to data has included data produced by both individual researchers and government agencies, it has generally emphasized the latter. In the closely related area of environmental data, the National Research Council (2001) emphasizes that publicly funded data are a public good and that the public is entitled to full and open access. The consensus of this work is that secondary use of data for replication and new research is valuable and that both privately and publicly produced data should be shared. The most recent report on the subject (National Research Council, 2005a) presents a concise set of recommendations that encourage increased access to publicly produced data. At the same time, these reports and studies have also insisted on the protection of research participants, mostly in the broader context of protecting all human research subjects. This report supports the conclusions of the prior work while exploring new ground. None of the earlier reports considered the potential for breaches of confidentiality posed by the increase in research using linked social-spatial data. The analyses and recommendations included in this report strive to expand the field to the new world of locational data. The concerns addressed in this report are raised in the context of a broader recognition that vast amounts of data are available about most residents of the United States, that these data have been collected and collated without the explicit permission of their subjects, and that invasions of privacy take place frequently (OâHarrow 2005; Dobson and Fisher 2003; Goss 1995; Fisher and Dobson 2003; Sui 2005; Electronic Privacy Informa- tion Center [http://www.epic.org/pivacy/census], 2003). Huge commercial databases of financial transactions, court records, telephone records, health information, and other personal information have been established, in many cases without any meaningful request to the relevant individuals for release of that information. These databases are often linked and the results made available for a fee to purchasers in a system that has greatly diminished individualsâ and businessesâ control over information about themselves. These invasions or perceived invasions of privacy, however, are not a sub- ject of this report. All datasets that include personal information, including those created for commercial as well as research purposes, whether or not they have spatial information and those that do not, are in need of compre- hensive care to prevent breaches of confidentiality and invasions of privacy. Neither this report nor earlier reports deal with the kinds of information technology security required to prevent breaches or invasions, in the case of this report because there is nothing special for spatial data about the need for that security.
13 LINKED SOCIAL-SPATIAL DATA PRIVACY, CONFIDENTIALITY, IDENTIFICATION, AND HARM To understand the dimensions of the confidentiality problem, it is im- portant first to distinguish the concepts of privacy, confidentiality, identifi- cation, and harm (see Box 1-1). Privacy concerns the ability of individuals to control personal information that is not knowable from their public presentations of themselves (see Appendix A for a more detailed discussion of privacy and U.S. privacy law). When someone willingly provides infor- mation about himself or herself, it is not an invasion of privacy, especially if the person has been informed that it is acceptable to terminate the disclo- sure at any time. An invasion of privacy occurs when an agent obtains such information about a person without that personâs agreement. An invasion of privacy is especially egregious when the person does not want the agent to have the information. An example is the acquisition and sale of the mobile telephone records of individuals without their permission (New York Times, 2006). Confidentiality involves a promise given by an agentâa researcher in the cases of interest in this reportâin exchange for information. Before a research activity begins, the researcher explains the purposes of the project, describes the benefits and harms that may affect the research participant and society more broadly, and obtains the consent of the participant to BOX 1-1 Brief Definitions of Some Key Terms Privacy concerns the ability of individuals to control personal information this is not knowable from their public presentations of themselves. An invasion of privacy occurs when an agent obtains such information about a person without that per- sonâs agreement. Confidentiality in the research context involves an agreement in which a research participant makes personal information available to a researcher in an exchange for a promise to use that information only for specified purposes and not to reveal the participantâs identity or any identifiable information to unauthorized third parties. Identification of an individual in a database occurs when a third party learns the identity of the person whose attributes are described there. Identification disclo- sure risk is the likelihood of identification. Harm is a negative consequence that affects a research participant because of a breach of confidentiality.
14 PUTTING PEOPLE ON THE MAP continue. This process is called âinformed consentâ (see National Research Council, 1993). The researcher then collects the informationâthrough in- terview, behavioral observation, physical examination, collection of bio- logical sample specimens, or requests for the information from a third party, such as a hospital or a government agency. In exchange, the re- searcher promises to use that information only for specified purposes (often limited to statistical analysis) and not to reveal the participantâs identity or any identifiable information to unauthorized third parties. If promises of confidentiality are kept, a participantâs privacy is protected in relation to the information given to the researchers. In academic and other research organizations, the process of obtaining informed consent and making con- fidentiality promises is part of normal research protocol: institutional re- view boards have guidelines that require agreements and protection of confidentiality and the ethical standards of research communities provide further support for confidentiality. Identification is a key element in confidentiality promises. Confidenti- ality means that when researchers release any informationâanalyses, de- scriptions of the project, or databases that might be used by third partiesâ they promise that the identity of the participants will not be publicly revealed and cannot be inferred. Identification of an individual in a data- base occurs when a third party learns the identity of the person whose attributes are described there. Identification obviously increases the risk of breaches of confidentiality. Identification disclosure risk is sometimes quan- tified in terms of the likelihood of identification. In the context of this study, precise spatial information increases the risk of disclosure and thus the likelihood of identification. It is important to note that it is not so much the information that is being protected, but the link of the information to the individual. For example, it is acceptable to describe a personâs survey answers or character- istics so long as the identity of the participant is not revealed. The danger inherent in a breach of confidentiality is not only that private information about an individual might be revealed, but also that the successful conduct of research requires that there be no breaches of confidentiality: any such breach may significantly endanger future research by making potential re- search participants wary of sharing personal information. Including spatial data in a dataset with social data greatly increases the possibility of identi- fication while at the same time being necessary for certain kinds of analysis. Harm is a negative consequence that affects a survey respondent or other research participant, in the instances of interest in this study, because of a breach of confidentiality. Social science research can cause various kinds of harm (for example, legal, reputational, financial, or psychological) because information is revealed about a person that she or he does not wish others to know, such as financial liabilities or a criminal record. In excep-
15 LINKED SOCIAL-SPATIAL DATA tional cases, identification of a participant in social science research could put the person at risk of physical harm from a third party. In linking social and spatial data, the need to prevent breaches of confidentiality remains serious, even if no discernable harm is done to respondents, because even apparently harmless breaches violate the expectations of a trusting relation- ship and can also damage the reputation of the research enterprise.1 Thus, the challenge to the research community is to preserve confiden- tiality (and also to protect private information to the extent possible). This means that research participants must be protected from identification es- pecially, but not only, when identification can harm them. Though the chance of a confidentiality breach is never zero, the risk of disclosure de- pends on the nature of the data. The separate risk of harm also depends on the nature of the data. In some instances, confidentiality is difficult to protect but the risk of harm to respondents is low (e.g., when the data include only information that is publicly available); in others, confidential- ity may be easy to protect (e.g., because the data include few characteristics that might be used to identify someone), but the risk of harm may be high if identification occurs (because some of the recorded characteristics could, if known, endanger the well-being of the respondent). When precise locational data are included in or can be determined from a dataset, re- searchers face tougher challenges of protecting confidentiality and prevent- ing identification. OPPORTUNITIES AND CHALLENGES FOR RESEARCHERS In response to the growing opportunities for knowledge about relation- ships between social and spatial phenomena on the part of researchers and policy makers, research fundersâespecially the National Institute of Child Health and Human Development [National Institutes of Health], the Na- tional Science Foundation, and the National Aeronautics and Space Admin- istrationâthe sponsors of this study, have contributed substantial resources to the creation of linked social-spatial datasets (see Box 1-2). Such datasets cover parts of the United States (Arizona State University, 2006; University of Michigan, 2005a), Brazil (Moran, Brondizio, and VanWey, 2005; Indi- ana University, 2006), Ecuador (University of North Carolina, 2005), Thai- land (Walsh et al., 2005; University of North Carolina, 2006), Nepal (Uni- versity of Michigan, 2005b), and other countries. One outstanding example is research on the relationship among population, land use, and environ- ment in the Nang Rang district of Thailand, described in Figure 1-1. 1For more on the distinction between risk and harm, see the Risk and Harm Report of the Social and Behavioral Sciences Working Group on Human Research Protections (http:// www.aera.net/aera.old/humansubjects/risk-harm.pdf, accessed January 2007).
16 PUTTING PEOPLE ON THE MAP FIGURE 1-1 Confidentiality in Nang Rong, Thailand. The image is an aerial photo with simu- lated households identified and linked to their farm plots. At this resolution, it is impossible to prevent identification of households. BOX 1-2 An Example of Social-Spatial Data A good example of a social-spatial dataset comes from the Nang Rong study, begun in 1984. This project covers 51 villages in Nang Rong district, Northeast Thailand, an agricultural setting in the countryâs poorest region. The researchers who work on this project have collected data from all households in each village, including precise locations of dwelling units and agricultural parcels. Social net- work data link households along lines of kinship as well as economic assistanceâ who helps whom with agricultural tasks. The project team also follows migrants out to their most common destinations, including Bangkok and the countryâs Eastern
17 LINKED SOCIAL-SPATIAL DATA FIGURE 1-2 Confidentiality issues in Bangkok, Thailand. The background is an Ikonos satel- lite image, with simulated household data overlaid. The figure shows that migrants from the same village cluster at their destination, forming a village enclave (upper insert) or cluster with migrants from other Nang Rong villages forming a Nang Rong cluster (lower insert). Released in this fashion, the data can give away the identity of the migrants (unless circles are enlarged to cover more area in which case the quality of the data is degraded). Seaboardâa government-sponsored development zone. The projectâs social data have been merged with the locations of homes, fields, and migration destinations, and then linked to a variety of other types of geographic information including satellite data, aerial photographs, elevation data, road networks, and hydrological features. These linked data have been used for many types of analysis (see Uni- versity of North Carolina, 2006). Figures 1-1 and 1-2 are simulated data of the type created for the Nang Rong project. They show just how clearly individuals and households can be located in these data and therefore how easy it would be for anyone who has the spatial information for actual respondents to identify them.
18 PUTTING PEOPLE ON THE MAP Linking social data that are collected from individuals and households with spatial data about them, collected in place or by remote sensing, creates potential for improved understanding of a variety of social phenom- ena (see Butz and Torrey, 2006). Much has already been learned about the effects of context on social outcomes by analyzing social data at relatively imprecise geographic levels, such as census blocks and tracts or other pri- mary sampling units (e.g., Gephart,1997; Smith and Waitzman 1997; Le Clere et al. 1998; Ross et al., 2000; Sampson et al., 2002). Advances in geographic information science and in remote sensing make it possible to connect individuals and households to their geographic and biophysical environmentsâand changes in themâat much finer scales. Because concerns about confidentiality have limited the use of linked social and fine-scale spatial data, the potential for advancing knowledge through such linkages is only beginning to be explored. There are some early hints of exciting work, and we can speculate about future progress. Some of the progress involves studies of human interactions with the natu- ral environment, a field that has been supported by the agencies that have requested the present study (e.g., National Research Council, 1998, 2005b). Researchers have combined household surveys with remotely sensed data on changes in land use to gain deeper understanding of the processes driv- ing those changes and their economic consequences (e.g., conversion of agricultural land to urban uses, Seto, 2005; changes in cropping patterns, Walsh et al., 2005; changes in forest cover, Foster, 2005; Moran et al., 2005). Another area of research and opportunity involves global population patterns. Global gridded population data demonstrates that people tend to live at low elevation and near sea coasts and rivers (Small and Cohen, 2004; Small and Nicholls, 2003) and that people living in coastal regions are disproportionately residents of urban areas. Moreover, coastal regions, whether urban or rural, are much more densely populated than other types of ecosystems (McGranahan et al., 2005). About one of every ten people on Earth lives in a low elevation coastal zone at risk of storm surges associated with expected increases in sea levels (McGranahan et al., 2006) Interesting examples come from health research. For example, the avail- ability of exercise options near where people live, including features as simple as a sidewalk, affects peopleâs health and physical fitness (Gordon- Larsen et al., 2006). Other research shows how migration responds to local environmental conditions, with recurrent droughts perhaps providing the best example (Deane and Gutmann, 2003; Gutmann et al., 2006). There are opportunities for improving estimates of vulnerability to famine by combining data on food availability with data on household coping capa- bilities and strategies (Hutchinson, 1998). In one example, combining de- mographic survey data with environmental variables showed that house-
19 LINKED SOCIAL-SPATIAL DATA hold factors (composition, size, assets), maternal education, and soil fertil- ity were all significant determinants of child hunger in Africa (Balk et al, 2005). The future of health research offers myriad opportunities. For ex- ample, environmental factors (e.g., air and water quality) have been linked to peopleâs health: as social and biophysical datasets become better inte- grated at finer scales, it will be possible to examine a variety of environmen- tal factors and link them to peopleâs health with greater precision and so develop better understanding of those environmental factors. Another example of the future of research concerns understanding travel behavior by linking personal data with fine-scale spatial information on actual travel patterns. Researchers could evaluate simultaneously the individual attributes of the research participants, the environmental at- tributes of the places they live, work, or otherwise frequent, and the de- tailed travel patterns that lead from one to another. Beyond knowing whether a route to school has a sidewalk and whether a child walks to school, one can ask whether that route also has a candy store or a commu- nity exercise facility and whether the actual trip to school allows the child to stop there. Yet combining all that informationâlocation of home and school, route taken, and attributes of child and familyâand publishing it would reveal the actual identities of research participants and so breach the promise of confidentiality made when data were collected from them. As research combining spatial data with social data collected from individuals has expanded, both researchers and their sponsors have been forced to confront questions about the release of the massive amounts of data they have accumulated. The opportunities for research offer the poten- tial for great benefits, but there is also some risk of harm. Moreover, both professional ethics and agency policies require that researchers share their data with others.2 At the same time, researchers who collect social and behavioral data customarily promise the participants who provide the data confidentiality, and the same professional ethics and agency policies that require data sharing also require that pledges of confidentiality be honored. These requirements combine to produce the central dilemma that this re- port addresses. 2See, for example, the codes of ethics of the Urban and Regional Information Systems Association (http://www.urisa.org/about/ethics); the American Society of Photogrammetry and Remote Sensing (http://www.asprs.org/membership/certification/appendix_a.html); the Ameri- can Sociological Association (http://www.asanet.org/galleries/default-file/Code%20of%20 Ethics.pdf) and the Association of American Geographers (http://www.aag.org/Publications/ EthicsStatement.html). Also see, for example, the policies of the National Institutes of Health (http://grants1.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html) and the National Sci- ence Foundation (article 36 at http://nsf.gov/pubs/2001/gc101/gc101rev1.pdf). [All above- cited web pages accessed January 2007.]
20 PUTTING PEOPLE ON THE MAP In order to understand the challenges and opportunities, consider a recent finding and two hypothetical examples. The finding concerns the rapidly growing use of maps in medical research. Brownstein and col- leagues (2006) identified 19 articles in five major medical journals in 2004 and 2005 that plotted the addresses of patients as dots or symbols on maps. To determine how easy it might be to identify individual patients from these maps, they created a simulated map of 550 geographically coded addresses of patients in Boston, using the minimum figure resolution required for publication in the New England Journal of Medicine, and attempted to reidentify the addresses using standard GIS technology. They precisely iden- tified 79 percent of the addresses from the map, and came within 14 meters of precision with the rest. The authorsâ point was that improved ability to visualize disease patterns in space comes at a cost to patientsâ privacy. The first hypothetical example concerns a researcher who (expanding on the insights in Gordon-Larsen et al., 2006) undertakes a project that includes a survey of adolescent behavior, including exercise and eating habits, in order to understand the causes of obesity in the teenage popula- tion. In addition to asking about how the research subjects get to school and the availability of places to walk and exercise, the researcher takes GPS readings of their homes and schools, and asks them to wear a device that tracks their location during waking hours for 1 week. Because of the com- plexity of the problem, the researcher asks about drug and alcohol con- sumption in addition to food consumption. Finally, the information ob- tained from the participants is merged with detailed maps of the communities in which they live in order to know the location of specific kinds of places and the routes between them. In the second example, a researcher interested in the effects of family size on land use and resource consumption in south Asia conducts a survey that asks each family about their reproductive and health history, as well as detailed questions about the ways that they obtain food and fuel. Then, walking in the community with family representatives, the researcher takes GPS readings of the loca- tions of the familiesâ farm plots and the areas where they gather wood for heat and cooking. Finally, the researcher spends a day with the women and children in the families as they go about gathering fuelwood, wearing a GPS-based tracking device so that the location and timing of their activities can be recorded. Some of these locations are outside the sanctioned areas in which the family is legally permitted to gather fuel. In both hypothetical examples, the linking of the social data gathered from the participants and the spatial data will permit identification of some or all of the participants. Yet the researchers have made promises of confi- dentiality, which state that the data will only be analyzed by qualified re- searchers and that the participants will never be identified in any publication or presentation. Yet both the sponsor of the research and the research ethics require that the researchers make their data available to other researchers for
21 LINKED SOCIAL-SPATIAL DATA replication and for new research. In both surveys, there are questions about activities that are outside officially sanctioned behavior, which if linked to an individual respondent might cause them harm if revealed. In both hypothetical examples, the locational information is essential to the value of the data, so the researchers may not simply discard or modify data items that could lead to identification. Rather, they face a choice between honoring the requirement to share data and the commit- ment to protect confidentiality, or somehow finding a way to do both. Sharing data is not by itself automatically harmful to research participants. Responsible researchers regularly analyze data that include confidential information, and do so without compromising the promises that were made when the data were collected. The challenge arises when the data are shared with secondary researchers, who must either guarantee that they will ad- here to the promise of confidentiality made by the original researcher, or receive data that are stripped of useful identifying information. The goal is to make sure that responsible secondary users do not reveal respondent identities, and do not share the data to others who might do so. But locational information may also make it possible for a secondary researcher to identify research participants by linking to data from other sources, without requesting permission for that information. Some recent research suggests that it is possible to gauge social, demo- graphic, and economic characteristics from remote sensing data alone (Cowen and Jensen 1998; Cowen et al 1993; Weeks, Larson, and Fugate, 2005), but this suggestive idea is unproven and would require considerable supporting research to overcome the challenge that the data are of limited value and have a high likelihood of error. Identifying social attributes from Earth-observing satellites is not easy, but satellite data, particularly from high-resolution satellites (launched since the late 1990s) make the identifi- cation of particular anthropogenic featuresâroads, buildings, infrastruc- ture, vehiclesâmuch easier than previously.3 Other forms of spatial data, such as aerial photographs, especially historic ones, are much less likely to be accurately georeferenced (if georeferenced at all) for fine-scale matching with other attributes, but may nevertheless foster identification. Spatial data create the possibility that confidentiality may be compro- mised indirectly by secondary data users in ways that identify individual participants.4 Those ways relate to the spatial context of observations and the spatial covariance that exists among variables. Spatial covariance refers 3A review of satellites, their spatial and temporal resolutions and coverage, and detectable features can be found at http//sedac.ciesin.columbia.edu/tg/guide_frame.jsp?rd=RS&ds=1 [ac- cessed January 2007]. 4Confidentiality issues rarely, if ever, arise for spatial data when unlinked to social data. Much spatial data are in the public domain, and the Supreme Court has ruled that privacy rights do not exist for observations made from publicly navigable airspace (see Appendix A).
22 PUTTING PEOPLE ON THE MAP to the tendency of the magnitude of variables to be arranged systematically across space. For example, the locations of high values of one variable are often associated systematically with high values (or with low values) of another variable. Thus, if the spatial covariance structure between variables is known, and the value for one variable is also known, an estimate of the other variable can be made, along with an estimate of error. This knowl- edge can be applied in several ways, such as interpolation and contextual analyses associated with process models. Interpolation methods can be placed into two classes: exact and ap- proximate (Lam, 1983). Exact methods enforce the condition that the inter- polated surface will pass through the observations. Approximate methods use the data points to fit a surface that may pass above or below the actual observations. Kriging is a widely used exact method in which the link between location (x,y) and value of the observation (z) is preserved. Kriging, therefore, threatens confidentiality because it exactly reproduces data val- ues for each sample point: if the spatial location of sample data points is known, the linked values of other variables can be revealed (Cox, 2004). Kriging also provides the analyst with an assessment of the error at each point. Contextual data are sometimes used to facilitate analysis when detailed exact data are either too sensitive for release or unavailable. However, contextual data can themselves be identifying; for example, a sequence of daily air quality monitoring readings from the nearest monitor provide a complete âsignatureâ for each monitor, revealing fairly precise locations for individuals whose data are linked to such air quality readings. Knowl- edge about context can also be used to infer locations when deterministic spatial process models are used. Studies of the human effects of air pollu- tion may use such models to study atmospheric dispersion of harmful sub- stances. Given a model and a set of input parameters, such as wind speed, direction, temperature, and humidity, results are reported in the form of a plume âfootprintâ of dispersion (see, e.g., Chakraborty and Armstrong, 2001). If the location of a pollution source is known, along with the model and its parameters, a result from the model can be used to reveal the locations of participants in the dataset, who can then be identified, along with the confidential information they provided for the dataset. DATA QUALITY, ACCESS, AND CONFIDENTIALITY: TRADEOFFS More precise and accurate data are generally more useful for analysis. For analysis of social and spatial relationships, accuracy and precision in the spatial data are often crucial. However, having such data increases the chances that research participants can be identified, thus breaking research- ersâ promises of confidentiality. In general, as data with detailed locational
23 LINKED SOCIAL-SPATIAL DATA information about participants becomes more widely accessible, the risk of a confidentiality breach increases. The problem of tradeoffs involving data quality, access, and confidentiality is becoming more urgent because of two recent trends. One is increased demands from research funders, particularly federal agencies, for improving data access so as to increase the scientific benefit derived from a relatively fixed investment in data collection. The other is the continuing improvement in computer technologies generally, and especially techniques for mining datasetsâtechniques that can be used not only to provide more detailed understanding of social phenomena, but also to identify research participants despite researchersâ promises of confi- dentiality. The current context and a consideration of the ethical, legal, and statistical issues are discussed in Chapter 2. This report also addresses ways to solve the problem of increasing the value of linked social-spatial data, both to the original researchers and to potential secondary users, while at the same time keeping promises of confidentiality to research participants. Chapter 3 examines several meth- ods available for dealing with the problem. They can be roughly classified as technical and institutional, and each has significant limitations. Both technical and institutional approaches limit the amount of data available, the usefulness of the data for research, or the ways that research- ers can access those data in return for increased protection of pledges of confidentiality. Most researchers believe that those restrictions have had a negative effect on the amount and value of research that has been done, but there is relatively little solid evidence about the quantity of research not performed for this cause. It is not surprising that such negative evidence does not exist, and its absence does not prevent us from recommending improvements. At the workshop organized by the panel we heard testimony from users of data enclaves about the ways that the arduous rules of those institutions limited research. In addition, there was interesting testimony submitted at the time of the preparation of the 2000 U.S. census that documented research that could not be conducted because of variables and values that the Census Bureau proposed to remove from the Public Use Microdata Samples in order to reduce the risk of identification (Minnesota Population Center, 2000). The lack of readily accessible data about any- thing smaller than quite large areas does limit research. Research is not being done on certain topics that require knowledge of locations because the data are not available or access is difficult. Some of the technical approaches involve changing data in various ways to protect confidentiality. One is to mask locations by shifting them randomly. This approach helps protect against identification, but makes the data less useful for understanding the spatial phenomena that justified creating the linked dataset in the first placeâthe significance of location of places (such as home and work) for the social conditions of interest. Re-
24 PUTTING PEOPLE ON THE MAP searchers and data stewards need to be sensitive to linkages of data that are masked in order to avoid conclusions based on an overestimation of the accuracy of data that have been changed in some way. Institutional approaches include restrictions in access to the data. The notion of tiers of access to data means that there is a gradient of accessibil- ity: data that create the greatest risk of identification are least available and those with the lowest risk are the most available. At the same time, many analyses will only be possible with data that have the highest risk of disclo- sure and harm and therefore will be the least available. The seriousness of these tradeoffs, in terms of the likelihood of identi- fication or disclosure and of the potential for harm to research participants, depends on attributes of the research population, the information in the dataset, the contexts of inadvertent disclosure, and the motives of second- ary users who may act as âdata spiesâ (Armstrong et al., 1999) in relation to the dataset, as well as on the strategy used to protect confidentiality. Most of these factors apply regardless of whether the data include spatial information, but the availability of spatial characteristics of the research population can affect the seriousness of the tradeoffs. For example, a highly clustered sample of school-age students (with school as the primary sam- pling unit and with geographic identifiers) is more identifiable and more open to risk of harm than a nationally scattered sample of adults, especially if the data collected include information about social networks.5 Many nonspatial factors can also affect disclosure risk. For example, questions about individualsâ attitudes (what do you think about âxâ) are less likely to increase disclosure risk than questions about easily known characteristics of family or occupation (age, number of children, occupation, distance to place of employment). At the same time, some questions, if identification occurs, are more likely to be harmful than others, with a question about drug use more likely to cause harm than a question about retirement planning. Finally, the seri- ousness of the tradeoffs may depend on the identities and motives of sec- ondary users. At present, little is known about such users, what they might want, the conditions under which they might seek what they want from a confidential dataset, the extent to which what they want would lead to identification of research participants and their attributes, or the techniques that they might use (see, e.g., Duncan and Lambert, 1986b; Armstrong et al., 1999). It is possible for the linkage of social and spatial data to create signifi- 5Because social networks locate individuals within a social space, releasing social network data involve analogous risks to the risks related to spatial network data discussed in this report. For discussions of ethical issues in social network research, see Borgatti and Molina (2003), Breiger (2005), Kadushin (2005), and Klovdahl (2005).
25 LINKED SOCIAL-SPATIAL DATA cant risks of harm to research participants. For example, it has been claimed that the Nazis used maps and tabulations of âJews and Mixed Breedsâ to round up people for concentration camps (Cox, 1996) and that the U.S. government used special tabulations of 1940 census data to locate Japanese Americans for internment (Anderson and Fienberg, 1997). Improvements in the precision of spatial data and advances in geocoding are likely to lower the costs of identifying people for such purposes. We note, however, that risks of identification and harm by governments or other organizations with strong capabilities for tracking people and mining datasets exist even if social data are not being collected under promises of confidentiality. The key issue for this study concerns the incremental risks of linking confiden- tial social data to precise spatial information about research participants. Among secondary users who might seek information about particular individuals, those who know that another person is likely or certain to be included in a database (e.g., a parent knowing that a child was studied or one spouse knowing about another) have a much easier time identifying a respondent than someone who starts without that knowledge. Experts sus- pect that although those who know which participant they are looking for may be interested in harming that individual, they are unlikely to be inter- ested in harming the entire class of participants or the research process itself. The benefit-risk tradeoffs created by social-spatial is a major chal- lenge for research policy.