Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 59
4
The Tradeoff:
Confidentiality Versus Access
The previous three chapters describe the challenge of preserving confi-
dentiality while facilitating research in an era of increasingly detailed and
available data about research participants and their geographic locations.
This chapter presents the committee’s conclusions about what can—and
cannot—be done to achieve two goals: ensure that both explicit and im-
plied pledges of confidentiality are kept when social data are made spatially
explicit and provide access to important research data for analysts working
on significant basic and policy research questions. Following our conclu-
sions, we offer recommendations for data stewards, researchers, and re-
search funders.
CONCLUSIONS
Tradeoffs of Benefits and Risks
Recognition of the Benefits and Risks Making social data spatially explicit
creates benefits and risks that must be considered in ethical guidelines and
research policy. Spatially precise and accurate data about individuals,
groups, or organizations, added to data records through processes of
geocoding, make it possible for researchers to examine questions they could
not otherwise explore and gain better understanding of human actors in
their physical and environmental contexts, and they create benefits for
society in terms of the knowledge that can flow from that research.
59
OCR for page 60
60 PUTTING PEOPLE ON THE MAP
CONCLUSION 1: Recent advances in the availability of social and
spatial data and the development of geographic information systems
(GIS) and related techniques to manage and analyze those data give
researchers important new ways to study important social, environ-
mental, economic, and health policy issues and are worth further
development.
Sharing of linked social-spatial data among researchers is imperative to
get the most from the time, effort, and money that goes into obtaining the
data. However, to the extent that data are spatially precise and accurate,
the risk increases that the people or organizations that are the subject of the
data can be identified. Promises of confidentiality that are normally pro-
vided for research participants and that can be kept when data are not
linked could be jeopardized as a result of the data linkage, increasing the
risk of disclosure and possibly also of harm, particularly when linked data
are made available to secondary data users who may, for example, combine
the linked data with other spatially explicit information about respondents
that enables new kinds of analysis and, potentially, new kinds of harm.
These risks affect not only research participants, but also the scientific
enterprise that depends on participants’ confidence in promises of confiden-
tiality.
Researcher’s Obligations Researchers who collect or undertake secondary
analysis of linked social-spatial data and organizations that support re-
search or provide access to such data have an ethical obligation to maxi-
mize the benefits of the research and minimize the risk of breaches of
confidentiality to research participants. This obligation exists even if legal
obligations are not clearly defined. Those who collect, analyze, or provide
access to such data need to articulate strong data protection plans, stipulate
conditions of access, and safeguard against possible breaches of confidenti-
ality through all phases of the research—from data collection through dis-
semination. Protecting against any breach of confidentiality is a priority for
researchers, in light of the need to honor confidentiality agreements be-
tween research participants and researchers, and to support public confi-
dence in the integrity of the research.
The Tradeoff of Confidentiality and Access Restricting data access affords
the highest protection to the confidentiality of linked social-spatial data
that include exact locations. However, the costs to science are high. If
confidentiality has been promised, common public-use forms of data distri-
bution create unacceptable risks to confidentiality. Consequently, only more
restrictive forms of data management and dissemination are appropriate,
including extensive data reduction, strong licenses, and data center (en-
OCR for page 61
61
THE TRADEOFF: CONFIDENTIALITY VERSUS ACCESS
clave) access. When the precise data are available only in data enclaves,
many researchers simply do not use the datasets, so research that could be
done is not undertaken. Improved methods for providing remote access to
enclave data require research and development efforts.
CONCLUSION 2: The increasing use of linked social-spatial data has
created significant uncertainties about the ability to protect the confi-
dentiality promised to research participants. Knowledge is as yet inad-
equate concerning the conditions under which and the extent to which
the availability of spatially explicit data about participants increases
the risk of confidentiality breaches.
The risks created by the availability and publication of such informa-
tion increases the better-known risks associated with other publication-
related breaches of confidentiality, such as the publication of the names or
locations of primary sampling units or of specific tabular cell sizes. For
example, cartographic materials are often used in publications to illustrate
points or findings that do not lend themselves as easily to tabular or text
explication: what is not yet understood are the conditions under which they
also increase the ability to identify a research participant.
Technical Strategies for Reducing Risk
Cell Suppression, Data Swapping, and Aggregation Cell suppression and
data swapping techniques can protect confidentiality, but they seriously
degrade the value of data for analyses in which spatial information is
essential. Aggregation can provide adequate protection and preserves analy-
sis at a level of aggregation, but it renders data useless when exact locations
are required. Hence, aggregation has merit for data that have low levels of
risk and are slated for public-use dissemination, but not for data that will
be used for analyses that require exact spatial information.
When analyses require exact locations, essentially all observations are
the equivalent of small cells in a statistical table: cell suppression would
therefore be tantamount to destroying the spatial component of the data.
Suppressing nonspatial attributes leaves so much missing information that
the data are difficult to analyze. Swapping exact locations may not prevent
identifications and can create serious distortions in analysis when a location
or a topological relationship is a critical variable. Swapping nonspatial
attributes to limit attribute disclosure risk may need to be done at so high a
rate that the associations in the data are badly attenuated. Suppression or
swapping can be used to preserve confidentiality when analyses require
inexact levels of geography, but aggregation is a superior approach in these
cases because it preserves analyses at those levels. Aggregation makes it
OCR for page 62
62 PUTTING PEOPLE ON THE MAP
impossible to perform many types of analyses, and when it is used it can
lead to ecological inference problems.
Data Alteration Data alteration methods, such as geographic masking or
adding noise to sensitive nonspatial attributes, may improve confidentiality
protection but at the expense of data quality. Altering data to mask precise
spatial locations impedes the ability of researchers to calculate accurate
spatial relationships, such as distances, directions, and inclusion of loca-
tions within an enumeration unit (e.g., a census tract). There is a tradeoff
between the magnitude of any masking displacement and the correspond-
ing utility of an observation for a particular use. Decisions about this
tradeoff affect the risk of a breach of confidentiality. A mask may also be
applied to nonspatial attributes associated with known locations: this might
be done when knowledge about the magnitude of an attribute, along with
knowledge about a generating process (such as a deterministic model of
toxic emissions), could enable the recovery of a location that could then be
linked to other information.
Synthetic Data Synthetic data approaches may have the potential to pro-
vide access to data with exact spatial identifiers while preserving confiden-
tiality. There is insufficient evidence at present to determine how well this
approach preserves the social-spatial relationships of interest to research-
ers. In addition, with current technologies, it is very difficult for data stew-
ards to create analytically valid synthetic datasets. The goal of synthetic
data approaches is to protect confidentiality while preserving certain rela-
tionships in the data. This approach depends on data simulation models
that capture the relationships among the spatial and nonspatial variables.
The effectiveness of such models has not been fully demonstrated across a
wide range of analyses and datasets. For example, it is not known how well
these models can preserve distance and topological relationships. It is also
not known whether and how the various synthetic data approaches can be
applied when linking datasets.
Secure Access Techniques for providing secure access to linked data, such
as sharing sums but not individual values or conducting data analyses on
request and returning the results but not the data may have the potential to
provide results from spatial analyses without revealing data values. These
approaches are not yet extensively used by stewards of spatial data, and
their feasibility for social and spatial data is unproven. They are com-
putationally intensive and require expertise that is not available to many
data stewards. The value of some of these methods is limited by restric-
tions on the total number of queries that can be performed before queries
could be combined to identify elements in the original data.
OCR for page 63
63
THE TRADEOFF: CONFIDENTIALITY VERSUS ACCESS
CONCLUSION 3: Recent research on technical approaches for reduc-
ing the risk of identification and breach of confidentiality has demon-
strated promise for future success. At this time, however, no known
technical strategy or combination of technical strategies for managing
linked social-spatial data adequately resolves conflicts among the ob-
jectives of data linkage, open access, data quality, and confidentiality
protection across datasets and data uses.
In our judgment, it will remain difficult to reconcile these conflicting
objectives by technical strategies alone, though efforts to identify effective
methods and procedures should continue. It is likely that different methods
and procedures will be optimal for different applications and that the best
approaches will evolve with the data and with techniques for protecting
confidentiality and for identifying respondents.
Institutional Approaches
CONCLUSION 4: Because technical strategies will be not be sufficient
in the foreseeable future for resolving the conflicting demands for data
access, data quality, and confidentiality, institutional approaches will
be required to balance those demands.
Institutional approaches involve establishing tiers of risk and access
and producing data-sharing solutions that match levels of access to the risks
and benefits of the planned research. Institutional approaches must address
issues of shared responsibility for the production, control, and use of data
among primary data producers, secondary producers who link additional
information, data users of all kinds, research sponsors, IRBs, government
agencies, and data stewards. It is essential that the power to decide about
data access and use be allocated appropriately among these responsible
actors and that those with the greatest power to decide are highly informed
about the issues and about the benefits and risks of the data access policies
they may be asked to approve. It is also essential that users of the data bear
the burden of confidentiality protection for the data they use.
RECOMMENDATIONS
We generally endorse the recommendations of two reports, Protecting
Participants and Facilitating Social and Behavioral Sciences Research (Na-
tional Research Council, 2003) and Expanding Access to Research Data:
Reconciling Risks and Opportunities (National Research Council, 2005a)
regarding general issues of confidentiality and data access. It is important to
note that the recommendations in those reports address only data collected
OCR for page 64
64 PUTTING PEOPLE ON THE MAP
and held by federal agencies, and they do not deal with the special issues
that arise when social and spatial data are linked. This report extends those
recommendations to include the large body of data that are collected by
individual researchers and academic and research organizations and held at
universities and other public research entities. It also addresses the need for
research sponsors, research organizations such as universities, and research-
ers to pay special attention to data that record exact locations.
In particular, we support several key recommendations of these re-
ports:
• Access to data should be provided “through a variety of modes,
including various modes of restricted access to confidential data and unre-
stricted access to public-use data altered in a variety of ways to maintain
confidentiality” (National Research Council, 2005a:68).
• Organizations that sponsor data collection should “conduct or spon-
sor research on techniques for providing useful, innovative public-use data
that minimize the risk of disclosure” (National Research Council, 2005a:72)
and continue efforts to “develop and implement state-of-the-art disclosure
protection practices and methods (National Research Council, 2003:4).
• Organizations that sponsor data collection “should conduct or spon-
sor research on cost-effective means of providing secure access to confiden-
tial data by means of a remote access mechanism, consistent with their
confidentiality assurance protocols” (National Research Council,
2005a:78).
• Data stewardship organizations that use licensing agreements should
“expand the files for which a license may be obtained [and] work with data
users to develop flexible, consistent standards for licensing agreements and
implementation procedures for access to confidential data” (National Re-
search Council, 2005a:79).
• Professional associations should develop strong codes of ethical con-
duct and should provide training in ethical issues for “all those involved in
the design, collection, distribution, and use of data collected under pledges
of confidentiality” (National Research Council, 2005a:84).
Some of these recommendations will not be straightforward to imple-
ment for datasets that link social and spatially explicit data. We therefore
elaborate on those recommendations for the special issues and tradeoffs
raised by linking social and spatial data.
Technical and Institutional Research
RECOMMENDATION 1: Federal agencies and other organizations
that sponsor the collection and analysis of linked social-spatial data—
OCR for page 65
65
THE TRADEOFF: CONFIDENTIALITY VERSUS ACCESS
or that support data that could provide added benefits with such link-
age—should sponsor research into techniques and procedures for dis-
seminating such data while protecting confidentiality and maintain-
ing the usefulness of the data for social and spatial analysis. This
research should include studies to adapt existing techniques from other
fields, to understand how the publication of linked social-spatial data
might increase disclosure risk, and to explore institutional mechanisms
for disseminating linked data while protecting confidentiality and main-
taining the usefulness of the data.
This research should include three elements. First, it should include
studies that focus on both adapting existing techniques and developing new
approaches in social science, computer science, geographical science, and
statistical science that have the potential to deal effectively with the prob-
lems of linked social-spatial data. The research should include assessments
of the disclosure risk, data quality, and implementation feasibility associ-
ated with the techniques, as well as seeking to identify ways for data stew-
ards to make these assessments for their data.
This line of research should include work on techniques that enable
data analysts to understand what analyses can be reliably done with shared
data. It should also include research on analytical methods that correct or
at least account for the effects of data alteration. Finally, the research
should be done through collaborations among data stewards, data users,
and researchers in the appropriate sciences. Among the most promising
techniques are spatial aggregation, geographic masking, fully and partially
synthetic data and remote access model servers and other emerging meth-
ods of secure access and secure record linkage.
Second, the research should include work to understand how the pub-
lication of spatially explicit material using linked social-spatial data might
increase disclosure risk and thus to increase sensitivity to this issue. The
research would include assessments of disclosure risk associated with carto-
graphic displays. It should involve researchers from the social, spatial, and
statistical sciences and would aim to better understand how the public
presentation of cartographic and other spatially explicit information could
affect the risk of confidentiality breaches. The education should involve
researchers, data stewards, reviewers and journal editors.
Third, the research should work on institutional mechanisms for dis-
seminating linked social-spatial data while protecting confidentiality and
maintaining the usefulness of the data for social and spatial analysis. This
research should include studies of modifications to traditional data enclave
institutions, such as expanded and virtual enclaves, and of modified licens-
ing arrangements for secondary data use. Direct data stewards, whether in
government agencies, academic institutions, or private organizations, should
OCR for page 66
66 PUTTING PEOPLE ON THE MAP
participate in such research, which should seek to identify and examine the
effects of various institutional mechanisms and associated enforcement sys-
tems on data access, data use, data quality, and disclosure risk.
Education and Training
RECOMMENDATION 2: Faculty, researchers, and organizations in-
volved in the continuing professional development of researchers should
engage in the education of researchers in the ethical use of spatial data.
Professional associations should participate by establishing and incul-
cating strong norms for the ethical use and sharing of linked social-
spatial data.
Education is an essential tool for ensuring that linked social-spatial
data are organized and used in ways that balance the benefits of the data for
developing knowledge, the value of wide access to the data, and the need to
protect the confidentiality of research participants. Education and training,
both for students and as part of continuing education, require materials
that extrapolate from general ethical principles for data collection, mainte-
nance, dissemination, and access. These materials should include the ethical
issues raised by linked social-spatial data and, to the extent they are identi-
fied and accepted, best practices in the handling of these forms of data.
Organizations and programs involved in training members of institutional
review boards (IRBs) should incorporate attention to the benefits, uses, and
potential risks of linked social-spatial data.
Training in Ethical Issues
RECOMMENDATION 3: Training in ethical considerations needs to
accompany all methodological training in the acquisition and use of
data that include geographically explicit information on research par-
ticipants.
Education about how to collect, analyze, and maintain linked social-
spatial data, how to disseminate results without compromising the identi-
ties of individuals involved in the research, and how to share such data
consonant with confidentiality protections is essential for ensuring that
scientific gains from the capacity to obtain such information can be maxi-
mized. Graduate-level courses and professional workshops addressed to
ethical considerations in the conduct of research need to include attention
to social and spatial data; to enhance awareness of the ethical issues related
to consent, confidentiality, and benefits as well as risks of harm; and to
identify the best practices available to maximize the benefits from such
OCR for page 67
67
THE TRADEOFF: CONFIDENTIALITY VERSUS ACCESS
research while minimizing any added risks associated with explicit spatial
data. Similarly, institutes, courses, and programs focusing on spatial meth-
ods and their use need to incorporate substantive consideration of ethical
issues, in particular those related to confidentiality. Education needs to
extend to primary and secondary researchers, staffs of organizations en-
gaged in data dissemination, and institutional review boards (IRBs) that
consider research protocols that include linked social-spatial data.
Outreach by Professional Societies and Other Organizations
RECOMMENDATION 4: Research societies and other research orga-
nizations that use linked social-spatial data and that have established
traditions of protection of the confidentiality of human research par-
ticipants should engage in outreach to other research societies and
organizations less conversant in research with issues of human partici-
pant protection to increase their attention to these issues in the context
of the use of personal, identifiable data.
Expertise on outreach is not uniformly distributed across research dis-
ciplines and fields. Given the likely increased interest in using explicit spa-
tial data linked to other social data, funding agencies, scientific societies,
and related research organizations should take steps to ensure that exper-
tise in the conduct of research with human participants is broadly accessible
and shared. An outreach priority should be to develop targeted materials,
workshops, and short-course training institutes for researchers in fields or
subfields that have had little or no tradition of safeguarding personal,
identifiable information.
Research Design
RECOMMENDATION 5: Primary researchers who intend to collect
and use spatially explicit data should design their studies in ways that
not only take into account the obligation to share data and the disclo-
sure risks posed, but also provide confidentiality protection for human
participants in the primary research as well as in secondary research use
of the data. Although the reconciliation of these objectives is difficult,
primary researchers should nevertheless assume a significant part of
this burden.
Researchers need to consider the tradeoffs between data utility and
confidentiality at the very start of their research programs, when they are
making commitments to sponsors, designing procedures to obtain informed
consent, and presenting their plans to their IRBs. They should be mindful of
OCR for page 68
68 PUTTING PEOPLE ON THE MAP
both potential benefits and potential harm and plan accordingly. Everyone
involved needs to understand that achieving a balance between benefits and
harms may turn out to be difficult, and at the very least it will require
innovative thinking, compromise, and partnership with others. It is impera-
tive to recognize that it may take a generation to find norms for sharing the
new kind of data and an equally long effort to ensure the safety of human
research subjects. If, for example, IRBs need to be continuously involved in
monitoring projects, they (and the researchers) should accept that role. If
researchers must turn their data over to more experienced stewards for
safe-keeping, that, too, will need to be acknowledged and accepted. Finally,
secondary researchers need to understand that access to confidential data
may involve difficulties, and plan their work accordingly.
Institutional Review Boards
RECOMMENDATION 6: Institutional Review Boards and their orga-
nizational sponsors should develop the expertise needed to make well-
informed decisions that balance the objectives of data access, confiden-
tiality, and quality in research projects that will collect or analyze
linked social-spatial data.
Given the rapidity with which advances are being made in collecting
and linking social and spatial data, maintaining appropriate expertise will
be an ongoing task. IRBs need to learn what they do not know and develop
plans to consult with experts when appropriate. Traditionally, IRBs have
concerned themselves more with the collection of data than its dissemina-
tion, but the heightened risks to confidentiality that arise from linking
social data to spatial data requires increased attention to data dissemina-
tion. Government agencies that sponsor research that requires the applica-
tion of the common rule, the Human Subjects Research Subcommittee of
the Executive Branch Committee on Research, and the Association for the
Accreditation of Human Research Protection Programs (AAHRPP) should
work together to convene an expert working group to address the issue of
social and spatial data and make recommendations for best practices.
Data Enclaves
RECOMMENDATION 7: Data enclaves deserve further development
as a way to provide wider access to high-quality data while preserving
confidentiality. This development should focus on the establishment of
expanded place-based enclaves, “virtual enclaves,” and meaningful pen-
alties for misuse of enclaved data.
OCR for page 69
69
THE TRADEOFF: CONFIDENTIALITY VERSUS ACCESS
Three elements are critical to this development. First, data producers,
data stewards, and academic and other research organizations should con-
sider expanding place-based (as opposed to virtual) data enclaves to hold
more extensive collections of social and spatial data. Currently, many such
data enclaves are maintained by a data producer (such as the U.S. Bureau
of the Census) and contain only the data produced by that organization or
agency. The panel’s recommendation proposes alternative models in which
organizations that store the research they produce also house social and
spatial datasets produced elsewhere or in which institutions that manage
multiple enclaves combine them into a single entity. This recommendation
may require that some agencies (e.g., the Census Bureau) obtain regulatory
or legislative approval in order to broaden their ability to manage re-
stricted data. This approach could make such data more accessible and
cost-effective for secondary researchers while also increasing the capacity
and sustainability of data enclaves. The main challenge is to work out
adequate confidentiality protection arrangements between data producers
and the stewards of expanded enclaves.
Second, “virtual enclaves,” in which data are housed in a remote loca-
tion but accessed in a secure setting by researchers at their own institution
under agreed rules, deserve further development. Virtual archives at aca-
demic institutions should be managed by their libraries, which have exper-
tise in maintaining the security of valuable information resources, such as
rare books and institutional archives. The Census Bureau has demonstrated
the effectiveness of such remote archives with the technology used for its
Research Data Centers, and Statistics Canada has created a system that is
relatively more accessible (relative to the number of Canadian researchers)
through its Research Data Centre program (see http://www.statcan.ca/
english/rdc/index.htm). The extension of these approaches will reduce the
cost of access to research data if researchers and their home institutions
invest in construction and staffing and if principles of operation can be
agreed on. One key issue in the management of virtual or remote enclaves is
the location of the “watchful eye” that ensures that the behavior of re-
stricted data users follows established rules. In some cases, the observer will
be a remote computer or operator, while in others it will be a person
working at the location where the data user is working, for example, in a
college or university library.
Third, access to restricted data through virtual or place-based enclaves
should be restricted to those who agree to abide by the confidentiality
protections governing such data, and meaningful penalties should be en-
forced for willful misuse of the linked social-spatial data. High-quality
science depends on sound ethical practices. Ethical standards in all fields of
science require honoring agreements made as a condition of undertaking
professional work—whether those agreements are between primary re-
OCR for page 70
70 PUTTING PEOPLE ON THE MAP
searchers and research participants or between researchers and research
repository in the case of secondary use. Appropriate penalties might include
publication of reports of willful misuse, disbarment from future research
using restricted-access data, reduced access to federal research funding, and
mechanisms that would provide incentives to institutions that employ re-
searchers who willfully or carelessly misuse enclaved data so that they
enforce agreements to which they are party.
Licensing
RECOMMENDATION 8: Data stewards should develop licensing
agreements to provide increased access to linked social-spatial datasets
that include confidential information.
Licensing agreements place the burden of confidentiality protection on
the data user. Several aspects of licensing deserve further development.
First, nontransferable, time-limited licenses require the data user only to
ensure that his or her own use does not make respondents identifiable to
others or cause them harm and to return or destroy all copies of the data as
promised. However, to be effective, such agreements require strong incen-
tives for users to protect the confidentiality of the research participants.
Second, strong licensing, which requires data users to take special
precautions to protect the shared data, can make sensitive data more widely
available than has been the case to date. Data stewards who are responsible
for managing data enclaves or other restricted data centers, as well as
research sponsors who support research that can only be disseminated
under tight restrictions, should make these kinds of data as accessible as
possible. Strong licensing agreements provide an appropriate mechanism
for providing increased access in many situations.
Third, research planning should include mechanisms to facilitate data
use under license. Sponsors of primary research should ensure that plans
are developed at the outset, with sufficient resources provided (e.g., time to
do research, funds to pay for access) to prepare datasets that facilitate
analysis by secondary data users. Data sponsors and data stewards should
ensure that the plans for data access are carried through.
Fourth, explicit enforcement language should be included in contracts
and license agreements with secondary users setting forth penalties for
breaches of confidentiality and other willful misuse of the linked geospatial
and social data. Funding agencies and research societies with codes of ethics
should scrutinize confidentiality breaches that occur and take actions ap-
propriate to their roles and responsibilities.
Representative terms from entire chapter:
data stewards