Disclosure Risks in Social Science Data
Alter briefly noted some of the factors that are increasing the risks of disclosure in social science data and, in particular, in data from which direct identifiers have been removed so that the data seem to be anonymous. As other speakers had noted, even with de-identified data it can be possible to identify individuals from the information that remains in the dataset, and several trends are increasing concerns about re- identification. One is that more and more research is being done with geocoded data, he noted. Another is increasing use of longitudinal datasets, such as the Health and Retirement Survey, “where the accumulation of information about each individual makes them identifiable,” he said. Finally, many datasets have multiple levels—data on student, teacher, school, and school district, for example, or on patient, clinic, and community—which can make it possible to identify individuals by working down from the higher levels.
Protecting Confidential Data
With respect to protecting confidential data, Alter said, it is useful to think in terms of a framework that considers protecting confidentiality with four different but complementary approaches: safe data, safe places, safe people, and safe outputs. “You can approach making data safe in all of these different ways,” he said, “and in general we try to do some of each.”
It is possible to take steps to make data safer both before and after they are collected. Before data are collected, one can design studies in such a way that disclosure risks are reduced. Researchers can, for example, carry out their studies at multiple sites because “a study that is designed in one location, especially when that location is known, is much more difficult to protect from disclosure risk than a national survey or a survey in multiple sites.” Researchers can also work to keep the sampling locations secret, releasing characteristics of the contexts without providing locations.
After the data are collected, there are many procedures that researchers can use to make the data more anonymous. They can group values, for instance, or aggregate over geographical areas. They can suppress unique cases or swap values, and there are a variety of more intrusive approaches, such as adding noise to the data or replacing real data with synthetic data that preserve the statistical relationships in the original data.