The following HTML text is provided to enhance online
readability. Many aspects of typography translate only awkwardly to HTML.
Please use the page image
as the authoritative form to ensure accuracy.
Protecting and Accessing Data from the Survey of Earned Doctorates: A Workshop Summary
Partially synthesized data. Reiter pointed out that the NSF solution protects against reidentification based only on field. It fails to protect against the possibility of colleague or self-identification. He is concerned that, if a colleague knows the field, the year, and the gender (or similar sensitive information), the data will not be able to be fully protected.
One means of protecting against this problem is to partially synthesize only the small cells that are most susceptible to being disclosed, publishing simulated data for the 4 percent of cells that are most sensitive. The simulated data would look and behave like actual data, valid inferences could be derived from statistical procedures, and longitudinal series could be preserved. This methodology would avoid the problem inherent with the use of the “cutoff of 25” rule, in that cells would not be subject to publication one year and disappearance in another year because the cutoff was not attained.
However, as several participants pointed out, the users of the data need actual counts for policy evaluation and other purposes, and synthetic data would not suit their purposes. The use of synthetic data was compared to data perturbation, with the important difference that the aggregates would be unaffected. Using synthetic data might lead to the further problem, Schneider suggested, of having a dual approach in which because some number of users want the real data, different representations of the same data cells would be generated from the restricted data sets, and two sets of numbers—synthetic and real—would be published.
Concentration criteria. A possible solution to the problem of colleague identification, as suggested by Stephen Cohen, would be to set criteria for concentrations. This approach would be similar to the rules now used by some government agencies to publish data only for companies in which there is not a concentration of the variable of interest. In this case, the concentration rules could be based on the number of schools that contribute doctorates to the field.
In the discussion that followed Reiter’s presentation, Cohen gave an example of how a smart intruder with some knowledge could identify a respondent through the published tables. The NSF contractor was able to use Google scholar, dissertation abstracts, and a candidate whose gender and race were surmised from a faculty photograph found on a departmental website to find a match in the SED. The result was judged to be a correct match. This exercise lent support to the aggregation decision.