6
Participant Views and Unresolved Issues
The workshop served to emphasize the tension between data access and confidentiality protection. It pointed out the need for the National Science Foundation (NSF) to continually solicit input from respondents and data users on decisions regarding the balance between access and protection.
As for the specific options for protecting the Survey of Earned Doctorates (SED) data, participants in the workshop expressed their views that the NSF had made the correct decision in selecting an aggregation approach to limiting disclosure risk for the Race/Ethnicity, Gender, and Fine Field of Study Tables (called here the REG tables). In summarizing the workshop, Mark Schneider said that “we are now going in the direction of aggregation rules.” Jacob Bournazian’s overall assessment was that the data aggregation approach selected by NSF is both compatible with user needs and with future growth in accessing data. At the conclusion of his presentation, Jerome Reiter commended NSF for the decision to select an aggregation approach rather than a data suppression one.
Although most people at the workshop agreed with the approach selected by NSF, there were important caveats. Reiter, for example, agreed that suppression should be avoided but pointed out that aggregation, as envisioned in the NSF solution, also has some drawbacks. These drawbacks and other unresolved issues brought up in the general discussion period are outlined below.
Partially synthesized data. Reiter pointed out that the NSF solution protects against reidentification based only on field. It fails to protect against the possibility of colleague or self-identification. He is concerned that, if a colleague knows the field, the year, and the gender (or similar sensitive information), the data will not be able to be fully protected.
One means of protecting against this problem is to partially synthesize only the small cells that are most susceptible to being disclosed, publishing simulated data for the 4 percent of cells that are most sensitive. The simulated data would look and behave like actual data, valid inferences could be derived from statistical procedures, and longitudinal series could be preserved. This methodology would avoid the problem inherent with the use of the “cutoff of 25” rule, in that cells would not be subject to publication one year and disappearance in another year because the cutoff was not attained.
However, as several participants pointed out, the users of the data need actual counts for policy evaluation and other purposes, and synthetic data would not suit their purposes. The use of synthetic data was compared to data perturbation, with the important difference that the aggregates would be unaffected. Using synthetic data might lead to the further problem, Schneider suggested, of having a dual approach in which because some number of users want the real data, different representations of the same data cells would be generated from the restricted data sets, and two sets of numbers—synthetic and real—would be published.
Concentration criteria. A possible solution to the problem of colleague identification, as suggested by Stephen Cohen, would be to set criteria for concentrations. This approach would be similar to the rules now used by some government agencies to publish data only for companies in which there is not a concentration of the variable of interest. In this case, the concentration rules could be based on the number of schools that contribute doctorates to the field.
In the discussion that followed Reiter’s presentation, Cohen gave an example of how a smart intruder with some knowledge could identify a respondent through the published tables. The NSF contractor was able to use Google scholar, dissertation abstracts, and a candidate whose gender and race were surmised from a faculty photograph found on a departmental website to find a match in the SED. The result was judged to be a correct match. This exercise lent support to the aggregation decision.
Volatility of the data. Among the issues that warrant additional investigation, according to several participants, is that of the volatility of the estimates when the cutoff of 25 rule is applied. The NSF proposal would be to reassess the fine fields every 3 years to identify fields to be added or deleted based on the number of doctorates in a field and the number of schools granting those doctorates. Currently, NSF adds 8 to 10 new fields and loses 2 or 3 fields as a result of these triennial reviews. The decision is usually a joint decision between NSF and the sponsoring agencies of the survey.
Informed consent. One means of avoiding the problem of potential identification of persons in the published small cells is to obtain the permission of individuals to have personal characteristics and other data published. This would be done by asking for their informed consent to make their data available. It was suggested that informed consent would be sought only for certain sensitive data items, such as gender, race, or ethnicity. This might avoid increased nonresponse that might accompany asking for informed consent for the whole array of data collected by the survey.
According to Cohen, the NSF legislation seems to prohibit requesting the informed consent of the respondents for release of their data. Although the Confidential Information Protection and Statistical Efficiency Act permits the solicitation of the informed consent of respondents, the NSF legislation and the data collection strategy militate against using this authority for SED. Nonetheless, Lynda Carlson agreed that it would be useful to test an application of informed consent to the SED to see if obtaining such consent would be feasible. If so, NSF would then be in a position to deal with the implementation of such a procedure.