Read "Protecting Participants and Facilitating Social and Behavioral Sciences Research" at NAP.edu

« Previous: Appendix D: Selected Studies of IRB Operations: Summary Descriptions

Page 235 Cite

Suggested Citation:"Appendix E: Confidentiality and Data Access Issues for Institutional Review Boards." National Research Council. 2003. Protecting Participants and Facilitating Social and Behavioral Sciences Research. Washington, DC: The National Academies Press. doi: 10.17226/10638.

Page 236 Cite

Page 237 Cite

Page 238 Cite

Page 239 Cite

Page 240 Cite

Page 241 Cite

Page 242 Cite

Page 243 Cite

Page 244 Cite

Page 245 Cite

Page 246 Cite

Page 247 Cite

Page 248 Cite

Page 249 Cite

Page 250 Cite

Page 251 Cite

Page 252 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

âEâ Conï¬dentiality and Data Access Issues for Institutional Review Boards George T. Duncan Carnegie Mellon University INTRODUCTION A CCEPTED PRINCIPLES of information ethics (see National Re- search Council, 1993) require that promises of conï¬dentiality be preserved and that the data collected in surveys and stud- ies adequately serve their purposes. A compromise of the conï¬den- tiality pledge could harm the research organization, the subject, or the funding organization. A statistical disclosure occurs when the data dis- semination allows data snoopers to gain information about subjects by which the snooper can isolate individual respondents and correspond- ing sensitive attribute values (Duncan and Lambert, 1989; Lambert, 1993). Policies and procedures are needed to reconcile the need for conï¬dentiality and the demand for data (Dalenius, 1988). Under a body of regulation known as the Federal Policy for the Pro- tection of Human Subjects, the National Institutes of Health Ofï¬ce of Human Subjects Research (OHSR) mandates that institutional review boards (IRBs) determine that research protocols assure the privacy and conï¬dentiality of subjects. Speciï¬cally, it requires IRBs to ascer- tain whether (a) personally identiï¬able research data will be protected to the extent possible from access or use and (b) any special privacy and conï¬dentiality issues are properly addressed, e.g., use of genetic infor- mation. This standard directs an IRBâs attention, but without elabora- tion and clariï¬cation it does not provide IRBs with operational crite- ria for evaluation of research protocols. Nor does it provide guidance to researchers in how to establish research protocols that can merit IRB approval. The Ofï¬ce for Human Research Protection (OHRP) is responsible for interpreting and overseeing implementation of the reg- 235

236 PROTECTING PARTICIPANTS AND FACILITATING SOCIAL AND BEHAVIORAL SCIENCES RESEARCH ulations regarding the Protection of Human Subjects (45 CFR 46) pro- mulgated by the Department of Health and Human Services (DHHS). OHRP is responsible for providing guidance to researchers and IRBs on ethical issues in biomedical and behavioral research. As IRBs respond to their directive to ethically oversee the burgeon- ing research on human subjects, they require systematic ways of ex- amining protocols for compliance with best practice for conï¬dentiality and data access. Clearly, the task of an IRB is lightened if researchers are fully aware of such practices and how they can be implemented. This paper identiï¬es key conï¬dentiality and data access issues that IRB members must consider when reviewing protocols. It provides both a conceptual framework for such reviews and a discussion of a variety of administrative procedures and technical methods that can be used by researchers to simultaneously assure conï¬dentiality protection and appropriate access to data. CRITICAL ISSUES Reason for Concern Most generally, an ethical perspective requires researchers to maxi- mize the beneï¬ts of their research while minimizing the risk and harm to their subjects. This beneï¬cence notion is often interpreted that, ï¬rst, âone ought not to inï¬ict harmâ and, second, that âone ought to do or promote good.â In the context of assuring data quality from research studies, this means ï¬rst assuring an adequate degree of conï¬dentiality protection and then maximizing the value of the data generated by the research. Conï¬dentiality is afforded for reasons of ethical treatment of research subjects, pragmatic grounds of assuring subject cooperation, and, in some cases, legal requirements. Aspects of Concern Data have a serious risk of disclosure when (a) disclosure would have negative consequences, (b) a data snooper is motivatedâboth psy- chologically and pragmaticallyâto seek disclosure (Elliot, 2001), and (c) the data are vulnerable to disclosure attack. Based on its conï¬den- tiality pledges, researchers must protect certain sensitive objects from a data snooper. Sensitive objects can be any of a variety of variables associated with a subject entity (person, household, enterprise, etc.). Examples include the values of numerical variables, such as household income, an X-ray of a patientâs lung, and a subjectâs report of their sex-

CONFIDENTIALITY AND DATA ACCESS ISSUES FOR INSTITUTIONAL REVIEW BOARDS 237 ual history. Data with particular characteristics pose substantial risk of disclosure and suggest vulnerability: â¢ geographical detailâcensus block (Elliot, Skinner, and Dale, 1998; Greenberg and Zayatz, 1992); â¢ longitudinal or panel structureâcriminal histories (Abowd and Woodcock, 2001); â¢ outliers, likely unique in the populationâsuch as a 16-year-old widow (Dalenius, 1986; Greenberg, 1990); â¢ attributes with high level of detailâincome to the nearest dollar (Elliot, 2001); â¢ many attribute variablesâsuch as medical record (Sweeney, 2001); â¢ population data, as in a census, rather than a survey with small sampling fraction (Elliot, 2001); â¢ databases that are publicly available, identiï¬ed, and share indi- vidual respondents and attribute variables (key variablesâElliot and Dale, 1999) with the subject dataâmarketing and credit data- bases. Data with geographical detail, such as census tract data, may be easily linked to known characteristics of respondents. Concern for this suggests placing minimum population levels for geographical identi- ï¬ers. For particular geographical regions, this can mean specifying the minimum size of a region that can be reported. Longitudinal data, which tracks entities over time, also poses substantial disclosure risk. Many individuals had coronary bypass surgery in the Chicago area in 1998 and many had bypass surgery in Phoenix in 1999, but few did both. Outliers, say on variables like weight, height, or cholesterol level can lead to identiï¬able respondents. Data with many attribute vari- ables allow easier linkage with known attributes of identiï¬ed entities, and entities, which are unique in the sample, are more likely to be unique in the population. Population data pose more disclosure risk than data from a survey having a small sampling fraction. Finally, spe- cial concern must be shown when other databases are available to the data snooper and these databases are both identiï¬ed and share with the subject data both individual respondents and certain attribute vari- ables. Record linkage may then be possible between the subject data and the external database. The shared attribute variables provide the key.

238 PROTECTING PARTICIPANTS AND FACILITATING SOCIAL AND BEHAVIORAL SCIENCES RESEARCH Disclosure The legitimate objects of inquiry for research involving human sub- jects are statistical aggregates over the records of individuals, for ex- ample, the median number of serious infections sustained by patients receiving a drug for treatment of arthritis. The investigators seek to provide the research community with data that will allow accurate in- ference about such population characteristics. At the same time, to respect conï¬dentiality, the investigators must thwart the data snooper who might seek to use the disseminated data to draw accurate infer- ences about, say, the infection history of a particular patient. Such a capability by a data snooper would constitute a statistical disclosure. There are two major types of disclosureâidentity disclosure and attribute disclosure. Identity disclosure occurs with the association of a respondentâs identity and a disseminated data record (Paass, 1988; Spruill, 1983; Strudler et al., 1986). Attribute disclosure occurs with the association of either an attribute value in the disseminated data or an estimated attribute value based on the disseminated data with the respondent (Duncan and Lambert, 1989; Lambert, 1993). In the case of identity disclosure, the association is assumed exact. In the case of attribute disclosure, the association can be approximate. Many investigators emphasize limiting the risk of identity disclosure, perhaps because of its substantial equivalence to the inadvertent release of an identiï¬ed record. An attribute disclosure, even though it invades the privacy of a respondent, may not be so easily traceable to the actions of an agency. An IRB in its oversight capacity should be concerned that investigators limit the risk of both attribute and identity disclosures. Risk of Disclosure Measures of disclosure risk are required (Elliot, 2001). In the con- text of identity disclosure, disclosure risk can arise because a data snooper may be able to use the disseminated data product to reiden- tify some deidentiï¬ed records. Spruill (1983) proposed a measure of disclosure risk for microdata: (1) for each âtestâ record in the masked ï¬le, compute the Euclidean distance between the test record and each record in the source ï¬le; (2) determine the percentage of test records that are closer to their parent source record than to any other source record. She deï¬nes the risk of disclosure to be the percentage of test records that match the correct parent record multiplied by the sam- pling fraction (fraction of source records released). More generally, and consistent with Duncan and Lambert (1986, 1989), an agency will have succeeded in protecting the conï¬dential-

CONFIDENTIALITY AND DATA ACCESS ISSUES FOR INSTITUTIONAL REVIEW BOARDS 239 ity of a released data product if the data snooper remains sufï¬ciently uncertain about a protected target value after data release. From this perspective, a measure of disclosure risk is built on measures of uncer- tainty. Furthermore, an agency can model the decision making of the data snooper as a basis for using disclosure limitation to deter infer- ences about a target. Data snoopers are deterred from publicly making inferences about a target when their uncertainty is sufï¬ciently high. Mathematically, uncertainty functions provide a workable framework for this analysis. Examples include Shannon entropy, which has found use in categorizing continuous microdata and coarsening of categor- ical data (Domingo-Ferrer and Torra, 2001; Willenborg and de Waal, 1996:138). Generally, a data snooper has a priori knowledge about a target, of- ten in the form of a database with identiï¬ed records (Adam and Wort- mann, 1989). Certain variables may be in common with the subject database. These variables are called key or identifying (De Waal and Willenborg, 1996; Elliot, 2001). When a single record matches on the key variables, the data snooper has a candidate record for identiï¬ca- tion. That candidacy is promoted to an actual identiï¬cation if the data snooper is convinced that the individual is in the target database. This would be the case either if the data snooper has auxiliary information to that effect or if the data snooper is convinced that the individual is unique in the population. The data snooper may ï¬nd from certain key variables that a sample record is unique. The question then is whether the individual is also unique on these key variables in the population. Bethlehem, Keller, and Pannekoek (1990) have examined detection of records agreeing on simple combinations of keys based on discrete variables in the ï¬les. Record linkage methodologies have been exam- ined by Domingo-Ferrer and Torra (2001), Fuller (1993), and Winkler (1998). Deidentiï¬cation Deidentiï¬cation of data is the process of removing apparent iden- tiï¬ers (name, e-mail address, social security number, phone number, address, etc.) from a data record. Deidentiï¬cation does not necessar- ily make a record anonymous, as it may well be possible to reidentify the record using external information. In a letter to DHHS, the Amer- ican Medical Informatics Association (2000) noted: However, in discussions with a broad range of healthcare stakeholders, we have found the concept of âdeidentiï¬ed informationâ can be misleading, for it implies that if the

240 PROTECTING PARTICIPANTS AND FACILITATING SOCIAL AND BEHAVIORAL SCIENCES RESEARCH 19 data elements are removed, the problem of reidentiï¬ca- tion has been solved. The information security literature suggests otherwise. Additionally, with the continuing and dramatic increase in computer power that is ubiquitously available, personal health data items that currently would be considered âanonymousâ may lend themselves to increas- ingly easy reidentiï¬cation in the future. For these reasons, we believe the regulations would be better served by adopt- ing the conventions of personal health data as being of âHigh Reidentiï¬cation Potentialâ (e.g., the 19 data elements listed in the current draft), and âLow Reidentiï¬cation Potential.â Over time, some elements currently considered of low po- tential may migrate to the high potential classiï¬cation. More importantly, this terminology conveys the reality that virtu- ally all personal health data has some conï¬dentiality risk associated with it, and helps to overcome the mistaken im- pression that the conï¬dentiality problem is solved by remov- ing the 19 speciï¬ed elements. Most health care information, such as hospital discharge data, can- not be anonymized through deidentiï¬cation. The reason that remov- ing identiï¬ers does not assure sufï¬cient anonymity of respondents is that, today, a data snooper can get inexpensive access to databases with names attached to records. Marketing and credit information databases and voter registration lists are exemplars. Having this exter- nal information, the data snooper can employ sophisticated, but readily available, record linkage techniques. The resultant attempts to link an identiï¬ed record from the public database to a deidentiï¬ed record are often successful (Winkler, 1998). With such a linkage, the record would be reidentiï¬ed. New Areas of Concern Technological developments continue to raise new issues that must be addressed in the ethical direction of research involving human sub- jects. Of burgeoning importance in recent years are developments in information technology, especially the Internet, and in biotechnology, especially human genetics research. The Internet A good discussion of some of the issues involved in providing re- mote access to data through the web is provided by Blakemore (2001). These include security assurances against hacker attack and fears of

CONFIDENTIALITY AND DATA ACCESS ISSUES FOR INSTITUTIONAL REVIEW BOARDS 241 record linkage. A prominent example of web access to data is American FactFinder, maintained by the U.S. Census Bureau (http://factï¬nder. census.gov). American FactFinder provides access to population, hous- ing, economic, and geographic data. The site gives a good description of the elaborate procedures followed to ensure conï¬dentiality through statistical disclosure limitation (see also American Association for the Advance of Science, 1999). Genetic Research The American Society of Human Genetics published the following statement on this issue: Studies that maintain identiï¬ed or identiï¬able specimens must maintain subjectsâ conï¬dentiality. Information from these samples should not be provided to anyone other than the subjects and persons designated by the subjects in writ- ing. To ensure maximum privacy, it is strongly recommended that investigators apply to the Department of Health and Human Services for a Certiï¬cate of Conï¬dentiality. . . . In- vestigators should indicate to the subject that they cannot guarantee absolute conï¬dentiality. A statement by the Health Research Council of New Zealand (1998) is more speciï¬c: Researchers must ensure the conï¬dentiality and privacy of stored genetic information, genetic material or results of the research which relate to identiï¬ed or identiï¬able par- ticipants. In particular, the research protocol must specify whether genetic information or genetic material and any in- formation derived from studying the genetic material, will be stored in identiï¬ed, deidentiï¬ed or anonymous form. Re- searchers should consider carefully the consequences of stor- ing information and material in anonymous form for the proposed research, future research and communication of research results to participants. Researchers should dis- close where storage is to be and to whom their tissues will be accessible. Tissue or DNA should only be sent abroad if this is acceptable to the consenting individual.

242 PROTECTING PARTICIPANTS AND FACILITATING SOCIAL AND BEHAVIORAL SCIENCES RESEARCH TENSION BETWEEN DISCLOSURE RISK AND DATA UTILITY Data Quality Audit The process of assuring conï¬dentiality through statistical disclosure limitation while maintaining data utility has the following components: â¢ a data quality audit that, beginning with the original, collected data, assesses disclosure risk and data utility; â¢ a determination of adequacy of conï¬dentiality protection; â¢ if conï¬dentiality protection is inadequate, the implementation of a restricted access or restricted data procedure; and â¢ a return to the data quality audit. A quality audit of collected data evaluates the utility of the data and assesses disclosure risk. Typically, with good research design and im- plementation, the data utility is high. But, also, the risk of disclosure through the release of the original, collected data is too high, even when the data collected have been deidentiï¬ed, i.e., apparent identi- ï¬ers (name, e-mail address, phone number, etc.) have been removed. Reidentiï¬cation techniques have become too sophisticated to assure conï¬dentiality protection (Winkler, 1998). A conï¬dentiality audit will include identiï¬cation of (1) sensitive objects and (2) characteristics of the data that make it susceptible to attack. R-U Conï¬dentiality Map A measure of statistical disclosure risk, R, is a numerical assess- ment of the risk of unintended disclosures following dissemination of the data. A measure of data utility, U, is a numerical assessment of the usefulness of the released data for legitimate purposes. Illustrative results using particular speciï¬cations for R and U have been devel- oped. The R-U conï¬dentiality map was initially presented by Duncan and Fienberg (1999) and further explored for categorical data by Dun- can et al. (2001). As it is more fully developed by Duncan, Keller- McNulty, and Stokes (2002), the R-U conï¬dentiality map provides a quantiï¬ed link between R and U directly through the parameters of a disclosure limitation procedure. With an explicit representation of how the parameters of the disclosure limitation procedure affect R and U, the tradeoff between disclosure risk and data utility is apparent. With the R-U conï¬dentiality map, data-holding groups have a workable new tool to frame decision making about data dissemination under conï¬- dentiality constraints.

CONFIDENTIALITY AND DATA ACCESS ISSUES FOR INSTITUTIONAL REVIEW BOARDS 243 Restricted Access Procedures Restricted access procedures are administrative controls on who can access data and under what conditions. These controls may in- clude use of sworn agent status, licensing, and secure research sites. Each of these restricted access procedures requires examination of its structure and careful monitoring to ensure that it provides both con- ï¬dentiality protection and appropriate access to data. Licensing sys- tems, for example, require periodic inspections and a tracking database to monitor restricted-use data ï¬les (Seastrom, 2001). Even in secure research sites, only restricted data may be made available, say with de- identiï¬ed data ï¬les. Secure sites require a trained staff who can impart a âculture of conï¬dentialityâ (Dunne, 2001). Restricted Data Procedures: Disclosure Limitation Methods Restricted data procedures are methods for disclosure limitation that require a disseminated data product to be some transformation of the original data. A variety of disclosure limitation methods have been proposed by researchers on conï¬dentiality protection. Gener- ally, these methods are tailored either to tabular data or to microdata. These procedures are widely applied by government statistical agen- cies since they face conï¬dentiality issues directly in producing data products for their users. The most commonly used methods for tab- ular data are cell suppression based on minimum cell count or dom- inance rules; recoding variables; rounding; and geographic or mini- mum population thresholds. The most commonly used methods for microdata are microaggregation, deletion of data items, deletion of sensitive records, recoding data into broad categories, top and bottom coding, sampling, and geographic or minimum population thresholds (see FelsÂ¨ , Theeuwes, and Wagner, 2001). o Direct transformations of data for conï¬dentiality purposes are called disclosure limiting masks (Jabine, 1993a, 1993b). With masked data sets, there is a speciï¬c functional relationship, possibly as a function of multiple records and possibly as a stochastic function, between masked values and the original data. Because of this relationship, the possibil- ities of both identity and attribute disclosures continue to exist, even though the risk of disclosure may be substantially reduced. The idea is to provide a response that, while useful for statistical analysis pur- poses, has sufï¬ciently low disclosure risk. As a general classiï¬cation, disclosure-limiting masks can be categorized as suppressions, recod- ings, or samplings.

244 PROTECTING PARTICIPANTS AND FACILITATING SOCIAL AND BEHAVIORAL SCIENCES RESEARCH Whether for microdata or tabular data, many of these transforma- tions can be represented as matrix masks (Duncan and Pearson, 1991), M = AXB + C, where X is a data matrix, say n Ã p. In general, the deï¬ning matrices A, B, and C can depend on the values of X and be stochastic. The matrix A (since it operates on the rows of X) is a record- transforming mask, the matrix B (since it operates on the columns of X) is a variable-transforming mask, and the matrix C is a displacing mask (noise addition). Methods for Tabular Data A variety of disclosure limitation methods for tabular data are iden- tiï¬ed or developed and then analyzed by Duncan et al. (2001). The discussion below tells about some of the more important of these meth- ods. Suppression A suppression is a refusal to provide a data instance. For microdata, this can involve the deletion of all values of some particularly sensitive variable. In principle, certain record values could also be suppressed, but this is usually handled through recoding. For tabular data, the values of table cells that pose conï¬dentiality problems are suppressed. These are the primary suppressions. Often, a cell is considered un- safe for publication according to the (n, p) dominance rule, i.e., if a few (n), say three, contributing entities represent a percentage p, say 70 percent, or more of the total. Additionally, enough other cells are suppressed so that the values of the primary suppressions cannot be inferred from released table margins. These additional cells are called secondary suppressions. Even tables of realistic dimensionality with only a few primary suppressions present a multitude of possible conï¬g- urations for the secondary cell suppressions. This raises computational difï¬culties that can be formulated as combinatorial optimization prob- lems. Typical techniques that are used include mathematical program- ming (especially integer programming) and graph theory (Chowdhury et al., 1999). Recoding A disclosure-limiting mask for recoding creates a set of data for which some or all of the attribute values have been altered. Recoding can be applied to microdata or to tabular data. Some common methods of recoding for tabular data are global recoding and rounding. A new method of recoding is Markov perturbation.

CONFIDENTIALITY AND DATA ACCESS ISSUES FOR INSTITUTIONAL REVIEW BOARDS 245 â¢ Under global recoding, categories are combined. This represents a coarsening of the data through combining rows or combining columns of the table. â¢ Under rounding, every cell entry is rounded to some base b. The controlled rounding problem is to ï¬nd some perturbation of the original entries that will satisfy (marginal, typically) constraints and that is âcloseâ to the original entries (Cox, 1987). Multidi- mensional tables present special difï¬culties. Methods for dealing with them are given by Kelley, Golden, and Assad (1990). â¢ Markov perturbation (Duncan and Fienberg, 1999) makes use of stochastic perturbation through entity moves according to a Markov chain. Because of the cross-classiï¬ed constraints im- posed by the ï¬xing of marginal totals, moves must be coupled. This coupling is consistent with a GrÂ¨ bner basis structure (Fien- o berg, Makov, and Steele 1998). In a graphical representation, it is consistent with data ï¬ows corresponding to an alternating cycle, as discussed by Cox (1987). Disclosure-Limitation Methods for Microdata Examples of recoding as applied to microdata include data swap- ping; adding noise; and global recoding and local suppression. In data swapping (Dalenius and Reiss, 1982; Reiss, 1980; Spruill, 1983), some ï¬elds of a record are swapped with the corresponding ï¬elds in an- other record. Concerns have been raised that while data swapping lowers disclosure risk, it may excessively distort the statistical struc- ture of the original data (Adam and Wortmann, 1989). A combina- tion of data swapping with additive noise has been suggested by Fuller (1993). Masking through the introduction of additive or multiplicative noise has been investigated (e.g., Fuller, 1993). A disclosure limitation method for microdata that is used in the Âµ-Argus software is a combi- nation of global recoding and local suppression. Global recoding com- bines several categories of a variable to form less speciï¬c categories. Topcoding is a speciï¬c example of global recoding. Local suppression suppresses certain values of individual variables (Willenborg and de Waal, 1996). The aim is to reduce the set of records where only a few agree on particular combinations of key values. Both methods make the data less speciï¬c and so result in some information loss to legiti- mate researchers.

246 PROTECTING PARTICIPANTS AND FACILITATING SOCIAL AND BEHAVIORAL SCIENCES RESEARCH Sampling Sampling, as a disclosure-limiting mask, creates an appropriate sta- tistical sample of the original data. Alternatively, if the original data is itself a sample, the data may be considered self-masked. Just the fact that the data are a sample may not result in disclosure risk sufï¬ciently low to permit data dissemination. In that case, subsampling may be required to obtain a data product with adequately low disclosure risk. Synthetic, Virtual, or Model-Based Data The methods described so far have involved perturbations or mask- ing of the original data. These are called data-conditioned methods by Duncan and Fienberg (1999). Another approach, while less studied, should be conceptually familiar to statisticians. Consider the original data to be a realization according to some statistical model. Replace the original data with samples (the synthetic data) according to the model. Synthetic data sets consist of records of individual synthetic units rather than records the agency holds for actual units. Rubin (1993) suggested synthetic data construction through a mul- tiple imputation method. The effect of imputation of an entire micro- data set on data utility is an open research question. Rubin (1993) asserts that the risk of identity disclosure can be eliminated through the dissemination of synthetic data and proposes the release of syn- thetic microdata sets for public use. His reasoning is that the synthetic data carries no direct functional link between the original data and the disseminated data. So while there can be substantial identity dis- closure risk with (inadequately) masked data, identity disclosure is, in a strict sense, impossible with the release of synthetic data. However, the release of synthetic data may still involve risk of attribute disclosure (Fienberg, Makov, and Steele 1998). Rubin (1993) cogently argues that the release of synthetic data has advantages over other data dissemination strategies, because â¢ masked data can require special software for its proper analysis for each combination of analysis, masking method, and database type (Fuller, 1993); â¢ release of aggregates, e.g., summary statistics or tables, is inad- equate due of the difï¬culty in contemplating at the data release stage what analysts might like to do with the data; and â¢ mechanisms for the release of microdata under restricted access conditions, e.g., user-speciï¬c administrative controls, can never fully satisfy the demands for publicly available microdata.

CONFIDENTIALITY AND DATA ACCESS ISSUES FOR INSTITUTIONAL REVIEW BOARDS 247 The methodology for the release of synthetic data is simple in con- cept, but complex in implementation. Conceptually, the data-holding research group would use the original data to determine a model to generate the synthetic data. But the purpose of this model is not the usual prediction, control, or scientiï¬c understanding that argues for parsimony through Occamâs Razor. Instead, its purpose is to gener- ate synthetic data useful to a wide range of users. The agency must recognize uncertainty in both model form and the values of model pa- rameters. This argues for the relevance of hierarchical and mixture models to generate the synthetic data. CONCLUSIONS IRBs must examine protocols for human subjects research carefully to ensure that both conï¬dentiality protection is afforded and that ap- propriate data access is afforded. Promising procedures are available based on restricted access, through means such as licensing and se- cure research sites, and restricted data, through statistical disclosure limitation. REFERENCES AND BIBLIOGRAPHY Abowd, J.M., and S.D. Woodcock 2001 Disclosure limitation in longitudinal linked data. Pp. 215-277 in Conï¬den- tiality, Disclosure, and Data Access: Theory and Practical Applications for Sta- tistical Agencies, P Doyle, J.I. Lane, J.J.M. Theeuwes, and L.V. Zayatz, eds. . Amsterdam: North-Holland/Elsevier. Adam, N.R., and J.C. Wortmann 1989 Security-control methods for statistical databases: A comparative study. ACM Computing Surveys 21:515-556. Agarwal, R., and R. Srikant 2000 Privacy-preserving data mining. Proceedings of the 2000 ACM SIGMOD on Management of Data, May 15-18, Dallas, Tex. American Association for the Advancement of Science 1999 Ethical and Legal Aspects of Human Subjects Research on the Internet. Work- shop Report. Available: http://www.aaas.org/spp/dspp/sfrl/projects/intres/report. pdf [4/12/02]. American Medical Informatics Association 2000 Letter to the U.S. Department of Health and Human Services. Available: http: //www.amia.org/resource/policy/nprm response.html [4/1/03]. Blakemore, M. 2001 The potential and perils of remote access. Pp. 315-340 in Conï¬dentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, P Doyle, J.I. Lane, J.J.M. Theeuwes, and L.V. Zayatz, eds. Amster- . dam: North-Holland/Elsevier.

248 PROTECTING PARTICIPANTS AND FACILITATING SOCIAL AND BEHAVIORAL SCIENCES RESEARCH Chowdhury, S.D., G.T. Duncan, R. Krishnan, S.F. Roehrig, and S. Mukherjee 1999 Disclosure detection in multivariate categorical databases: Auditing conï¬den- tiality protection through two new matrix operators. Management Science 45:1710-1723. Cox, L.H. 1980 Suppression methodology and statistical disclosure control. Journal of the American Statistical Association 75:377-385. 1987 A constructive procedure for unbiased controlled rounding. Journal of the American Statistical Association 82:38-45. Dalenius, T. 1986 Finding a needle in a haystack. Journal of Ofï¬cial Statistics 2:329-336. 1988 Controlling Invasion of Privacy in Surveys. Department of Development and Research. Statistics Sweden. Dalenius, T., and S.P Reiss . 1982 Data-swapping: A technique for disclosure control. Journal of Statistical Plan- ning and Inference 6:73-85. De Waal, A.G., and L.C.R.G. Willenborg 1996 A view on statistical disclosure for microdata. Survey Methodology 22:95-103. Domingo-Ferrer, J. and V. Torra 2001 A quantitative comparison of disclosure control methods for microdata. Pp. 111-134 in Conï¬dentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, P Doyle, J.I. Lane, J.J.M. Theeuwes, and . L.V. Zayatz, eds. Amsterdam: North-Holland/Elsevier. Duncan, G.T. 2001 Conï¬dentiality and statistical disclosure limitation. In N.J. Smelser and P .B. Baltes, eds., International Encyclopedia of the Social and Behavioral Sciences. Oxford, England: Elsevier Science. Duncan, G.T., and S.E. Fienberg 1999 Obtaining information while preserving privacy: A Markov perturbation method for tabular data. Eurostat. Statistical Data Protection â98 Lisbon 351- 362. Duncan, G.T., and S. Kaufman 1996 Who should manage information and privacy conï¬icts?: Institutional design for third-party mechanisms. The International Journal of Conï¬ict Management 7:21-44. Duncan, G.T., and D. Lambert 1986 Disclosure-limited data dissemination (with discussion). Journal of the Amer- ican Statistical Association 81:10-28. 1989 The risk of disclosure of microdata. Journal of Business and Economic Statis- tics 7:207-217. Duncan, G.T., and S. Mukherjee 2000 Optimal disclosure limitation strategy in statistical databases: Deterring tracker attacks through additive noise. Journal of the American Statistical Association 95:720-729. Duncan, G.T., and R. Pearson 1991 Enhancing access to microdata while protecting conï¬dentiality: Prospects for the future (with discussion). Statistical Science 6:219-239. Duncan, G.T., S.E. Fienberg, R. Krishnan, R. Padman, and S.F. Roehrig 2001 Disclosure limitation methods and information loss for tabular data. Pp. 135- 166 in Conï¬dentiality, Disclosure, and Data Access: Theory and Practical Ap- plications for Statistical Agencies, P Doyle, J.I. Lane, J.J.M. Theeuwes, and . L.V. Zayatz, eds. Amsterdam: North-Holland/Elsevier.

CONFIDENTIALITY AND DATA ACCESS ISSUES FOR INSTITUTIONAL REVIEW BOARDS 249 Duncan, G.T., S. Keller-Mcnulty, and S.L. Stokes 2002 Disclosure risk vs. data utility: The R-U conï¬dentiality map. Technical Re- ports: Statistical Sciences Group, Los Alamos National Laboratory and Heinz School of Public Policy and Management, Carnegie Mellon University. Dunne, T. 2001 Issues in the establishment and management of secure research sites. Pp. 297-314 in Conï¬dentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, P Doyle, J.I. Lane, J.J.M. Theeuwes, and . L.V. Zayatz, eds. Amsterdam: North-Holland/Elsevier. Elliot, M. 2001 Disclosure risk assessment. Pp. 135-166 in Conï¬dentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, P Doyle, . J.I. Lane, J.J.M. Theeuwes, and L.V. Zayatz, eds. Amsterdam: North-Holland/ Elsevier. Elliot, M., and A. Dale 1999 Scenarios of attack: The data intruderâs perspective on statistical disclosure risk. Netherlands Ofï¬cial Statistics 14:6-10. Elliot, M., C. Skinner, and A. Dale 1998 Special uniques, random uniques and sticky populations: Some counterin- tuitive effects of geographical detail on disclosure risk. Research in Ofï¬cial Statistics 1:53-68. Eurostat 1996 Manual on Disclosure Control Methods. Luxembourg: Ofï¬ce for Publications of the European Communities. Federal Committee on Statistical Methodology 1994 Statistical Policy Working Paper 22: Report on Statistical Disclosure Limita- tion Methodology. Washington, DC: U.S. Ofï¬ce of Management and Budget. FelsÂ¨ , F., J. Theeuwes, and G. Wagner o 2001 Disclosure limitation methods in use: Results of a survey. Pp. 17-42 in Conï¬- dentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, P Doyle, J.I. Lane, J.J.M. Theeuwes, and L.V. Zayatz, eds. . Amsterdam: North-Holland/Elsevier. Fienberg, S.E. 1994 Conï¬icts between the needs for access to statistical information and demands for conï¬dentiality. Journal of Ofï¬cial Statistics 10:115-132. Fienberg, S.E., U.E. Makov, and R.J. Steele 1998 Disclosure limitation using perturbation and related methods for categorical data. Journal of Ofï¬cial Statistics 14:347-360. Fuller, W.A. 1993 Masking procedures for microdata disclosure limitation. Journal of Ofï¬cial Statistics 9:383-406. Greenberg, B. 1990 Disclosure avoidance research at the Census Bureau. Pp. 144-166 in Pro- ceedings of the U.S. Census Bureau Annual Research Conference, Washington, DC. Greenberg, B. and L. Zayatz 1992 Strategies for measuring risk in public use microdata ï¬les. Statistica Neer- landica 46:33-48. Health Research Council of New Zealand 1998 Statement. Available: http://www.hrc.govt.nz/genethic.htm. Jabine, T.B. 1993a Procedures for restricted data access. Journal of Ofï¬cial Statistics 9:537-589. 1993b Statistical disclosure limitation practices of United States statistical agencies. Journal of Ofï¬cial Statistics 9:427-454.

250 PROTECTING PARTICIPANTS AND FACILITATING SOCIAL AND BEHAVIORAL SCIENCES RESEARCH Kelley, J., B. Golden, and A. Assad 1990 Controlled rounding of tabular data. Operations Research 38:760-772. Kim, J.J. 1986 A method for limiting disclosure in microdata based on random noise and transformation. Pp. 370-374 in Proceedings of the Survey Research Methods Section, American Statistical Association. Kim, J.J., and W. Winkler 1995 Masking microdata ï¬les. In Proceedings of the Section on Survey Research Methods, American Statistical Association. Kooiman, P J. Nobel, and L. Willenborg ., 1999 Statistical data protection at Statistics Netherlands. Netherlands Ofï¬cial Statis- tics 14:21-25. Lambert, D. 1993 Measures of disclosure risk and harm. Journal of Ofï¬cial Statistics 9:313-331. Little, R.J.A. 1993 Statistical analysis of masked data. Journal of Ofï¬cial Statistics 9:407-426. Marsh, C., C. Skinner, S. Arber, B. Penhale, S. Openshaw, J. Hobcraft, D. Lievesley, and N. Walford 1991 The case for samples of anonymized records from the 1991 census. Journal of the Royal Statistical Society, Series A 154:305-340. Marsh, C., A. Dale, and C.J. Skinner 1994 Safe data versus safe settings: Access to microdata from the British Census. International Statistical Review 62:35-53. Mokken, R.J., P Kooiman, J. Pannekoek, and L.C.R.J. Willenborg . 1992 Disclosure risks for microdata. Statistica Neerlandica 46:49-67. Mood, A.M., F.A. Graybill, and D.C. Boes 1963 Introduction to the Theory of Statistics. New York: McGraw-Hill. Moore, R.A. 1996 Controlled data-swapping techniques for masking public use microdata sets. Statistical Research Division Report Series, RR 96-04. Washington, DC: U.S. Bureau of the Census. National Research Council 1993 Private Lives and Public Policies: Conï¬dentiality and Accessibility of Govern- ment Statistics. Panel on Conï¬dentiality and Data Access, G.T. Duncan, T.B. Jabine, and V.A. de Wolf, eds. Committee on National Statistics and Social Science Research Council. Washington, DC: National Academy Press. 2000 Improving Access to and Conï¬dentiality of Research Data: Report of a Workshop Committee on National Statistics, C. Mackie and N. Bradburn, eds. Washing- ton, DC: National Academy Press. Paass, G. 1988 Disclosure risk and disclosure avoidance for microdata. Journal of Business and Economic Statistics 6:487-500. Rubin, D.B. 1993 Satisfying conï¬dentiality constraints through the use of synthetic multiply- imputed microdata. Journal of Ofï¬cial Statistics 9:461-468. Seastrom, M.M. 2001 Licensing. Pp. 279-296 in Conï¬dentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, P Doyle, J.I. Lane, J.J.M. . Theeuwes, and L.V. Zayatz, eds. Amsterdam: North-Holland/ Elsevier. Skinner, C.J. 1990 Statistical Disclosure Issues for Census Microdata. Paper presented at Interna- tional Symposium on Statistical Disclosure Avoidance, Voorburg, The Nether- lands, December 13.

CONFIDENTIALITY AND DATA ACCESS ISSUES FOR INSTITUTIONAL REVIEW BOARDS 251 Spruill, N.L. 1983 The conï¬dentiality and analytic usefulness of masked business microdata. Pp. 602-607 in Proceedings of the Section on Survey Research Methods, American Statistical Association. Sweeney, L. 2001 Information explosion. Pp. 43-74 in Conï¬dentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, P Doyle, J.I. . Lane, J.J.M. Theeuwes, and L.V. Zayatz, eds. Amsterdam: North-Holland/ Elsevier. Willenborg, L., and T. de Waal 1996 Statistical Disclosure Control in Practice. Lecture Notes in Statistics #111. New York: Springer. Winkler, W.E. 1998 Re-identiï¬cation methods for evaluating the conï¬dentiality of analytically valid microdata. Research in Ofï¬cial Statistics 1:87-104. Zayatz, L.V., P Massell, and P Steel . . 1999 Disclosure limitation practices and research at the U.S. Census Bureau. Nether- lands Ofï¬cial Statistics 14:26-29.

Next: Biographical Sketches of Panel Members and Staff »

Protecting Participants and Facilitating Social and Behavioral Sciences Research (2003)

Chapter: Appendix E: Confidentiality and Data Access Issues for Institutional Review Boards

Welcome to OpenBook!

Get Email Updates