Read "Principles and Obstacles for Sharing Data from Environmental Health Research: Workshop Summary" at NAP.edu

« Previous: 1 Introduction

Page 5 Cite

Suggested Citation:"2 Current Approaches and Weaknesses of Those Approaches." National Academies of Sciences, Engineering, and Medicine. 2016. Principles and Obstacles for Sharing Data from Environmental Health Research: Workshop Summary. Washington, DC: The National Academies Press. doi: 10.17226/21703.

Page 6 Cite

Page 7 Cite

Page 8 Cite

Page 9 Cite

Page 10 Cite

Page 11 Cite

Page 12 Cite

Page 13 Cite

Page 14 Cite

Page 15 Cite

Page 16 Cite

Page 17 Cite

Page 18 Cite

Page 19 Cite

Page 20 Cite

Page 21 Cite

Page 22 Cite

Page 23 Cite

Page 24 Cite

Page 25 Cite

Page 26 Cite

Page 27 Cite

Page 28 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

2 Current Approaches and Weaknesses of Those Approaches This chapter is presented in two parts. The first part introduces some of the terminology used throughout the workshop, as well as the federal laws and policies that form the framework within which the sharing of data on environmental health occurs. After the summary of the presentations, relevant discussions from sessions from throughout the day1 are described. The second part of the chapter summarizes the presentations that described current approaches to the sharing of environmental health data by federal agencies and identified the weaknesses and shortcomings of these approaches. Again, this is followed by a summary of the relevant discussion that occurred at the workshop. TERMINOLOGY Lynn Goldman, dean of the Milken Institute School of Public Health at George Washington University, introduced the topic of terminology in her workshop overview. âOne thing that I have noticed is that some of the words that we use in science ... have sometimes not been used consistently,â so she offered definitions for a series of terms that are important in talking about the sharing of environmental health data. âPeer review is when you actually evaluate the scientific work by others in the same field,â she said. âA systematic review is a summary of the clinical literature,â she continued. âThere are various methods that 1 Presentations and, especially, discussion sessions often covered topics formally introduced in sessions different from the one being described in this summary. Where a discussion point from a different session is relevant, it is presented not in the order in which it was made but in proximity to the most appropriate discussion within this summary. 5

6 PRINCIPLES AND OBSTACLES FOR SHARING DATA people can use. They can quantitatively pool the data or do a meta- analysis, which is a way of combining data from many different studies using a statistical process.â There are various approaches to testing and validating previous scientific work, she said. âA reanalysis is when you conduct a further analysis of data.â A person doing a reanalysis of data may use the same programs and statistical methodologies that were originally used to analyze the data or may use alternative methodologies, but the point is to analyze exactly the same data and see if the same result emerges from the analysis. âReplication means that you actually repeat a scientific experiment or a trial to obtain a consistent result,â she continued. The second experiment uses exactly the same protocols and statistical programs but with data from a different population. The goal is to see if the same results hold with data from a different population. âAnd then, finally, when you reproduce, you are producing something that is very similar to that research, but it is in a different medium or context,â she said. In other words, a researcher who is reproducing an experiment addresses the same research question but from a different angle than the original researcher did. âMost of us, when we are doing systematic reviews, are more convinced that something is going on when we see reproducibility as well as replicability.â Different Meanings of âDataâ During Session 2, Bernard Lo, president and chief executive officer of The Greenwall Foundation, noted that there are a number of different types of data. There are raw data, which come straight from the survey or the experiment. There are cleaned-up data, which consist of the raw data modified to remove obvious errors. There are processed data, which are data that have been computed and analyzed to extract relevant information. There is the final clean data set that is provided with a publication. And there are the metadata that describe the data. All of these types of data are important in different ways and for different purposes (see Figure 2-1).

CURRE ENT APPROACHES AND WEAKNESSES W 7 FIGUR RE 2-1 Data flow from particcipant to analyzzed data and reeporting. SOURC CE: IOM, 2014 4. Loo explained thhat investigato ors may wantt to make thee different typpes of data available to o different people at diffferent times. For each datta- sharing g element orr activity, diffferent questi ons may neeed to be poseed. What types of dataa will be shaared? Who shhould be provviding data ffor sharing g? Who shou uld be the reccipients of thhe shared dataa? When in tthe study process shou uld certain data d be shareed and how?? Should som me peoplee have open n access? Is I it imporrtant to payy attention to proporrtionality in terms t of the risks, beneffits, and burddens to varioous partiess who are go oing to be afffected by datta sharing (e.g., researchers, particiipants, sponso ors, the publicc)? Loo noted that heh chairs an Institute of M Medicine commmittee that is tasked d with issuing g a report onn the responsiible sharing of clinical trrial data. In January 2014 that committee isssued a preliiminary repoort, Discusssion Frameework for Clinical C Triaal Data Shaaring: Guiding

8 PRINCIPLES AND OBSTACLES FOR SHARING DATA Principles, Elements, and Activities, which lays out the committeeâs thoughts on the principles that should guide the sharing of clinical trial data, describes certain data-sharing activities, and defines the key elements of data and data-sharing activities (IOM, 2014). For its final report the committeeâs charge is âto analyze the benefits, challenges, and risk of various models of data sharing and to make recommendations to enhance the responsible sharing of clinical trial data,â Lo said.2 He added that while that committee is focused solely on clinical trials, many of its ideas and recommendations will likely apply to environmental health research more generally. FEDERAL LAWS AND POLICIES PERTAINING TO DATA SHARING A variety of federal laws specify what data must be shared and under what circumstances. In general, these laws apply to data held by federal agencies and to data collected with federal funding. Paul Verkuil, chairman of the Administrative Conference of the United States, offered some background on these laws. âAs a traditional matter,â he said, âagencies were required to disclose data underlying investigations undertaken by agency scientists upon public request but were not required to disclose data from studies commissioned by the agency but performed by private entities. This framework has changed in recent years, and certain disclosure requirements also now apply to privately conducted research.â Specifically, Verkuil said, several federal acts govern the sharing of data. The Freedom of Information Act (FOIA)3 requires that agencies release recordsâincluding scientific dataâupon public request. It does contain a number of exceptions that protect things like confidential business information and personal privacy. The Electronic Freedom of Information Act Amendments of 19964 require agencies to release electronic copies of documents that have been previously requested and 2 The Committee on Strategies for Responsible Sharing of Clinical Trial Data released its final report in January 2015. Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk is available at www.nap.edu/catalog/18998. 3 Freedom of Information Act, 5 U.S.C. Â§ 552, Amended by Public Law 104- 231, 110 Stat. 3048, 104th Congress. 4 Electronic Freedom of Information Act Amendments of 1996, Public Law 104-231, 104th Congress.

CURRENT APPROACHES AND WEAKNESSES 9 that are likely to be the subject of future requests rather than waiting for subsequent requests. The result, Verkuil said, is much greater transparency than had previously been the case. In 1998 a law referred to as the Shelby Amendment5 was passed. That law required that all federally funded research data be made available to the public under FOIA. Traditionally, it was not the case that funded research had to be made available in response to a FOIA request. The Shelby Amendment changed that, requiring that data produced by grantees be released under FOIA, subject to the usual exceptions. The Shelby Amendment was enacted in response to concerns about a desire to reanalyze two studies, the Harvard Six Cities Study (Dockery et al., 1993) and the American Cancer Society Study (Pope et al., 1995) that were looking at the health risks caused by particulate matter in the air. This was followed in 2001 by the enactment of the Information Quality Act, also called the Data Quality Act,6 which was intended to improve the quality of information used and promulgated by agencies. Among other things, the act requires agencies to create a procedure that allows people to correct information that has been released if the information is erroneous, Verkuil said. Guidelines issued by the Office of Management and Budget (OMB) in response to the Information Quality Act state that when executive branch agencies provide âinfluential scientific, financial, or statistical information,â they also âshall include a high degree of transparency about data and methods to facilitate the reproducibility of information by qualified third partiesâ (OMB, 2002). These OMB guidelines have affected how agencies respond both to requests for data and also to what are called âinformation corrections,â which are intended to correct information that has been promulgated. Verkuil noted that contractors and grantees are treated differently in the Shelby Amendment, with only grantees being forced to make data available. âI do not think that this distinction is one that can last long,â he said, because once you get data from one type of federally funded research, it is difficult to imagine not requiring availability from all federally funded research. âThe transparency is promoted when an agency relies upon a privately funded study. It urges the researcher to disclose the underlying data. When a private researcher declines, agencies should issue an explanation why they relied on such studies 5 Shelby Amendment to the Omnibus Appropriations Act for Fiscal Year 1999, Public Law 105-277, 105th Congress. 6 Data Quality Act, Section 515 of the Consolidated Appropriations Act for Fiscal Year 2001, Public Law 106-554, 106th Congress.

10 PRINCIPLES AND OBSTACLES FOR SHARING DATA despite the declining. And agencies should require conflict-of-interest disclosures for all scientific research submitted to inform the decision- making process.â He noted that the ultimate beneficiary of such data sharing is the broader society. âOpen communication among scientists and engineers and between these experts and the public accelerates scientific and technological advancement, strengthens the economy, educates the nation, and enhances democracy,â he said. The Role of Courts In addition to Congress and the executive branch, including federal agencies, courts also play a role in determining how data are shared. Verkuil described that role in his presentation. Most of the rules produced by the U.S. Environmental Protection Agency (EPA) and other federal agencies are developed through informal rule making, Verkuil noted, but when agencies issue rules through that process, reviewing courts scrutinize them very carefully. The process is governed by the âarbitrary and capriciousâ standard of 5 U.S.C. Â§ 706(2)(A). For an agencyâs rule to be upheld on judicial review, the agency must have placed in the administrative record all of the information that the agency relied upon to reach its decisionâa requirement known as the âPortland Cement doctrine.â7 âThe E-Government Act makes it a little easier,â Verkuil said, âbecause you can post rules online and you can use regulations.gov to find out what else has been posted by other commenters.â Sometimes, however, there can be problems with paper- based comments, Verkuil said. For example, they may not be scanned. This is a transitional problem, he said, but a real one for agencies depending on how many paper-based versus electronically transmitted comments that they receive. Verkuil commented that his organization, the Administrative Conference of the United States, encourages agencies to do as much as possible electronically rather than on paper. This amount of information that can be presented to the court is huge. âYou can appreciate what the ârecordâ looks like on review,â Verkuil said. âIt is enormous, and the agency has to decide what is in and what is out on its own.â The situation is different from formal adjudication or formal rule making, in which there is an administrative 7 The Portland Cement doctrine is named after the decision in Portland Cement Association v. Ruckleshaus, 486 F.2d 375, 393-94 (D.C. Cir. 1973).

CURRENT APPROACHES AND WEAKNESSES 11 law judge making the decision, the judge decides what goes in and what does not, and what is accepted becomes part of a ârecordâ in the traditional legal sense. In contrast, Verkuil explained, in informal rule making, the record is an accumulation of the best estimate of what needs to be in there to support the rule. Although courts can scrutinize the research and underlying data upon which the agency relied, Verkuil said, they generally defer to the agency on technical determinations. Courts recognize that they do not have the technical expertise to second-guess an agencyâs scientific or technical judgment. Instead, the courts seek to make sure that an agency has behaved rationally in light of the data before it. A court also looks to determine whether an agency has observed appropriate procedures in reviewing the underlying evidence and whether it considered relevant information and alternative approaches. Those procedures, Verkuil said, all seek to make sure that what an agency has produced was produced in an appropriate scientific manner, with proper judgments supported being by the evidence. Finally, Verkuil touched on the issue of whether someone can take an agency to court if it is believed that the agency failed to comply with the Data Quality Act. He explained that this issue is still being debated and is yet to be resolved very convincingly. As you can imagine, he said, it could be a big issue if the courts decided to intervene every time that an agency decides whether the Data Quality Act has been properly complied with or not. Executive Branch Guidance More generally, the executive branch puts forward a variety of policies regarding data sharing. George Gray, director of the Center for Risk Science and Public Health at the Milken Institute School of Public Health at George Washington University, described some of the more relevant policies during his presentation. These policies are different from laws and regulations, he emphasized. They provide guidance and do not have the same force as either laws or regulations. Many of the relevant policies are promulgated by the OMB, he said, and he focused on what are referred to as the OMB âcirculars,â which are instructions or information from the OMB to federal agencies that are generally in effect for 2 or more years.

12 PRINCIPLES AND OBSTACLES FOR SHARING DATA The Shelby Amendment instructed the OMB to amend one of its circulars, Circular A-110.8 In particular, the Shelby Amendment told the OMB to amend Circular A-110 to ensure that all data produced with funding from federal grants would be available to the public under FOIA. In particular, Gray said, the amended version of Circular A-110 has the following provisions: â¢ It applies to grants and agreements with institutions of higher education, hospitals, and other nonprofit organizations. â¢ It obligates EPA to obtain from its contractors âresearch dataâ underlying findings used by the agency in developing action that has the force and effect of law. â¢ It has exceptions for drafts, peer reviews, personally identifiable information, and so on. â¢ It also exempts confidential business information or other information that needs to be confidential, but only until the data are published in a journal or cited by an agency in support of its action. One of the things that is interesting about the circular, Gray said, is that rather than applying to research that is done within the federal government, it applies to grants and agreements that are made by the government with institutions of higher education, hospitals, and other nonprofits. In particular, it obligates EPA to get the research data that underlie findings that are used by the agency in developing agency actions. This is a way of building a record of what leads the agency to make a particular decision. The OMB Circular A-1309 lays out the basic principles that the Executive Office of the President wishes agencies to follow in using information to come to decisions. Those principles, as laid out in the circular, include the following: â¢ The free flow of information between the government and the public is essential to a democratic society. In other words, Gray paraphrased, âSharing is the right thing to do.â â¢ The nation can benefit from government information disseminated both by federal agencies and by diverse nonfederal parties, 8 Circular A-110 can be found at http://www.whitehouse.gov/omb/circulars_ a110 (accessed October 26, 2015). 9 Circular A-130 can be found at http://www.whitehouse.gov/omb/circulars_ a130_a130trans4 (accessed October 26, 2015).

CURRENT APPROACHES AND WEAKNESSES 13 including state and local government agencies, educational and other not-for-profit institutions, and for-profit organizations. â¢ The open and efficient exchange of scientific and technical government information, subject to applicable national security controls and the proprietary rights of others, fosters excellence in scientific research and effective use of federal research and development funds. Finally, Gray described a memorandum from the Office of Science and Technology Policy to the heads of executive departments and agencies (Executive Office of the President, 2013). âAgain, this is the center of the executive branch giving instruction, guidance, policy approaches to the other executive branch agencies,â he said. The memorandum offered several of the administrationâs âpolicy principles,â including the following: â¢ âThe Administration is committed to ensuring that, to the greatest extent and with the fewest constraints possible and consistent with law and the objectives set out below, the direct results of federally funded scientific research are made available to and useful for the public, industry, and the scientific communityâ (Executive Office of the President, 2013). âThey want to see the direct results of federally funded scientific research made available to and useful for the public, industry, and the scientific community,â Gray commented. âAgain, this is the exhortation to more data sharing, more openness in the way things are done.â â¢ âScientific research supported by the federal government catalyzes innovative breakthroughs that drive our economy. The results of that research become the grist for new insights and are assets for progress in areas such as health, energy, the environment, agriculture, and national securityâ (Executive Office of the President, 2013). âThese are the policies,â Gray concluded. âThis is, again, the center of the executive branch speaking to all of the executive branch agencies, the ones that ultimately end up implementing all the various laws that come out of Congress.â

14 PRINCIPLES AND OBSTACLES FOR SHARING DATA COMMENTS AND DISCUSSION Several challenges related to the current approaches to data sharing were highlighted during the Session 1 discussion. Workshop speakers and participants provided individual remarks that are summarized in this section. The Limits to Data-Sharing Requirements Given that the federal government requires researchers to share data that have been collected through the use of federal funds, several workshop participants raised the issue of just how far that requirement extends. Does it extend to any research project that has accepted any federal funds for any aspect of the project? Does it extend to research that was done 10 or 20 years ago? Gwen Collman, the director of the Division of Extramural Research and Training at National Institute of Environmental Health Sciences (NIEHS), noted that the National Institutes of Health (NIH) has data- sharing policies for the researchers that it funds. âWe require data- sharing plans for some of our investigators depending on the size and scope of the funding that they receive,â she said. Goldman pointed out that â[t]here are many statutes now that require regulatory agencies to use [the] best available data,â she said. âThe agencies are not supposed to simply use data that are submitted to them, but they are supposed to do a data dragnet and find the best available data whether these are data that the investigators want to submit or not.â But this does not take into account the researchersâ wishes, she said. âI think a concern by investigators has been, âI did not do that research for the purpose of a regulation. Now I am being asked to undertake the burden of doing all these special things for a regulatory agency that did not fund me.ââ The issue becomes even more complicated if the best available research was done by researchers in other countries. So, Goldman asked, âWhat is the obligation of investigators to a regulatory agency that perhaps did not fund them?â She offered as an example an investigator in Norway who did a study that EPA decided was one of the best studies on a particular contaminant, so this study, as long as it was published in the scientific literature, is supposed to be included in the EPA assessment. âI do not think you can subpoena the investigator in Norway and get his data,â Verkuil said, âbut it does raise a bit of a conundrum for the agency.â That conundrum centers on the precise meaning of the word

CURRENT APPROACHES AND WEAKNESSES 15 âconsider,â because anything that an agency considers is supposed to be put in the record on review. âNow what is âconsideringâ?â he asked. âIf you are doing a proactive review of the scientific literature before you make a decision and you see something in a Norwegian scientistâs report, are you âconsideringâ it? ... If the scientist does not have, letâs say, underlying codes available and other things, which means it cannot be analyzed, then it probably would not have been considered. But I do not know how you go beyond that.â Linda Birnbaum, director of NIEHS of NIH, offered as an example a project in which NIEHS funded a small piece of an analysis of some data that had been collected by Norwegian investigators. âIt is often unclear to us, even given the Shelby Amendment, whether that data has to be provided upon FOIA or not,â she said. If the analysis is being used in rule making, one interpretation would be that because at least part of the analysis was federally funded, then the data would need to be provided, but she added, âIt is certainly not crystal clear to us what our grantees have to make available in those situations.â A second type of complication revolves around the issue of who owns the data being requested. As Steven Lamm of Consultants in Epidemiology & Occupational Health, LLC, noted, âSo many of the papers that we use were somebodyâs Ph.D. dissertation, and they are now in some other institution, or they are not interested in that area anymore, and the professor has moved on to other things. In that case, he said, it can be exceptionally difficult to, first, find out where the data are and, second, discover who is in a position to release the data. Those are critical field-level issues,â he said. Alan Morrison of the George Washington University Law School agreed with Lamm and added that the issue becomes particularly tricky for research done at state universities when the questions arise as to whether the researcher owns the data or the university owns the data and what happens when there is a conflict. Lamm added that a related issue concerns reimbursement for supplying the data. âIf the agency is going to request data, it ought to have a budget that allows it to pay for the acquisition,â he said. âProjects have been funded. The budgets no longer exist. Asking somebody to go and find the data in the archives requires time and money.â Dan Greenbaum, president of the Health Effects Institute, explored the issue further. âGoing back to the study of Norway: suppose it is one of a dozen studies that have found similar things, maybe some positive, some negative, but the agency is considering it as one of several. Or

16 PRINCIPLES AND OBSTACLES FOR SHARING DATA suppose that study in Norway is one of two studies worldwide that have found an effect that the agency is trying to characterize and think about regulating. Is there a different legal standard there?â How would an agency approach those two different situations? And how would a judge think about them? Verkuil suggested that a judge would likely rule differently on the two situations. âIf it is one of two studies, then you have to have it, and, as the agency, you better well track it down and have it ready. If it is one of ten, you explain why you did not do it, but you do not need it, I think.â In short, he said, you use logic to determine how important a study is to a particular decision. Morrison added that it would be important whether there were any countervailing studies. That is, the role that a study plays in an agencyâs determination vis-Ã -vis other studies is an important factor in determining whether a study must be included. âIt is very context specific,â he said. âYou really need to go out and try to get the data,â but if you cannot get the data that you want, you get the data that you can. âMany statutes require you to use the best scientific evidence available,â he said. âYou may not be able to get the very best,â so you get the best available. Goldman suggested that there are other considerations to take into account as well. âWhat is the seriousness of the outcome? Is this something that kills people or gives them a slight headache? Also, where was it published? The best journal in the world?â It is necessary to take into account the quality of the peer review and the judgment of scientists about the quality of the study. Finally, it is important to think about when the study was done. âThere are many things that have been demonstrated in the environmental literature decades ago that you could not get published today,â Goldman said. âBenzene and leukemiaânobody is seriously able to even study some of those. Asbestos and lung cancerâthe data for those are not available in raw form. If they are available anywhere at all, they are probably in media that you cannot read anymore. You cannot utilize them. To say I am going to lop things off and only allow the use of data that you can actually acquire is difficult to hear in that context.â What Exactly Is Required When Sharing Data? Morrison raised the question of exactly what is expected to be provided under data-sharing requirements. âIn addition to data that are

CURRENT APPROACHES AND WEAKNESSES 17 actually produced, there are a lot more things, stuff that is out there as part of the process,â he noted, âfor example, the original proposal to a federal agency to do a study. There is the protocol that was developed. There are algorithms. There are models that were used by a grantee. Do those have to be made available ... to agencies, and should they be made available?â After all, he commented, understanding the data requires more than just having access to the data themselves; all of these other pieces may play a role as well. Gray answered that scientific journals generally require authors to provide whatever things were used to produce the results, such as âthe raw data. Many journals will require you to post your computer code, whether it is an analytic code in SAS [Statistical Analysis System] or you write your own code to help a particular model to get a result.â Gray said he was not certain whether the Shelby Amendment would require computer codes to be provided by researchers who received federal funding. Goldman offered her own thoughts on the issue. âSpeaking as an epidemiologist,â she said, âmost of us believe that our questionnaires and protocols are fair game and that we do need to make them available when people want to see them. There are times when if you do not see the questionnaire and understand exactly how the data are collected that you really cannot understand what the responses mean. It is the art of epidemiology. Depending on how you ask the question [and whom you are asking], you can get a very different answer.â FEDERAL AGENCIES It is, then, up to the various federal agencies to carry out the laws passed by Congress and to follow the instructions provided by the OMB and other executive agencies. During different sessions throughout the day, four presenters offered details about what specific federal agencies do in response to these laws and directives. U.S. Environmental Protection Agency Goldman spoke about EPAâs experience with data sharing under the Information Quality Act (IQA) and Circular A-110 during her opening remarks. âA couple of years ago, a couple of us looked at the experience at the Environmental Protection Agency under the IQA and found that according to EPAâs Web page at that time, over 10 years there were 79

18 PRINCIPLES AND OBSTACLES FOR SHARING DATA requests that had been filed,â she said. âOf these, only two actually asked for raw data. The request for raw data was not a common request.â Goldman then offered some details about the two requests. The first was a case that involved a perchlorate study, and the requester was an industry consortium called the Perchlorate Study Group. EPA had the relevant data from a contractor, and it made the data available to the study group, which âallowed them to examine the original brain images from the animals that were studied in this study as well as the original contractorâs reports that actually contained data tables.â The second case was not so straightforward. In 2008 an industry group called the Association of Battery Manufacturers asked for raw data from a systematic review of a number of studies of lead toxicity. The principal investigator, Bruce Lanphear, had solicited colleagues all over the world to provide raw data from studies that they had carried out on the effects of lead toxicity on the intelligence quotient of children (this example is also referenced in the Chapter 3 discussion). EPA provided Lanphear with some of the funding for his systematic review, and the agency then used the result of this review in developing a lead-in-air standard. âI spoke with Bruce about this [request for raw data] at the time we did our review,â Goldman said. âHe had signed data transfer agreements with these investigators promising that he would not release these data to other people.... He felt that those agreements precluded him from sharing those data with anyone else. However, EPA ruled that the data needed to be made available pursuant to the Shelby Amendment because of the fact that some federal funding had been made available for doing this systematic review.â There were a few other factors as well, Goldman said, such as the fact that the battery manufacturers were suing EPA about the National Ambient Air Quality Standards, and EPA did not want to act until the lawsuit was settled. âBut at the end of the day,â she said, âEPA prevailed upon Cincinnati Childrenâs Medical Center, and Bruceâs hard drive was taken and provided to EPA.â Currently, Goldman continued, there is a new request for EPA to release raw data from the Harvard Six Cities Study and the American Cancer Society Study. It is from Senator David Vitter, who provided a statement on the case on March 11, 2014: As the input and output files are fundamental to conducting reanalysis, I repeatedly requested that EPA (1) obtain all the data files; (2) determine which data files pose a threat to privacy; (3) immediately release all

CURRENT APPROACHES AND WEAKNESSES 19 data files that do not pose a threat to privacy; and (4) investigate measures to remove all personal health information from the files that contain confidential data prior to release. (Vitter, 2014) âI think we are going to hear a lot more about this situation,â Goldman said. Furthermore, in February 2014 a bill10 was introduced in the House of Representatives to âprohibit the U.S. Environmental Protection Agency from proposing, finalizing, or disseminating regulations or assessments based upon science that is not transparent or reproducible.â Specifically, Goldman said, the bill is aimed at making sure that EPA specifically identifies all scientific and technical information used in proposing, finalizing, or disseminating any action and that it makes such information âpublicly available in a manner that is sufficient for independent analysis and substantial reproduction of research results.â The actions covered by the bill, Goldman said, are not just regulations but any risk, exposure, or hazard assessment; criteria document; standard; limitation; regulatory impact analysis; or guidanceâin effect, almost anything that the agency does. National Institute for Occupational Safety and Health During Session 3 of the workshop, John Howard, director of the National Institute for Occupational Safety and Health (NIOSH), offered six principles for the sharing of data based on lessons learned at NIOSH. Overall, he said, âScientific data developed with taxpayer dollars by taxpayer-supported scientists should be shared with data requestors unless there is a strong countervailing interest that can be articulated [and] that supports a decision to withhold data in a manner that prevents data reanalysis.â In short, the sharing of data should be the default position, and if it is not feasible to share all of the dataâfor instance, because of privacy concernsâthen one should share as many of the data as possible. His first lesson is that âresearchers need to think about the optimal data- sharing practices at the study concept stage.â There should be a balance between the ability of the investigators to complete the research mission and the ability of legitimate data seekers to have timely access to data from the study that can be used for reanalysis. âFrequent communication with 10 H.R. 4012, Secret Science Reform Act of 2014, 113th Congress.

20 PRINCIPLES AND OBSTACLES FOR SHARING DATA data seekers during the study, while it is happening, is really highly recommended,â he said. Lesson two is that research budgets should prospectively include sufficient resources for the investigators to implement robust data- sharing plans. âNIOSH has found that implementation of data-sharing plans can be resource intensive,â he said. Lesson three is that data use agreements have limited value. âWhile they are somewhat popular,â he said, âthey do not provide particularly strong protections for sensitive, potentially identifying data provided to data analyzers because it is difficult to monitor and it is difficult to enforce specified restrictions on data use. Recognizing these limitations, we found that data use agreements need to clearly state what happens in the case of nonadherence.â Lesson four is that secure enclaves can protect the confidentiality of highly sensitive and potentially identifying data sets while providing access for analysis that is maximally useful. He noted that âthe data can be used within those enclaves, but only nonidentifying aggregate analysis can be removed from the enclave.â Lesson five is to ensure that reanalysis is based on a strong and reproducible foundation. âOnly clean, verified, finalized data sets from completed and published studies should be shared,â he said. âWe found that sharing preliminary data sets, which were not the basis of the published study, definitely confuses the data reanalyzers and creates scientific confusion in the end.â Lesson six is to make certain that study participants are not surprised by data disclosure issues that can arise from reanalysis. He noted that study participants should be made aware of the possible scope of disclosure parameters at the time of enrollment in the study that they are actually going to participate in. National Center for Health Statistics Edward Sondik, former director of the National Center for Health Statistics (NCHS) of the Centers for Disease Control and Prevention, spoke about data sharing at NCHS during Session 4 of the workshop. âOur role is to provide information for policy and researchâinformation about the health care system and about the health of people in this country as compared with other countries,â he said. Section 306 of the Public Health Service Act describes the general NCHS mandate by saying that the center âshall conduct and support statistical and epidemiological

CURRENT APPROACHES AND WEAKNESSES 21 activities for the purpose of improving the effectiveness, efficiency, and quality of health services in the United States.â A second part of this mandate, Sondik said, is to ensure the widespread dissemination of and access to the data that it collects. NCHS collects a wide variety of data. It collects health and health care data in such surveys as the National Health Interview Study and the National Health and Nutrition Examination Survey, it coordinates and collates data concerning births and deaths in the United States, and so on. âAll of that is extremely important,â Sondik said, âbut if we put it in a safe, it does absolutely no good at all.â Thus, the centerâs prime directive includes not only the collection of data but also the dissemination of the information that it collects. âThen we have another prime directive, which is about confidentiality,â he said. âIt is really clear. It prohibits the release of potentially identifiable data. This is under Section 308(d) of the Public Health Service Act. Other agencies have their own confidentiality legislation. But in general, over the last several years it has been covered by the act we called CIPSEA, the Confidential Information Protection and Statistical Efficiency Act.â There are serious penalties for violating confidentiality, he saidâ fines of up to $250,000 or up to 5 years in prison. âI took this very personally. This is what it says under 308(d): No information identifying the person supplying the information may be released in any form without the consent of the person. It is really clear.â NCHS disseminates information in a variety of ways, including publications and public-use data files. Essentially everything that the center provides is put on the Web now, Sondik said. In addition to the publications and data files, there are a number of data access tools and also various linkages between the data, such as links from one survey to another. âWe have some very interesting things that we do along that line,â he said. NCHS maintains a balance between the widespread dissemination of the data and maintaining the confidentiality of the people in the data files. The center uses a number of strategies to maintain confidentiality, Sondik said. âFirst of all, we create public use data sets which are as deidentified as we can make them.... We try to do the best job we can and produce public use data sets that anybody can use [and] that will safeguard the identity of the people who supply the data.â However, he noted, there is no way to know exactly what the probability of disclosure is for any of the information in the data sets. âI wanted a probability of

22 PRINCIPLES AND OBSTACLES FOR SHARING DATA disclosure for 17 years, but we do not have probability of disclosures. We do not know what those probabilities are.â To allow researchers to work with sensitive, personal, identifiable data, NCHS uses a variety of approaches, said Sondik. It developed research data centers that researchers can come to to work with the data or else can access remotely. It has also used other data enclaves, and it has reworked the data sets in ways that change the data enough to protect the identity of the individual respondents but not so much that the data are no longer useful to researchers. A strategy that NCHS has chosen not to use is licensing, which provides identifiable data to researchers under a licensing agreement that prohibits them from sharing the data publicly or sharing it with unauthorized users. The National Center for Education Statistics does use this approach, Sondik noted. âThat is something that we at NCHS felt that we were not able to do because of our ... interpretation of the legislation,â he said. âIt is interesting. You have two federal statistical agencies taking different strategies.â Given the potential effects of reidentification on individuals and the possibility that such risks could keep people from taking part in environmental health studies, Sondik suggested that the effects of disclosure should also be discussed in terms of the ability to carry out research in this area. âThere is impact on the individual,â he said, âbut from my viewpoint, it was impact on the agency and our ability to continue to function and to be able to collect very sensitive information and preserve that informationâ that merits consideration. Sondik discussed a particular issue that must be taken into account when an attempt is made to understand the likelihood of reidentification of the data in a data set. âThe mosaic problem,â he explained, âis the fact that there is this semi-infinite set of data sets out there, and data in one can relate to another, which can relate to another, which can relate to another. You can start with some piece of information, relate that to a different data set [and] to another data set and wind up, after you go through this sequence, being able to actually identify a person supplying data in that first data set.â Because of the complexity of the issue and the uncertainties surrounding exactly what data are available, he said, it is extremely difficult to get any sort of estimate of the risk of reidentification through such an approach. Finally, Sondik suggested that in understanding the likely response of the public to the risks of reidentification, it is important to understand what the public expects in terms of privacy, particularly in this era when

CURRENT APPROACHES AND WEAKNESSES 23 people are willingly offering up more and more personal information in various venues. âThere has been some work on the part of the federal statistical agencies to understand what the public expectations are,â he said, âbut I do not think we have a very good handle on it.â In particular, no one has explored the âinconsistencyâ between the information that people allow various entitiesâFacebook and other websites, for exampleâto collect and to use and the thought that these same people expect absolute confidentiality when it comes to the information that they provide to scientific researchers. National Institutes of Health Birnbaum described several NIH efforts focused on data during her presentation in Session 5 of the workshop. For one, NIH is taking the lead on working to improve the reproducibility of data, Birnbaum said. In February 2014, NIH director Francis Collins and principal deputy director Lawrence Tabak published a commentary in Nature (Collins and Tabak, 2014) that discussed the lack of reproducibility in health research, especially preclinical studies, and described what NIH is intending to do to address the issue. For instance, NIH is developing a mandatory training module on the responsible conduct of research that will be given to NIH-funded trainees, both intramural and extramural, and the various institutes and centers at NIH are developing checklists to ensure the more systematic evaluation of grant applications. Pilot programs are also being run to assess the value of such things as cross-reviewing panels. âYou take one reviewer from each panel,â Birnbaum said, âand have those reviewers look at the whole review on another ongoing panel, and they also have the specific task of evaluating the scientific premise of the application. In other words, when they are reviewing a grant, [they ask concerning] the key publications on which an application is based, Are these in fact valid or appropriate publications?â NIH also has a major initiative called BD2K, an abbreviation for Big Data to Knowledge. The goal of this $60 million project, Birnbaum said, is to make biomedical data intelligible, accessible, and citable. As part of the project, four centers of excellence are being established for biomedical big data analysis. Birnbaum also described efforts at her own institute, NIEHS, aimed at sharing environmental data. âOur journal, Environmental Health Perspectives, was a pioneer in open access for scientific journals, which

24 PRINCIPLES AND OBSTACLES FOR SHARING DATA is now being called for by legislation in the House,â she said. âOur policy for open access is decades old, and it predates the policies from PLoS (the Public Library of Science), NIH, and PubMed requirements, for example. And we are evolving publication policies to deal with data- sharing issues, such as all the supplemental materials that are online.â The journal is also working to improve reproducibility by, for example, requiring a checklist to be filled out when a paper is submitted to the journal to make sure that some of the key information related to study design is clearly presented in the study. COMMENTS AND DISCUSSION This section is a summary of the discussions related to federal agency implementations that took place throughout the workshop. Examples of Secure Enclaves During the discussion after Session 3, Howard further described the secure data enclaves with which NIOSH is experimenting. The work is part of an effort to make data that have been used in analyses accessible to people who wish to look at the data themselves to verify that an analysis was accurate. âWe are trying to figure out how to make [those data] accessible while maintaining ... privacy, not just personal privacy, but also trade secret issues and several other avenues of privacy,â he said. The data enclaves are one avenue that NIOSH is examining in order to make data available in this way, Howard said, but the instituteâs work with data enclaves is not far enough along that he could report how well they work. âIt is fairly new,â he said. âWe are probably not at the stage where I could report that it is entirely worked out. That remains to be seen.â In contrast, Greenbaum noted during Session 4 of the workshop that other federal agencies do have working data enclaves. âThe research data centers, which NCHS runs in Hyattsville, Maryland, are set up,â he said. âInvestigators do go in. They get access to NHANES [National Health and Nutrition Examination Survey] data and other information that [arenât] just generally available.â These data centers are secure facilities where researchers carry out analyses on the data that are kept there. The centers provide various statistical packages for the researchers to use. âYou cannot take the data set back out,â he said, âbut there is a mechanism for doing it in the federal government. How well that will

CURRENT APPROACHES AND WEAKNESSES 25 work with every single study that the federal government has funded is where we are still in the pilot stages of figuring out.â Looking to Scientific Journals During Session 1, Gray highlighted the fact that PLoS, the first and the biggest online journal, has just published new data policy procedures that focus on public access to the data. âAccess to research results, immediately and without restriction, has always been at the heart of PLoSâs mission and the wider open-access movement,â he said. âHowever, without similar access to the data underlying the findings, the article can be of limited use. PLoS is trying to increase access to data and is revising its data-sharing policy.â Authors who submit to PLoS must now make all data publicly available without restriction immediately upon publication of the article, he said. âGoing forward, I think we are going to see a different world. This is being driven by a lot of forces.â Furthermore, as Birnbaum noted in Session 5, the NIEHS publication Environmental Health Perspectives was one of the first journals to require the authors of manuscripts to make the data in their papers available in a database, but many journals have followed suit, and that is becoming standard practice in the scientific publishing industry. Francesca Dominici, professor of biostatistics and senior associate dean for research at the Harvard University School of Public Health, said during the discussion after Session 2 that there are journals in biostatistics that require authors to make both their data and their software available to others. On a more global level, Gray had said in Session 1 that the scientific world is changing rapidly, driven by the increasing ability to move information around electronically and a growing desire for openness in the scientific community. âIf you want to publish in Nature, one of the very best journals in the world, [you] are required to make materials, data, and associated protocols promptly available to readers without undue qualifications. They want to see data sharing there.â Information Concerning Data Collection and Analysis In Session 2, Collman noted that many of the difficulties in reproducing studies are related to the methods used and simply sharing data will do little to help. The scientific papers that describe a study often have relatively short methods sections that are not very detailed, and

26 PRINCIPLES AND OBSTACLES FOR SHARING DATA someone in another laboratory who is trying to replicate those studies from scratch can find it quite difficult. When there are human data from epidemiological or clinical studies, Collman said, the real question is: What kinds of communication are necessary to fully communicate the details and the nuances of how those studies were done as well as what all of the data mean and how the data variables are created? âWe do have metadata, and we have data dictionaries,â she said. âBut oftentimes ... the methods of how we recruit and how we select participants and what the demographics of the original group are is not the most exciting paper to write. But in thinking about a future world where those data that come from those things end up in an open-access database with the proper protections or are available for sharing with consent of the research group, ... we really need to pay a lot of attention to the details of how these things came to be in order for the next group of scientists to be able to use them.â In addition to getting access to the data and to information about how the data were collected, it is also important to have access to information about how the data were processed and used to come to a decision, said Gray in his presentation during Session 1 of the workshop. He acknowledged that one of the factors that causes people to question how a governmental agency came to a decision is the unavailability of the data that underlay the decision. If data are unavailableâfor example, because they are considered to be confidential business informationâ then some people may be concerned that âthe agency is playing around with data when they are doing their analysesâ and that âsome part of data that might be important is not being released.â However, Gray continued, âI would say the place where I had the greatest trouble with this is actually understanding the agencyâs underlying reasoning. How did the data that were available to the agency end up in this result?â In using data and scientific information to come to conclusions, there are always many choices to be made, Gray said. âWhat models are we going to use? Which populations are we going to use? Which studies will we rely upon? Which ones will we not rely upon? All of those can ultimately have an influence on the end product, especially if the end product is something quantitative, like a national ambient air quality standard or a reference dose in EPAâs integrated risk information system.â Thus, it is crucial when sharing data to also share how those data were used in coming to a particular conclusion.

CURRENT APPROACHES AND WEAKNESSES 27 REFERENCES Collins, F. S., and L. A. Tabak. 2014. NIH plans to enhance reproducibility. Nature 505(7485):612â613. Dockery, D. W., C. A. Pope, X. Xu, J. D. Spengler, J. H. Ware, M. E. Fay, B. G. Ferris, and F. E. Speizer. 1993. An association between air pollution and mortality in six U.S. cities. New England Journal of Medicine 329:1753â 1759. Executive Office of the President. 2013. Memorandum for the heads of executive departments and agencies. Available at http://www.whitehouse.gov/sites/ default/files/microsites/ostp/ostp_public_access_memo__2013.pdf (accessed October 26, 2015). IOM (Institute of Medicine). 2014. Discussion framework for clinical trial data sharing: Guiding principles, elements, and activities. Washington, DC: The National Academies Press. OMB (Office of Management and Budget). 2002. Guidelines for ensuring and maximizing the quality, objectivity, utility, and integrity of information disseminated by federal agencies. Available at http://www.whitehouse.gov /omb/fedreg_reproducible (accessed October 26, 2015). Pope, C. A., M. J. Thun, M. M. Namboodiri, D. W. Dockery, J. S. Evans, F. E. Speizer, and C. W. Heath. 1995. Particulate air pollution as a predictor of mortality in a prospective study of U.S. adults. American Journal of Respiratory and Critical Care Medicine 151:669â674. Vitter, D. 2014. Memo to Dr. Francesca Grifo. Available at http://www. epw.senate.gov/public/_cache/files/6de2a2b9-ad38-41bc-a0c4-c909b391a526/ 031714 vitterlettertodrgrifodatamisconduct.pdf (accessed October 28, 2015).

Next: 3 The Benefits of Data Sharing »

Principles and Obstacles for Sharing Data from Environmental Health Research: Workshop Summary (2016)

Chapter: 2 Current Approaches and Weaknesses of Those Approaches

Welcome to OpenBook!

Get Email Updates