This chapter is presented in two parts. The first part introduces some of the terminology used throughout the workshop, as well as the federal laws and policies that form the framework within which the sharing of data on environmental health occurs. After the summary of the presentations, relevant discussions from sessions from throughout the day1 are described. The second part of the chapter summarizes the presentations that described current approaches to the sharing of environmental health data by federal agencies and identified the weaknesses and shortcomings of these approaches. Again, this is followed by a summary of the relevant discussion that occurred at the workshop.
Lynn Goldman, dean of the Milken Institute School of Public Health at George Washington University, introduced the topic of terminology in her workshop overview. “One thing that I have noticed is that some of the words that we use in science ... have sometimes not been used consistently,” so she offered definitions for a series of terms that are important in talking about the sharing of environmental health data.
“Peer review is when you actually evaluate the scientific work by others in the same field,” she said. “A systematic review is a summary of the clinical literature,” she continued. “There are various methods that
1 Presentations and, especially, discussion sessions often covered topics formally introduced in sessions different from the one being described in this summary. Where a discussion point from a different session is relevant, it is presented not in the order in which it was made but in proximity to the most appropriate discussion within this summary.
people can use. They can quantitatively pool the data or do a meta-analysis, which is a way of combining data from many different studies using a statistical process.”
There are various approaches to testing and validating previous scientific work, she said. “A reanalysis is when you conduct a further analysis of data.” A person doing a reanalysis of data may use the same programs and statistical methodologies that were originally used to analyze the data or may use alternative methodologies, but the point is to analyze exactly the same data and see if the same result emerges from the analysis.
“Replication means that you actually repeat a scientific experiment or a trial to obtain a consistent result,” she continued. The second experiment uses exactly the same protocols and statistical programs but with data from a different population. The goal is to see if the same results hold with data from a different population.
“And then, finally, when you reproduce, you are producing something that is very similar to that research, but it is in a different medium or context,” she said. In other words, a researcher who is reproducing an experiment addresses the same research question but from a different angle than the original researcher did. “Most of us, when we are doing systematic reviews, are more convinced that something is going on when we see reproducibility as well as replicability.”
Different Meanings of “Data”
During Session 2, Bernard Lo, president and chief executive officer of The Greenwall Foundation, noted that there are a number of different types of data. There are raw data, which come straight from the survey or the experiment. There are cleaned-up data, which consist of the raw data modified to remove obvious errors. There are processed data, which are data that have been computed and analyzed to extract relevant information. There is the final clean data set that is provided with a publication. And there are the metadata that describe the data. All of these types of data are important in different ways and for different purposes (see Figure 2-1).
Lo explained that investigators may want to make the different types of data available to different people at different times. For each data-sharing element or activity, different questions may need to be posed. What types of data will be shared? Who should be providing data for sharing? Who should be the recipients of the shared data? When in the study process should certain data be shared and how? Should some people have open access? Is it important to pay attention to proportionality in terms of the risks, benefits, and burdens to various parties who are going to be affected by data sharing (e.g., researchers, participants, sponsors, the public)?
Lo noted that he chairs an Institute of Medicine committee that is tasked with issuing a report on the responsible sharing of clinical trial data. In January 2014 that committee issued a preliminary report, Discussion Framework for Clinical Trial Data Sharing: Guiding
Principles, Elements, and Activities, which lays out the committee’s thoughts on the principles that should guide the sharing of clinical trial data, describes certain data-sharing activities, and defines the key elements of data and data-sharing activities (IOM, 2014). For its final report the committee’s charge is “to analyze the benefits, challenges, and risk of various models of data sharing and to make recommendations to enhance the responsible sharing of clinical trial data,” Lo said.2 He added that while that committee is focused solely on clinical trials, many of its ideas and recommendations will likely apply to environmental health research more generally.
A variety of federal laws specify what data must be shared and under what circumstances. In general, these laws apply to data held by federal agencies and to data collected with federal funding.
Paul Verkuil, chairman of the Administrative Conference of the United States, offered some background on these laws. “As a traditional matter,” he said, “agencies were required to disclose data underlying investigations undertaken by agency scientists upon public request but were not required to disclose data from studies commissioned by the agency but performed by private entities. This framework has changed in recent years, and certain disclosure requirements also now apply to privately conducted research.”
Specifically, Verkuil said, several federal acts govern the sharing of data. The Freedom of Information Act (FOIA)3 requires that agencies release records—including scientific data—upon public request. It does contain a number of exceptions that protect things like confidential business information and personal privacy. The Electronic Freedom of Information Act Amendments of 19964 require agencies to release electronic copies of documents that have been previously requested and
2 The Committee on Strategies for Responsible Sharing of Clinical Trial Data released its final report in January 2015. Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk is available at www.nap.edu/catalog/18998.
3 Freedom of Information Act, 5 U.S.C. § 552, Amended by Public Law 104-231, 110 Stat. 3048, 104th Congress.
4 Electronic Freedom of Information Act Amendments of 1996, Public Law 104-231, 104th Congress.
that are likely to be the subject of future requests rather than waiting for subsequent requests. The result, Verkuil said, is much greater transparency than had previously been the case.
In 1998 a law referred to as the Shelby Amendment5 was passed. That law required that all federally funded research data be made available to the public under FOIA. Traditionally, it was not the case that funded research had to be made available in response to a FOIA request. The Shelby Amendment changed that, requiring that data produced by grantees be released under FOIA, subject to the usual exceptions. The Shelby Amendment was enacted in response to concerns about a desire to reanalyze two studies, the Harvard Six Cities Study (Dockery et al., 1993) and the American Cancer Society Study (Pope et al., 1995) that were looking at the health risks caused by particulate matter in the air.
This was followed in 2001 by the enactment of the Information Quality Act, also called the Data Quality Act,6 which was intended to improve the quality of information used and promulgated by agencies. Among other things, the act requires agencies to create a procedure that allows people to correct information that has been released if the information is erroneous, Verkuil said. Guidelines issued by the Office of Management and Budget (OMB) in response to the Information Quality Act state that when executive branch agencies provide “influential scientific, financial, or statistical information,” they also “shall include a high degree of transparency about data and methods to facilitate the reproducibility of information by qualified third parties” (OMB, 2002). These OMB guidelines have affected how agencies respond both to requests for data and also to what are called “information corrections,” which are intended to correct information that has been promulgated.
Verkuil noted that contractors and grantees are treated differently in the Shelby Amendment, with only grantees being forced to make data available. “I do not think that this distinction is one that can last long,” he said, because once you get data from one type of federally funded research, it is difficult to imagine not requiring availability from all federally funded research. “The transparency is promoted when an agency relies upon a privately funded study. It urges the researcher to disclose the underlying data. When a private researcher declines, agencies should issue an explanation why they relied on such studies
5 Shelby Amendment to the Omnibus Appropriations Act for Fiscal Year 1999, Public Law 105-277, 105th Congress.
6 Data Quality Act, Section 515 of the Consolidated Appropriations Act for Fiscal Year 2001, Public Law 106-554, 106th Congress.
despite the declining. And agencies should require conflict-of-interest disclosures for all scientific research submitted to inform the decision-making process.”
He noted that the ultimate beneficiary of such data sharing is the broader society. “Open communication among scientists and engineers and between these experts and the public accelerates scientific and technological advancement, strengthens the economy, educates the nation, and enhances democracy,” he said.
The Role of Courts
In addition to Congress and the executive branch, including federal agencies, courts also play a role in determining how data are shared. Verkuil described that role in his presentation.
Most of the rules produced by the U.S. Environmental Protection Agency (EPA) and other federal agencies are developed through informal rule making, Verkuil noted, but when agencies issue rules through that process, reviewing courts scrutinize them very carefully. The process is governed by the “arbitrary and capricious” standard of 5 U.S.C. § 706(2)(A).
For an agency’s rule to be upheld on judicial review, the agency must have placed in the administrative record all of the information that the agency relied upon to reach its decision—a requirement known as the “Portland Cement doctrine.”7 “The E-Government Act makes it a little easier,” Verkuil said, “because you can post rules online and you can use regulations.gov to find out what else has been posted by other commenters.” Sometimes, however, there can be problems with paper-based comments, Verkuil said. For example, they may not be scanned. This is a transitional problem, he said, but a real one for agencies depending on how many paper-based versus electronically transmitted comments that they receive. Verkuil commented that his organization, the Administrative Conference of the United States, encourages agencies to do as much as possible electronically rather than on paper.
This amount of information that can be presented to the court is huge. “You can appreciate what the ‘record’ looks like on review,” Verkuil said. “It is enormous, and the agency has to decide what is in and what is out on its own.” The situation is different from formal adjudication or formal rule making, in which there is an administrative
7 The Portland Cement doctrine is named after the decision in Portland Cement Association v. Ruckleshaus, 486 F.2d 375, 393-94 (D.C. Cir. 1973).
law judge making the decision, the judge decides what goes in and what does not, and what is accepted becomes part of a “record” in the traditional legal sense. In contrast, Verkuil explained, in informal rule making, the record is an accumulation of the best estimate of what needs to be in there to support the rule.
Although courts can scrutinize the research and underlying data upon which the agency relied, Verkuil said, they generally defer to the agency on technical determinations. Courts recognize that they do not have the technical expertise to second-guess an agency’s scientific or technical judgment. Instead, the courts seek to make sure that an agency has behaved rationally in light of the data before it. A court also looks to determine whether an agency has observed appropriate procedures in reviewing the underlying evidence and whether it considered relevant information and alternative approaches. Those procedures, Verkuil said, all seek to make sure that what an agency has produced was produced in an appropriate scientific manner, with proper judgments supported being by the evidence.
Finally, Verkuil touched on the issue of whether someone can take an agency to court if it is believed that the agency failed to comply with the Data Quality Act. He explained that this issue is still being debated and is yet to be resolved very convincingly. As you can imagine, he said, it could be a big issue if the courts decided to intervene every time that an agency decides whether the Data Quality Act has been properly complied with or not.
Executive Branch Guidance
More generally, the executive branch puts forward a variety of policies regarding data sharing. George Gray, director of the Center for Risk Science and Public Health at the Milken Institute School of Public Health at George Washington University, described some of the more relevant policies during his presentation. These policies are different from laws and regulations, he emphasized. They provide guidance and do not have the same force as either laws or regulations.
Many of the relevant policies are promulgated by the OMB, he said, and he focused on what are referred to as the OMB “circulars,” which are instructions or information from the OMB to federal agencies that are generally in effect for 2 or more years.
The Shelby Amendment instructed the OMB to amend one of its circulars, Circular A-110.8 In particular, the Shelby Amendment told the OMB to amend Circular A-110 to ensure that all data produced with funding from federal grants would be available to the public under FOIA.
In particular, Gray said, the amended version of Circular A-110 has the following provisions:
- It applies to grants and agreements with institutions of higher education, hospitals, and other nonprofit organizations.
- It obligates EPA to obtain from its contractors “research data” underlying findings used by the agency in developing action that has the force and effect of law.
- It has exceptions for drafts, peer reviews, personally identifiable information, and so on.
- It also exempts confidential business information or other information that needs to be confidential, but only until the data are published in a journal or cited by an agency in support of its action.
One of the things that is interesting about the circular, Gray said, is that rather than applying to research that is done within the federal government, it applies to grants and agreements that are made by the government with institutions of higher education, hospitals, and other nonprofits. In particular, it obligates EPA to get the research data that underlie findings that are used by the agency in developing agency actions. This is a way of building a record of what leads the agency to make a particular decision.
The OMB Circular A-1309 lays out the basic principles that the Executive Office of the President wishes agencies to follow in using information to come to decisions. Those principles, as laid out in the circular, include the following:
- The free flow of information between the government and the public is essential to a democratic society. In other words, Gray paraphrased, “Sharing is the right thing to do.”
- The nation can benefit from government information disseminated both by federal agencies and by diverse nonfederal parties,
9 Circular A-130 can be found at http://www.whitehouse.gov/omb/circulars_a130_a130trans4 (accessed October 26, 2015).
including state and local government agencies, educational and other not-for-profit institutions, and for-profit organizations.
- The open and efficient exchange of scientific and technical government information, subject to applicable national security controls and the proprietary rights of others, fosters excellence in scientific research and effective use of federal research and development funds.
Finally, Gray described a memorandum from the Office of Science and Technology Policy to the heads of executive departments and agencies (Executive Office of the President, 2013). “Again, this is the center of the executive branch giving instruction, guidance, policy approaches to the other executive branch agencies,” he said. The memorandum offered several of the administration’s “policy principles,” including the following:
- “The Administration is committed to ensuring that, to the greatest extent and with the fewest constraints possible and consistent with law and the objectives set out below, the direct results of federally funded scientific research are made available to and useful for the public, industry, and the scientific community” (Executive Office of the President, 2013). “They want to see the direct results of federally funded scientific research made available to and useful for the public, industry, and the scientific community,” Gray commented. “Again, this is the exhortation to more data sharing, more openness in the way things are done.”
- “Scientific research supported by the federal government catalyzes innovative breakthroughs that drive our economy. The results of that research become the grist for new insights and are assets for progress in areas such as health, energy, the environment, agriculture, and national security” (Executive Office of the President, 2013).
“These are the policies,” Gray concluded. “This is, again, the center of the executive branch speaking to all of the executive branch agencies, the ones that ultimately end up implementing all the various laws that come out of Congress.”
Several challenges related to the current approaches to data sharing were highlighted during the Session 1 discussion. Workshop speakers and participants provided individual remarks that are summarized in this section.
The Limits to Data-Sharing Requirements
Given that the federal government requires researchers to share data that have been collected through the use of federal funds, several workshop participants raised the issue of just how far that requirement extends. Does it extend to any research project that has accepted any federal funds for any aspect of the project? Does it extend to research that was done 10 or 20 years ago?
Gwen Collman, the director of the Division of Extramural Research and Training at National Institute of Environmental Health Sciences (NIEHS), noted that the National Institutes of Health (NIH) has data-sharing policies for the researchers that it funds. “We require data-sharing plans for some of our investigators depending on the size and scope of the funding that they receive,” she said.
Goldman pointed out that “[t]here are many statutes now that require regulatory agencies to use [the] best available data,” she said. “The agencies are not supposed to simply use data that are submitted to them, but they are supposed to do a data dragnet and find the best available data whether these are data that the investigators want to submit or not.” But this does not take into account the researchers’ wishes, she said. “I think a concern by investigators has been, ‘I did not do that research for the purpose of a regulation. Now I am being asked to undertake the burden of doing all these special things for a regulatory agency that did not fund me.’”
The issue becomes even more complicated if the best available research was done by researchers in other countries. So, Goldman asked, “What is the obligation of investigators to a regulatory agency that perhaps did not fund them?” She offered as an example an investigator in Norway who did a study that EPA decided was one of the best studies on a particular contaminant, so this study, as long as it was published in the scientific literature, is supposed to be included in the EPA assessment.
“I do not think you can subpoena the investigator in Norway and get his data,” Verkuil said, “but it does raise a bit of a conundrum for the agency.” That conundrum centers on the precise meaning of the word
“consider,” because anything that an agency considers is supposed to be put in the record on review. “Now what is ‘considering’?” he asked. “If you are doing a proactive review of the scientific literature before you make a decision and you see something in a Norwegian scientist’s report, are you ‘considering’ it? ... If the scientist does not have, let’s say, underlying codes available and other things, which means it cannot be analyzed, then it probably would not have been considered. But I do not know how you go beyond that.”
Linda Birnbaum, director of NIEHS of NIH, offered as an example a project in which NIEHS funded a small piece of an analysis of some data that had been collected by Norwegian investigators. “It is often unclear to us, even given the Shelby Amendment, whether that data has to be provided upon FOIA or not,” she said. If the analysis is being used in rule making, one interpretation would be that because at least part of the analysis was federally funded, then the data would need to be provided, but she added, “It is certainly not crystal clear to us what our grantees have to make available in those situations.”
A second type of complication revolves around the issue of who owns the data being requested. As Steven Lamm of Consultants in Epidemiology & Occupational Health, LLC, noted, “So many of the papers that we use were somebody’s Ph.D. dissertation, and they are now in some other institution, or they are not interested in that area anymore, and the professor has moved on to other things. In that case, he said, it can be exceptionally difficult to, first, find out where the data are and, second, discover who is in a position to release the data. Those are critical field-level issues,” he said.
Alan Morrison of the George Washington University Law School agreed with Lamm and added that the issue becomes particularly tricky for research done at state universities when the questions arise as to whether the researcher owns the data or the university owns the data and what happens when there is a conflict.
Lamm added that a related issue concerns reimbursement for supplying the data. “If the agency is going to request data, it ought to have a budget that allows it to pay for the acquisition,” he said. “Projects have been funded. The budgets no longer exist. Asking somebody to go and find the data in the archives requires time and money.”
Dan Greenbaum, president of the Health Effects Institute, explored the issue further. “Going back to the study of Norway: suppose it is one of a dozen studies that have found similar things, maybe some positive, some negative, but the agency is considering it as one of several. Or
suppose that study in Norway is one of two studies worldwide that have found an effect that the agency is trying to characterize and think about regulating. Is there a different legal standard there?” How would an agency approach those two different situations? And how would a judge think about them?
Verkuil suggested that a judge would likely rule differently on the two situations. “If it is one of two studies, then you have to have it, and, as the agency, you better well track it down and have it ready. If it is one of ten, you explain why you did not do it, but you do not need it, I think.” In short, he said, you use logic to determine how important a study is to a particular decision.
Morrison added that it would be important whether there were any countervailing studies. That is, the role that a study plays in an agency’s determination vis-à-vis other studies is an important factor in determining whether a study must be included. “It is very context specific,” he said. “You really need to go out and try to get the data,” but if you cannot get the data that you want, you get the data that you can. “Many statutes require you to use the best scientific evidence available,” he said. “You may not be able to get the very best,” so you get the best available.
Goldman suggested that there are other considerations to take into account as well. “What is the seriousness of the outcome? Is this something that kills people or gives them a slight headache? Also, where was it published? The best journal in the world?” It is necessary to take into account the quality of the peer review and the judgment of scientists about the quality of the study.
Finally, it is important to think about when the study was done. “There are many things that have been demonstrated in the environmental literature decades ago that you could not get published today,” Goldman said. “Benzene and leukemia—nobody is seriously able to even study some of those. Asbestos and lung cancer—the data for those are not available in raw form. If they are available anywhere at all, they are probably in media that you cannot read anymore. You cannot utilize them. To say I am going to lop things off and only allow the use of data that you can actually acquire is difficult to hear in that context.”
What Exactly Is Required When Sharing Data?
Morrison raised the question of exactly what is expected to be provided under data-sharing requirements. “In addition to data that are
actually produced, there are a lot more things, stuff that is out there as part of the process,” he noted, “for example, the original proposal to a federal agency to do a study. There is the protocol that was developed. There are algorithms. There are models that were used by a grantee. Do those have to be made available ... to agencies, and should they be made available?” After all, he commented, understanding the data requires more than just having access to the data themselves; all of these other pieces may play a role as well.
Gray answered that scientific journals generally require authors to provide whatever things were used to produce the results, such as “the raw data. Many journals will require you to post your computer code, whether it is an analytic code in SAS [Statistical Analysis System] or you write your own code to help a particular model to get a result.” Gray said he was not certain whether the Shelby Amendment would require computer codes to be provided by researchers who received federal funding.
Goldman offered her own thoughts on the issue. “Speaking as an epidemiologist,” she said, “most of us believe that our questionnaires and protocols are fair game and that we do need to make them available when people want to see them. There are times when if you do not see the questionnaire and understand exactly how the data are collected that you really cannot understand what the responses mean. It is the art of epidemiology. Depending on how you ask the question [and whom you are asking], you can get a very different answer.”
It is, then, up to the various federal agencies to carry out the laws passed by Congress and to follow the instructions provided by the OMB and other executive agencies. During different sessions throughout the day, four presenters offered details about what specific federal agencies do in response to these laws and directives.
U.S. Environmental Protection Agency
Goldman spoke about EPA’s experience with data sharing under the Information Quality Act (IQA) and Circular A-110 during her opening remarks. “A couple of years ago, a couple of us looked at the experience at the Environmental Protection Agency under the IQA and found that according to EPA’s Web page at that time, over 10 years there were 79
requests that had been filed,” she said. “Of these, only two actually asked for raw data. The request for raw data was not a common request.”
Goldman then offered some details about the two requests. The first was a case that involved a perchlorate study, and the requester was an industry consortium called the Perchlorate Study Group. EPA had the relevant data from a contractor, and it made the data available to the study group, which “allowed them to examine the original brain images from the animals that were studied in this study as well as the original contractor’s reports that actually contained data tables.”
The second case was not so straightforward. In 2008 an industry group called the Association of Battery Manufacturers asked for raw data from a systematic review of a number of studies of lead toxicity. The principal investigator, Bruce Lanphear, had solicited colleagues all over the world to provide raw data from studies that they had carried out on the effects of lead toxicity on the intelligence quotient of children (this example is also referenced in the Chapter 3 discussion). EPA provided Lanphear with some of the funding for his systematic review, and the agency then used the result of this review in developing a lead-in-air standard.
“I spoke with Bruce about this [request for raw data] at the time we did our review,” Goldman said. “He had signed data transfer agreements with these investigators promising that he would not release these data to other people.... He felt that those agreements precluded him from sharing those data with anyone else. However, EPA ruled that the data needed to be made available pursuant to the Shelby Amendment because of the fact that some federal funding had been made available for doing this systematic review.” There were a few other factors as well, Goldman said, such as the fact that the battery manufacturers were suing EPA about the National Ambient Air Quality Standards, and EPA did not want to act until the lawsuit was settled. “But at the end of the day,” she said, “EPA prevailed upon Cincinnati Children’s Medical Center, and Bruce’s hard drive was taken and provided to EPA.”
Currently, Goldman continued, there is a new request for EPA to release raw data from the Harvard Six Cities Study and the American Cancer Society Study. It is from Senator David Vitter, who provided a statement on the case on March 11, 2014:
As the input and output files are fundamental to conducting reanalysis, I repeatedly requested that EPA (1) obtain all the data files; (2) determine which data files pose a threat to privacy; (3) immediately release all
data files that do not pose a threat to privacy; and (4) investigate measures to remove all personal health information from the files that contain confidential data prior to release. (Vitter, 2014)
“I think we are going to hear a lot more about this situation,” Goldman said.
Furthermore, in February 2014 a bill10 was introduced in the House of Representatives to “prohibit the U.S. Environmental Protection Agency from proposing, finalizing, or disseminating regulations or assessments based upon science that is not transparent or reproducible.” Specifically, Goldman said, the bill is aimed at making sure that EPA specifically identifies all scientific and technical information used in proposing, finalizing, or disseminating any action and that it makes such information “publicly available in a manner that is sufficient for independent analysis and substantial reproduction of research results.” The actions covered by the bill, Goldman said, are not just regulations but any risk, exposure, or hazard assessment; criteria document; standard; limitation; regulatory impact analysis; or guidance—in effect, almost anything that the agency does.
National Institute for Occupational Safety and Health
During Session 3 of the workshop, John Howard, director of the National Institute for Occupational Safety and Health (NIOSH), offered six principles for the sharing of data based on lessons learned at NIOSH. Overall, he said, “Scientific data developed with taxpayer dollars by taxpayer-supported scientists should be shared with data requestors unless there is a strong countervailing interest that can be articulated [and] that supports a decision to withhold data in a manner that prevents data reanalysis.” In short, the sharing of data should be the default position, and if it is not feasible to share all of the data—for instance, because of privacy concerns—then one should share as many of the data as possible.
His first lesson is that “researchers need to think about the optimal data-sharing practices at the study concept stage.” There should be a balance between the ability of the investigators to complete the research mission and the ability of legitimate data seekers to have timely access to data from the study that can be used for reanalysis. “Frequent communication with
10 H.R. 4012, Secret Science Reform Act of 2014, 113th Congress.
data seekers during the study, while it is happening, is really highly recommended,” he said.
Lesson two is that research budgets should prospectively include sufficient resources for the investigators to implement robust data-sharing plans. “NIOSH has found that implementation of data-sharing plans can be resource intensive,” he said.
Lesson three is that data use agreements have limited value. “While they are somewhat popular,” he said, “they do not provide particularly strong protections for sensitive, potentially identifying data provided to data analyzers because it is difficult to monitor and it is difficult to enforce specified restrictions on data use. Recognizing these limitations, we found that data use agreements need to clearly state what happens in the case of nonadherence.”
Lesson four is that secure enclaves can protect the confidentiality of highly sensitive and potentially identifying data sets while providing access for analysis that is maximally useful. He noted that “the data can be used within those enclaves, but only nonidentifying aggregate analysis can be removed from the enclave.”
Lesson five is to ensure that reanalysis is based on a strong and reproducible foundation. “Only clean, verified, finalized data sets from completed and published studies should be shared,” he said. “We found that sharing preliminary data sets, which were not the basis of the published study, definitely confuses the data reanalyzers and creates scientific confusion in the end.”
Lesson six is to make certain that study participants are not surprised by data disclosure issues that can arise from reanalysis. He noted that study participants should be made aware of the possible scope of disclosure parameters at the time of enrollment in the study that they are actually going to participate in.
National Center for Health Statistics
Edward Sondik, former director of the National Center for Health Statistics (NCHS) of the Centers for Disease Control and Prevention, spoke about data sharing at NCHS during Session 4 of the workshop. “Our role is to provide information for policy and research—information about the health care system and about the health of people in this country as compared with other countries,” he said. Section 306 of the Public Health Service Act describes the general NCHS mandate by saying that the center “shall conduct and support statistical and epidemiological
activities for the purpose of improving the effectiveness, efficiency, and quality of health services in the United States.”
A second part of this mandate, Sondik said, is to ensure the widespread dissemination of and access to the data that it collects. NCHS collects a wide variety of data. It collects health and health care data in such surveys as the National Health Interview Study and the National Health and Nutrition Examination Survey, it coordinates and collates data concerning births and deaths in the United States, and so on. “All of that is extremely important,” Sondik said, “but if we put it in a safe, it does absolutely no good at all.” Thus, the center’s prime directive includes not only the collection of data but also the dissemination of the information that it collects.
“Then we have another prime directive, which is about confidentiality,” he said. “It is really clear. It prohibits the release of potentially identifiable data. This is under Section 308(d) of the Public Health Service Act. Other agencies have their own confidentiality legislation. But in general, over the last several years it has been covered by the act we called CIPSEA, the Confidential Information Protection and Statistical Efficiency Act.”
There are serious penalties for violating confidentiality, he said—fines of up to $250,000 or up to 5 years in prison. “I took this very personally. This is what it says under 308(d): No information identifying the person supplying the information may be released in any form without the consent of the person. It is really clear.”
NCHS disseminates information in a variety of ways, including publications and public-use data files. Essentially everything that the center provides is put on the Web now, Sondik said. In addition to the publications and data files, there are a number of data access tools and also various linkages between the data, such as links from one survey to another. “We have some very interesting things that we do along that line,” he said.
NCHS maintains a balance between the widespread dissemination of the data and maintaining the confidentiality of the people in the data files. The center uses a number of strategies to maintain confidentiality, Sondik said. “First of all, we create public use data sets which are as deidentified as we can make them.... We try to do the best job we can and produce public use data sets that anybody can use [and] that will safeguard the identity of the people who supply the data.” However, he noted, there is no way to know exactly what the probability of disclosure is for any of the information in the data sets. “I wanted a probability of
disclosure for 17 years, but we do not have probability of disclosures. We do not know what those probabilities are.”
To allow researchers to work with sensitive, personal, identifiable data, NCHS uses a variety of approaches, said Sondik. It developed research data centers that researchers can come to to work with the data or else can access remotely. It has also used other data enclaves, and it has reworked the data sets in ways that change the data enough to protect the identity of the individual respondents but not so much that the data are no longer useful to researchers.
A strategy that NCHS has chosen not to use is licensing, which provides identifiable data to researchers under a licensing agreement that prohibits them from sharing the data publicly or sharing it with unauthorized users. The National Center for Education Statistics does use this approach, Sondik noted. “That is something that we at NCHS felt that we were not able to do because of our ... interpretation of the legislation,” he said. “It is interesting. You have two federal statistical agencies taking different strategies.”
Given the potential effects of reidentification on individuals and the possibility that such risks could keep people from taking part in environmental health studies, Sondik suggested that the effects of disclosure should also be discussed in terms of the ability to carry out research in this area. “There is impact on the individual,” he said, “but from my viewpoint, it was impact on the agency and our ability to continue to function and to be able to collect very sensitive information and preserve that information” that merits consideration.
Sondik discussed a particular issue that must be taken into account when an attempt is made to understand the likelihood of reidentification of the data in a data set. “The mosaic problem,” he explained, “is the fact that there is this semi-infinite set of data sets out there, and data in one can relate to another, which can relate to another, which can relate to another. You can start with some piece of information, relate that to a different data set [and] to another data set and wind up, after you go through this sequence, being able to actually identify a person supplying data in that first data set.” Because of the complexity of the issue and the uncertainties surrounding exactly what data are available, he said, it is extremely difficult to get any sort of estimate of the risk of reidentification through such an approach.
Finally, Sondik suggested that in understanding the likely response of the public to the risks of reidentification, it is important to understand what the public expects in terms of privacy, particularly in this era when
people are willingly offering up more and more personal information in various venues. “There has been some work on the part of the federal statistical agencies to understand what the public expectations are,” he said, “but I do not think we have a very good handle on it.” In particular, no one has explored the “inconsistency” between the information that people allow various entities—Facebook and other websites, for example—to collect and to use and the thought that these same people expect absolute confidentiality when it comes to the information that they provide to scientific researchers.
National Institutes of Health
Birnbaum described several NIH efforts focused on data during her presentation in Session 5 of the workshop.
For one, NIH is taking the lead on working to improve the reproducibility of data, Birnbaum said. In February 2014, NIH director Francis Collins and principal deputy director Lawrence Tabak published a commentary in Nature (Collins and Tabak, 2014) that discussed the lack of reproducibility in health research, especially preclinical studies, and described what NIH is intending to do to address the issue. For instance, NIH is developing a mandatory training module on the responsible conduct of research that will be given to NIH-funded trainees, both intramural and extramural, and the various institutes and centers at NIH are developing checklists to ensure the more systematic evaluation of grant applications. Pilot programs are also being run to assess the value of such things as cross-reviewing panels. “You take one reviewer from each panel,” Birnbaum said, “and have those reviewers look at the whole review on another ongoing panel, and they also have the specific task of evaluating the scientific premise of the application. In other words, when they are reviewing a grant, [they ask concerning] the key publications on which an application is based, Are these in fact valid or appropriate publications?”
NIH also has a major initiative called BD2K, an abbreviation for Big Data to Knowledge. The goal of this $60 million project, Birnbaum said, is to make biomedical data intelligible, accessible, and citable. As part of the project, four centers of excellence are being established for biomedical big data analysis.
Birnbaum also described efforts at her own institute, NIEHS, aimed at sharing environmental data. “Our journal, Environmental Health Perspectives, was a pioneer in open access for scientific journals, which
is now being called for by legislation in the House,” she said. “Our policy for open access is decades old, and it predates the policies from PLoS (the Public Library of Science), NIH, and PubMed requirements, for example. And we are evolving publication policies to deal with data-sharing issues, such as all the supplemental materials that are online.” The journal is also working to improve reproducibility by, for example, requiring a checklist to be filled out when a paper is submitted to the journal to make sure that some of the key information related to study design is clearly presented in the study.
This section is a summary of the discussions related to federal agency implementations that took place throughout the workshop.
Examples of Secure Enclaves
During the discussion after Session 3, Howard further described the secure data enclaves with which NIOSH is experimenting. The work is part of an effort to make data that have been used in analyses accessible to people who wish to look at the data themselves to verify that an analysis was accurate. “We are trying to figure out how to make [those data] accessible while maintaining ... privacy, not just personal privacy, but also trade secret issues and several other avenues of privacy,” he said. The data enclaves are one avenue that NIOSH is examining in order to make data available in this way, Howard said, but the institute’s work with data enclaves is not far enough along that he could report how well they work. “It is fairly new,” he said. “We are probably not at the stage where I could report that it is entirely worked out. That remains to be seen.”
In contrast, Greenbaum noted during Session 4 of the workshop that other federal agencies do have working data enclaves. “The research data centers, which NCHS runs in Hyattsville, Maryland, are set up,” he said. “Investigators do go in. They get access to NHANES [National Health and Nutrition Examination Survey] data and other information that [aren’t] just generally available.” These data centers are secure facilities where researchers carry out analyses on the data that are kept there. The centers provide various statistical packages for the researchers to use. “You cannot take the data set back out,” he said, “but there is a mechanism for doing it in the federal government. How well that will
work with every single study that the federal government has funded is where we are still in the pilot stages of figuring out.”
Looking to Scientific Journals
During Session 1, Gray highlighted the fact that PLoS, the first and the biggest online journal, has just published new data policy procedures that focus on public access to the data. “Access to research results, immediately and without restriction, has always been at the heart of PLoS’s mission and the wider open-access movement,” he said. “However, without similar access to the data underlying the findings, the article can be of limited use. PLoS is trying to increase access to data and is revising its data-sharing policy.” Authors who submit to PLoS must now make all data publicly available without restriction immediately upon publication of the article, he said. “Going forward, I think we are going to see a different world. This is being driven by a lot of forces.”
Furthermore, as Birnbaum noted in Session 5, the NIEHS publication Environmental Health Perspectives was one of the first journals to require the authors of manuscripts to make the data in their papers available in a database, but many journals have followed suit, and that is becoming standard practice in the scientific publishing industry.
Francesca Dominici, professor of biostatistics and senior associate dean for research at the Harvard University School of Public Health, said during the discussion after Session 2 that there are journals in biostatistics that require authors to make both their data and their software available to others. On a more global level, Gray had said in Session 1 that the scientific world is changing rapidly, driven by the increasing ability to move information around electronically and a growing desire for openness in the scientific community. “If you want to publish in Nature, one of the very best journals in the world, [you] are required to make materials, data, and associated protocols promptly available to readers without undue qualifications. They want to see data sharing there.”
Information Concerning Data Collection and Analysis
In Session 2, Collman noted that many of the difficulties in reproducing studies are related to the methods used and simply sharing data will do little to help. The scientific papers that describe a study often have relatively short methods sections that are not very detailed, and
someone in another laboratory who is trying to replicate those studies from scratch can find it quite difficult.
When there are human data from epidemiological or clinical studies, Collman said, the real question is: What kinds of communication are necessary to fully communicate the details and the nuances of how those studies were done as well as what all of the data mean and how the data variables are created? “We do have metadata, and we have data dictionaries,” she said. “But oftentimes ... the methods of how we recruit and how we select participants and what the demographics of the original group are is not the most exciting paper to write. But in thinking about a future world where those data that come from those things end up in an open-access database with the proper protections or are available for sharing with consent of the research group, ... we really need to pay a lot of attention to the details of how these things came to be in order for the next group of scientists to be able to use them.”
In addition to getting access to the data and to information about how the data were collected, it is also important to have access to information about how the data were processed and used to come to a decision, said Gray in his presentation during Session 1 of the workshop. He acknowledged that one of the factors that causes people to question how a governmental agency came to a decision is the unavailability of the data that underlay the decision. If data are unavailable—for example, because they are considered to be confidential business information—then some people may be concerned that “the agency is playing around with data when they are doing their analyses” and that “some part of data that might be important is not being released.”
However, Gray continued, “I would say the place where I had the greatest trouble with this is actually understanding the agency’s underlying reasoning. How did the data that were available to the agency end up in this result?” In using data and scientific information to come to conclusions, there are always many choices to be made, Gray said. “What models are we going to use? Which populations are we going to use? Which studies will we rely upon? Which ones will we not rely upon? All of those can ultimately have an influence on the end product, especially if the end product is something quantitative, like a national ambient air quality standard or a reference dose in EPA’s integrated risk information system.” Thus, it is crucial when sharing data to also share how those data were used in coming to a particular conclusion.
Collins, F. S., and L. A. Tabak. 2014. NIH plans to enhance reproducibility. Nature 505(7485):612–613.
Dockery, D. W., C. A. Pope, X. Xu, J. D. Spengler, J. H. Ware, M. E. Fay, B. G. Ferris, and F. E. Speizer. 1993. An association between air pollution and mortality in six U.S. cities. New England Journal of Medicine 329:1753–1759.
Executive Office of the President. 2013. Memorandum for the heads of executive departments and agencies. Available at http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo__2013.pdf (accessed October 26, 2015).
IOM (Institute of Medicine). 2014. Discussion framework for clinical trial data sharing: Guiding principles, elements, and activities. Washington, DC: The National Academies Press.
OMB (Office of Management and Budget). 2002. Guidelines for ensuring and maximizing the quality, objectivity, utility, and integrity of information disseminated by federal agencies. Available at http://www.whitehouse.gov/omb/fedreg_reproducible (accessed October 26, 2015).
Pope, C. A., M. J. Thun, M. M. Namboodiri, D. W. Dockery, J. S. Evans, F. E. Speizer, and C. W. Heath. 1995. Particulate air pollution as a predictor of mortality in a prospective study of U.S. adults. American Journal of Respiratory and Critical Care Medicine 151:669–674.
Vitter, D. 2014. Memo to Dr. Francesca Grifo. Available at http://www.epw.senate.gov/public/_cache/files/6de2a2b9-ad38-41bc-a0c4-c909b391a526/031714vitterlettertodrgrifodatamisconduct.pdf (accessed October 28, 2015).
This page intentionally left blank.