Key Messages Identified by Individual Speakers
• Data sharing can enhance understanding of the results of an individual clinical trial and enable the pooling of data from multiple trials to extend scientific discoveries beyond those derivable from any single study.
• The moral and ethical arguments for data sharing center on fulfilling obligations to research participants, minimizing safety risks, and honoring the nature of medical research as a public good.
• The practical and scientific arguments for data sharing include improving the accuracy of research, informing risk/benefit analysis of treatment options, strengthening collaborations, accelerating biomedical research, and restoring trust in the clinical research enterprise.
• A cultural shift has already begun as leaders in industry, academia, and regulatory agencies recognize the value in increased transparency and data sharing and are focusing on how—instead of why—data should be shared.
• Participant-level data are particularly useful when shared, but care must be taken to avoid drawing inaccurate conclusions from reanalysis of such data.
Clinical data come in a variety of formats (see Box 2-1), from the raw data collected in case report forms during trials to the coded data
Terms such as “participant-level data,” “individual patient data,” and “raw data” are not well defined, noted Elizabeth Loder of the BMJ. A mutual understanding of the way these data are generated and shared can help alleviate ambiguities in nomenclature. In a typical multicenter clinical trial, data originate with case report forms, which can be handwritten or electronic. Study monitors audit the data, either at individual sites or electronically, to ensure accuracy. When a form contains an entry that is difficult to interpret or obviously mistaken, the monitors send a query back to the investigator or study staff to resolve the problem. Each query has to be explained and resolved before the data are entered into the coordinating center database (Kirwan et al., 2008). At several points in this process, a portion of the data is coded or categorized, and additional checks are performed to make sure the data entry is correct. Sometimes in the process of data entry, additional queries about the data are generated that must be addressed by the original investigator and the study staff.
The term “participant-level data” generally refers to the de-identified records of individual patients generated through this process. De-identification is the process by which personal information that can be used to identify an individual is removed. However, even participant-level data may not capture all relevant information recorded in the raw dataset. For example, Loder described several challenges involved in coding adverse events. Misclassification of adverse events in clinical trials can have serious consequences—as when adverse events like suicidal behavior are coded only as emotional liability—so systems have evolved to minimize this possibility. Adverse events usually are categorized using a predefined hierarchy or organizational system. But the symptoms reported by patients do not necessarily fall into this hierarchy or system. As a result, such symptoms can be interpreted in different ways. Because of this ambiguity, some have argued for access to raw data as reported by patients or researchers on the case report forms before any coding has taken place (Gøtzsche, 2011).
stored in computerized databases to the summary data made available through journals and registries like ClinicalTrials.gov. Data sharing can also occur at many levels. Several of the presenters at the workshop de-
scribed these data-sharing continuums and discussed the benefits and risks of data sharing, based on the degree to which participant-level data are made available to researchers and the public.
In some trials, data are not even made available to individual researchers participating in a multicenter trial. Sometimes, data are released to researchers not associated with the study only if they show a genuine research interest in the question and a track record of research capability. In some cases, data are shared with everyone.
De-identified patient data have two major uses, observed Deborah Zarin, director of ClinicalTrials.gov at the National Library of Medicine. They can improve transparency, helping to understand the results of an individual clinical trial, including what happened to individuals in the trial, and they can be pooled to discover new things not identified in the individual trials.
Data Sharing to Enable Independent Reanalysis
Steven Goodman, associate dean for clinical and translational research and professor of medicine and health policy and research at the Stanford University School of Medicine, discussed the former use case in the context of ensuring that a study was correctly analyzed and interpreted. Independent reanalysis of data is the basis of reproducible research and can be an extremely difficult task. An example he mentioned was a study of childhood asthma that had 72 different study forms, 109 form revisions, and almost 300,000 records in the database. The original manuscript started with 73 tables and 9 figures and underwent 40 revisions. The published manuscript contained three tables and two figures. “How do we begin from this tiny little slice that we see to begin to work backward and figure out is what they did right?” he asked. While the top tier of journals may have methodologists who can begin to check the chain of scientific custody from protocol to conduct to data to analysis to results, other journals have to rely on peer reviewers to detect problems. The authors of published studies can put additional information on the Web in the form of supplementary material and appendixes, but in reality, checking the accuracy of the results for a study like this is extremely difficult.
In talking about the tools that are needed to ensure that published findings are based on sound data and analyses, Goodman referenced a paper titled “Reproducible Epidemiologic Research” that proposes a standard for reproducibility (Peng et al., 2006). The premise behind that paper is that independent replication of research findings is the fundamental mechanism by which scientific evidence accumulates to support a hypothesis. The authors, therefore, argue that datasets and software should be made available to allow other researchers to conduct their own analyses and verify the published results.
Peter Doshi, a postdoctoral fellow at the Johns Hopkins University School of Medicine, also discussed the application of shared data to credible assessment of clinical trial results. Doshi, however, argued for a broader view of what should be considered clinical trial data. He proposed that detailed records of measurements and analyses, as well as narratives—including descriptions of patient dispositions, study protocols, and even correspondence—are needed to evaluate the quality of published trial results.
Data Sharing for Discovery
Participant-level data from multiple trials also can be combined to learn more than can be derived from the results of a single trial. Elizabeth Loder, clinical epidemiology editor at BMJ, observed that although meta-analyses historically have been done using summary-level data, the number of meta-analyses of individual participant data has been growing substantially. Furthermore, meta-analyses done with individual patient data are typically more likely to be able to detect treatment effects that differ across subgroups than meta-analyses done with aggregate data (Riley et al., 2010). These subgroup effects are frequently of great interest to clinical investigators. As Loder said, drawing from the title of an essay by Stephen Jay Gould, “the median is not the message.”
The arguments in favor of sharing can be divided into two broad and overlapping categories, Loder explained. The first category consists of moral and ethical arguments. These arguments point to the necessity of fulfilling obligations to research participants, minimizing known risks
and potential harm from unnecessary exposure to previously tested interventions, and honoring the nature of medical research as a public good. Patients participate in clinical trials based at least in part on the understanding that their data may benefit others, and these benefits are more likely to occur if the data are widely available. Also, unpublished information might in some cases prevent the occurrence of adverse events (Chalmers, 2006). Data sharing may take different forms, from simply publishing the results of research to publicly sharing detailed patient-level datasets. Finally, taxpayers provide a large amount of money to support publicly funded research and expect to have access to the benefits of that research.
The second category consists of practical and scientific arguments. These include detecting and deterring selective or inaccurate reporting of research; enabling the replication of results and potential resolution of apparently conflicting results; informing risk/benefit analyses for treatment options; facilitating application of previously generated data to new study questions; accelerating research; enhancing collaboration; and building trust in the clinical research enterprise. Rob Califf, director of the Duke Translational Medicine Institute, professor of medicine, and vice chancellor for clinical and translational research at Duke University Medical Center, who also spoke during the first session, pointed to the need to resolve results that appear conflicting. Clinicians are not able to interpret conflicting clinical trials data based on looking at the data abstractly without any kind of expert synthesis of information. Only through replication can one sort out whether conflicting results are due to chance or true differences.
Califf went on to describe a “cycle of quality” that can generate evidence to inform patient care (see Figure 2-1). Clinical trials generate knowledge, which is then applied in clinical practice. The measurement of patient outcomes then leads both to clinical practice guidelines that define standard of care and to further clinical trials. At the core of the cycle is measurement and education, which in turn depend on access to data. Box 2-2 describes how this paradigm of cumulatively building and sharing datasets has worked to reduce deaths due to heart attacks by 40 percent.
As an example of the kinds of advances that may be possible, Loder cited the case of a high school student who won $75,000 at the Intel International Science and Engineering Fair. The student cited searchable databases and free online science papers as the tools that allowed him to create his prize-winning entry. “How many collaborators are out there,
who we cannot even imagine at this point, who might make use of the data?” said Loder.
Loder also called attention to the need to build trust in the clinical research enterprise. This trust is at “an all-time low,” she said, which is causing a crisis in recruitment for clinical trials (Williams et al., 2008).
The risk of death after a heart attack is now 40 percent lower than it was before the development of medical therapies designed to reduce such deaths (Krumholz et al., 2009) and the development of these therapies relied extensively on clinical trials, said Rob Califf, director of the Duke Translational Medicine Institute, professor of
medicine, and vice chancellor for clinical and translational research at Duke University Medical Center. As an example, he pointed to the Antithrombotic Trialists’ Collaboration (2002), which involved 135,000 patients and 287 randomized controlled trials. This study provided compelling evidence that the use of aspirin can reduce deaths from heart attacks. Replication of results from multiple trials has also demonstrated the benefits of fibrinolytics, beta blockers, angiotensin-converting enzyme inhibitors, and other treatments. These studies also showed that particular therapies were more or less useful in different groups of patients and at different times following presentation of symptoms, providing information that then shaped clinical practice guidelines.
Another example involves the effects of statins. By pooling data from multiple trials, it has been possible to show that statins confer benefits regardless of their effects on cholesterol levels (Baigent et al., 2005). In contrast, when data were not released and combined regarding the use of erythropoietin in renal patients who are anemic, the harmful effects of high-dose erythropoietin were overlooked (McCullough and Lepor, 2005). “This could have been detected much earlier if the right trials had been done and the data had been combined,” Califf asserted.
The lack of trust extends even to physicians, who tend to discount studies of superior methodological rigor when they perceive that the studies have been funded by industry (Kesselheim et al., 2012). “If doctors do not believe the evidence, what hope is there for evidence-based medicine?” Loder asked.
Sharing data may generate problems that cannot be anticipated today, but it will also generate unanticipated benefits. “We are engaged in one of the great struggles of human knowledge—the struggle to liberate clinical trial information and make sure it is put to its best and highest use now and in the future,” Loder concluded. “It is a thrill to be part of this historic meeting.”
Commitment to Open Science
Every day, many people face difficult questions about health care, observed Harlan Krumholz, Harold H. Hines, Jr., Professor of Medicine at the Yale University School of Medicine. They need all of the information that is relevant to the options they are considering. If data are
missing, their ability to make informed decisions will be impaired. This is the central argument in favor of open science, Krumholz said.
Krumholz’s experience has been that whenever data are shared, whether voluntarily or not, new and important things are learned. In particular, the release of participant-level data has generated vital new information about the risks and benefits of drugs and devices. In some cases, access to this information leads to conclusions that contrast with the prevailing knowledge and changes the use of a drug or device. In other cases, it provides “nuance and understanding.” For instance, Krumholz described a study (also described by Loder) which found that unreleased data are about as likely to strengthen evidence for the use of a product as to weaken such evidence (Hart et al., 2012). “What is important is that we support the idea that data are a social good and the best science takes place in the light,” he said.
Krumholz shared his vision of a future where data sharing is widely accepted as being in everyone’s best interest and will be the cultural norm. “Data sharing [will be] an essential characteristic of being a good scientist and a good citizen,” he said. With the full release of data, companies would compete on the basis of science, not marketing. Academic researchers could get credit not only for the papers they publish, but for the knowledge generated from the databases they create.
Industry has the opportunity to demonstrate leadership, restore trust, and reclaim its position of integrity through meaningful actions to share data, Krumholz continued. “You have a meaningful motivation,” he said. “The [medical] profession has less trust in your science than in [National Institutes of Health]-sponsored studies and is less likely to act on the results of the trials you sponsored, not just the ones you conduct. The pharmaceutical and device industries no longer have the respect they once held…. The result is a situation that does a disservice to the public, the medical profession, and the vast majority of professionals in industry who have extraordinarily high integrity and are in that industry for the right reasons.”
Krumholz noted that an important cultural shift is already taking place. Some industry leaders have already taken steps to support data sharing and have contributed to major scientific advances as a result. For example, Medtronic’s decision to release the company’s data on a product that has nearly a billion dollars in annual sales was a powerful statement that the company was seeking the truth. The individuals who have made these decisions “realize that studies are only possible due to the generosity of people who consented to participate, and that we have an
obligation to ensure that the efforts of those subjects contribute as much as possible to knowledge generation.” Such transparency will also be essential to ensure the continuing flow of individuals who are willing to participate in trials, Krumholz added.
In return for the privilege of selling a medical product to the public, industry bears a responsibility to ensure that all the data concerning the risks and benefits are available to everyone, said Krumholz. The current challenge is not to decide whether data should be released, but how to do so while being attentive to the needs and concerns of all stakeholders. In addition, the publication of summary results is not enough, according to Krumholz. Rather, individual patient-level data need to be broadly and freely available for investigators. “We need the protocols and case report forms. We need full sharing of the source data…. With the talent in this room, and with those listening on the webinar and those who are interested, I know solutions can be found. If we are committed to the path, we can figure out how to do it.”
Jesse Berlin, vice president of epidemiology for Janssen Research & Development, LLC, provided a countervailing view by asking whether participant-level data are always needed. Complications can arise when the data are reexamined, he said. Decisions may have been made during a clinical trial that cannot be replicated. Published studies may not always incorporate the appropriate intent-to-treat analysis. Endpoints may be defined differently in different trials. Study designs, patient populations, and treatments can vary from trial to trial. As a result of these and other potential problems, such analyses can go “seriously wrong,” Berlin warned. “It is not just a matter of feeling more comfortable having the individual-level data. You can actually get wrong answers.”
Although there is a common belief that participant-level data can enable verification and reproduction of trial results, that premise is reliant on the trustworthiness of the shared data, warned Peter Doshi. Even participant-level data can lead investigators astray. For example, a computerized database of participant-level data may not reflect what is actually recorded on a case report form. In some cases, it may be necessary to look beyond what people typically consider data (i.e., numbers) into more narrative forms of documentation depending on the intended use of the shared data.