In this session, panelists explored opportunities and potential solutions for overcoming the technical and sustainability challenges discussed throughout the workshop. Mark Helfand, professor of medicine, medical informatics, and clinical epidemiology at Oregon Health & Science University, shared his perspective as user of shared data for research synthesis. Monica Bertagnolli, professor of surgery at the Harvard Medical School, provided the perspective of an institution that generates and shares data. Rebecca Kush, chief scientific officer at Elligo Health Research, discussed next steps from a platform perspective. The panel was moderated by Ida Sim, professor of medicine at the University of California, San Francisco, School of Medicine.
Sim provided a brief review of the many components of data sharing that were discussed thus far in the workshop relative to the components of the data-sharing environment (see Figure 8-1).
The findable, accessible, interoperable, and reusable principles1 state that data should be findable, accessible, interoperable, and reusable. One suggestion to address barriers to findability and accessibility, Sim said, was to create a single portal to search for individual participant data across platforms (ClinicalTrials.gov and the World Health Organization International Clinical Trials Registry Platform were both mentioned as models).
Some of the barriers to data reusability have been addressed by the use of data-sharing platforms, which Sim said incorporate technical infrastructure, governance, and policies. Suggestions discussed that would aid usability included standards for both data and metadata (Clinical Data Interchange Standards Consortium [CDISC] was mentioned as an example), and the need for an infrastructure to link an individual’s ORCID ID2 to the digital object identifiers for their contributions.
With regard to governance, it was discussed that the patient voice needs to be at the center of the process. Engaging research participants in decisions about how their data will be shared and used fosters trust in the biomedical research enterprise, Sim said. Maintaining patient trust is essential because without participants, there can be no clinical trials.
Workshop participants also discussed the need for common data request functionality across platforms, and there were suggestions to combine the platforms. In this regard, Sim pointed out that Vivli bridges multiple platforms, and users can request datasets from other platforms (e.g., Johnson & Johnson, the National Heart, Lung, and Blood Institute’s Biologic Specimen and Data Repository Information Coordinating Center3 Project Data Sphere), which can be collected into a secure environment in Vivli for analysis.
With regard to workforce, discussion took place about training, incentives, and ensuring that those who generate and share data receive credit. Developing a career path for data stewards was also suggested.
Data sharing generates new findings and discoveries. One discussion was that these are not only clinical but also methodologic, Sim said, and reuse of data benefits patients and contributes to the scientific knowledge base. The topic of metrics was discussed throughout the workshop, and it was emphasized that there is a need for metrics beyond the number of publications.
These findings and discoveries are not the endpoint, Sim continued, but they feed back and inform the next round of primary research (e.g., trial design, methodology). She noted that improving the process of science is a key benefit of data sharing, and metrics are needed to better understand
1 See footnote 3 on p. 64.
2 See footnote 3 on p. 12.
3 See p. 68 for more information.
the value added. Another point of discussion was that, when data sharing is expected, trials are designed up front with data sharing in mind. One suggestion was to institute auditing of a subset of trials, with the premise being that the risk of audit could inspire uniform data husbandry.
Mark Helfand, Professor of Medicine, Medical Informatics, and Clinical Epidemiology, Oregon Health & Science University
Helfand focused his remarks on overcoming data-sharing challenges as they relate to research synthesis—specifically, systematic review and meta-analysis. Technical issues raised by speakers throughout the workshop included IPD availability and the lack of any centralized, searchable catalog of IPD; insufficient data standardization; and inability to merge data. In addition, Helfand said that “users can’t learn everything,” elaborating that, for researchers whose primary background is in systematic review and meta-analysis, navigating original clinical trial data can be challenging.
The goal of clinical research synthesis is to “maximize the relevance and utility of clinical research to clinical practice,” Helfand explained. Clinicians have difficulty applying clinical study findings in journal articles to practice. Reuse of data for research synthesis can pool results from different studies of an intervention, evaluate an outcome across interventions, and importantly, assess the impact of the way that studies were designed, conducted, and analyzed on the results. Helfand elaborated on several areas for attention from his perspective as a data user.
Helfand observed that there is mistrust of secondary data users, which he suggested stems from the culture of clinical trial science (not from concerns about the data users themselves). In essence, trialists are being asked to trust a data analyst who is not associated with the study to evaluate their work (product, idea, surgical procedure, etc.). He said that activism by patients and clinicians is needed to change this culture. Patients who participate in clinical trials and clinicians who need to apply trial findings to practice should demand that researchers make the best use of participant data.
The depth of metadata needed for systematic review and meta-analysis can be different from that needed for other uses, Helfand said.
Systematic reviewers can find themselves working with trialists to refer back to the raw information from ascertainment, case report forms, and clinical notes to find the necessary metadata. This is generally beyond what platforms can address at this time, Helfand said.
Helfand also raised the issue of using shared data for replication of studies, voicing his disappointment in platforms’ experiences so far. Some reported zero data requests expressing intent to replicate studies. However, others reported that all their study replication data requests resulted in perfect replication results. He speculated that the platform process might not be ideal for that type of “forensic” assessment of trials, or that perhaps platforms might be filtering out users interested in replication studies.
Data Sharing and Analysis
Helfand agreed with comments by several other speakers that “systematic reviewers, meta-analysts, and other reusers should have trial experience.” He elaborated that trialists use practical and subjective judgment in interpreting data, which data analysts with minimal trial experience tend to label as bias. Analysts without trial experience also tend to rediscover things that are generally already known by trialists. He recommended that the training pathway for data analysts include trial experience.
Helfand also agreed with the suggestions made regarding prioritizing which clinical trial data to share. He continued that the data that clinicians and patients need to use should be prioritized. He added that data reuse for validation and answering questions that help clinicians and patients make informed decisions should be rewarded.
Monica Bertagnolli, Professor of Surgery, Harvard Medical School
Bertagnolli spoke from her perspective as the chair of the Alliance for Clinical Trials in Oncology.4 A successful clinical trial, she said, “answers important questions, achieves timely accrual of appropriately representative patients, addresses disparities, generates high-quality data, and creates resources for future use,” which include data and biospecimens.
Alliance has long been committed to data sharing as a fundamental goal, Bertagnolli said. Alliance has been a member of the National Cancer Institute’s (NCI’s) National Clinical Trials Network (NCTN) since its establishment in 1955, and was involved in the first multi-institutional
4 See https://www.allianceforclinicaltrialsinoncology.org/main (accessed February 10, 2020).
cancer clinical trial, called “Protocol #1,” Bertagnolli said.5 She added that the protocol was 17 pages long, and four carbon copies were made. In 2007, Alliance began a major commitment to data sharing when the Adjuvant Colon Cancer End Points (ACCENT) Group was established. Researchers in the ACCENT Group agreed that data from their individual trials would be shared and pooled. For perspective on data volume, she said that, in 2010, Alliance had 61 actively enrolling clinical trials when the Alliance Statistics and Data Center was relocated from Duke University to the Mayo Clinic. Although electronic data capture was being used, there were seven semi-truck loads of paper data spanning 1955 to 2010 that had to be moved. The NCTN adopted a network-wide electronic data management system in 2011 for use by all NCTN groups, which Bertagnolli described as “a huge advance.” In 2015, the NCI launched its own platform, the NCTN/NCI Community Oncology Research Program (NCORP) Data Archive.
Alliance Contributions to Data Sharing
Alliance currently contributes data to the NCTN/NCORP Data Archive (all studies published on or after January 1, 2015). Studies that were completed before the NCTN/NCORP launch were contributed first to Project Data Sphere and later to Vivli. Biospecimens linked to Alliance clinical trials are contributed to NCTN Navigator, and genomic data are contributed to the database of Genotypes and Phenotypes. All other data requests are handled by Alliance, and Bertagnolli noted that Alliance receives many requests for legacy data.
Bertagnolli summarized Alliance’s contributions to data sharing for 2015 through late 2019 as follows:
- NCTN Data Archive: 38 publication datasets from 33 unique trials.
- Project Data Sphere: 22 publication datasets of legacy data from 20 unique trials.
- NCTN Navigator: 40 unique trials covering more than 23,000 patients and 355,000 specimens, 15 requests completed (providing specimens with links to clinical data).
- National Center for Biotechnology Information database of Genotypes and Phenotypes: 5 institutional review board–approved data transfers for 4 unique trials.
5 Alliance in its current form is the merger of the American College of Surgeons Oncology Group, the Cancer and Leukemia Group B, and the North Central Cancer Treatment Group. NCTN was formerly known as the NCI Clinical Trials Cooperative Group Program.
- Alliance (legacy data): 143 datasets shared with 100 external collaborators and 43 Alliance member groups.
Bertagnolli added that from January through October 2019, Alliance received 28 requests for legacy data (15 have been fulfilled thus far). Requests received are reviewed by the Alliance Statistics and Data Management Center (SDMC), which she explained is also responsible for new study design, data quality monitoring, results analysis and manuscript preparation, and Data and Safety Monitoring Board and funding and agency reports for Alliance’s clinical research program. “A research group will always … saturate the capabilities of its statistics and data center,” Bertagnolli said. SDMC must prioritize, she said, and processing secondary data use requests can take lower priority in an active clinical research program.
Motivations for Sharing
Alliance is motivated to share data for a variety of reasons. Data sharing is, and always has been, a core value of Alliance. Bertagnolli said that Alliance has a “responsibility to clinical trials participants to share data in ways that bring the most benefit to society.” Data sharing is also frequently a condition of funding awards, she said, and awards can include support for data-sharing activities. Other motivations are receiving recognition as the source of data in publications, and opportunities for co-authorship on papers when Alliance participates in the study.
Elements of Success
According to Bertagnolli, Alliance data sharing is enabled by three key elements:
- Planning. Data sharing is planned from the beginning and is specified in contracts and agreements as well as participant consent forms. The intent is to share the data widely to reap the most benefit.
- Data standards. The merging of the cooperative group leads to the use of one standard system, which she described as a major improvement. Alliance is also working toward expanding the use of standard data formats to cancer control and EHR-based data.
- Resources. One valuable resource is the availability of data-sharing platforms for use by Alliance. Another important resource is the Alliance Data Sharing Working Group, which comprises Alliance volunteers and staff and is funded by the Alliance Foundation. Bertagnolli acknowledged three Alliance Data Sharing Champions,
Selina Chow of the University of Chicago, who chairs the data-sharing working group; Sumithra Mandrekar of the Mayo Clinic, who is the group statistician; and Mark Watson of Washington University, who is the Alliance biorepository director. She also acknowledged the many researchers at Alliance member institutions who make oncology clinical research a priority.
Rebecca Kush, Chief Scientific Officer, Elligo Health Research
Kush discussed data usability challenges from a platform perspective. She shared a Venn diagram, originally created in 1997, and said it represented an approach for automating electronic health records extraction of clinical research case report form data in order to optimize clinical research. Standards were clearly needed and, she said, they needed to be independent of any specific platform or technology. At that time, initial standards had been developed by HL76 and CDISC. In 2004, the NCI, the U.S. Food and Drug Administration (FDA), HL7, and CDISC came together to develop the Biomedical Research Integrated Domain Group (BRIDG) model to bridge clinical research and health care, enhancing data interoperability.7 Kush said that the BRIDG model is now a global standard, which has been approved through HL7, CDISC, and the International Standards Organization. In 2010, through the Healthcare Information Technology Standards Panel/American National Standards Institute, an interoperability specification (IS #158) was developed for using EHRs for research, identifying standards from HL7 and CDISC and integration profiles from Integrating the Healthcare Enterprise to support this use case. In 2016, FDA and the Japanese Pharmaceuticals and Medical Devices Agency began requiring that data in regulatory filings be submitted in a CDISC standard format, which Kush noted that other countries have now endorsed.8 Most recently, HL7 has released its Fast Healthcare Interoperability Resources (FHIR), and the National Institutes of Health is evaluating how this standard might be used by its grantees. Kush referred participants to several recent publications that support the value of using standards for data sharing (see Kush, 2014; Kush and Goldman, 2014; Ohmann et al., 2017; Varnai et al., 2014).
8 See https://www.fda.gov/industry/fda-resources-data-standards/study-data-standards-resources (accessed February 10, 2020).
Common Data Model Harmonization
Four organizations or networks, Observational Health Data Sciences and Informatics, Informatics for Integrating Biology with the Bedside, the FDA Sentinel Initiative, and the National Patient-Centered Clinical Research Network (PCORNet), currently use four different data models, Kush said, which increases the burden of researchers sharing data to participate in multiple research networks. In addition, she said there has been a proliferation of common data elements, some of which are not necessarily standards, and can potentially leave data gaps in databases, including inadequate metadata. To address the challenge of low interoperability among networks, FDA has worked on a project with other federal agencies, and with funding from the Patient-Centered Outcomes Research Institute trust fund, to harmonize these four common data models. Because the BRIDG model is broad and robust and is a global standard, it was decided to map the different common data models to the BRIDG model. Terminologies were also mapped and harmonized across these data models.
Elligo Health Research was contracted to identify data partners who would run queries for a pharmacovigilance use case, and thus test the goal of making it easier for participants using the four different data models to participate in research and data sharing by leveraging the products of the common data model harmonization initiative. Kush reported that it was interesting to find that the barriers encountered during the use testing were not as much technical as they were legal, cultural, and political. A second phase of the project will leverage HL7 FHIR.
In closing, Kush summarized some of the lessons learned and key messages regarding data usability challenges:
- “The technical ability to share meaningful data should be platform-independent, and data standards enable this,” Kush said.
- Standards can create efficiencies in research and in data sharing, she added, and should be addressed up front in the planning stages of studies.
- To be most effective, Kush continued, standards and data models need to be broadly adopted and should include metadata.
- “Researchers should leverage existing robust/global standards and data models before creating new ones,” she said.
- Technical barriers to data sharing can be overcome with standards and harmonized common data models; however, cultural, political, and legal barriers can persist.
What Data to Share
Participants continued the discussion of what data should be shared. Amy Nurnberger advocated that all trials should be shared because the expectation that data will be shared promotes good data management. Also, as discussed, sharing trial data “honors the time and the investment of the participant.” Sharing all trial data would also help to reduce duplication and increase efficiency. Helfand said that “all trials” would need to be defined. For example, is there a need to share data from very small trials with no clinical outcomes, or from Phase 1 pharmacokinetic trials? His perspective was that the definition needs to be broader than just Phase 3 trials and beyond. He said, for example, that dose-finding studies would be relevant to share. He explained that his intention is not to restrict sharing but to “set a target for starting, get the data processes right … and eventually expand it to all trials.” Bertagnolli said the oncology clinical trial groups are continually funded and supported data centers that conduct many trials, and that all of the publication datasets likely are shared via platforms. However, not all data collected are included in the publication dataset, and groups such as Alliance are needed to facilitate sharing of other data, such as case report forms that include all of the captured data elements. Sim raised the issue of who should make the decisions about the scope of trials to share. Helfand said that as platforms continue to evolve, they will need continued input. The pressing decisions, he said, are not about whether all data from every trial should be shared but, rather, what should be shared for a particular trial. He added, for example, that limiting sharing to publication datasets would not fulfill his analysis interests. Bertagnolli reiterated the value of sharing case report forms.
Frank Rockhold recalled the discussions during the development of ClinicalStudyDataRequest.com (CSDR) regarding which trials to include and said that “the easy decision is all.” Although the concept of prioritizing Phase 3 trials is logical, it can be difficult to define what that would include. For example, he said, some Phase 2 trials enroll 1,000 participants. Furthermore, it seems clear that a 15,000-participant outcomes-based cardiovascular disease trial of public health importance should be shared, he added, while there is likely to be little interest by other researchers in a food-effect study of yet another beta blocker. Yet, he said, there would likely be a lot of overlap between Phase 2 and Phase 3 studies. He reiterated his philosophy that trials should be conducted with the expectation that data might be shared. The expectation of future data sharing drives better data husbandry and reproducibility.
Sim agreed that all studies should practice good data husbandry with the expectation of sharing, but added that taking this approach does not mean that all trials need to fully prepare their data for sharing that might not occur. Some activities are time and resource intensive and could be done if and when data are requested (e.g., anonymization). Sim suggested that a useful task might be to define what dataset preparation should be done “in anticipation of a potential request.”
Tim Feeney raised the issue of sharing analysis and code along with trial results. He speculated that some requests for data might be simply to check the accuracy of the analysis. In this regard, he noted that there is now a trend toward posting code on GitHub so that others can review the code for errors. Kush mentioned the Mobilizing Computable Biomedical Knowledge initiative,9 as well as a call for papers describing “computable knowledge” by the journal Learning Health Systems.10
Replication of Studies
Following up on Helfand’s remarks on replication studies, Joseph Ross clarified that there has been only one study using data from the Yale University Open Data Access (YODA) Project for the specific purpose of replicating a particular trial, and there were some discrepancies between the primary report and the secondary analysis. He added that the secondary data users worked with the clinical team regarding issues of replication, which he said provided a better understanding of the metadata needed to replicate the analyses (Gay et al., 2017).
Ross said that replication is one purpose of data-sharing platforms, and he recalled the inability to replicate studies associated with the Merck Vioxx trials due to, he said, changes in the case report forms. He described a current analysis of the impact of data sharing on assessing the cardiovascular risk associated with the diabetes drug Avandia. Using data shared in CSDR, he identified a signal of cardiovascular risk that was not apparent from analyzing the summary data in the clinical study reports. He emphasized that the main purpose of a data-sharing platform should be to improve the overall understanding of the products tested in the trials and to enable other avenues of research (i.e., not detecting misconduct or other challenges).
Helfand noted that the publication of the Merck Vioxx Gastrointestinal Outcomes Research trial did mention cardiovascular effects in the discus-
sion, and FDA was aware of the concerns from the data in the regulatory submission. In writing a report for Medicaid, Helfand used the statistical and clinical summaries available on the FDA website, which he said included detailed analysis of the cardiovascular effects. He suggested that the regulatory role in data analysis is being overlooked, and added that he questioned the justification for replication analyses “validating what FDA has already discovered.” He observed that promotional materials and journal articles often do not fully reflect the data that are already in the public domain.
Rockhold supported the need for reproducibility and replicability of studies but cautioned that, just because another group cannot replicate the original study, it does not mean they are correct and the original research is flawed. Helfand agreed and added that there can also be different interpretations of data. As was observed in the Systolic Blood Pressure Intervention Trial, for example, many opinions were expressed about the relevance of the findings to clinical practice. The medical establishment is familiar with disagreement, and he cited the American College of Physicians and American College of Cardiology guidelines, which have interpreted studies differently. Helfand and Rockhold both said that different findings by a reanalysis is an opportunity to learn and do not necessarily mean one study or the other is wrong.
Engagement of Original Researchers
Matthew Sydes asked Bertagnolli to elaborate on the ways in which data sharing can lead to co-authorship on publications. For example, to what extent might collaboration with original researchers happen when data are shared via a repository versus a trials group such as Alliance? “When is it important to seek the engagement of the original researchers?” he asked, and what role do independent review panels (IRPs) play in facilitating or hindering collaboration? Bertagnolli said the study statistician is the original study team member most frequently engaged as a collaborator for the data requests received by Alliance. Data requests are reviewed by the Alliance working group, and also by the original study statistician, who can best address questions about the feasibility of the proposal. The statistician is also usually engaged in the secondary analysis. She added that she was not aware of a collaboration that had resulted from the sharing of a legacy trial through one of the platforms, but she acknowledged that Alliance does not track data that have been contributed to the platforms. Sydes asked about when a trial should be contributed to a data-sharing platform (versus sharing via a trials group website), and whether the original trial team should continue to be engaged in review of data access requests submitted to platforms.
He noted concern that a platform IRP cannot be expected to have the same depth of expertise in any given study as the original research team would. Bertagnolli clarified that all of Alliance’s publication datasets are submitted to a platform, unless prohibited by legacy third party agreements. Alliance researchers engage if asked, she said, but all requests for access to publication datasets are handled by the platform.
Sydes asked whether there is information about the extent to which contact with original study teams is made through a platform versus directly. Sim said that Vivli data requesters know who the data contributor is, and data contributors are informed that their data have been requested and by whom, but Vivli leaves it to the parties to contact each other if desired. Platforms cannot require collaboration, she said, but they could potentially take measures to make the process easier. She noted that Vivli does not yet formally track collaborations but said that this question was an interesting one. Helfand shared that, for one of the early YODA Project studies, Yale served as an intermediary among groups; he added that he did not know the extent to which this intermediary role happens now. Helfand also noted that there are other potential intermediaries, such as funders; he also raised a concern that original investigators might seek to have more control than perhaps they should (e.g., over what is emphasized, how results are disseminated). These issues are difficult to address in a DUA, he said, and he emphasized the need for intermediaries but added that there is not clear guidance available on negotiating such collaborations.
Bertagnolli reiterated that the vast workload of a statistics and data center will persist, as will resource constraints. She agreed with others that the solution is to plan for sharing from the outset so that there is no need to go back and prepare data for sharing at the end. Sim asked if that planning takes into account the statisticians’ time to work with data requesters (e.g., salary support). Bertagnolli responded that Alliance does fund the working group that reviews the requests. Jeffrey M. Drazen asked about handling the increased workload when a secondary analysis identifies an inconsistency or error in a shared dataset. Bertagnolli said they have not had this experience yet. Sherman shared that his group has experienced this, and they “dropped everything” to address it. They also then announced the issue and resolution so others who might have used those data could reevaluate.
Common Data Models
Ross said that aligning with common data models will be especially important as clinical trials increasingly rely on data from health systems.
Joanne Waldstreicher mentioned that companies are working together to develop master protocols, which she said ensures that sponsors are collecting the same endpoints and outcomes, and are using the same definitions and other parameters. Sponsors are also developing platform trials in several disease areas, which she said helps to reduce the number of participants needed in control groups and allows for “more meaningful comparisons on both efficacy and safety across different treatments.”
Choice of Data Standards
Alex Sherman observed that shared trial data will likely be in CDISC format as that is the format required by FDA for regulatory submission, but said that CDISC is not intuitive to many researchers. He asked Kush if other formats might be better for sharing. Kush responded that “there is no right or wrong when it comes to standards.” It is a matter of building consensus and all parties coming to agreement about which standards to use, then adopting these broadly and using them as the foundation for extensions such as therapeutic area–specific standards.