National Academies Press: OpenBook

Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop (2021)

Chapter: 4 Challenges Associated with Data Collection, Aggregation, and Sharing

« Previous: 3 Exploring the Challenges of Critical Quality Attributes: The Role of Systems Thinking
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

4

Challenges Associated with Data Collection, Aggregation, and Sharing

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

The third session of the workshop, moderated by Sadik Kassim of Vor Biopharma, explored challenges associated with data collection, aggregation, and sharing. The session featured presentations on how open science can be applied in omics and disease modeling and also on the use of big data in clinically stratifying patients. This session’s objectives were to discuss how big data can be used to identify which patients will respond best to a particular regenerative medicine and to highlight challenges in data collection and data sharing such as small sample sizes in clinical trials, proprietary issues, and patient privacy.

TOWARD OPEN SCIENCE IN OMICS ANALYSIS AND DISEASE MODELING

Larsson Omberg, the vice president of systems biology at Sage Bionetworks, spoke about the opportunities and challenges involved in using open science approaches in omics analyses and disease modeling. Sage Bionetworks, helps research communities develop reliable outcomes to advance the understanding of human health by harnessing the power of open and collaborative science. He added, however, that open science should be supported with appropriate incentives and structures to catalyze innovation.

Shift Toward Team-Based Production of Knowledge in Biomedical Research

In recent years teams have come to increasingly dominate the production of knowledge in the field of biomedical research (Wuchty et al., 2007), although the biological sciences have lagged behind other hard sciences, such as physics and astronomy, in shifting toward this type of team-based production of knowledge. This shift has been driven largely by the increasing costs associated with cutting-edge research, Omberg said. The fields of physics and astronomy made this shift to large-scale collaborative efforts and team-based production of knowledge earlier than biomedical research, where it emerged in the 2000s. In the field of genomics, this transition was also spurred by the large costs associated with data production. Historically, those costs were related to the expensive technologies that were required, but now the costs result primarily from the large sample sizes needed to conduct research. Furthermore, researchers have found that in the fields of science and technology, large teams tend to develop, while small teams are more likely to disrupt (Wu et al., 2019). Sage Bionetworks works mainly with large teams by helping and enabling large consortia, but there have been cases of disruption when individuals within teams have brought new ideas to bear, he added.

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

Moving Toward Open Science

Omberg said he prefers the term “open science” over “open data” because open data represent just one component of many in open science which should all be considered together. The volume of open data has increased substantially since 2004, as demonstrated by the growing rate of Internet searches for the term “open data” and related concepts over the past decade. Underlying concepts of fairness and accessibility are particularly strong drivers of many open research efforts. The U.S. government also has a strong interest in open data, he added. For example, the National Institutes of Health (NIH) introduced a genomics data sharing policy in 20141 that requires all data to be made open within 6 months after generation. Scientists are sometimes reluctant to use open methods due to a lack of incentives for collaborating and sharing data, Omberg said, however this attitude is changing, particularly among younger researchers. Making science open is a good first step, he continued, but it is not enough: incentives must be adjusted appropriately (Chen et al., 2019). In physics research, for example, large numbers of astronomical datasets were made available openly, yet much of the data went unused. Incentive structures may help to ensure that scientists work with data that are made available, he said.

Consortium-Based Collaboration in Alzheimer’s Disease Research

Sage Bionetworks is involved in numerous collaborative projects and consortia, primarily in the fields of cancer research, neurodegeneration, and neuropsychiatric disease, Omberg said. As an example of collaborative science, he described the Accelerating Medicines Partnership Alzheimer’s Disease (AMP AD).2 This public–private consortium was established to bring together NIH, biopharmaceutical and life sciences companies, and nonprofit organizations to develop new diagnostics and treatments and address the lack of biological targets for Alzheimer’s disease (AD). Although AD has been cured in mice, not a single clinical trial has succeeded in curing the disease in humans despite investments of billions of dollars from both pharmaceutical companies and governments (Franco and Cedazo-Minguez, 2014). Research on AD has traditionally been focused on a limited set of targets, so the AMP AD was established to expand the number of AD targets by using systems biology approaches. Sage Bionetworks has taken on a coordinating role to facilitate numerous additional related efforts, including

___________________

1 For more information on the NIH genomics data sharing policy, see https://osp.od.nih.gov/scientific-sharing/genomic-data-sharing (accessed January 19, 2021).

2 More information about the Accelerating Medicines Partnership is available at https://fnih.org/what-we-do/programs/amp (accessed November 30, 2020).

3 More information about MODEL-AD is available at https://sagebionetworks.org/researchprojects/model-ad (accessed December 22, 2020).

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
  • MODEL-AD,3 which is developing mice models of AD;
  • Resilience-AD,4 which is studying the resilience of AD;
  • M2OVE-AD,5 which studies vascularization;
  • Psych-AD,6 which studies the interaction between neuropsychiatric disease and AD; and
  • TREAT-AD,7 which is working to develop target-enabling packages to help with the preclinical development of drugs using targets identified through AMP AD.

At Sage, there is a data repository containing more than 17,000 biosamples and 15 genomic data types from 7,261 human donors collected from a range of sources, Omberg said. These open-access data are used to build algorithms (e.g., RNA-seq processing, proteomic analysis, single-cell RNA) to generate analytical results that can be used to identify new targets for AD.

Making genomic data useful across studies requires substantial collaborative work, Omberg emphasized. To illustrate, he described how RNA-seq analysis8 is conducted collaboratively within the AMP AD consortium. Initially, the consortium was working with highly heterogeneous data from postmortem brain samples collected using varying technologies and analytical approaches, which made it impossible to make direct comparisons. To homogenize these data, several groups within the consortium worked together to create a canonical dataset that could be used in downstream analysis to derive insights. This type of data pooling allows different groups to conduct their own analyses on the same dataset, he added. A working group was formed to evaluate the methods used by different teams for these analyses, as well as some pre-published methodologies, in order to begin consensus modeling across these methods using a multimethod co-expression network analysis and differential expression meta-analysis. A consensus was established by developing a common set of outputs generated by those methods. This was used to identify a canonical set of existing

___________________

4 More information about Resilience-AD is available at https://grants.nih.gov/grants/guide/rfa-files/RFA-AG-17-061.html (accessed December 22, 2020).

5 More information about M2OVE-AD is available at https://adknowledgeportal.synapse.org/Explore/Programs/DetailsPage?Program=M2OVE-AD (accessed December 22, 2020).

6 More information about Psych-AD is available at https://grants.nih.gov/grants/guide/rfafiles/RFA-MH-19-510.html (accessed December 22, 2020).

7 More information about TREAT-AD is available at https://sagebionetworks.org/researchprojects/treat-ad (accessed December 22, 2020).

8 More information about RNA sequencing analysis is available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6096346 (accessed January 16, 2021).

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

networks; comparative module analysis was used to identify differences (Wan et al., 2020).

These analyses revealed a sizable degree of overlap between these groups’ work, Omberg said. A large subset of these consensus networks contained items that were found commonly across many methodologies, but did not correspond to known AD biology, thus highlighting the large universe of unknown or poorly understood AD biology. Described by Omberg as “the dark matter of AD biology,” these unknowns are of great interest in the context of identifying new possible targets and sources of disease. However, he said, much more research will be needed to operationalize this insight in a useful way (e.g., by using these newly identified targets in drug development). To that end, NIH’s National Institute on Aging has invested in TREAT-AD. The aim is to develop target-enabling packages for targets identified through the AMP AD consortium, to help identify which may be “druggable,” and to develop tools that can be used by those who work on drug development. This often involves expanding targets into a large set of possible targets based on “druggability,” then using associated targets through network models.

Incentivizing Collaboration Through Competitions and Crowdsourcing

Omberg presented an example of a collaboration that brings working groups together to crowdsource solutions to fundamental biomedical questions: Dialogue on Reverse Engineering Assessment and Methods (DREAM) Challenges.9 When a new method is developed, its developer often tests and validates the method themselves, he said, which can result in better-than-average findings. DREAM Challenges were created to separate development from benchmarking and to address problems related to “self-assessment,” which has impeded the translation and dissemination of biomedical tools and methods. New incentives would help to push research communities to develop domain standards and benchmarks, he suggested. By posing these challenges as an open problem, this project is intended to quickly explore a larger space of solutions that can be generated through crowd sourcing.

mPower is a series of mobile research studies aimed at understanding the progression of Parkinson’s disease (PD) in individuals, Omberg said.10 In one of those studies, the researchers recruited individuals with PD to

___________________

9 More information about the DREAM Challenges program is available at http://dreamchallenges.org (accessed December 11, 2020).

10 More information about mPower is available at https://sagebionetworks.org/researchprojects/mpower-researcher-portal (accessed December 11, 2020).

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

measure their symptoms using accelerometers and on-screen interactions via smartphones (Bot et al., 2016). In accordance with Sage Bionetworks’s values, this study released its first 6 months of data upon compilation and before analysis, making it a public resource.11 Within the first year, more than 130 individuals from 35 different institutions had requested access to the data, which eventually led to more than a dozen publications. However, several of these publications arrived at insights about these data that were inaccurate, Omberg said. He attributed this, in part, to the fact that some of those who accessed the data were better versed in machine learning than in disease. Thus, they did not consider the covariates and noise characteristics in the data that might affect the disease or measurements.

Competitions can help incentivize innovation, Omberg said. He described a challenge that Sage Bionetworks held to help develop impartial benchmarks from mPower data. Sage Bionetworks asked researchers to build diagnostic digital biomarkers using the mPower accelerometer data to determine if an individual has PD and, if so, the severity of the individual’s disease. Previous analyses of these accelerometer and digital wearable data conducted by a group of experts had taken years to develop a diagnostic biomarker which had an area under the curve (AUC) of 0.69, which is not a good level of predictive accuracy. Through the challenge, more than 300 groups accessed the data and used them to build their own models; within weeks, the winning group had built a measure with an AUC of 0.84, far exceeding the predictive accuracy of the one developed by the expert group (Sieberts et al., 2021). This new measure led to a 20 percent increase in performance. This was a prime example of how incentivizing competition can create better quality work. Importantly, this entire effort was made open source so that the data, tools, and collaborative methodologies can be reused.12

Challenges Related to Data Governance

Many of the data needed to answer important clinical questions are not open data, Omberg said. Ethical considerations as well as regulatory considerations such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States and General Data Protection Regulation 2016/679 in Europe restrict the access and use of data. These considerations are rooted in data governance: the freedoms, constraints, and incentives that determine how two or more parties manage—among themselves and

___________________

11 More information about the mPower Researcher Community is available at https://www.synapse.org/#!Synapse:syn4993293/wiki/247859 (accessed December 1, 2020).

12 Sage Bionetworks’s open-source tools are available at https://github.com/Sage-Bionetworks/mhealthtools (accessed December 1, 2020).

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

with others—the ingress, storage, analysis, and egress of data tools, methods, and knowledge. Because data governance involves two or more parties, there are often associated issues related to communication, negotiation, and interpersonal power dynamics, Omberg said. Data governance also affects software, storage, computer power, and know-how while access to external digital resources will also be involved. Data governance structures and their attributes can be characterized in terms of their associated freedoms and availability, Omberg said. The most closed and restricted data would rank lowest in both freedoms and availability, while data from open sources and citizen science have high degrees of freedom and availability.

Model-to-Data Challenge in Digital Mammography

One such option for governance that Sage Bionetworks has experimented with is model-to-data challenges, Omberg said, which provide an opportunity to spur innovation without compromising the confidentiality of biomedical data (Guinney and Saez-Rodriguez, 2018). These challenges are appropriate for addressing specific research questions rather than general purpose analysis, Omberg noted. Challenge participants submit containerized models, built using training data, to a privacy-conserving cloud platform. Model submissions are then validated using datasets containing patient-identifying information that are not available to the challenge participants in order to develop competition leaderboards and benchmarks. For example, the Digital Mammography DREAM Challenge13 was a model-to-data challenge where challenge participants built models to predict whether mammogram images contained breast cancer. It included data collected from 86,801 women in 146,201 digital mammography examinations conducted at Kaiser Permanente; 1,006 of these exams were cancer-positive. A total of 640,394 images were collected, along with clinical and demographic information. Dozens of teams participated by submitting models based on deep-learning methods which were trained and validated on these images without the participants ever having seen or accessed the data directly.

USING BIG DATA FOR CLINICAL STRATIFICATION OF PATIENTS

Atul Butte, the Priscilla Chan and Mark Zuckerberg Distinguished Professor and the director of the Bakar Computational Health Sciences

___________________

13 More information about the Digital Mammography DREAM Challenge is available at https://sagebionetworks.org/research-projects/digital-mammography-dream-challenge (accessed December 11, 2020).

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

Institute at the University of California, San Francisco, discussed how big data can be used for the clinical stratification of patients. He opened with an overview that emphasized the size and complexity of the University of California Health (UCH) system. The system comprises ten campuses and three national labs, with approximately 200,000 employees and about 250,000 students per year, Butte said, and its health system includes 20 health professional schools, and the UCH system trains half of all the medical students and residents in California. Around 5,000 faculty physicians and 12,000 nurses work in the UCH system, with 100,000 outside doctors writing orders on patients within the system. The UCH system also includes five National Cancer Institute (NCI) comprehensive cancer centers. The system has a policy of institutional review board reliance, whereby approvals in one campus can easily be applied in other campuses, and it also benefits from centralized contracting.

In 2016 the UCH system and UnitedHealth Group entered into an agreement to form a single accountable care organization (ACO) for the entire University of California (UC) system as part of a 10-year strategic relationship with Optum to expand the use of its clinically integrated network services and advanced data analytics services.14 By their nature, ACOs take on risk; they are paid per member, per month, and must absorb the current prices for delivering care. Moving forward, the UCH system has to determine how it would best practice medicine in order to operate with a single ACO, because the various UC campuses deliver care in different ways. Thus, Butte said, the operational need to harmonize practice data across the entire system motivated the decision to aggregate all of their health data in a single place.

Centralizing Health Care Data Across the University of California Health System

Today, health care data from across the six UC medical schools15 is stored both locally and in the centralized UCH data warehouse.16 These data include basic data from more than 15 million patients treated since 2005 and detailed electronic health record (EHR) data on more than 7 million patients since EHR systems were installed in early 2012, providing the

___________________

14 More information about UHC’s ACO is available at https://hitconsultant.net/2016/10/03/uchealth-united-healthcare-form-new-aco (accessed December 1, 2020).

15 The six schools are the University of California, San Francisco; University of California, Los Angeles; University of California, Irvine; University of California, Davis; University of California, San Diego; and University of California, Riverside.

16 Before the coronavirus disease 2019 (COVID-19) pandemic, data backups were scheduled monthly, according to Butte; during the COVID-19 pandemic, data backups have been scheduled nightly from midnight to 6:00 a.m. Pacific time.

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

UCH system with a unique view of its medical system. No other example of comprehensive bulk data sharing across multiple academic medical centers exists in the United States, Butte said. All of the UC medical schools now use Epic for their EHRs, but the central database was built using Observational Medical Outcomes Partnership (OMOP) for the data backend rather than Epic because OMOP is an open vendor-neutral method for storing patient data. Each campus moves its data to OMOP to facilitate centralization using commonly shared and governed tools. As of this writing, the database contains structured data from 2012 to the present, including data for 7.3 million patients, 192 million encounters, 553 million procedures, 739 million medical orders, 661 million diagnosis codes, and 2.1 billion laboratory tests and vital signs. This database includes “everything from Tylenol to chimeric antigen receptor T cells” as well as regulatory data related to California’s Office of Statewide Health Planning and Development, pathology and radiology text elements, and California state death index data.17 Claims data from the UCH system’s self-funded plans are also included. Elements of this data system are continuously being harmonized and, with new medications and cellular therapies approved on a weekly to monthly basis, the database must be constantly updated to include the latest terminologies. Data governance policies have been put in place, Butte said, to ensure the safe and respectful use of these data, both internally and externally.

The creation of the UCH system data warehouse has facilitated a range of benefits and improvements, Butte said. Many operational teams within the UCH system now use and benefit from the UCH system data warehouse, which is already saving UC millions of dollars. The UCH system data warehouse has also facilitated the central management of primary care patients. Central tools have been developed to improve the quality of care. For UCH system self-funded health plans, managing costs have led to some decreases in expenses, especially in pharmacy spending. Many UCH system and UC employees participate in the self-funded plans, choosing to receive care from their employer. Thus, there is an alignment of incentives for all parties to ensure that the self-funded plans provide the best possible care to the employees and families within these systems, Butte noted. Some measures can be taken to realize cost savings, including the use of generic medications instead of branded versions. Making a case for data agglomeration and harmonization on the basis of improving research alone can be challenging, Butte said, however, making a compelling business case can

___________________

17 Butte explained that in the United States each state must track in-state deaths in order to manage Social Security and other programs. UCH has been contracted to manage these death indices and has merged that data with its central data warehouse.

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

bring heightened interest to these efforts and attract health system funding to pay for them.

University of California Cancer Consortium

In 2017 the UCH system announced that its five NCI-designated comprehensive cancer centers would collaborate through the newly created University of California Cancer Consortium to help patients benefit from therapies that are only deployed through trials.18 In 2019 the UCH system saw more than 160,000 cancer patients, a patient volume that Butte estimated to be three to four times the volume at the largest cancer centers in the United States. These types of consortia strengthen the work of smaller individual cancer centers, which benefit from scale through collaboration, Butte said. He presented the University of California Cancer Consortium’s Foundation Medicine cancer genomic reports as an example of how the group constructs and represents the entire UC system even with the latest tools used in precision medicine.

The consortium benefits from common contracting and the institutional review board reliance process, allowing the group to scale large trials for cancer and cancer therapies quickly, Butte said. For example, Foundation Medicine performs cancer genomics testing for cancer patients across UC and other institutions. UC can show those gene mutation results along with patient race, ethnicity, age, smoking status, and gender. Furthermore, the consortium is able to collect Foundation Medicine cancer genomics data from across the UCH system in one central database. Butte presented data from the San Francisco, Los Angeles, Irvine, and Davis campuses that showed that TP53 is the mutated gene most frequently found in UCH system cancer patients who have their cancer sequenced. The consortium can use these data on all genetic mutations, downstream therapies, and cancer cases within this centralized database. The consortium also collects information on the cost and charge data for the drugs used across the UCH system, providing the ability to respond to questions in ways that are similar to—and in some ways enhanced—versus how they are answered by groups such as the Patient-Centered Outcomes Research Institute (PCORI). As an example, Butte described how his team analyzed the system’s data on the top 10 drug charges across UCH and found that most of the drugs used are biologics and that drug charges across the UCH system total $1.6 billion. Given that the types of biologics used throughout the UCH system

___________________

18 More information about the University of California Cancer Consortium is available at https://www.ucsf.edu/news/2017/09/408271/university-california-cancer-consortium-takes-californias-14-billion-killer (accessed December 1, 2020).

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

vary, these real-world data enable comparative effectiveness studies to be conducted in order to evaluate these different therapies.

Using Real-World Data from the Cancer Consortium

Another advantage of the UCH system database, Butte said, is that a single researcher can conduct multicenter, real-world evidence studies about the use of a drug. He described preliminary work conducted by a graduate student, Michelle Wang, on the therapeutic axicabtagene ciloleucel (axicel/Yescarta). Across the UCH system, 120 patients have already been treated with this cellular therapy, primarily at the Los Angeles, San Diego, and San Francisco campuses. Wang used the data from the patients treated with the drug within the UCH system and analyzed the patients by race, age, ethnicity, and gender (Neelapu et al., 2017). Wang is now evaluating whether the UCH patients would have qualified for the ZUMA-1 clinical trial for axicabtagene ciloleucel,19 in which 111 individuals were treated with the study drug. The data available through the consortium can already provide a larger sample than the original trial, Butte said. Wang found that more than half of the UCH patients might not have qualified for the ZUMA-1 trial for a number of potential reasons, including (1) the UCH patients might have had worse health status (e.g., they were on oxygen) during the initial acquisition of cells; (2) their laboratory tests were otherwise abnormal; or (3) they required bridging therapy. The amount of time spent from acquiring the cells to delivering the cells (i.e., “vein-to-vein time”) determined eligibility for the trial, Butte said. In the real world this time span is longer than it is in the trial data—especially if patients need bridging therapy—thus precluding their qualification for ZUMA-1. This work highlights the importance of real-world evidence, because patients receiving treatments in the real-world setting may not match the patients studied in randomized controlled trials. Both randomized controlled trials and analysis of real-world patients are valuable, Butte said, but this combination of data will be especially useful in studying regenerative medicines.

Wang also prepared data, largely using automated tools provided by the consortium, which compared progression-free survival and overall survival at 12 months for patients treated with axicabtagene ciloleucel. These rates were comparable to the ZUMA-1 trial, Butte said. The availability of these data allows the UCH system to evaluate why outcomes may vary across UC campuses. Using the consortium’s data tools, Wang was able to analyze the number and type of adverse events across individual patients, with a focus on mild and severe neutropenia. She can also build computational tools to extract data from laboratory results, including various

___________________

19 See https://clinicaltrials.gov/ct2/show/NCT02348216 (accessed February 10, 2021).

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

markers and even text notes that can be parsed and used. In addition, Wang has begun working to compare data related to cytokine release syndrome (CRS) and neurotoxicity. These data can be plotted according to multiple grading strategies, and when that was done it revealed that the timeline for CRS varies from the timeline of neurotoxicity and achieves an earlier peak than the neurological symptoms seen in neurotoxicity.

Use Cases for Real-World Data

The UCH system now captures real-world data on all activities within its campuses in its clinical data repositories. “Everything we do and measure on patients is captured now, and the electronic health record is now the legal record for the patient,” Butte said. Data should serve more than just academic interests, he emphasized. For example, broad and reliable data can be used to create a strong business case for a specific drug or intervention, which tends to speed up implementation. To enumerate the many potential uses of data—including EHR data, clinical data, and patient-reported data—Butte and his colleague compiled 21 use cases for real-world data (Rudrapatna and Butte, 2020) (see Table 4-1). These uses for real-world data extend far beyond the Food and Drug Administration’s requirements for pharmaceutical and biotech companies. Ultimately, Butte said, data should be used to communicate with patients about how they may benefit from treatment, how difficult their treatment will be, and the advantages and disadvantages of various potential therapies they may receive.

DISCUSSION

Establishing a Lingua Franca of Data

In order to promote data sharing, there is a need for a lingua franca, or common language, among industry, regulators, and academia, said session moderator Sadik Kassim. Such a system was established within UCH, but he asked how a lingua franca might be established more broadly. Various standards have been developed, Butte said, including Fast Healthcare Interoperability Resources (FHIR), a federal standard for data formats and an application programming interface for exchanging electronic health records. The UCH system uses this standard primarily to export data to patients via the FHIR feed. Many patients receive these feeds using their smartphone, but FHIR feeds can also be used with other tools and technology devices. For sharing data within health systems, UCH uses OMOP, which is also used by PCORI. OMOP is open source and vendor neutral, Butte said, making it as close as possible to a lingua franca. FHIR and

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

TABLE 4-1 Use Cases for Real-World Data

Category of Use Use Cases for Real-World Data
Post-approval safety
  • Updating side effect rates
  • Discovering novel side effects
Supporting regulatory approval
  • Conducting single-arm experimental trials
  • Supporting “digital approvals”
  • Evaluating biosimilar development
Informing clinical trials design
  • Improving patient selection
  • Increasing efficiency of data collection (“trimming the trials”)
Continually establishing efficacy
  • Assessing the efficacy–effectiveness gap
  • Searching for efficacy in specific populations
  • Informing effect modifiers and precision medicine
  • Evaluating long-term, post-trial outcomes
Comparative effectiveness
  • Integrating costs with comparative effectiveness
  • Understanding effects of pharmacy practices on health care use
  • Studying novel on-label pharmaceuticals versus older off-label drugs
Studying the practice of medicine
  • Improving quality of practice and reducing medical errors
  • Standardizing care and care delivery
  • Studying the effect of payors on medical care
  • Evaluating impact of new-generation diagnostics on outcomes
Data-driven decision support
  • Improving clinical decision support: the provider perspective
  • Improving clinical decision support: the patient perspective
  • Improving clinical decision support: the community perspective

SOURCES: Atul Butte workshop presentation, October 22, 2020. Adapted from Rudrapatna and Butte, 2020.

OMOP both refer to other standards, so they do not eliminate the need to keep track of vocabularies and the names of pharmaceuticals. The FHIR and OMOP are used far less commonly in the field of digital health, Omberg said. The lack of consistent standards has caused some difficulty in Omberg’s work, especially in working across datasets. Acknowledging the positive impact of the use of FHIR for medical records affirmed the importance of developing standards, he said.

Expanding to a National Data-Sharing System

A member of the audience asked whether any efforts are under way to expand the system described by Butte beyond California and how the

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

field of regenerative medicine can use the type of data centralization implemented in the UCH system. There are no immediate plans for expansion outside of California, Butte said, although there are some decentralized national efforts, such as PCORI and private efforts such as TriNetX.20 A business case would have to be presented to motivate the creation of a national centralized database, he said, but such a case may be difficult to make. To highlight this challenge, he asked why stakeholders would choose to share data for reasons other than getting grants and publishing papers. Kassim agreed that this is a matter of incentives.

Single-Investigator Versus Consortia-Driven Research

One audience member asked about Omberg’s earlier comments about insufficient data when it comes to finding “druggable” targets in addition to the challenges surrounding the use of single-investigator-initiated, hypothesis-driven research, which is often aimed at understanding the fundamental processes of disease rather than just hunting for targets. The audience member questioned whether the best use of a systems approach is to find “druggable” targets or, alternatively, whether it can involve higher priorities such as finding variables that control disease pathway and other preferences. Government-funded research often lacks access to data that come from hospital systems, Omberg replied; in many cases a single institution simply does not have enough data. Consortia can be instrumental for generating larger datasets where multiple academic institutions or industry partners can collaborate to build them. In some cases such collaborative datasets already exist. For example, in mental health research investigating schizophrenia and depression, brain samples often must be collected from brain banks. However, there have been instances where no brain bank had datasets that were sufficient in size to power RNA-sequence analysis or genome-wide association studies analysis. Only by pooling together resources is there enough data, Omberg said.

Single-investigator and consortia-driven research serve different purposes, so neither can replace the other, Omberg said. Systems-level approaches alone are not sufficient to answer all pertinent research questions, so deep dives into individual mechanisms and other inquiries will always be of value, he added. He suggested that individual-investigator- and consortia-driven research approaches be paired, so that systems approaches are supplemented with deep-dive research to better understand and contextualize system-level discoveries. Experiments can then be conducted to validate these approaches.

___________________

20 More information about TriNetX is available at https://trinetx.com (accessed December 11, 2020).

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

As to whether to identify druggable targets or to use more comprehensive approaches to understand disease states—as well as the variables that determine and drive these states—Omberg said that in his collaborative work with AMP AD, they have identified multiple targets. Once a target is identified, certain researchers within AMP AD have advocated for investigating the target’s biological basis to better understand its mechanisms and significance. Additionally, the biological investigation of these targets may lead to the discovery of other targets that may be better suited in terms of druggability. Conversely, many pharmaceutical stakeholders have not expressed interest in such biological investigations, Omberg said. Rather, they tend to ask for a target to be identified with the intention of determining its druggability independently. This is an example of divergent views on the importance of these two approaches, he said.

Applying the Concept of Attractor States to Patient Identification

The concept of attractor states (see Chapter 2) may be useful if applied to understand disease processes and inform the identification and stratification of patients at the systems level, Kassim said. Butte proposed using “trajectories of care” to model how patients move through various states within health systems. Related concepts, not yet well explored, include patients’ transitions between states, the number of possible states in medicine, and similarities and differences among decision nodes. This modeling work, which is still in its infancy, is complicated because there are no simple trajectories of care to model as a first step. For example, a patient may accumulate multiple conditions or diseases that could each be treated in several different ways. Modeling these types of probabilistic transitions is not a trivial task, Butte said. For instance, acute and chronic diseases would likely be modeled quite differently.

Daily measurements of patients with PD have begun to reveal that patients’ response to their medications over time changes over time, Omberg said. In addition, digital biomarker studies have revealed variations between different groups such as between males and females, suggesting a cluster of differences between how biomarkers appeared by gender. Differences were also found between age groups; for example, the response rates to Levodopa varied, with some patients responding in the way they performed a finger tapping test (a measure of bradykinesia), while other had less gait freezing while walking.

Ensuring Patients About Data Security

One audience member commented that patients may seek assurance that their data are secure within health systems such as UCH. Every patient

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

should feel secure, Butte emphasized. The UCH system’s primary goal in centralizing its database was “to get patients their own data back,” he said. First and foremost, it is for the patients’ benefit that their data are organized, harmonized, and entered into tools such as FHIR feeds. Still, it would be tragic if the multi-billion-dollar investments in EHR systems were not also used to improve the practice of medicine, but these improvements must be realized in a safe and respectful manner, Butte said. Because HIPAA has been in place for more than 20 years, managing data under it has become predictable. Those working with patient data know how to deidentify data in accordance with HIPAA; they also have developed research methods for re-introducing certain data elements, such as zip codes, to create “limited datasets,” Butte said. The stability of current regulations and policies related to patient data can offer patients assurance that their data are being used in a safe, respectful way. Furthermore, patients sit on UCH institutional review boards and participate in the institutional review process. Researchers and health systems are also subject to the policies of data governance before any patient data—even de-identified patient data—may be exported for any purpose. In cases where there is clearly mutual benefit in sharing de-identified patient data, the contracts used will require that the recipient of the data not re-identify any patients, further protecting patients’ data once they have been exported.21

In his work related to the DREAM Challenge, Omberg said, he used particular contract language in dealings with the Kaiser Permanente health system, but the contracts were not created with the individuals representing various organizations that participated in the challenge because they did not have access to the data. Challenge participants only agreed to particular requirements, such as making algorithms available using open-source licensing, but their participation did not involve access to patient-identifying data. Setting up contracts such as material transfer agreements or data use agreements can sometimes cause year-long delays because they take time to develop and approve, he added. In some extreme cases data transfers can involve agreements among multiple international and local stakeholders and can take even more than 1 year to finalize. It is always best to harmonize and clean data to minimize complications in data transfers, Omberg said.

Navigating Data Shortcomings

One reason for discussing systems biology, Kassim said, is to enable the systematic identification of quality attributes of products that will lead to clinical responses in regenerative medicine. He asked how data can be

___________________

21 Butte remarked on the value of carefully chosen contract language, invoking the term “contract hygiene” to describe the careful crafting of contracts.

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

used to understand the mechanisms of action or critical quality attributes and whether there are sufficient patient biology, immunology, or product characterization data being collected in the realm of regenerative medicine. Given the mixed nature of datasets—with some data coming from highly regulated analytical methods and other data coming from more exploratory methods—he asked how these data can be integrated into a single database. The amount of data collected will never be sufficient, Omberg said. Researchers must use what data are available and account for the features of those data in their modeling and analysis. Even in cases where sufficient data have ostensibly been captured, rapid changes within biological systems preclude the possibility of maintaining a complete dataset. For example, when a patient has blood drawn at a clinic for omics analysis, that analysis “misses” the patient’s entire life history. Thus, the resulting analysis is merely a snapshot. In this sense, researchers will never have enough data. They can, however, use available data to conduct their modeling and analysis, as long as they diligently account for the shortcomings of their datasets. In the face of insufficient data, the task for researchers is to explore what can be done with the data that are currently available, Butte said. Researchers can expect the quality and quantity of data to improve over time, but they should not wait for these improvements. Rather, Butte continued, they should find the best ways to put available data to use.

Behind these data are patients who need to benefit from this research, Butte pointed out—for instance, by using data from the research to identify a patient cohort for enrollment in post-approval studies to acquire molecular markers. Furthermore, residual data from clinical care can be used to study drugs and identify early signs of classifiers for predicting efficacy. Currently, research in regenerative medicine is aimed at predicting success, Butte said. However, introducing real-world evidence can affect early-stage discovery if researchers begin to study failures. For instance, if 30–40 percent of patients are not benefiting from a trial, researchers can enter these patients into a study to measure blood and serum and identify the markers that may be linked to drug failure. He compared this approach to the ways in which smartphone developers rely on bug reports to analyze how and why their technology fails, and suggested that a similar approach to seek out mechanisms of failure is needed in the pharmaceutical and biotechnology industries. Furthermore, those industries would benefit from a new mindset about how data should be used and how research should feed back into the design of products and services, Butte said. He also mentioned the “information commons,” which is being deployed across UC campuses as an internal, central repository for data. It includes a cloud-based, secure database where de-identified clinical, imaging, and genomics data can be viewed within the health system by researchers who are in compliance with the requisite data governance processes.

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×

This page intentionally left blank.

Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 47
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 48
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 49
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 50
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 51
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 52
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 53
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 54
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 55
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 56
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 57
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 58
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 59
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 60
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 61
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 62
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 63
Suggested Citation:"4 Challenges Associated with Data Collection, Aggregation, and Sharing." National Academies of Sciences, Engineering, and Medicine. 2021. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/26025.
×
Page 64
Next: 5 Challenges and Opportunities Associated with Systems-Level Analysis and Modeling »
Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop Get This Book
×
Buy Paperback | $40.00 Buy Ebook | $32.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Regenerative medicine products, which are intended to repair or replace damaged cells or tissues in the body, include a range of therapeutic approaches such as cell- and gene-based therapies, engineered tissues, and non-biologic constructs. The current approach to characterizing the quality of a regenerative medicine product and the manufacturing process often involves measuring as many endpoints as possible, but this approach has proved to be inadequate and unsustainable.

The Forum on Regenerative Medicine of the National Academies of Sciences, Engineering, and Medicine convened experts across disciplines for a 2-day virtual public workshop to explore systems thinking approaches and how they may be applied to support the identification of relevant quality attributes that can help in the optimization of manufacturing and streamline regulatory processes for regenerative medicine. A broad array of stakeholders, including data scientists, physical scientists, industry researchers, regulatory officials, clinicians, and patient representatives, discussed new advances in data acquisition, data analysis and theoretical frameworks, and how systems approaches can be applied to the development of regenerative medicine products that can address the unmet needs of patients. This publication summarizes the presentation and discussion of the workshop.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!