The 21st century has already seen the emergence of four pandemic viruses (chikungunya virus, Zika virus, 2009 H1N1 influenza virus, and severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]), several viral epidemics (e.g., 2003 SARS, 2012 Middle East respiratory syndrome [MERS-CoV], 2014 Ebola virus in West Africa, and 2018 Ebola virus in the Democratic Republic of the Congo), and intermittent sporadic outbreaks of other viruses such as H7N9 influenza. At the time of this writing, SARS-CoV-2 had spread worldwide, infecting at least 10 million people with an estimated 500,000 deaths within 6 months. Multiple outbreaks suggest that preparedness and response strategies need modernization. New advances in metagenomics, epidemiology, and big data analyses provide new paradigms for tracing symptomatic and asymptomatic transmission networks, thereby enabling our capacity to break or delay virus transmission to reduce morbidity and mortality. Recognizing this need, the U.S. Department of Health and Human Services’ Office of the Assistant Secretary for Preparedness and Response and Office of Science and Technology Policy requested that the National Academies of Sciences, Engineering, and Medicine convene an ad hoc committee to lay out a framework to define and describe the data needs for a system to track and correlate viral genome sequences with clinical and epidemiological data. Such a system would help ensure the integration of data on viral evolution with detection, diagnostic, and countermeasure efforts.
1 This Summary does not include references. Citations for the discussion presented in the Summary appear in the subsequent report chapters.
Previous efforts to integrate genomic, clinical, and epidemiological data have led to new insights around the transmission and pathogenesis of disease, including for previous outbreaks of SARS-CoV, Ebola virus, Zika virus, seasonal influenza, mumps, foodborne illnesses, and antibiotic-resistant bacteria. The most successful approaches to date have involved multipronged approaches and the timely collaboration of public and private stakeholders.
CURRENT GENOMIC EPIDEMIOLOGY EFFORTS FOR SARS-CoV-2
Several ongoing efforts are leveraging the power of genomic epidemiology in response to the coronavirus disease 2019 (COVID-19) pandemic. In the United States, the U.S. Centers for Disease Control and Prevention’s SARS-CoV-2 Sequencing for Public Health Emergency Response, Epidemiology, and Surveillance consortium is working to coordinate a nationwide genomic sequencing effort. The National Institutes of Health supports the National COVID Cohort Collaborative (N3C), a secure portal for patient-level COVID-19 clinical data, and the National Center for Biotechnology Information’s reference sequence database. Several regional initiatives have emerged as well, integrating data sharing through existing global efforts like the Global Initiative on Sharing All Influenza Data and Nextstrain. Even as new efforts are being established, the committee found that several limitations blunt their effectiveness, such as insufficient funding, poor coordination, limited capacity for data integration, unrepresentative data, and lack of an adequately trained workforce with the multifaceted expertise needed to conduct this work. Fundamental governance and collaboration issues extending from the top down have led to the fragmentation of approaches and varying capacities at local and national levels.
Conclusion: Current sources of SARS-CoV-2 genome sequence data, and current efforts to integrate these data with relevant epidemiological and clinical data, are patchy, typically passive, reactive, uncoordinated, and underfunded in the United States. As a result, currently available data are unrepresentative of many important population features, biased, and inadequate to answer many of the pressing questions about the evolution and transmission of the virus, and the relationships of genome sequence variants with virulence, pathogenesis, clinical outcomes, and the effectiveness of countermeasures. Thus, the viral sequence data and associated data needed are not being collected.
RECOMMENDATION 1. The U.S. Department of Health and Human Services should ensure the generation of representative, high-quality full genome sequences of SARS-CoV-2 across the United States, and in the future, from emerging epidemic or pandemic pathogens, in order that these data can be used to meet key needs for genomic surveillance.
- Pathogen samples must be obtained from individuals who represent a broad diversity of factors such as race and ethnicity, gender, age, geography, and other demographic features such as housing type, clinical manifestations and outcomes, and transmissibility.
- Capacity for genomic sequencing should be developed and supported at many geographically distributed sites performing testing, including public health laboratories and academic and medical centers.
- Representative SARS-CoV-2 clinical samples from across the United States should be collected and sequenced on an ongoing basis to provide baseline data and facilitate near-real-time transmission tracking.
- Genome sequences should be shared openly on publicly accessible databases, such as the National Center for Biotechnology Information linked to the Global Initiative on Sharing All Influenza Data.
BUILDING A FRAMEWORK TO TRACK AND CORRELATE VIRAL GENOME SEQUENCES WITH CLINICAL AND EPIDEMIOLOGICAL DATA
To understand the evolution of SARS-CoV-2 and the implications for transmission and clinical manifestations, the interpretation of genomic data (see Recommendation 1) is reliant on linked clinical and epidemiological data. Table S-1 briefly outlines how viral genome sequence data, when combined with other types of data, can be used to inform questions related to transmission, evolution, and clinical disease.
In order to answer the questions outlined in Table S-1, development of data integration will be crucial. Currently, no central repository exists for the collection and curation of infectious disease outbreak data from multiple sources such as federal, state, and local public health agencies; health care networks; and public health and clinical laboratories. In order to create a more integrated data system, insights can be gleaned from existing efforts to integrate data. Leveraging and expanding existing infrastructure and planning—through programs such as N3C—will be crucial to addressing the data infrastructure challenge in a way that is strategic, innovative, and iterative.
|Goal||Question||Viral Genomic Sequence Data Needs||Clinical and/or Epidemiological Data Needsa|
|Transmission patterns||Is outbreak due to multiple introductions? Where is the virus coming from?||Pathogen samples from individuals who represent broad diversity from outbreaks and many regions/countries||Time and place of virus isolation and travel history of cases|
|Is outbreak due to local spread? How and/or where is the virus being transmitted?||Sequences from local groups/areas with increased incidence rates||Local population-based information on sites of exposure, gatherings, isolated communities, and congregate living (long-term care facilities, hospitals, prisons)|
|Is there evidence of super-spreading events and how important are they?||Sequences of virus from groups of people infected in the same setting||Information on sites of exposure, gatherings|
|Evolution/influence of selective pressures||Is the virus changing in transmissibility?||Changes in viral genome sequence associated with increased spread||Calculations of R0 (contact tracing data–number of people infected)|
|Is resistance to antiviral drugs or other treatments changing?||Changes in viral genome associated with failure to respond to treatment||Hospital or health care center data on patients who do not respond to therapy or show failure of treatment|
|Is there altered escape from the host immune response/within host evolution?||Changes in viral genome associated with persistence||Hospital data on patients who show prolonged shedding|
|Is there changed protection from vaccine-induced immunity?||Changes in virus that affect epitopes important for protective immunity and sequences of viruses associated with vaccine failure||Vaccine trial databases and post-marketing vaccine failures|
|Goal||Question||Viral Genomic Sequence Data Needs||Clinical and/or Epidemiological Data Needsa|
|Clinical disease||Are there strains/mutations associated with changes in disease severity?||Sequences of viruses from patients with different disease severity||Severity of symptoms, ICU, ventilation, mortality, length of hospitalization, co-infections|
|Are there strains/mutations that affect virus loads or clearance?||Sequences of viruses from patients with viral load data||RT-PCR data to measure viral load of respiratory secretions, blood, and feces over time|
|Are there strains/mutations that affect response to different treatments?||Sequences of viruses from before and after treatment||Treatment type, duration, and outcome|
|Are there strains/mutations that are associated with response to different treatments?||Sequences of viruses from different body sites and patients with and without specific complications||Clinical data on complications related to different organ systems (e.g., kidney, liver, nervous system)|
|Are there strains/mutations that predispose to MIS-C?||Sequences of viruses from children in the same community/family with and without MIS-C||Clinical data over time on immune response, viral load, treatment, and response|
NOTE: ICU = intensive care unit; MIS-C = multisystem inflammatory syndrome in children; R0 = basic reproduction number; RT-PCR = reverse transcription polymerase chain reaction.
a The committee recognizes that clinical and epidemiological data often come from very different data collection sources and efforts, but for the purposes of this table these data needs have been incorporated into one column.
RECOMMENDATION 2. The U.S. Department of Health and Human Services should develop and invest in a national data infrastructure system that constructively builds on existing programmatic infrastructure with the ability to accurately, efficiently, and safely link genomic data, clinical data, epidemiological data, and other relevant data across multiple sources critical to a public health response such as the current SARS-CoV-2 outbreak. Such a system should:
- Allow for the linkage of genomic data, clinical data, epidemiological data, and other relevant data in a way that is not overly burdensome to laboratories that collect data regularly.
- Create and foster safe data-sharing practices to ensure that individuals’ personal identifying information remains unexposed when data are being used and shared across the system.
- Be grounded in the pursuit of standardization, interoperability, flexibility, and the practical linkage of data, including consideration of a potential national patient identifier.
- Consider not only the data required to create such a system, but also investment in mechanisms supporting the collection and analysis of such data, including promoting formal education in “data wrangling” at the intersection of data science and infectious disease epidemiology.
- Conduct regular annual reviews—including scenario-based simulations—to identify capacity gaps, promote process improvement (based on existing U.S. infrastructure to assess the annual risk of seasonal influenza, work could improve usability and coverage of health information exchanges, and other initiatives), and ensure inclusion of entities with supporting functions across scales—including private health care systems that provide data or state and local public health laboratories that collect data—in ongoing system development and evaluation.
GOVERNANCE AND LEADERSHIP
In the United States, federal or state laws do not protect or mandate sharing of samples of viral sequence data. As such, any sharing of such data and samples is done voluntarily and generally without concerns about possible regulatory barriers. Conversely, federal and state laws protect clinical and epidemiological data, including through the Health Insurance Portability and Accountability Act and the Common Rule at the federal level. The sharing of viral sequence data and associated information should be guided by national-level leadership to create supportive legal or strategic frameworks that instill principles of good governance. These data-sharing and reporting processes should be clearly established and resourced as an urgent matter, and prior to an emergency. Without a clear and urgent public health rationale, changing reporting processes during an emergency should be avoided, and emergencies should not justify not complying with principles of good governance, including data transparency. Principles and elements of good governance include accountability processes that clarify authorities and responsibilities, as well as maintenance of transparency, equity, participation, and clear and certain legal protections for public health agencies, researchers, and individuals’ rights.
RECOMMENDATION 3. The U.S. Department of Health and Human Services should establish an effective and sustainable science-driven leadership and governance structure for the use of SARS-CoV-2 genome sequences in addressing critical national public health and basic science issues, develop a national strategy, and ensure the funding needed for successful execution of the strategy.
- Leaders of this effort must have sufficient authorities and responsibilities to ensure that key issues are identified and prioritized, representative data are generated, and barriers to data sharing are diminished.
- A national strategy for SARS-CoV-2 genome sequences linked to clinical and epidemiological data should be developed that articulates goals, priorities, and a path for achieving them.
- A board with diverse relevant expertise should be established with broad authority to oversee and advise the national strategy for SARS-CoV-2 genome sequences linked to clinical and epidemiological data, and the delivery of actionable data for related investigations.
This page intentionally left blank.