Page 47 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

4

Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data

To inform public health analysis of an infectious disease outbreak, the genomic sequence of the pathogen obtained from an infected person must be accurate and be linked with sufficient metadata for context. In this chapter, the committee lays out a framework to describe the types of clinical and epidemiological data that need to be linked to viral genome sequence data to answer specific questions related to transmission, evolution, treatment, and prevention of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and in the future, new emerging epidemic or pandemic pathogens. It concludes with a discussion about the data integration and infrastructure considerations for a system to track and correlate genomic, clinical, and epidemiological data. Demographic factors, such as age or occupation, are also important components to understand disease transmission and data collection needs for specific populations.

CONSIDERATIONS FOR TRANSMISSION, EVOLUTION, AND CLINICAL DISEASE

Overarching Data Collection Considerations

Acquisition of genomic data is one piece (see Recommendation 1 in Chapter 3) but will also be reliant on clinical and epidemiological data to understand the evolution of SARS-CoV-2 and the implications for transmission and clinical manifestations. The collection of clinical data is exceedingly important but also one of the biggest hurdles.

Page 48 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

Temporal and geographic information (date and location of specimen collection) are essential for assessing spread of the pathogen in time and space throughout the epidemic, establishing transmission chains, developing predictions, and identifying clusters of similar sequences as an indication of a super-spreading event, for example. Similarly, any recent travel to places, gatherings, or events that might currently or subsequently be recognized as areas of high disease activity is fundamentally important for mitigation. Residence in a long-term care facility, recent (especially inpatient) clinical encounters, and close contact with a person known to have coronavirus disease 2019 (COVID-19) would be key variables for downstream use. Comorbid disease, immunosuppression, and disease severity may reveal associations with viral evolution that would otherwise be undecipherable. Preceding receipt of antiviral treatment, episode(s) of COVID-19, and any prior SARS-CoV-2 vaccination, are likely to become increasingly relevant in the future to contextualize SARS-CoV-2 evolution in response to selective pressure such as escape from antiviral responses.

A critical overarching consideration will be in ensuring representation through participatory parties (Gould et al., 2017). To ensure that epidemiological sampling is representative of populations at risk, basic demographics should be linkable with the genomic sequence. A mixture of public health, health care, tribal leaders, bioethics, community health leaders, and those working in genomic epidemiology would be beneficial to help determine how best to represent all critical parties. In fact, it may be helpful to establish a proactive “push” team that helps resource-challenged areas—such as tribal territories and critical access hospitals—ensure they are afforded representation. Adequate representation should go beyond geographical considerations, and should also include gender, race, ethnicity, living situation, and occupation.

Table 4-1 briefly outlines how viral genome sequence data, when combined with other types of data, can be used to inform questions related to transmission, evolution, and clinical disease.

Transmission

Data on viral genomic sequences can answer questions related to the source(s) of the virus causing an outbreak. The “simplest use of genomic data” is used to show how viral spread happens when combined with phylogeographic approaches where it can be used to detect transmission hot spots and help direct interventions (Holmes et al., 2016). For the current situation with SARS-CoV-2, such data can determine how the virus is spreading between individuals and within a community. Once a vaccine is available, these data can determine whether new cases are due to virus importation or to local spread. For instance, genomic epidemiology with

Page 49 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

TABLE 4-1 Summary Table of Considerations for Transmission, Evolution, and Clinical Disease

Goal	Question	Viral Genomic Sequence Data Needs	Clinical and/or Epidemiological Data Needs^a
Transmission patterns	Is outbreak due to multiple introductions? Where is the virus coming from?	Pathogen samples from individuals who represent broad diversity from outbreaks and many regions/countries	Time and place of virus isolation and travel history of cases
	Is outbreak due to local spread? How and/or where is the virus being transmitted?	Sequences from local groups/areas with increased incidence rates	Local population-based information on sites of exposure, gatherings, isolated communities, and congregate living (long-term care facilities, hospitals, prisons)
	Is there evidence of super-spreading events and how important are they?	Sequences of virus from groups of people infected in the same setting	Information on sites of exposure, gatherings
Evolution/influence of selective pressures	Is the virus changing in transmissibility?	Changes in viral genome sequence associated with increased spread	Calculations of R₀ (contact tracing data–number of people infected)
	Is resistance to antiviral drugs or other treatments changing?	Changes in viral genome associated with failure to respond to treatment	Hospital or health care center data on patients who do not respond to therapy or show failure of treatment
	Is there altered escape from the host immune response/within host evolution?	Changes in viral genome associated with persistence	Hospital data on patients who show prolonged shedding
	Is there changed protection from vaccine-induced immunity?	Changes in virus that affect epitopes important for protective immunity and sequences of viruses associated with vaccine failure	Vaccine trial databases and post-marketing vaccine failures

Page 50 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

Goal	Question	Viral Genomic Sequence Data Needs	Clinical and/or Epidemiological Data Needs^a
Clinical disease	Are there strains/mutations associated with changes in disease severity?	Sequences of viruses from patients with different disease severity	Severity of symptoms, ICU, ventilation, mortality, length of hospitalization, and co-infections
	Are there strains/mutations that affect virus loads or clearance?	Sequences of viruses from patients with viral load data	RT-PCR data to measure viral load of respiratory secretions, blood, and feces over time
	Are there strains/mutations that affect response to different treatments?	Sequences of viruses from before and after treatment	Treatment type, duration, and outcome
	Are there strains/mutations that are associated with response to different treatments?	Sequences of viruses from different body sites and patients with and without specific complications	Clinical data on complications related to different organ systems (e.g., kidney, liver, nervous system)
	Are there strains/mutations that predispose to MIS-C?	Sequences of viruses from children in the same community/family with and without MIS-C	Clinical data over time on immune response, viral load, treatment, and response

NOTE: ICU = intensive care unit; MIS-C = multisystem inflammatory syndrome in children; R₀ = basic reproductive number; RT-PCR = reverse transcription polymerase chain reaction.

^a The committee recognizes that clinical and epidemiological data often come from very different data collection sources and efforts, but for the purposes of this table these data needs have been incorporated into one column.

knowledge of viral sequences from different regions is regularly used to determine whether cases of measles virus infection are due to introduction from countries with continued endemic measles or to chains of transmission within the community due to inadequate population immunity (Harvala et al., 2015; Penedos et al., 2015).

Where Is the Virus Coming From?

As described in Chapter 3, sequencing 87 SARS-CoV-2 genomes from infected patients early in the spread of COVID-19 in New York City demonstrated multiple independent introductions of dominant strains circulat-

Page 51 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

ing in Europe followed by undetected local transmissions (Gonzalez-Reiche et al., 2020). In a hypothetical scenario, the reader should imagine the first group of college students arriving to a college campus in August 2020. If cases of COVID-19 begin to be detected in the days and weeks that follow, administrators and health care providers will need to respond in near real time. Important to their mitigation strategy will be distinguishing multiple independent introductions from local transmission. To understand what proportion of students came to campus carrying SARS-CoV-2 strains from their home regions will require national and international baseline data. Moreover, to know which events to discourage, students will need to provide accurate data of their activities and contacts—many of whom they will not know.

Genomic data linked to time, place, and exposure history will help to cluster cases, delineate local transmissions, and illuminate which epidemiological links need not be investigated further.

Where Is the Virus Being Transmitted?

Of particular epidemiological importance for SARS-CoV-2 is identification of route of transmission, asymptomatic spread, and super-spreading events. Virus sequence data can help identify transmission via different pathways, both expected and unexpected (Holmes et al., 2017). For instance, SARS-CoV-2 RNA is frequently found in stool samples as well as respiratory secretions with more persistent shedding from the gastrointestinal tract (Xu et al., 2020). Viral RNA in stool and as aerosols in the toilet areas of communal living facilities (Liu et al., 2020) may or may not represent infectious virus (Wölfel et al., 2020). Identification of fecal–oral transmission will require epidemiological information on exposures linked to virus sequence information and could have a substantial effect on public health interventions. Likewise, knowledge of transmission from sites of virus persistence (particularly semen as now recognized for Zika and Ebola viruses) provides opportunities for late transmission to reignite outbreaks after apparent control, which affect public health interventions.

Super-spreading events and identification of the settings where they occur are of particular epidemiological importance. These events can only be identified with viral sequence data from multiple individuals involved in an outbreak linked to information on participant activities, such as religious services, sporting events, or concerts (Holmes et al., 2016).

Evolution and Influence of Selective Pressures

To better understand the evolution of SARS-CoV-2 in the United States or elsewhere, it would be ideal to integrate patient clinical data and

Page 52 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

genomic sequence data, with representation of both abundant and rare viral genotypes, representative of geographic, gender, racial, ethnic, and other demographics. Of course, the difficulty of this goal is the challenge in obtaining such data, given the current lack of an efficient and reliable network to connect data drawn from local regions across the United States. Thus, it remains challenging to elucidate how the virus is currently evolving, which suggests poor ability to predict its future potential for evolution in the face of ongoing and novel selection pressures, such as vaccine development.

A brief comparison of the evolution patterns of SARS-CoV, Middle East respiratory syndrome (MERS)-CoV, and SARS-CoV-2 reveals interesting similarities and differences. Although SARS-CoV-2 studies suggest an emergence event involving single lineage, it is clear that multiple introductions of SARS-CoV and MERS-CoV occurred early in the expanding epidemic (Liya et al., 2020). During the SARS-CoV epidemic, distinct mutations in the receptor-binding domain were critically associated with the emergence of middle- and late-phase isolates that spread geographically, but transiently throughout the world (Hu et al., 2017). Other interesting differences include the high transmissibility of SARS-CoV-2, prior to disease symptom onset, while both SARS-CoV and MERS-CoV are primarily transmitted after clinical disease onset. Mortality rates of the three emerging coronaviruses are estimated at 1, 10, and 35 percent for SARS-CoV-2, SARS-CoV, and MERS-CoV, respectively. While asymptomatic infections were and are rare in the 2003 SARS-CoV epidemic and the ongoing MERS-CoV outbreak, asymptomatic infections are common in SARS-CoV-2 infections, recently estimated to represent 40–50 percent of all cases (Feaster and Goh, 2020). What are the genetic differences between SARS-CoV and SARS-CoV-2 that regulate these fundamental differences in transmissibility, virulence, and pathogenesis? Could highly virulent, highly transmissible coronavirus strains emerge from zoonotic sources or during an expanding epidemic or pandemic? How does virulence evolve after a zoonotic emergence event? What is the relationship between the evolution of virulence and coronavirus transmissibility? Using model organisms, the evolutionary relationships between virulence and transmissibility are thought to be complex traits and include examples of synergistic and antagonistic relationships (Geoghegan and Holmes, 2018). Given the large diversity of novel coronaviruses harbored in bats and other animals, it is therefore conceivable that many worse highly transmissible and highly virulent zoonotic coronaviruses may exist in nature that threaten human populations in the future. Consequently, fundamental insights into the evolutionary trade-offs and genetic relationships between SARS-CoV-2 evolution, virulence, and transmissibility may better inform global preparedness efforts, designed to minimize the impact of consequential coronavirus disease outbreaks of the future (Messenger et al., 1999).

Page 53 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

For example, if the mutation rate (replication fidelity) changes such that more allele substitutions occur per round of genome replication, it would indicate that greater variation and adaptive potential is available to the virus as raw fuel for evolution by natural selection (Duffy et al., 2008; Elena and Sanjuán, 2007). In turn, this scenario could lead to adaptive change whereby the major virus variants become either more or less dangerous (virulent), such that increased or decreased mortality risk becomes associated with COVID-19. Thus, evolution of a higher mutation rate in the virus may not be necessarily problematic from a clinical or public health perspective, because viral adaptation may coincide with greater or lesser host morbidity and mortality. The adaptive potential of any biological system relies on a positive correlation between increased mutation rate and a larger number of useful (beneficial) mutations occurring in the population (Orr, 2000). That is, mutation rate can create more changes per unit of time, but there is no guarantee that this will also create a larger fraction of beneficial mutations, because the latter is determined by how well spontaneous mutations provide an adaptive match to the selective challenges faced by the population. Nevertheless, even if the mutation rate of SARS-CoV-2 remains unchanged, the short generation times of the virus—coupled with the very large number of infected human hosts—create ample opportunity for rare spontaneous mutations to arise and spread over short periods of time, indicating enormous virus evolutionary potential.

Is Virus Transmissibility Changing?

To date, about a dozen mutations in the gene encoding the spike protein are accumulating and being evaluated for positive selection. A prominent D614G mutation identified both in China and Europe in January 2020 is expanding in geographic range and frequency across the world (Korber et al., 2020). The mutation is located on the interface between spike protomers where it may alter stability that enhances infectivity. Identification of this mutation is leading to more detailed studies aimed at unraveling the importance of this mutation in the biology of SARS-CoV-2 and its relationship to other mutations in the genome that may contribute to the selective sweep of this genotype across the globe. In addition to the in vitro data on competitive cell entry and growth rates that have been released recently (Grubaugh et al., 2020; Hu et al., 2020; Korber et al., 2020; Zhang et al., 2020), it will be important to examine the time course and natural history of mutant and isogenic parental strain experimental infections in relevant whole animal models, as well as in naturally infected humans, and households.

In addition, several other spike mutations have been observed in smaller clusters of cases. However, none have risen to global prominence. For

Page 54 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

example, signal peptide mutations (L5F and L8V) could potentially affect posttranslational modifications, folding, abundance, and glycosylation, while residue changes V367F, G476S, and V483A are found within the RBD domain. However, only G476S is located at the RBD binding interface. The functional significance of these mutations in mammalian angiotensin-converting enzyme 2 interaction networks (primate and animal) remain unknown and warrant additional study. Finally, several other mutations occur in regions of unknown function (H49Y, Y145H/del, Q239K) and appear to be diminishing, or are remaining stable in the population and located in and about the fusion machinery (A831V and D839Y/N/E) or c-terminal end (P1263L). Other mutations have been recorded in ORF1ab and ORF8 regions, although their functional significance remains unknown (Chang et al., 2020).

Is the Virus Evolving in Response to Selective Pressures?

Many selective pressures could lead to evolved changes in viral traits that improve the success and spread of SARS-CoV-2 infection. For instance, increased virus particle stability in aerosols or on surfaces could promote transmission opportunities (van Doremalen et al., 2020) and methods exist to interrogate how spontaneous mutations can improve virus stability against environmental degradation (Ogbunugafor et al., 2013). Relatedly, increased time between transmission events should select for infectious viruses with increased particle stability, although evolution of increased particle stability tends to trade-off with rate of genome replication, such that more durable viruses suffer slower reproduction (Goldhill and Turner, 2014). However, it is unclear whether the stability–reproduction trade-off would reduce viral load in an infected human as examined in influenza virus (Handel et al., 2014), a highly relevant clinical concern for COVID-19. Therefore, prolonged social distancing could select for SARS-CoV-2 variants with increased particle stability that may or may not affect viral load during infection. If genomic epidemiology studies point to virus evolution at loci that affect SARS-CoV-2 particle stability, it would provide motivation to closely study whether clinical symptoms in infected patients are changing as well.

Additional selective pressures (e.g., antiviral and immune modulating treatments and eventually vaccination) are being introduced with unknown consequences for virus evolution. Antibody responses and drug activities are dependent on specific regions of the viral proteins. For instance, the drug remdesivir requires certain regions of the viral RNA-dependent polymerase and exoribonuclease (Agostini et al., 2018), and the receptor-binding domain of the spike protein is the main target for neutralizing antibody and most vaccines in development (Premkumar et al., 2020). Mutation in these proteins could affect treatment and vaccine efficacy.

Page 55 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

More recently, host susceptibility loci on human chromosomes 3 and 9 may be unevenly distributed globally, potentially selecting for new variants of SARS-CoV-2 that interacts specifically well with this human genotype (Ellinghaus et al., 2020). Identification of these viral changes and their importance requires linkage of patient metadata to genome sequencing, to determine response to treatments and identify instances of vaccine failure. It will be critical to the control of this pandemic to recognize both types of evolution in real time to be able to institute corrective action.

Together, these data reveal a critical need for reverse genetic strategies and well-developed models to investigate the role of these mutations in SARS-CoV-2 biology and disease. Moreover, these data reveal the critical need for linkage of patient metadata to genome sequencing, without which definitive causal epidemiological associations between genotype and phenotype cannot be determined.

Clinical Disease

There has been relatively little sequence variability among human SARS-CoV-2 isolates so far, so compelling associations between sequence variant/mutations and specific clinical outcomes or features have not yet been identified. Nonetheless, identification of virus strains with different clinical features would provide insights into disease pathogenesis and potentially identify patients requiring specific interventions. Linking virus sequence data with data on patient demographics, hospitalization, duration of hospitalization, clinical complications, intensive care unit (ICU) stay, co-infections, ventilation/duration, use of extracorporeal membrane oxygenation (MacLaren et al., 2020), duration of positive reverse transcription polymerase chain reaction tests for SARS-CoV-2 RNA, and exposure/response to experimental treatments (e.g., remdesivir, convalescent plasma, dexamethasone, immune modulators) would facilitate identification of:

Strains/mutations associated with changes in disease severity, viral loads, and viral shedding periods
Strains/mutations associated with more co-infections or response to certain medical interventions, such as convalescent plasma
Strains/mutations associated with specific complications, for example, neurologic manifestations (Wood, 2020), vascular complications (Ackermann et al., 2020; Lang et al., 2020), hypercoagulable states, gastrointestinal manifestations (Ong et al., 2020), or fecal shedding
- A question that underlies all of these organ system complications: Are there SARS-CoV-2 mutations that affect organ tropism in humans?

Page 56 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

Strains or mutations associated with the pediatric multisystem inflammatory syndrome (Feldstein et al., 2020)

OPPORTUNITIES TO SUPPORT DATA INTEGRATION

An essential component of all of this involves the timely integration of data. Information flows comprising viral genomic data, as well as associated clinical and epidemiological aspects, will be useless unless there are proactive efforts to distill the data into the most useful components and then to organize the data into an integrated format that can be used by researchers, health care providers, public health practitioners, and policy makers. Given the potential large scale of the universe of data that might be available, it will be important to determine what types of information are thought to be the most relevant. Given the uncertainty as to how the current or future pandemics might progress, however, it will also be important to build flexibility and expansion capability in the resultant data management system in order to accommodate additional sources and types of data.

A national system for integrating genomic, clinical, and epidemiological data collected during an infectious disease outbreak would receive large volumes of data coming in from multiple sources, including federal, state, and local public health agencies; health care networks; and public health and clinical laboratories. Currently, no central repository exists for these different types of data and the entities contributing the information do not have dedicated staff to curate the data. Efforts to build an infrastructure to facilitate integration will likely face multiple challenges related to coordination, interoperability, flexibility, and privacy. For instance, interoperability may be a challenge because incoming data will likely be shared in a range of different formats. Constraints across existing databases, such as differences in the content fields for inputting information, can preclude the ability to record and share the full range of relevant data. Laws and regulations that govern data sharing and privacy—as well as their local-level interpretations—can also potentially impact the data in variable ways, depending on the data source, relevant regulatory or legal restrictions, and concerns related to protected health information. Regulatory and governance considerations are discussed further in Chapter 5.

In terms of encouraging participation, tying participation into existing Medicare and Medicaid financial incentives—similar to the efforts of the BioSense Platform (Gould et al., 2017)—can ensure a wide group of participants, including those in ambulatory settings. This effort would link into hospital data and help answer several important clinical questions. Hospital grants could also be made available for data agreements. With competing data reporting requirements, and the fact that a new system may create new expectations for laboratories that would normally dispose of samples,

Page 57 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

data collection must find a middle ground with a return on the investment. Providing real-time data pushes to participating parties, such as infection prevention programs or hospitals, is also an important consideration.

A prime opportunity to address these barriers is to develop agreed-upon standard data packages for submission to the system. Importantly, these packages should allow for some degree of variation for different sources and types of data. For example, a hospital laboratory would submit a comprehensive package of clinical and diagnostic data, while a commercial clinical laboratory would submit a larger population-based data package lacking clinical details. State and local public health agencies would likely have a variety of data types. To support this work, scoping of those data packages should be factored into any analytical plans. An actual data repository, along with the requisite support and analytical staff, also needs to be established. A key component of building this repository is to establish countrywide reporting relationships across all levels to ensure that comprehensive data are being submitted. This data repository should be flexible—for example, it should ensure ease of adding new data types and fields—as well as be accessible for advanced analytical methods, such as machine learning and artificial intelligence analyses to inform disease and epidemiology models.

INFRASTRUCTURE NEEDS

Most previous efforts to integrate genomic, clinical, and epidemiological data in response to viral or microbial outbreaks have been conducted on a small scale. To optimize the application of integrated data to inform the response to SARS-CoV-2 and future outbreaks, these efforts will need to be scaled up to nationwide infrastructure through which data can be shared and reported. A primary role of the U.S. Centers for Disease Control and Prevention (CDC) is epidemiological surveillance. The agency has links to each state-level health department as well as a global network of other national agencies, in which CDC serves as the country’s representative in international cooperation to fight emerging infectious diseases. The fields of clinical microbiology and epidemiology have now largely embraced genomic sequencing. Although there have been successful efforts in applying genomic epidemiology to influenza and outbreaks of foodborne bacteria (see the case studies in Chapter 2), CDC has lagged behind in incorporating genomics to its full potential. CDC is responsible for funding public health laboratories nationwide to facilitate the integration of data; however, most of those laboratories remain substantially under-resourced.

To enable larger-scale collaboration and coordination of data in a national system, insights can be gleaned from the innovative elements and constraints of CDC’s ongoing efforts and from other existing regional

Page 58 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

networks of data integration. CDC’s PulseNet,¹ established in 1996, allows members of the network to compare whole genome sequencing of bacterial DNA to help detect and mitigate foodborne outbreaks. The National Action Plan for Combating Antibiotic-Resistant Bacteria² (CARB), a national strategy (PCAST, 2014) to track antibiotic-resistant bacteria, led to the establishment of CDC’s Antibiotic Resistance Laboratory Network.³ The network strengthens national laboratory capacity to rapidly perform genomic epidemiological studies, as well as providing a mechanism for coordination and reporting. This served as the impetus for the coordination of all reporting across New York State that was leveraged for SARS-CoV-2.

Enclave Model in the National COVID Cohort Collaborative to Enable Linkage of Detailed Clinical Metadata

The National COVID Cohort Collaborative (N3C) embodies a massive, scalable collection of medical record data from people infected with SARS-CoV-2 in a centralized, secure enclave (see Chapter 3).⁴ N3C uses a project-specific hashed identifier constructed using data security standards to support linking data from disparate sources without revealing the personal identifiers used to generate the hashed ID (N3C, 2020). To support linking SARS-CoV-2 genome sequences to clinical metadata in N3C, viral genome sequences, or links (e.g., accession numbers) to their records in GenBank or the Global Initiative on Sharing All Influenza Data, would need to be deposited into N3C.

N3C is expected to contain data from 2–3 million people with confirmed SARS-CoV-2 infection by the end of 2020, and is designed with the potential to accommodate data from the U.S. population. Data in N3C are converted to the Observational Medical Outcomes Partnership (OMOP) standard (version 5.3.1, currently) after ingestion, mapping, and harmonization from multiple supported data standards. Because data accessible in the N3C are a limited dataset under terms preventing re-identification, important epidemiological activities such as contact tracing are not supported; nonetheless, inclusion of SARS-CoV-2 genomic data into N3C would represent a clinically phenotyped collection of viral genomic sequences that could scale to the U.S. population.

___________________

¹ See https://www.cdc.gov/pulsenet/index.html (accessed June 25, 2020).

² See https://aspe.hhs.gov/pdf-report/national-action-plan-combating-antibiotic-resistant-bacteria-progress-report-year-3 (accessed June 25, 2020).

³ See https://www.cdc.gov/drugresistance/solutions-initiative/ar-lab-network.html (accessed June 25, 2020).

⁴ See https://ncats.nih.gov/news/releases/2020/NIH-launches-analytics-platform-to-harness-nationwide-COVID-19-patient-data-to-speed-treatments (accessed June 25, 2020).

Page 59 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

Using Influenza Infrastructure to Integrate SARS-CoV-2 Data

Linking genomic data for SARS-CoV-2 with clinical and epidemiological data might be possible by utilizing pre-existing systems for tracking changes in the genomic structure of the influenza virus. CDC collaborates with many partners in state, local, and territorial health departments and laboratories, offices of vital statistics, health care providers, clinics, and emergency departments to monitor influenza on an annual basis (CDC, 2020). The U.S. influenza surveillance system is designed to find out when and where influenza activity is occurring; determine what influenza viruses are circulating; detect changes in influenza viruses; and measure the impact influenza is having on outpatient illness, hospitalizations, and deaths (CDC, 2020). These goals are in line with what the committee proposes for the use of genomic data on SARS-CoV-2.

Approximately 100 public health and 300 clinical laboratories in all 50 states, Puerto Rico, Guam, and the District of Columbia participate in surveillance for influenza viruses through either the U.S. World Health Organization Collaborating Laboratories System or through the National Respiratory and Enteric Virus Surveillance System.

Data from clinical laboratories provide useful information on the timing and intensity of influenza activity from respiratory specimens largely obtained for diagnostic purposes. Public health laboratories provide data useful to understand what influenza virus types, subtypes, and lineages are circulating and the age groups being affected as test specimens are collected primarily for the purposes of surveillance. For genetic characterization, all influenza-positive surveillance samples are submitted for genomic sequencing by CDC to determine the genetic characteristics of circulating influenza viruses and to monitor the course of evolution of viruses circulating in the population under surveillance. Phylogenetic analysis classifies virus gene segments into genetic clades or subclades. CDC also tests a sample of the influenza viruses collected by public health laboratories for susceptibility to antiviral, such as neuraminidase inhibitors using genomic sequencing analysis and/or a functional assay.

The U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet) collects information on outpatient visits to health care providers in all 50 states, Puerto Rico, the District of Columbia, and the U.S. Virgin Islands for influenza-like illness (ILI). More than 2,500 outpatient health care providers around the country report data to CDC every week recording the total number of patients seen, including specifically the number of those patients with ILI by age group (0–4 years, 5–24 years, 25–49 years, 50–64 years, and ≥65 years).

The Influenza Hospitalization Surveillance Network (FluSurv-NET) monitors laboratory confirmed influenza-associated hospitalizations in children younger than 18 years of age (since the 2003–2004 influenza

Page 60 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

season) and adults (since the 2005–2006 influenza season). High-risk medical conditions are extracted from patient medical charts at the time of hospitalization, including cardiovascular disease, chronic lung disease, immunocompromised condition, obesity, and pregnancy status that match similar underlying conditions of interest in patients with COVID-19.

Health Information Exchanges

In the wake of the Health Information Technology for Economic and Clinical Health Act of 2009⁵ and continuing financial incentives from the Centers for Medicare & Medicaid Services, there is widespread adoption of electronic medical records systems by hospitals and physician practices (CDC, 2019). In addition, health information exchanges, built largely to facilitate the exchange of digital health information for clinical treatment purposes, exist across the country; according to one survey, 7 out of 10 hospitals in the United States belong to at least one nationwide health data sharing network (Johnson et al., 2018).

In the 21st Century Cures Act, the U.S. Congress required the U.S. Department of Health and Human Services (HHS) to establish a voluntary network to facilitate nationwide digital sharing of electronic health information, and a public–private partnership has been launched, led by the Sequoia Project, to create a national health information sharing “trusted exchange framework” pursuant to a common agreement (HealthIT.gov, 2020b). In addition, on May 1, 2020, the HHS Office of the National Coordinator for Health Information Technology finalized rules to prohibit health care providers, certified electronic medical record vendors, and health information exchanges from “blocking” the sharing of information, including for public health purposes; these rules will go into effect on November 2, 2020 (HealthIT.gov, 2020a). Although this national network is still in formation, certified electronic medical record vendors and health information exchanges across the country could be leveraged today to facilitate the sharing of clinical metadata that will help public health departments and researchers answer critical questions related to SARS-CoV-2 and COVID-19. Implementation of interoperable health records systems must be cognizant of the potential for sensitive and private personal health information to be inadvertently shared, en masse. This not only risks violating individuals’ rights to privacy and non-discrimination, but also undermines public trust and as a result, the potential accuracy of public health data gathered. Upgrading existing records systems will likely be necessary to allow for options to protect privacy such as segmentation

___________________

⁵ See https://www.govinfo.gov/content/pkg/PLAW-111publ5/pdf/PLAW-111publ5.pdf (accessed June 25, 2020).

Page 61 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

of data in a manner that achieves both the goals of sharing and irrelevant or sensitive individual personal health data (Rothstein and Tovino, 2019). While unrestricted access to shared data would incur privacy risks, the enclave model of N3C described above illustrates sharing of national-scale health data with strong privacy protections.

Participatory Surveillance

The growing field of participatory surveillance allows individuals to report symptoms of illness through crowd-sourced, voluntary systems that allow for community-level health monitoring (Smolinski et al., 2017). A number of participatory surveillance systems already exist worldwide. Most of these systems collect epidemiological data that are provided to public health authorities and research institutions and used to analyze trends and broaden surveillance beyond the traditional, sentinel surveillance approach. For instance, participatory surveillance provides a mechanism to collect information on influenza in the community at large. Because the majority of persons with influenza each year do not seek medical care (Biggerstaff et al., 2014; Van Cauteren et al., 2012; van Noort et al., 2007), a large number of self-reporting systems collect information on ILI. Boston Children’s Hospital’s Flu Near You⁶ is a self-reported ILI system that has combined laboratory testing for diagnosis of influenza or other respiratory pathogens with the epidemiological data collected by the open-source GoViral Study (Li, 2016). Individuals who report symptoms of illness compatible with influenza are provided with a home test kit and asked to collect a sputum sample for testing; the results of the test are then compared to the symptom data shared in the open-source system. Such participatory surveillance systems could potentially be expanded through the use of home test kits that would allow for genomic sequencing of pathogens in the community.

Numerous community-based surveillance systems have arisen during the COVID-19 pandemic. For example, the Flu Near You system was adapted with expanded symptoms related to SARS-CoV-2 infection into the COVID Near You⁷ system. Other systems have arisen as longitudinal research projects that are collecting epidemiological data along with testing results for COVID-19. All of these systems offer opportunities to incorporate genomic sequencing, which could target data collection from specific subsets of the population or from people in specific geographic regions. Genomic sequencing could be added to existing systems for COVID-19 symptom reporting, tracking, and contact tracing in the United States.

___________________

⁶ See https://flunearyou.org (accessed June 25, 2020).

⁷ See https://www.covidnearyou.org/us/en-US (accessed June 25, 2020).

Page 62 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

PARTNERSHIPS, COORDINATION, AND CAPACITY CONSIDERATIONS

Fostering partnerships across laboratories in different sectors and at different levels—from state and local public health to clinical, academic, and commercial laboratories—will be critical for developing capacity and facilitating coordination necessary for national-level genomic data that can be integrated with clinical and epidemiological data. These efforts should seek to partner with a range of different laboratories to better represent the entire population. CDC’s SARS-CoV-2 Sequencing for Public Health Emergency Response, Epidemiology, and Surveillance will likely cover a large proportion of the population, but better coverage could be achieved by partnering with the third-party private laboratories that are often contracted with health care systems (e.g., LabCorp or Quest Diagnostics). Hospital and clinical laboratories are also valuable sources of information, especially in rural or critical access areas. Tests are going unused in many of these settings, where facilities often lack laboratory capacity and local health systems face multiple barriers to utilizing their samples (Maxmen, 2020). In such settings, partnerships can also provide access to important data on hard-to-reach patient populations. These same partnerships should exist with academic and commercial laboratories, albeit with an awareness of capacity considerations.

Data usability, capacity considerations, and key outputs—to include content and periodicity of reports—are important aspects of coordinating partnerships, particularly for smaller hospital and public health laboratories operating with limited resources. For example, close support, coordination, and capacity building are all valuable for mitigating the stress experienced by laboratories in the context of an infectious disease outbreak. Many of these laboratories may have valuable data on vulnerable patient populations that are not being reported into broader databases or larger systems due to a range of barriers, such as bureaucratic red tape, dependency on limited resources, and often outdated tools for communication and data sharing (e.g., faxing). Tying data collection and integration into hospitals’ meaningful use standards could be a beneficial approach for biosurveillance. Ultimately, however, the utilization of a low-cost approach that mitigates such barriers to collecting, analyzing, and sharing data is critical. Systems should be in place that enable hospitals to seamlessly push data to their key stakeholders, such as public health agencies, infection prevention programs, and clinical partners. Furthermore, partnerships can help to ensure these data are presented in a way that is beneficial to the end user. For instance, genomic data are of great value for many purposes, but clinical and epidemiological data may be relevant and applicable to patient care and public health outcomes.

Page 63 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

Workforce Capacity Development for Genomic Epidemiology

An important consideration in the development of such a system is the building of genome sequencing and analysis capabilities within public health agencies and health care systems. Even with advancing technology, multidisciplinary and highly trained teams remain the most valuable asset for combining genomic, clinical, and epidemiological data into actionable knowledge (Lesho et al., 2016). Since 2016, the Broad Institute has partnered with the Massachusetts State Public Health Laboratory and CDC to build distributed capacity for genomic sequencing through a train-the-trainer program for regional- and state-level laboratory personnel. This program, for example, could serve as a model for developing national coordination among state public health laboratories.

CONCLUDING REMARKS

There remains no central repository to house the large volume of individually identifiable data from various actors involved in the public health response to SARS-CoV-2 in the United States, just as none existed for prior infectious disease outbreaks or for hypothetical future outbreaks. While it is important to learn from several of the smaller scale examples described in this report, building out successful elements from these success stories remains a major challenge at the national scale. The committee recognizes that advancing beyond the current small-scale efforts to a national or even global repository is a challenging undertaking, but the current pandemic puts the lack of such a system in stark relief. Incremental efforts, such as establishing regional repositories, can be taken now and leveraged in the future for a large-scale effort. As noted above, leveraging and expanding existing infrastructure and planning—through programs such as N3C, PulseNet, CARB, ILINet, and health information exchanges—will be crucial to addressing the data infrastructure challenge in a way that is both innovative and iterative. The creation of a system of data infrastructure built on a standard data package could cultivate a more interoperable data environment, a challenge of paramount importance when principles such as flexibility and privacy remain priorities. Ultimately, a data management and infrastructure system with investment in the proper resources, staff, and storage will be critical for the coordination of data needs in response to SARS-CoV-2 and future outbreak responses.

RECOMMENDATION 2. The U.S. Department of Health and Human Services should develop and invest in a national data infrastructure system that constructively builds on existing programmatic infrastructure with the ability to accurately, efficiently, and safely link genomic data, clinical data, epidemiological data, and other relevant data across

Page 64 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

multiple sources critical to a public health response such as the current SARS-CoV-2 outbreak. Such a system should:

Allow for the linkage of genomic data, clinical data, epidemiological data, and other relevant data in a way that is not overly burdensome to laboratories that collect data regularly.
Create and foster safe data-sharing practices to ensure that individuals’ personal identifying information remains unexposed when data are being used and shared across the system.
Be grounded in the pursuit of standardization, interoperability, flexibility, and the practical linkage of data, including consideration of a potential national patient identifier.
Consider not only the data required to create such a system, but also investment in mechanisms supporting the collection and analysis of such data, including promoting formal education in “data wrangling” at the intersection of data science and infectious disease epidemiology.
Conduct regular annual reviews—including scenario-based simulations—to identify capacity gaps, promote process improvement (based on existing U.S. infrastructure to assess the annual risk of seasonal influenza, work could improve usability and coverage of health information exchanges, and other initiatives), and ensure inclusion of entities with supporting functions across scales—including private health care systems that provide data or state and local public health laboratories that collect data—in ongoing system development and evaluation.

REFERENCES

Ackermann, M., S. E. Verleden, M. Kuehnel, A. Haverich, T. Welte, F. Laenger, A. Vanstapel, C. Werlein, H. Stark, A. Tzankov, W. W. Li, V. W. Li, S. J. Mentzer, and D. Jonigk. 2020. Pulmonary vascular endothelialitis, thrombosis, and angiogenesis in COVID-19. New England Journal of Medicine 383:120–128.

Agostini, M. L., E. L. Andres, A. C. Sims, R. L. Graham, T. P. Sheahan, X. Lu, E. C. Smith, J. B. Case, J. Y. Feng, R. Jordan, A. S. Ray, T. Cihlar, D. Siegel, R. L. Mackman, M. O. Clarke, R. S. Baric, and M. R. Denison. 2018. Coronavirus susceptibility to the antiviral remdesivir (GS-5734) is mediated by the viral polymerase and the proofreading exoribonuclease. mBio 9(2):e00221-18.

Biggerstaff, M., M. A. Jhung, C. Reed, A. M. Fry, L. Balluz, and L. Finelli. 2014. Influenza-like illness, the time to seek healthcare, and influenza antiviral receipt during the 2010-2011 influenza season-united states. The Journal of Infectious Diseases 210(4):535–544.

CDC (U.S. Centers for Disease Control and Prevention). 2019. Public health and promoting interoperability programs: Introduction. https://www.cdc.gov/ehrmeaningfuluse/introduction.html (accessed July 7, 2020).

CDC. 2020. U.S. influenza surveillance system: Purpose and methods. https://www.cdc.gov/flu/weekly/overview.htm (accessed June 24, 2020).

Page 65 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

Chang, T. J., D. M. Yang, M. L. Wang, K. H. Liang, P. H. Tsai, S. H. Chiou, T. H. Lin, and C. T. Wang. 2020. Genomic analysis and comparative multiple sequences of SARS-CoV-2. Journal of the Chinese Medical Association 83(6):537–543.

Duffy, S., L. A. Shackelton, and E. C. Holmes. 2008. Rates of evolutionary change in viruses: Patterns and determinants. Nature Reviews Genetics 9(4):267–276.

Elena, S. F., and R. Sanjuán. 2007. Virus evolution: Insights from an experimental approach. Annual Review of Ecology, Evolution, and Systematics 38(1):27–52.

Ellinghaus, D., F. Degenhardt, L. Bujanda, M. Buti, A. Albillos, P. Invernizzi, J. Fernández, D. Prati, G. Baselli, R. Asselta, M. M. Grimsrud, C. Milani, F. Aziz, J. Kässens, S. May, M. Wendorff, L. Wienbrandt, F. Uellendahl-Werth, T. Zheng, X. Yi, R. de Pablo, A. G. Chercoles, A. Palom, A.-E. Garcia-Fernandez, F. Rodriguez-Frias, A. Zanella, A. Bandera, A. Protti, A. Aghemo, A. Lleo, A. Biondi, A. Caballero-Garralda, A. Gori, A. Tanck, A. Carreras Nolla, A. Latiano, A. L. Fracanzani, A. Peschuck, A. Julià, A. Pesenti, A. Voza, D. Jiménez, B. Mateos, B. Nafria Jimenez, C. Quereda, C. Paccapelo, C. Gassner, C. Angelini, C. Cea, A. Solier, D. Pestaña, E. Muñiz-Diaz, E. Sandoval, E. M. Paraboschi, E. Navas, F. García Sánchez, F. Ceriotti, F. Martinelli-Boneschi, F. Peyvandi, F. Blasi, L. Téllez, A. Blanco-Grau, G. Hemmrich-Stanisak, G. Grasselli, G. Costantino, G. Cardamone, G. Foti, S. Aneli, H. Kurihara, H. ElAbd, I. My, I. Galván-Femenia, J. Martín, J. Erdmann, J. Ferrusquía-Acosta, K. Garcia-Etxebarria, L. Izquierdo-Sanchez, L. R. Bettini, L. Sumoy, L. Terranova, L. Moreira, L. Santoro, L. Scudeller, F. Mesonero, L. Roade, M. C. Rühlemann, M. Schaefer, M. Carrabba, M. Riveiro-Barciela, M. E. Figuera Basso, M. G. Valsecchi, M. Hernandez-Tejero, M. Acosta-Herrera, M. D’Angiò, M. Baldini, M. Cazzaniga, M. Schulzky, M. Cecconi, M. Wittig, M. Ciccarelli, M. Rodríguez-Gandía, M. Bocciolone, M. Miozzo, N. Montano, N. Braun, N. Sacchi, N. Martínez, O. Özer, O. Palmieri, P. Faverio, P. Preatoni, P. Bonfanti, P. Omodei, P. Tentorio, P. Castro, P. M. Rodrigues, A. Blandino Ortiz, R. de Cid, R. Ferrer, R. Gualtierotti, R. Nieto, S. Goerg, S. Badalamenti, S. Marsal, G. Matullo, S. Pelusi, S. Juzenas, S. Aliberti, V. Monzani, V. Moreno, T. Wesse, T. L. Lenz, T. Pumarola, V. Rimoldi, S. Bosari, W. Albrecht, W. Peter, M. Romero-Gómez, M. D’Amato, S. Duga, J. M. Banales, J. R. Hov, T. Folseraas, L. Valenti, A. Franke, and T. H. Karlsen. 2020. Genomewide association study of severe COVID-19 with respiratory failure. New England Journal of Medicine. doi: 10.1056/NEJMoa2020283.

Feaster, M., and Y.-Y. Goh. 2020. High proportion of asymptomatic SARS-CoV-2 infections in 9 long-term care facilities, Pasadena, California, USA, April 2020. Emerging Infectious Diseases 26(10).

Feldstein, L. R., E. B. Rose, S. M. Horwitz, J. P. Collins, M. M. Newhams, M. B. F. Son, J. W. Newburger, L. C. Kleinman, S. M. Heidemann, A. A. Martin, A. R. Singh, S. Li, K. M. Tarquinio, P. Jaggi, M. E. Oster, S. P. Zackai, J. Gillen, A. J. Ratner, R. F. Walsh, J. C. Fitzgerald, M. A. Keenaghan, H. Alharash, S. Doymaz, K. N. Clouser, J. S. Giuliano, Jr., A. Gupta, R. M. Parker, A. B. Maddux, V. Havalad, S. Ramsingh, H. Bukulmez, T. T. Bradford, L. S. Smith, M. W. Tenforde, C. L. Carroll, B. J. Riggs, S. J. Gertz, A. Daube, A. Lansell, A. Coronado Munoz, C. V. Hobbs, K. L. Marohn, N. B. Halasa, M. M. Patel, and A. G. Randolph. 2020. Multisystem inflammatory syndrome in U.S. children and adolescents. New England Journal of Medicine 383:334–346.

Geoghegan, J. L., and E. C. Holmes. 2018. The phylogenomics of evolving virus virulence. Nature Reviews Genetics 19(12):756–769.

Goldhill, D. H., and P. E. Turner. 2014. The evolution of life history trade-offs in viruses. Current Opinion in Virology 8:79–84.

Page 66 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

Gonzalez-Reiche, A. S., M. M. Hernandez, M. J. Sullivan, B. Ciferri, H. Alshammary, A. Obla, S. Fabre, G. Kleiner, J. Polanco, Z. Khan, B. Alburquerque, A. van de Guchte, J. Dutta, N. Francoeur, B. S. Melo, I. Oussenko, G. Deikus, J. Soto, S. H. Sridhar, Y.-C. Wang, K. Twyman, A. Kasarskis, D. R. Altman, M. Smith, R. Sebra, J. Aberg, F. Krammer, A. García-Sastre, M. Luksza, G. Patel, A. Paniz-Mondolfi, M. Gitman, E. M. Sordillo, V. Simon, and H. van Bakel. 2020. Introductions and early spread of SARS-CoV-2 in the New York City area. Science 369(6501):279–301.

Gould, D. W., D. Walker, and P. W. Yoon. 2017. The evolution of biosense: Lessons learned and future directions. Public Health Reports (Washington, DC: 1974) 132(1 Suppl):7S–11S.

Grubaugh, N. D., W. P. Hanage, and A. L. Rasmussen. 2020. Making sense of mutation: What D614G means for the COVID-19 pandemic remains unclear. Cell 182(4):794–795.

Handel, A., C. Lebarbenchon, D. Stallknecht, and P. Rohani. 2014. Trade-offs between and within scales: Environmental persistence and within-host fitness of avian influenza viruses. Proceedings of the Royal Society B: Biological Sciences 281(1787).

Harvala, H., Å. Wiman, A. Wallensten, K. Zakikhany, H. Englund, and M. Brytting. 2015. Role of sequencing the measles virus hemagglutinin gene and hypervariable region in the measles outbreak investigations in Sweden during 2013–2014. The Journal of Infectious Diseases 213(4):592–599.

HealthIT.gov. 2020a. Information blocking. https://www.healthit.gov/topic/information-blocking (accessed July 7, 2020).

HealthIT.gov. 2020b. Trusted exchange framework and common agreement. https://www.healthit.gov/topic/interoperability/trusted-exchange-framework-and-common-agreement (accessed July 7, 2020).

Holmes, E. C., G. Dudas, A. Rambaut, and K. G. Andersen. 2016. The evolution of Ebola virus: Insights from the 2013–2016 epidemic. Nature 538(7624):193–200.

Hu, B., L.-P. Zeng, X.-L. Yang, X.-Y. Ge, W. Zhang, B. Li, J.-Z. Xie, X.-R. Shen, Y.-Z. Zhang, N. Wang, D.-S. Luo, X.-S. Zheng, M.-N. Wang, P. Daszak, L.-F. Wang, J. Cui, and Z.-L. Shi. 2017. Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus. PLOS Pathogens 13(11):e1006698.

Hu, J., C.-L. He, Q.-Z. Gao, G.-J. Zhang, X.-X. Cao, Q.-X. Long, H.-J. Deng, L.-Y. Huang, J. Chen, K. Wang, N. Tang, and A.-L. Huang. 2020. The D614G mutation of SARS-CoV-2 spike protein enhances viral infectivity and decreases neutralization sensitivity to individual convalescent sera. bioRxiv. https://doi.org/10.1101/2020.06.20.161323.

Johnson, C., Y. Pylypchuk, and V. Patel. 2018. Methods used to enable interoperability among U.S. non-federal acute care hospitals in 2017. The Office of the National Coordinator for Health Information Technology.

Korber, B., W. M. Fischer, S. Gnanakaran, H. Yoon, J. Theiler, W. Abfalterer, N. Hengartner, E. E. Giorgi, T. Bhattacharya, B. Foley, K. M. Hastie, M. D. Parker, D. G. Partridge, C. M. Evans, T. M. Freeman, T. I. de Silva, C. McDanal, L. G. Perez, H. Tang, A. Moon-Walker, S. P. Whelan, C. C. LaBranche, E. O. Saphire, D. C. Montefiori, A. Angyal, R. L. Brown, L. Carrilero, L. R. Green, D. C. Groves, K. J. Johnson, A. J. Keeley, B. B. Lindsey, P. J. Parsons, M. Raza, S. Rowland-Jones, N. Smith, R. M. Tucker, D. Wang, and M. D. Wyles. 2020. Tracking changes in SARS-CoV-2 spike: Evidence that D614G increases infectivity of the COVID-19 virus. Cell 182(4):812–827.

Lang, M., A. Som, D. P. Mendoza, E. J. Flores, N. Reid, D. Carey, M. D. Li, A. Witkin, J. M. Rodriguez-Lopez, J. O. Shepard, and B. P. Little. 2020. Hypoxaemia related to COVID-19: Vascular and perfusion abnormalities on dual-energy ct. The Lancet Infectious Diseases. https://doi.org/10.1016/S1473-3099(20)30367-4.

Page 67 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

Lesho, E., R. Clifford, F. Onmus-Leone, L. Appalla, E. Snesrud, Y. Kwak, A. Ong, R. Maybank, P. Waterman, P. Rohrbeck, M. Julius, A. Roth, J. Martinez, L. Nielsen, E. Steele, P. McGann, and M. Hinkle. 2016. The challenges of implementing next generation sequencing across a large healthcare system, and the molecular epidemiology and antibiotic susceptibilities of carbapenemase-producing bacteria in the healthcare system of the U.S. Department of Defense. PLOS ONE 11(5):e0155770.

Li, K. 2016. Dr. Rumi Chunara and Sofia Ahsanuddin: The GoViral Study. https://www.ghjournal.org/the-goviral-study (accessed July 6, 2020).

Liu, Y., Z. Ning, Y. Chen, M. Guo, Y. Liu, N. K. Gali, L. Sun, Y. Duan, J. Cai, D. Westerdahl, X. Liu, K. Xu, K.-f. Ho, H. Kan, Q. Fu, and K. Lan. 2020. Aerodynamic analysis of SARS-CoV-2 in two Wuhan hospitals. Nature 582(7813):557–560.

Liya, G., W. Yuguang, L. Jian, Y. Huaiping, H. Xue, H. Jianwei, M. Jiaju, L. Youran, M. Chen, and J. Yiqing. 2020. Studies on viral pneumonia related to novel coronavirus SARS-CoV-2, SARS-CoV, and MERS-CoV: A literature review. Apmis 128(6):423–432.

MacLaren, G., D. Fisher, and D. Brodie. 2020. Preparing for the most critically ill patients with COVID-19: The potential role of extracorporeal membrane oxygenation. JAMA 323(13):1245–1246.

Maxmen, A. 2020. Thousands of coronavirus tests are going unused in US labs. Nature 580(7803):312–313.

Messenger, S. L., I. J. Molineux, and J. J. Bull. 1999. Virulence evolution in a virus obeys a trade off. Proceedings of the Royal Society B: Biological Sciences 266(1417):397–404.

N3C (National COVID Cohort Collaborative). 2020. National COVID Cohort Collaborative (N3C): A national resource for shared analytics. https://ncats.nih.gov/n3c/about (accessed August 20, 2020).

Ogbunugafor, C., B. W. Alto, T. M. Overton, A. Bhushan, N. M. Morales, and P. E. Turner. 2013. Evolution of increased survival in RNA viruses specialized on cancer-derived cells. The American Naturalist 181(5):585–595.

Ong, J., B. E. Young, and S. Ong. 2020. COVID-19 in gastroenterology: A clinical perspective. Gut 69(6):1144–1145.

Orr, H. A. 2000. The rate of adaptation in asexuals. Genetics 155(2):961–968.

PCAST (President’s Council of Advisors on Science and Technology). 2014. Report to the President on combatting antibiotic resistance. Washington, DC.

Penedos, A. R., R. Myers, B. Hadef, F. Aladin, and K. E. Brown. 2015. Assessment of the utility of whole genome sequencing of measles virus in the characterisation of outbreaks. PLOS ONE 10(11):e0143081.

Premkumar, L., B. Segovia-Chumbez, R. Jadi, D. R. Martinez, R. Raut, A. J. Markmann, C. Cornaby, L. Bartelt, S. Weiss, Y. Park, C. E. Edwards, E. Weimer, E. M. Scherer, N. Rouphael, S. Edupuganti, D. Weiskopf, L. V. Tse, Y. J. Hou, D. Margolis, A. Sette, M. H. Collins, J. Schmitz, R. S. Baric, and A. M. de Silva. 2020. The receptor-binding domain of the viral spike protein is an immunodominant and highly specific target of antibodies in SARS-CoV-2 patients. Science Immunology 5(48):eabc8413.

Rothstein, M., and S. Tovino. 2019. Privacy risks of interoperable health records: Segmentation of sensitive information will help. Journal of Law, Medicine & Ethics 47:771–777.

Smolinski, M. S., A. W. Crawley, J. M. Olsen, T. Jayaraman, and M. Libel. 2017. Participatory disease surveillance: Engaging communities directly in reporting, monitoring, and responding to health threats. JMIR Public Health and Surveillance 3(4):e62.

Van Cauteren, D., S. Vaux, H. de Valk, Y. Le Strat, V. Vaillant, and D. Lévy-Bruhl. 2012. Burden of influenza, healthcare seeking behaviour and hygiene measures during the A(H1N1)2009 pandemic in France: A population based study. BMC Public Health 12(1):947.

Page 68 Cite

Suggested Citation:"4 Framework to Track and Correlate Viral Genome Sequences with Clinical and Epidemiological Data." National Academies of Sciences, Engineering, and Medicine. 2020. Genomic Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic Response Strategies. Washington, DC: The National Academies Press. doi: 10.17226/25879.

×

van Doremalen, N., T. Bushmaker, D. H. Morris, M. G. Holbrook, A. Gamble, B. N. Williamson, A. Tamin, J. L. Harcourt, N. J. Thornburg, S. I. Gerber, J. O. Lloyd-Smith, E. de Wit, and V. J. Munster. 2020. Aerosol and surface stability of SARS-CoV-2 as compared with SARS-CoV-1. New England Journal of Medicine 382(16):1564–1567.

van Noort, S. P., M. Muehlen, H. Rebelo de Andrade, C. Koppeschaar, J. M. Lima Lourenço, and M. G. Gomes. 2007. GripeNet: An Internet-based system to monitor influenza-like illness uniformly across Europe. Eurosurveillance 12(7):5–6.

Wölfel, R., V. M. Corman, W. Guggemos, M. Seilmaier, S. Zange, M. A. Müller, D. Niemeyer, T. C. Jones, P. Vollmar, C. Rothe, M. Hoelscher, T. Bleicker, S. Brünink, J. Schneider, R. Ehmann, K. Zwirglmaier, C. Drosten, and C. Wendtner. 2020. Virological assessment of hospitalized patients with COVID-2019. Nature 581(7809):465–469.

Wood, H. 2020. New insights into the neurological effects of COVID-19. Nature Reviews Neurology 16(8):403.

Xu, Y., X. Li, B. Zhu, H. Liang, C. Fang, Y. Gong, Q. Guo, X. Sun, D. Zhao, J. Shen, H. Zhang, H. Liu, H. Xia, J. Tang, K. Zhang, and S. Gong. 2020. Characteristics of pediatric SARS-CoV-2 infection and potential evidence for persistent fecal viral shedding. Nature Medicine 26(4):502–505.

Zhang, L., C. B. Jackson, H. Mou, A. Ojha, E. S. Rangarajan, T. Izard, M. Farzan, and H. Choe. 2020. The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity. bioRxiv. https://doi.org/10.1101/2020.06.12.148726.