by sampling trees (BEAST), which is a software architecture for analyzing molecular sequences related by an evolutionary tree, and PhyloSift.

Phylogenetics, which provides information on evolutionary relationships based on genetic data, can be used to determine what may be present in a sample. The greater the amount of genetic data, the deeper the resolution of the information on the microbe(s) present. With limited information, such as a single marker, phylum level may be the best that can be determined, which is likely to be of little use for attribution and determining the presence of, for example, a Select Agent. With a whole genome or a number of genes, deeper classification may be feasible. Bayesian hypothesis testing can be applied to any gene or genetic marker phylogeny, including toxin genes of interest. Metagenomic-derived phylogenetic data also can be used to answer the question, “What is the closest match between a sample of interest and a database of related samples?” This can be accomplished using phylogenetic placement data and the “phylogenetic Kantorvich–Rubinstein distance” (Evans and Matsen, 2012).

However, these methods can be applied only to a single gene at a time or to a set of phylogenetically coherent genes. The capability to analyze all genes simultaneously is needed. Given current technology, the information that is contained in whole genomes is not going to be entirely available and some information will not be captured. The analytical challenge of accomplishing this corresponds to reconstructing the histories of coevolution of all genes in a phylogenetic pattern. Evolution is complex and reticulate. Within a species tree, there are phylogenies of individual genes, which include gene duplication, gene conversion, lateral gene transfer, and gene death. The target areas that define a species or strain and distinguish it from near neighbors must be well defined. One or more sites may be required depending on phylogenetic resolution. Efforts are under way using informatics and phylogenetic reconstruction to meet the challenges (Bérard et al., 2012; Boussau et al., 2013). But scaling these methods presents technical, cost, and training challenges and is likely to remain difficult.


Even if there is a major effort to characterize microbial diversity comprehensively through nucleic acid sequencing, unless the data that are collected are shared in such a way that scientists around the world have access to them, they will not provide optimal benefit for either public health or microbial forensics. During the discussion following the workshop session on microbial ecology and diversity, various workshop participants pointed out a variety of problems involved with data sharing.

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement