BOX 3-1

Organizing Metagenomic Sequence Data

Clustering: An approach to data analysis in which a large dataset is divided into distinct subsets based on some specific measure. In analyzing DNA or protein sequences, clustering is used to identify groups of sequences that share an evolutionary origin (families) but can also identify larger sets, such as genomes (see binning). Genome annotations can be viewed as form of clustering, where individual genes are assigned to well-characterized (or at least previously known) gene families. In metagenomics, direct clustering of DNA sequences is likely to remain a primary annotation method, as most of these sequences will not be easily assigned to any known gene family. In direct clustering, the nucleotide (or predicted protein) sequence itself is the basis of the grouping of sequences.

Binning: A clustering method that uses composition and/or other characteristics of DNA contigs (overlapping individual reads) to divide them into groups (clusters) that belong to specific genomes or groups of genomes. Examples of characteristics that can be used for binning are GC content and codon use. In metagenomic projects in which genome assembly is a goal, this is used as a preliminary step.

Gene annotation: A process of classifying predicted genes into known and well-characterized gene families. In metagenomics, where a substantial percentage of sequences cannot be easily classified, annotations often remain at the preliminary stage of clustering the sequences into groups (families) that are otherwise uncharacterized.

Gene prediction: A process of analyzing genomic DNA sequences to predict which encode biological functions, such as coding for proteins, structural and regulatory RNA, and other regulatory elements. Gene prediction is important for determining the functional repertoire of a microbial community and for comparing the capabilities of different communities.

population, and metabolic processes that affect methane generation in the permafrost as it experiences global warming—is in principle not different from monitoring such changes in a culture of saccharomycetes as it adapts to a new substrate, in a fruit fly embryo as it develops, or in a human tumor as it progresses. Structural genomics—the systematic expression and structural characterization of the products of all the uncharacterized genes in a genome—will also be a boon; so far, this approach has been applied in the organismal context, but all the highly expressed but unidentified genes in a community metagenome would be an ideal target.

New concepts and methods will be developed for metagenomics that will expand the general genomic repertoire. Metagenomics captures microdiversity, or variation among strains of the same species, thereby producing

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement