IDR Team Summary 6
How can genomics be leveraged to develop coherent approaches for rapidly exploring the biochemical diversity in and engineering of non-model organisms?
The spectrum of biological organisms on earth provides an extraordinary repertoire of biochemical synthetic and signal processing systems that can be borrowed intact or modified to accomplish synthetic biological goals. Such efforts depend on a detailed understanding of the reactions that an organism carries out as well as the molecular players (e.g., proteins and metabolites) responsible for conducting these reactions. For compelling technical reasons, most molecular dissection of biological systems has focused on a bedrock group of five model organisms which include fruit flies, bakers yeast, roundworms, E. coli, and mice; the vast majority of breakthroughs in modern biology has come from work on these systems. There are a few other organisms like arabodopsis (mustard weed), zebrafish, and the frog Xenopus laevis. However, it requires a huge investment of time and resources to turn a wild organism into an experimentally tractable system, so researchers naturally try to get the most mileage out of the model organisms we already have. While understandable from a practical point of view, this focus comes at an enormous cost, as many of the most desirable reactions are not found in the common model organism. For example, none of the “big five” are able to directly harness energy from light through photosynthesis. Yet photosynthesis is the keystone of biofuels efforts.
The emerging field of metagenomics promises to help overcome this limitation and allow us to better exploit the full biological diversity of the world we live in. Metagenomics takes advantage of the revolution in DNA sequencing technologies to define genetic material recovered directly from
environmental samples. Traditional microbiology studies cultivated clonal cultures. Metagenomics, in contrast, enables studies of organisms that are not easily cultured in a laboratory as well as studies of organisms in their natural environment. One of the first results to come from metagenomics was the realization that species identification efforts based on organisms that can be cultured had vastly underestimated the true level of biodiversity. While this conclusion is well accepted, identifying and exploiting the mass of information obtainable from these new life forms represents a major challenge and one that we are only now beginning to address.
Automated DNA synthesis has rapidly improved in fidelity, length, speed and cost. This enables the nucleotide information from sequencing and metagenomic efforts to be converted into a physical DNA sequence without the exchange of genetic or cellular material. So-called synthetic metagenomics refers to mining of databases for functional sequences, the “printing” of this information, and screening for function. This methodology will revolutionize enzyme/pathway/genetic circuit discovery, sequence-function mapping, and annotation of sequences. Novel bioinformatic methods will be needed to identify genes to be synthesized and to analyze the functional information.
A number of applications could require the forward programming of meta communities. Understanding the natural language and metabolic interdependencies of natural communities will aid in this process. Natural systems will yield more quorum sensing circuits that enable multiple channels by which cells can be programmed to communicate. Understanding the metabolic origins for symbiosis will enable multiple cells to be programmed to interact in a fermenter to achieve stable populations and predicable product titers.
How do we identify environmental sources for metagenomics analyses that are most likely to contain organisms capable of novel biosynthetic strategies that will be of immediate value to synthetic biology efforts?
How do we identify novel synthetic and signal transduction pathways from genomic information alone even when we are not able to culture a given organism? For example, comparative genomics, analysis of the environmental conditions in which organisms are found, metabolomics on polycultures.
Are there general strategies for increasing the spectrum of novel organisms that can be cultured?
For those organisms that can be cultured, can we build a robust toolkit for establishing the basic infrastructure needed to carry out systematic functional analyses of that organism to identify novel biosynthetic pathways? For example, rapid strategies for creating collections of tagged and deleted strains. Integrated use of microarrays, proteomics, and metabolomics.
When it is possible to identify valuable biosynthetic pathways, how can the machinery responsible for this new chemistry be systematically identified, transplanted and modified to enhance synthetic biology efforts?
Are there general principles of polyculture life that can be revealed by metagenomics which will aid efforts to create robust, optimized polycultures for synthetic biology efforts?
Bayer TS, Widmaier DM, Temme K, Mirsky EA, Santi DV, and Voigt CA. Synthesis of methyl halides from biomass using engineered microbes. J Am Chem Soc 2009;131:6508: http://pubs.acs.org/doi/full/10.1021/ja809461u?cookieSet=1. Accessed online 28 July 2009.
Gaucher EA, Govindarajan S, Ganesh OK. Palaeotemperature trend for Precambrian life inferred from resurrected proteins. Nature 2008;451:704: http://www.nature.com/nature/journal/v451/n7179/full/nature06510.html. Accessed online 28 July 2009.
Brenner K, Karig DK, Weiss R, Arnold FH. Engineered bidirectional communication mediates a consensus in a microbial biofilm consortium. Proc Natl Acad Sci USA 2007;104:17300-17304: http://www.pnas.org/content/104/44/17300.full. Accessed online 28 July 2009.
IDR TEAM MEMBERS
Steven Benner, Foundation for Applied Molecular Evolution, Inc.
John Cumbers, Brown University/NASA
Gautam Dantas, Washington University School of Medicine
Catherine Goodman, Nature Chemical Biology
Wendy Kelly, Georgia Institute of Technology
Carla Koehler, University of California, Los Angeles
Reshma Shetty, Ginkgo BioWorks
Steven Skiena, Stony Brook University
Eileen Woo, Stanford University
Daneil van der Lelie, Brookhaven National Laboratory
James Berdahl, Massachusetts Institute of Technology
IDR TEAM SUMMARY
By James Berdahl, Graduate Science Writing Student, Massachusetts Institute of Technology
The field of genomics is in the midst of an explosion. As DNA sequencing becomes faster and cheaper, the genomes of various species are being completely sequenced in increasing numbers. New data are accumulating at astonishing rates. New techniques have given rise to new possibilities. Metagenomics, the analysis of genetic material gathered from environmental samples rather than from individual species, has given researchers the opportunity to look beyond the petri dish, beyond culturable cells, to the immense diversity of life in the world around them.
For compelling technological reasons, most molecular dissections of biological systems have focused on a bedrock group of five model organisms: fruit flies, baker’s yeast, roundworms, Escherichia coli (E. coli), and mice. Scientists have accomplished a great deal despite these limitations—many breakthroughs in modern biology have come from work on these systems—but there is much left to explore. Only in the past decade, for example, has the genome of a photosynthetic plant been sequenced.
In the face of the great potential unlocked by metagenomics, an Interdisciplinary Research (IDR) team of scientists at the 2009 National Academies Keck Futures Initiative Conference on Synthetic Biology thought about how best to use the technique to explore the Earth’s biosphere to discover its novel functions. The team began by reviewing issues that researchers have with gene databases, which already contain a wealth of undiscovered genes.
GenBank is one such database. Funded and maintained by the National Center for Biotechnology Information, it is a library of publicly known genetic sequences and the proteins they encode. It currently contains more than 100 billion nucleotide pairs from more than 150 million measured sequences; it is a valuable resource for researchers throughout the world of genomics and beyond. But it’s far from perfect. Data are flooding in, though with no quick way of identifying functional sequences of DNA amidst the rest of the A’s, T’s, C’s and G’s, researchers are left with a tremendous amount of information to wade through. Within the database, annotations of gene function are often inaccurate, and though they can be corrected, doing so presents an awkward task that can further propagate errors. Complicating this, more than a third of GenBank consists of domains
of unknown function, stretches of DNA that have yet to divulge their purposes, if they have any at all.
Current DNA sequencing techniques, which are still being improved, contribute to the problem because they sacrifice accuracy for efficiency. Error rates of one incorrect base pair for every 1000 seen in early sequencing methods have risen to as high as three for every 100 in more modern, faster techniques. Because just one erroneous nucleotide can radically alter a resultant protein’s structure and function, such error rates can be difficult or impossible to work with. Complicating things further, repetitive DNA sequences in a genomic sample can combine with the short read lengths generated by these rapid techniques to produce overlapping sequences that aren’t actually found in nature. But developing better DNA sequencing is hardly a new idea; the search for new techniques is ongoing, and the task itself wouldn’t exist without the breakthroughs that have already been made.
Still, the massive accumulation of unchecked data in GenBank has led some researchers to refer to it, half jokingly, as a “write-only” database; only a small fraction of what pours in is currently ever recalled and used. Improved annotation methods would allow researchers to quickly locate proteins based on function, giving them the chance to properly explore the tremendous amounts of genetic information that we have already collected before they turn their attentions and resources to the genomes of the rest of the world.
The problem with annotation, however, is that it is difficult to verify. Sequences in GenBank are annotated either by the researchers submitting them or by automated software that determines their function based on similarities with other sequences. There’s no guarantee against mistakes. The only way to know for sure what a given patch of DNA will do is through direct testing and wet biochemistry—inserting the sequence into the genomes of culturable cells and growing them to see what happens, then purifying whatever protein product might be produced and testing it in vitro to confirm or discredit a suspected function. Obviously, this presents something of a bottleneck when looking at billions of potential genes, so unless cheaper and faster methods of wet biochemistry are developed, a different approach is needed. The Holy Grail of automated annotation would be a program that could, given an input of A’s, C’s, T’s and G’s, use the physical properties of all the atoms involved to calculate the sequence, structure and function of the resultant protein, but such a program is a long way off, if it’s possible at all. For the moment, it is prohibitively difficult to even model water at such
a detailed level. A different approach in making annotations more effective would be to streamline the searching process by organizing proteins on an evolutionary basis, thus grouping classes of structure and function and making newcomers easier to identify. Again, though, this technique presents a bottleneck because researchers with special knowledge are needed to set up the database’s new structure and classify the constant deluge of incoming data. For the immediate future, annotation might be most improved simply by making adjustments to the process itself, allowing users to more easily and concisely correct erroneous data that they find in GenBank.
Moving on from data management problems, the IDR team agreed that the most pertinent problem scientists face in adapting metagenomics for use in synthetic biology is the issue of how to best search the biosphere for new genes or those with specific genetic functions. The problems here are varied.
One technique would be bulk geographic sampling, taking metagenomic samples across a planetary grid or taking representative samples from different ecological regions, but the team deemed such a comprehensive method impractical at best; apart from being logistically intensive, it would simply be adding to the scores of unexplored genetic data that we already have, unless these data passed through the bottleneck of wet biochemistry. The question was also raised of how useful such a program would ultimately be. The extent to which the overall genetic picture varies from one environment to another is not known, so a metagenomic sample from a swamp in Brazil, say, wouldn’t necessarily contain genes with much novel function compared to a field in Mongolia. But then again it could. Knowing how the diversity of genetic function relates to biodiversity is thus an important precursor to any attempts at more extensive environmental sampling.
The team outlined a straightforward plan to test this. By sampling a kilogram of soil from each of ten sites in differing ecological zones around the world, one would have a very rough approximation of the Earth’s diversity. For more direct processing, these samples could be analyzed with metabolomics—analysis of the chemical signatures of the biological activity in the samples. Once such a test was completed, it would quickly divulge whether or not further geographical sampling might be an effective method of bioprospecting. It would also point to which, if any, environments harbor greater concentrations of genetic diversity.
It was generally agreed that such an experiment would find at least some differences in genetic function, so the team discussed which environments were most likely to harbor unique functions that would be useful for
human application. These included toxic waste sites, where organisms living along a gradient of increased toxicity can evolve mechanisms to deal with the toxin in question. The functions of these organisms that allow them to survive could be exploited in other organisms to allow them to survive in similar environments, and even to clean up those environments.
In any case, the discussions of the IDR team touched upon many aspects of metagenomics, resulting in interesting suggestions for colleagues in synthetic biology to consider, from better database management and technology to the development of rational, inexpensive methods of targeted environmental sampling to exploit the diversity of the natural world. If better sense can be made of better incoming data, the field of metagenomics will come closer to realizing its incredible potential.