2
Challenges of Predicting Pathogenicity from Sequence

INTRODUCTION

Our committee was asked to “identify the scientific advances that would be necessary to permit serious consideration of developing and implementing an oversight system for Select Agents that is based on predicted features and properties encoded by nucleic acids rather than a relatively static list of specific agents and taxonomic definitions.” It is true that the microorganisms and toxins that currently make up the Select Agent list are defined by taxonomy and by their perceived importance to public health and security. They are (or are products of) pathogens, that is, microorganisms capable of causing disease. Most Select Agents are not typical of the common pathogenic microorganisms seen in human or animal medicine, or in agricultural practice. But Select Agents and more commonly encountered pathogenic microorganisms do share a number of biological properties. It is essential to understand that pathogenic microorganisms are not defined by “taxonomy”; it is very common for a given microbial species to have pathogenic and non-pathogenic members. Escherichia coli is found in the colon of virtually all humans and animals and is part of their indigenous flora. They are typically harmless. However, a genetically defined, and more recently sequence-defined, subgroup of E. coli is the most common cause of urinary tract infection in humans and dogs. From a taxonomic standpoint the microorganisms are unequivocally called Escherichia coli; from a genetic and sequence homology standpoint they are distinct categories of E. coli. Similarly, the taxonomic genus Yersinia includes Y. pestis, the causative agent of plague, and other Yersinia species that are certainly enteric pathogens but are not Select Agents, and other Yersinia species that are not known to be pathogens. The pathogens and the non-pathogens are not distinguished by taxonomy but can now be distinguished reasonably well with genetic and molecular analysis.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 37
2 Challenges of Predicting Pathogenicity from Sequence INTRODUCTION Our committee was asked to “identify the scientific advances that would be necessary to permit serious consideration of developing and implementing an oversight system for Select Agents that is based on predicted features and properties encoded by nucleic acids rather than a relatively static list of specific agents and taxonomic definitions.” It is true that the microorganisms and toxins that currently make up the Select Agent list are defined by taxonomy and by their perceived importance to public health and security. They are (or are prod - ucts of) pathogens, that is, microorganisms capable of causing disease. Most Select Agents are not typical of the common pathogenic microorganisms seen in human or animal medicine, or in agricultural practice. But Select Agents and more commonly encountered pathogenic microorganisms do share a number of biological properties. It is essential to understand that pathogenic microorgan- isms are not defined by “taxonomy”; it is very common for a given microbial species to have pathogenic and non-pathogenic members. Escherichia coli is found in the colon of virtually all humans and animals and is part of their in - digenous flora. They are typically harmless. However, a genetically defined, and more recently sequence-defined, subgroup of E. coli is the most common cause of urinary tract infection in humans and dogs. From a taxonomic standpoint the microorganisms are unequivocally called Escherichia coli; from a genetic and sequence homology standpoint they are distinct categories of E. coli. Similarly, the taxonomic genus Yersinia includes Y. pestis, the causative agent of plague, and other Yersinia species that are certainly enteric pathogens but are not Select Agents, and other Yersinia species that are not known to be pathogens. The pathogens and the non-pathogens are not distinguished by taxonomy but can now be distinguished reasonably well with genetic and molecular analysis. 

OCR for page 37
8 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS Is there a potential for developing and implementing an oversight system for Select Agents that is based on features and properties encoded by nucleic acids? The general answer to this question is yes. The committee believes, how- ever, that the entire concept of “predictive” oversight is flawed in that (1) the current Select Agent list has a non-biological as well as a biological basis for ex- istence and (2) functional “prediction” alone cannot provide a level of certainty sufficient to designate a microorganism as a Select Agent, whose possession is legally restricted. Nevertheless, oversight of novel pathogens, whether natural or synthetic, is clearly seen by policy makers and legal experts to be a necessary component of a comprehensive biosecurity strategy. We propose to discuss here only the biological factors relevant to establishing a sequence-based oversight system that is focused on identifying genes and gene products that are likely to be involved in survival and persistence of a microorganism and its interaction with a host. That would include genes of Select Agents, but would also include a far greater number of genes that are associated with pathogenicity (the ability to cause disease) and virulence (the degree of pathogenicity encoded by a given gene or group of genes). Understanding the basis of such an oversight system requires some understanding of the biology of pathogenicity and of the current limitations of genomic analysis. THE ART OF SEQUENCE-BASED PREDICTION It is clear that we are immersed in an age of genomics. As of December 27, 2009, the Web site Genomesonline.org (Genomesonline 2009) reported that 3,606 bacterial genomes were being sequenced and that complete DNA se - quences of at least 712 distinct bacterial strains were in the public domain. The completed sequences include all the bacterial Select Agents and most common pathogens of humans, animals, and plants. Entrez Genomes contained 3,498 reference sequences for 2,374 viral genomes, including all of the Select Agents and common plant and animal viral pathogens. The genomes of prokaryotes possess specific and relatively well-understood promoter sequences (signals), such as transcription factor binding sites, that are relatively easy to identify. The gene sequences that code for a protein occur as one contiguous open reading frame (ORF), which is typically many hundreds or thousands of base pairs long. The nucleotide compositions and frequency of use of stop codons (the punctuation between genes) are well known. Further- more, protein-coding DNA has periodicities of occurrence and other statistical properties. Therefore, recognizing genes in prokaryotic systems is relatively straightforward, and there are well-designed algorithms to do it with high levels of accuracy. However, identifying a gene and understanding its function are altogether different matters. At least one-fourth of genes that are identified in bacterial genomes, whether large or small, whether from pathogen or non-pathogen,

OCR for page 37
 CHALLENGES OF PREDICTING PATHOGENICITY FROM SEQUENCE are “hypothetical” or of unknown function. Considering the long history of biochemical and genetic examination of microorganisms, it is daunting that so much of the “equipment” of microorganisms is still unknown. The hypothetical genes are in two categories: ones that are found in a variety of organisms, and ones that are peculiar to particular lineages. For many genome sequences, the only annotation that will be available for the foreseeable future will be based on computational predictions and comparisons with known functional elements in related microorganisms. Despite the hypothetical genes, it has been pointed out that much of the whole-genome sequence of many microorganisms, when viewed against a backdrop of 100 years or more of biochemistry and microbi - ology, is not at all a total surprise in that much that has been found was fully expected (Doolittle 1981). Moreover, the full genome itself is a tremendously valuable asset because for a given organism, it provides all the details: The entire “parts list” is on the table—even if we do not know where some of them go or what they do. Seeing an assemblage of parts should not be mistaken for under- standing how the parts function. From the standpoint of the present report, for important pathogens, comparisons between strains can pinpoint differences between the virulent and the avirulent microorganisms, and comparisons be - tween species can be informative about host or tissue specificity. Comparisons have become even more useful as we have factored in the complete genomes of the human and other animals that serve as microbial hosts. At the genetic level, genome comparisons begin to reveal the fundamental divergences of microbial life and their evolutionary origins. We have also begun to understand how pathogens got to where they are and, we know to some extent what to look for if we are trying to design a pathogenic microorganism. The objective of a “predictive oversight” system would be to forecast with a high degree of certainty the pathogenic potential of sequences of Single or small numbers of genes related to Select Agent toxins. • Genomes or genomic regions that are closely related to Select Agent • pathogens. Genomes or genomic regions of newly identified natural pathogens. • Novel genomic sequences that are designed and assembled by syn - • thetic biology. Sequence prediction in biology is a hierarchy of increasing difficulty that reflects the complexity of the particular system under analysis. The simplest of such predictions would probably be that of a protein, such as a toxin.1 Next 1 Thisis by no means easy. For instance, Yoshida et al. have shown that three amino acid changes can turn the E. coli major chaperone GroEL, into an insect toxin. That co-option of function pres - ents a major problem for predictive systems, even at this level. Yoshida, N., K. Oeda, et al. (2001). ”Protein function: Chaperonin turned insect toxin.” Nature 411(6833): 44.

OCR for page 37
0 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS in order of predictive difficulty would be a genetic pathway (a group of co- regulated multiple proteins interacting in concert). The third most problematic set of sequences to evaluate as a means of forecasting function would be those of whole organisms alone in a controlled environment (with multiple pathways interacting in concert). The final and most difficult predictive situation would be one in which two or more organisms interact in their natural environment. 2 It is this last level of complexity that gives rise to the key biological attributes of pathogenicity and transmissibility, factors that contribute to the criteria that form the basis of inclusion of an organism on the Select Agent list. Predicting pathogenicity or transmissibility of a microorganism requires a detailed understanding of multiple attributes of both the pathogen and its host. It is a prediction problem of the greatest complexity. Using a single ge- nomic sequence to predict the potential consequences of the interaction of a microorganism, or a microbial virulence determinant, with a host clearly is not within the bounds of contemporary biology. Current sequence prognostication methods are at best at the level of foretelling the function of an individual pro - tein on the basis of its deduced amino acid sequence. Even with the availability of a high-resolution protein structure, projecting the activity of closely related molecules accurately is not straightforward. There is as yet little work that even attempts to make predictions at the next level, that of genetic or biochemical pathways. Predicting Biological Function from Sequence The integration of experimental and computational information suggests that the human genome encodes about 20,000 protein-coding genes and an unknown number of functional RNA molecules; the Bacillus anthracis (anthrax) genome encodes about 6,000 proteins; a large virus, such as the smallpox vi- rus, encodes about 200 proteins; and small positive-strand or negative-strand viruses, such as coronaviruses and influenza viruses, encode 10-30 proteins. As noted above, although these expressed RNAs and proteins can be identified us- ing computational approaches with relative certainty, assignment of function is problematic. Biological experiments are still needed to confirm computational predictions. The dominant method of function “prediction” uses sequence homology software. The underlying principle of such an approach is that proteins are reused or modified for applications in similar functional systems in different species far more often than entirely new ones are introduced. Most proteins generally fall into a relatively small number of homologous protein families of related structure and usually of at least somewhat related function. For 2 Considerthe enormous number of gene sequences at play and which must be choreographed as a microorganism leaves the salivary gland of a biting insect and is injected into human tissues.

OCR for page 37
 CHALLENGES OF PREDICTING PATHOGENICITY FROM SEQUENCE example, the Pfam protein family database contains about 10,000 protein fami - lies that account for about 75 percent of all known proteins. Two proteins that diverged through evolution from a common ancestral sequence (“homologous” sequences) tend to have structural and functional characteristics in common. The sequence that governs the mechanism of action of a particular protein evolves slowly, whereas the sequence that affects how a protein interacts with a binding partner, such as a cellular receptor, evolves rapidly. 3 If the function of one protein is known, some aspects of the functional annotation can be inferred for other homologous proteins. Computer programs for sequence-database homology search (such as BLAST, HMMER, and FASTA) are widely used to discern whether a newly annotated protein or RNA sequence is homologous to an already known sequence or sequence family. Homology offers only a “low-resolution” prediction of function. Sequence- homology analysis can often determine what a protein is likely to do (such as, protein kinase, metalloprotease, or oxidoreductase) but generally will not reveal the biochemical pathway to which its proteins partner(s) belong or the particular residue(s) that will be the target or substrate for it. There are less well-developed computational prediction methods that may occasion - ally offer clues to help to answer the more detailed questions but, generally, such queries must be addressed directly with controlled laboratory experi - ments. For example, if a novel influenza-like genome were obtained, sequence analysis would certainly and immediately recognize the homologous parts: the genes that encode hemagglutinin (HA) (Pfam protein family database code PF00509), neuraminidase (NA) (PF00064), nucleoprotein (PF00506), the ma - trix proteins M1 (PF00598) and M2 (PF00599), the proteins NS1 (PF00600) and NS2 (PF00601), and the RNA-dependent RNA polymerase components PA (PF00603), PB1 (PF00602), and PB2 (PF00604). These families have tens of thousands of examples of sequences; on the basis of the known diversity, the statistical models used in sequence-homology analyses are often capable of recognizing sequences that are separated by hundreds of millions or even bil - lions of years of evolution. The computational approaches would identify the general kinds of “parts” in a genome and would be able to determine whether an expected part were present, missing, or unexpectedly quite different from a currently known virus sequence component. We can recognize some molecular signatures that may be essential for maintaining effective pathogen-host inter- 3 For example, a particular enzyme cleaves DNA and recognizes a specific sequence defined cleav- age site; the enzyme structure that allows it to cleave DNA may evolve slowly (related enzymes also cut DNA), whereas the portion of the enzyme that recognizes the specific DNA sequence cleavage site might evolve rapidly (related enzymes cut different DNA sequences).

OCR for page 37
 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS action,4 replication efficiency and pathogenesis outcomes in natural, but not necessarily, alternative hosts. The identification of genera or species of patho - genic bacteria or viruses is likewise easily accomplished with DNA homology approaches. Those methods are used every day in clinical laboratories all over the world. What could not be readily foretold from the sequence-homology analy - ses described above is whether the influenza-like virus is highly pathogenic for humans and other mammals, or whether a particular vaccine will protect against it. Those traits depend on a small number of genetic changes that evolve rapidly in ways that are not well understood; even subtle changes may have a profound biological effect. Those features change so rapidly, they do not cor- relate well with evolutionary history. Thus, sequence-homology analysis is less informative for such viral characteristics than for simply identifying genome parts components of an influenza or related virus. For example, there is a strong correlation between high pathogenicity and trypsin-independent cleavage of the influenza virus hemagglutinin. This is not a perfect means of prediction, how - ever, inasmuch as the 1918 influenza virus, which was associated with 50 million deaths worldwide, has a cleavage site that appears from sequence analysis to be associated with low pathogenic potential (Box 2.1). Such poor predictive power from sequence analysis is likely to be common for many if not most, microbial virulence determinants because virulence is typically mutifactorial and is affected by details of molecular interactions between a microorganism and a specific target in a specific host. (See also Appendix G) Protein Structure Prediction Another possible route to prediction of function from sequence is to pre - dict the folded 3D structure of a protein from its sequence and then use fea - tures of that structure (which evolve much more slowly than sequence) to infer catalysis, binding partners, or other functional properties. Function prediction based on structure has been one of the “Grand Challenge” problems in science for the last 50 years, since Anfinsen showed that the information to determine protein 3D structure is encoded in the linear amino-acid sequence (Haber and Anfinsen 1962). Pure de noo structure prediction was essentially impossible until recently and is only occasionally successful even now. Progress has come mostly from the growing database of experimentally determined structures (the Protein 4 Methods such as CorrMut and CRASP identify functional domains within proteins that are co-evolving in response to one or more unknown selective pressures (Afonnikov, D. A. and N. A. Kolchanov (2004). “CRASP: a program for analysis of coordinated substitutions in multiple align - ments of protein sequences.” Nucl. Acids Res. 32(suppl_2): W64-68, Fleishman, S. J., O. Yifrach, et al. (2004). “An evolutionarily conserved network of amino acids mediates gating in voltage- dependent potassium channels.” Journal of Molecular Biology 340(2): 307-318.

OCR for page 37
 CHALLENGES OF PREDICTING PATHOGENICITY FROM SEQUENCE BOX 2.1 Influenza—Hemagglutinin Cleavage One of the most important sequence features of influenza virus A pathogenesis is a protease cleavage site in hemagglutinin (HA). Cleavage is required for HA to catalyze membrane fusion, a necessary step for viral infectivity. In viruses of low pathogenicity, this essential cleavage step tends to be catalyzed by a host- encoded protease (trypsin, in the human respiratory tract), so viral infectivity is limited by tissue distribution of the host protease. Conversely, an essential feature of highly pathogenic influenza viruses is the presence of mutations in the gene for HA that leads to trypsin-independent cleavage of the protein; such mutations enable the virus to infect a broader range of tissues. The HA protease cleavage site is a sequence of only about seven to 10 residues, and the sites that contain small insertions of a few basic residues (lysine and arginine) tend to be associated with trypsin independence. With such changes, HA becomes cleavable by widely distributed subtilisin-like proteases that have a consensus recognition sequence. Yet, the correlation of high pathogenicity and trypsin independent cleavage of HA is not perfect; the 1918 influenza virus, which was associated with 50 million deaths worldwide, has a cleavage site that appears from sequence analysis to be of the low pathogenic. Data Bank contains over 60,000; http://www.rcsb.org/pdb), which enable the modeling of new sequences on the basis of homology to known, related structures. Recent achievements have been impressive, as demonstrated by the Critical Assessment of protein Structure Prediction (CASP) competitions (Moult 2005). Two rather distinct current approaches achieve good predictions fairly often: One is inspired by protein evolution and can recognize and piece together distant sequence relationships (Zhang, Wang et al.), and the other is inspired by protein folding and uses a combination of physics and empirical data to construct a model (Raman, Vernon et al. 2009). Homology models are usually approximately correct (Keedy, Williams et al. 2009), and even de noo predictions sometimes succeed (Box 2.2). Further prediction of binding or catalytic sites from successfully modeled 3D structures (López, Ezkurdia et al. 2009) or prediction of protein/protein binding modes by docking known components, as in the CAPRI (Critical Assessment of PRedicted Interaction) competition (Janin 2005), are also still successful only sometimes and only partially. Approximately correct homology models can enhance the power of purely sequence-based comparisons consid - erably, especially when they show that known functional residues are brought together into the right 3D relationships. Often, however, the critical biological details hinge on structural details that confer a difference in specificity or in regulation and that are exactly the most difficult places to achieve accurate prediction.

OCR for page 37
 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS BOX 2.2 Critical Assessment of Protein Structure Prediction (CASP) Competition Last year’s Critical Assessment of protein Structure Prediction (CASP8) re- sults for free modeling (that is, when no clear template is available in the Protein Data Bank) included at least one good model for 5 of the 13 targets (Ben-David, Noivirt-Brik et al. 2009). Some predicted models have been shown to be accurate enough for the demanding application of molecular replacement for solving the crystallographic phase problem (Qian, Raman et al. 2007). CASP8 included a cautionary example of extreme structure-prediction difficulty in the form of two targets that hadFonly three different residues out of 56 (as shown igure Box 2-2 in the figure); the pair had been designed and selected for maximum sequence Bitmapped image Yeh et al. 2005). A purely match but to fold into entirely different 3D structures (He, sequence-based predictive method could not recognize the consequences of this tiny difference, and it seems that the predictor groups who got both targets right knew about the earlier stages of this tour-de-force design (He, Yeh et al. 2005). The example is analogous to the issue of the vaccine strain of a pathogen; in both cases, a very small, “linchpin” change in sequence causes a reversal of the large-scale relevant property: In this pair the protein fold, and for the vac- cine strain, its pathogenicity. Pure prediction is therefore chancy at best here, and correct classification depends on expert outside knowledge of such unusual near-neighbor cases.

OCR for page 37
 CHALLENGES OF PREDICTING PATHOGENICITY FROM SEQUENCE Overall, this route of prediction based on 3D structure is well worth encouraging for both practical and intellectual benefits, but its utility in a ro - bust system of predicting Select Agents (for legal oversight) is still extremely far off. Gene Regulation If an organism’s virulence depends on specialized gene products, it must be able to use them when they are needed but not squander its metabolic energy in producing them aimlessly or risk having them detected and prematurely neu- tralized by host defenses. Consequently, regulating the expression of virulence factors is an additional essential complication of a pathogenic microorganism’s life. The host presents an array of conditions strikingly distinct from those BOX 2.3 Ricin Ricin is a plant toxin isolated from the seeds of the castor bean plant, Ricinus communis. It inhibits protein synthesis in affected cells by modifying the ribosome; this leads to ribotoxic stress and eventually cell death. Ricin represents a family of toxins known as Ribosome Inactivating Proteins (RIPs) that are found through- out the plant and bacterial kingdoms. Despite their very different sequences and sometimes quite different structures, these toxins share three highly conserved amino acids that are responsible for their catalytic activity. One cautionary tale for the prediction of toxin activity from gene sequence comes from the work of Frankel and Robertus (Frankel, Welsh et al. 1990). They genetically mutated ricin amino acid glutamate-177 and predicted that the protein would be inactivated because this side chain is highly conserved and central to the catalytic activity of the toxin. Although more conservative mutations were inac- tive, when the glutamate was mutated down to an alanine residue, the enzyme still retained about 5 percent of the activity of the wild-type sequence—enough to slow growth of yeast cells sensitive to the toxin. Based on the structure of ricin, the researchers predicted that the nearby glutamate residue at position 208 was able to move and substitute in the reaction. They produced an inactivated ricin A chain only after mutating both glutamates 177 and 208; a crystal structure of the Ala- 177 mutant showed that the carboxyl of Glu-208 did indeed move into the former position of catalytic Glu-177 (Kim, Misna et al. 1992). The genes encoding ricin (rtx genes) of the various castor bean cultivars have only one or two nucleotide differences in regions that do not affect protein struc- ture or function. The R. communis genome would not be sequenced to determine the virulence of castor beans, but instead it would be assumed that unmodified beans contain active ricin toxin. Several research groups have genetically altered one or more of the conserved catalytic residues to produce inactive ricin A chain expressed recombinantly in E. coli in an attempt to produce a vaccine to protect against deliberate ricin poisoning (Munishkin and Wool 1995).

OCR for page 37
 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS of the outside environment, conditions that are not easily reproduced in the laboratory. In fact, laboratory culture conditions bias our understanding of microbial adaptation to natural environments. Vibrio cholerae, for example, is thought to persist without expression of virulence factors in brackish estuaries and other saline aquatic environments, sometimes in association with the chitin- ous exoskeletons of various marine organisms. Transition from that milieu to the contrasting environment of the human small intestinal lumen is accompa - nied by substantial genetic regulatory events. The microbial cell is relatively simple, yet it possesses the means to detect, often simultaneously, changes in temperature, ionic conditions, oxygen con - centration, pH, and calcium, iron, and other metal concentrations that might appear to be subtle signals but are essential for the precise mobilization of virulence determinants. Similarly, environmental regulatory signals prepare a microorganism for its transition from an extracellular to an intracellular state. For example, iron is a critical component of many cell metabolic processes; therefore, it is not surprising that animals rely on high-affinity iron-binding and iron-storage proteins to deprive microorganisms access to this nutrient, especially at the mucosal surface. In turn, most pathogens sense iron availability and induce or repress various iron acquisition systems accordingly. Moreover, many microorganisms produce toxins that are regulated by iron in such a way that low iron concentrations trigger toxin biosynthesis. Reversible regulation of the expression of virulence genes by temperature is common to many patho- gens. Thus, a microorganism like E. coli that may be deposited in feces and live for long periods under conditions of nutrient depletion and low temperature, mobilizes its colonization-specific genes when it is returned to the warm mam - malian body. The regulatory machinery used to accomplish that is an important feature of many pathogens, including Y. pestis and B. anthracis. The number of well-characterized virulence regulatory systems is rapidly increasing, in part because of the development of rapid methods for screening gene expression on a genome-wide basis (for example, with the use of DNA mi- croarrays). But, relatively little is known about either the specific environmental signals to which these systems respond and or the exact role of the responses in the course of human infection. One common mechanism of bacterial transduc - tion of environmental signals involves two-component regulatory systems that act on gene expression, usually at the transcriptional level. Such systems make use of similar pairs of proteins: one protein of the pair spans the cytoplasmic membrane, contains a transmitter domain, and may act as a sensor of environ - mental stimuli; the other is a cytoplasmic protein (a “response regulator”) with a receiver domain and regulates responsive genes or proteins. Those regulatory systems are common both in pathogens and non-pathogens, so their detection by sequence analysis cannot be used as a reliable predictor of whether a micro - organism is pathogenic. The coordinated control of pathogenicity incorporates a regulon; a group

OCR for page 37
 CHALLENGES OF PREDICTING PATHOGENICITY FROM SEQUENCE of operons or individual genes controlled by a common regulator, usually a protein activator or repressor. A regulon provides a means by which many genes can respond in concert to a particular stimulus. At other times, the same genes may respond to other signals independently. Global regulatory networks are a common feature of microbial virulence and basic microbial physiology, so their sequences, although often essential for a pathogen, are not reliable predictors of virulence. The apparent complexity of virulence regulation in a single microbial pathogen is magnified by the coexistence of multiple interact - ing (“cross-talking”) systems and by regulons within regulons. Thus, the inherent pathogenicity of a microorganism can be greatly altered through regulation of virulence genes. It is extremely difficult to predict how even a single nucleotide change will affect regulation and thereby alter patho - genesis or the viability of the microorganism. (Additional detailed examples of the important role of regulation in allowing pathogens to respond to the environment of the human host are given in Appendix J.) THE NATURE OF INFECTIOUS DISEASE AND THE ART OF PREDICTING PATHOGENICITY The preceding sections have shown that several computational approaches have promise for predicting biological function from sequence. Can they be ap- plied effectively to predict pathogenicity? If not, what is required to develop a predictive method that would suffice? To address this issue, we will first discuss the nature of infectious disease. Infectious diseases affect all living things, from the smallest amoeba to insects, plants, and the largest mammals. The co-existence and co-evolution of microorganisms with their hosts is a dynamic equilibrium ranging from one extreme of mutualism in which both partners benefit from the interaction (for example, bacterial production of organic nitrogen for plants or of vitamin K for the human), to a relationship of commensalism in which one organism benefits but the other is unaffected, to another extreme of parasitism that benefits one partner to the detriment of the other. Microorganisms are constant companions of plants and animals. Humans carry a vast indigenous microbial flora from shortly after birth until death, and the role of this human microbiome in human health and disease is the subject of considerable interest and recent investigation. Although it is biologically correct to say that most microorganisms that inhabit this planet are harmless to humans or may even benefit humankind, it is also true that humans have a prejudicial view of microorganisms and direct their focus to microorganisms as agents of disease. The biological reality is that most microbial infections are relatively benign and that symptoms of disease are sometimes the result of the human immune system’s response to infection rather than the product of the infecting microorganism (Box 2.4).

OCR for page 37
 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS fever virus, can replicate efficiently in insects as well as mammals. Dead-end interactions often result in severe, fulminant disease involving infections of an incidental host species that is not needed for maintenance of the natural life cycle of the pathogen (for example, B. anthracis, dengue virus, Nipah virus, and Ebola virus infections of humans, and citrus tristeza virus and Phytophthora infestans infections in citrus grafted onto sour orange and potatoes, respec- tively). The infectious agents originate in other vertebrate species or are carried by arthropods that cycle between insects and vertebrates or between insects and plants. In some circumstances, a “dead-end” infection can give rise to an emerging infection as was the case for HIV, Rift Valley fever virus, and SARS- CoV in humans and citrus tristeza virus and P. infectans infections mentioned above. The majority of the human pathogens found on the Select Agent list cause dead-end interactions. The outcome of a microbe-host encounter is based on interactions at the molecular and cellular level that take place over time. For certain viruses, a pro- ductive infection is determined by specific receptors that need to be engaged for virus binding and entry (for example, sialic acid and angiotensin-converting enzyme 2 for influenza A virus and SARS-CoV, respectively) the availability of intracellular complementing factors needed for efficient replication and the ability to manipulate intracellular antiviral signaling pathways (for example, interferon, pattern recognition receptors, apoptosis, and autophagy), and the adaptive immune response. In the case of Gram-negative bacteria, a produc - tive infection may be initiated by adhesion through fimbriae (for example, enterotoxigenic E. coli); in the case of Gram-positive bacteria, it may be initi- ated through cell wall-anchored proteins (for example, microbial surface com - ponents recognizing adhesive matrix molecules). Additional virulence factors are needed to counter the innate and adaptive immune response. In the case of plant pathogens, a productive infection is determined in part by success in bridging the plant basal defense or innate immunity system though the expres - sion of countermeasures. Appendix K, presents examples of various factors that affect microbe-host interactions, and our lack of understanding of the basis of a pathogen’s ability to infect a host or multiple hosts even if we know the genomic sequence of the pathogen and, in the case of human infections, the host. A key factor in the outcome of the microbe-host interaction is the effectiveness of the host innate and adaptive immune response in the face of sophisticated and redundant mi- crobial countermeasures, some of which are conserved in bacterial pathogens that infect plants and animals. Another important factor is selection pressure, which can be manifested in physiological, epigenetic and/or genetic changes in a pathogen in response to the innate or adaptive immune system, the absence of an adaptive immune response, or changes in the microbiome (Box 2.10). We are just beginning to understand the significance of the microbiome for human health. Microbial interactions may determine whether a would-be pathogen acquires increased virulence or transmissibility, and whether an infec-

OCR for page 37
 CHALLENGES OF PREDICTING PATHOGENICITY FROM SEQUENCE BOX 2.10 Bacterial Super Infection Following Influenza A Virus Infection Infection of the upper respiratory tract with such a virus as Influenza A virus predis- poses the host to superinfection with bacterial pathogens, including Staphylococ- cus aureus and Streptococcus pneumoniae. The massive host innate response to Influenza A is focused on elimination of the virus from the respiratory tract. The overwhelming inflammatory response directed against the intracellular viral patho- gen provides a perfect opportunity for a member of the normal respiratory flora (such as S. aureus or S. pneumoniae) to establish itself deeper in the respiratory tract, where it can cause bronchitis and/or pneumonia. The signals in the bacteria that cause them to transition from commensal organism to pathogen are not en- tirely known although the complex regulatory networks are slowly being identified and characterized. Once the bacteria migrate to their new niche, a plethora of virulence factors are produced. The host innate response to the bacteria is im- paired because the overwhelming response to the virus depletes the neutrophils, macrophages, and dendritic cells and the antimicrobial factors produced by them. As a result, the secondary bacterial infection is often far more life-threatening than the initial viral infection. S. pneumoniae has emerged as the most common cause of secondary bacterial pneumonia in the current H1N1 Influenza A outbreak, al- though S. aureus, both methicillin-resistant and methicillin-sensitive, is the second most commonly isolated organism in post-influenza bacterial pneumonia. tion will result in disease. Thus, the individual microorganism, microbe-host interactions, and the environmental context must be considered in assessing an organism or a gene sequence as a potential threat to human health. THE SPECIAL CASE OF SYNTHETIC BIOLOGY Innovations in the chemical synthesis of DNA have led to dramatic im - provements—the DNA can be longer, of higher quality, and less expensive per base pair—since the synthesis of the first copy of the 75-base pair tRNA Ala in the early 1960s (Agarwal, Buchi et al. 1970). DNA synthesis on solid supports combined with phosphoramidite nucleosides allowed the synthesis of 2.7 kb plasmid DNA, an infectious 7.5 kb poliovirus genome, a 32 kb bat coronavirus (HKU3) that was the precursor to the SARS-CoV epidemic, and the first com- plete synthesis of a 582 kb artificial bacterial genome. Most recently, a synthetic bacterial genome has been “booted”6 into an autonomous life-form, so the artificial bacterial genomes are self-perpetuating (Gibson, Glass et al. 2010). 6 Synthetic biologists have adopted this terminology from computer science, in which it means “to start (a computer) by loading an operating system from a disk. ” In the present case it refers to starting an organism from a genome.

OCR for page 37
 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS It is clear that the price of DNA synthesis has steadily decreased and a cur- sory survey of commercial suppliers show a cost of about $0.39-0.50/base—a dramatic reduction from synthesis costs commonly seen in the early history, when industry costs of about $5-10.00/base in the early 2000s were common. New optical deprotection chemistries and microfluidic technologies that allow programmable synthesis of hundreds of thousands of oligomers in parallel with fairly high fidelity seem poised to revolutionize inexpensive synthesis. With those and current multiplex technologies, it seems likely that future costs will approach $0.03/base (commercial costs); additional costs will be associated with gene assembly, quality control and other manufacturing issues. In the near future, as gene synthesis approaches $0.10-0.20/base, synthesis will replace most traditional recombinant DNA methods and allow the ready design and synthesis of new gene circuits and biological processes. The design and testing of artificial biological systems and understanding of functional interactions are key objectives of synthetic biology. The benefits and broad availability of affordable gene synthesis are expected to foster rapid response platform technologies for producing candidate vaccines and thera - peutics to address biothreat agents and newly emerging infectious diseases. It will allow more effective diagnostic platform design and basic inquiry into fundamental biological mechanisms, including pathogenesis and pathogen- host interactions. Gene synthesis will assist in designing and testing complex biological systems to fulfill specific purposes ranging from biofuel and pet - rochemical production, to genetically engineered foods, virus batteries, solar cells and energy systems, and the manufacture of new medicines. Thus, when considering steps that aim to prevent the misuse of the technology, we should also recognize the dramatic impact and potential that synthetic biology offers to the future economic growth, competitiveness and viability of the U.S. bio - technology industry. In general, the discipline uses either natural cellular components and sys - tems to construct new biological processes (the top-down approach) or the gen- eration of unique biological and/or chemical systems that have novel properties and are designed to mimic living systems (the bottom-up approach). Most in - vestigators have used the former approach because foundational understanding for de noo biological design (for example, protein structure, protein-protein interaction, genetic regulation) is in complete. One controversy is that unan - ticipated outcomes may occur when engineered organisms reproduce, evolve, and interact with the environment. Another concern is the deliberate misuse of the technology to design and construct new pathogens, either by engineering in components that resist current vaccine or therapeutic interventions, or by alter- ing pathogenesis by blending in virulence genes from alterative pathogens, or by the de noo design of new pathogens. Understanding of the complex genetic and protein networks that regulate replication and disease is substantially lim-

OCR for page 37
 CHALLENGES OF PREDICTING PATHOGENICITY FROM SEQUENCE ited, so for the near future, synthesis can for the most part only copy, emulate, or recreate existing gene sets that have been designed by nature. Top-Down Approach Since 1980, a standard set of recombinant DNA techniques has been de - veloped that allows the cloning of full length DNA copies of most DNA and RNA virus genomes. In parallel, highly efficient strategies have been developed for “booting” infectivity from the DNA genomes and then recover infectious virus, including recombinant viruses that contain design modifications (for ex - ample, mutations, new regulatory networks, and gene insertions or deletions). There are proven strategies exist for reconstituting most of the viral Select Agent and category A-C biodefense pathogens from full-length DNA genomes; however, infectious clones have not been constructed for many Select Agent viruses on these lists. Developing infectious full-length DNAs is traditionally a time-consuming and uncertain process; single nucleotide changes can destroy genome infectivity. In general, genome size is directly proportional to the dif - ficulty of generating an infectious molecular clone because of issues associated with genome and vector stability, sequence accuracy, and technical challenges in manipulating and recovering large genomes. For example, substantial techni- cal sophistication, targeted expertise, and practice are required to reproducibly “boot” genome infectivity of large RNA genomes (for example, coronavirus, Ebola virus and influenza virus), DNA genomes (such as poxvirus) and bacte - rial genomes. In contrast, small RNA and DNA genomes are much easier to manipulate and recover. The number of people capable of working with these systems is increasing on a daily basis. For example, students at Johns Hopkins University recently worked collaboratively and designed, synthesized and re - covered a 280 kb yeast chromosome (Dymond, Scheifele et al. 2009). Synthetic biology will alter the standard approaches for reconstructing full length infectious DNA genomes of most viruses and computer based genome design will probably become the norm in the near future. It is clear that gene and genome synthesis will allow for synthetic reconstruction of many highly pathogenic human, animal and plant virus genomes, and thereby, removing one major limit to biological warfare and terrorism: availability. There are similar concerns regarding synthetic bacterial genomes, which are on a longer time horizon. Assuming that a cost of $0.10/base is achievable in the near fu - ture, the synthesis of the genomes of most “agents of concern” will be readily affordable (RNA genome: about 7.32kb for $700.00-3200.00; DNA genome: about 150-300kb for $15,000-20,000), while half a million base pair bacterial genomes will cost about $50,000 U.S. dollars. Not only is the instrumentation affordable, highly portable, and globally available, but there are commercial vendors on every continent, so it would be difficult or impossible to track and

OCR for page 37
 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS police nefarious intent. Thus, it is possible that no virus (or microorganism) can ever be considered extinct (for example, poliovirus, 1918 influenza, small - pox, reconstructed extinct retroviruses, wooly mammoth viruses, etc.) as long as basic sequence information is available to support its synthetic reconstruc - tion. The considerable concern surrounding synthetic DNA technology for “dual-use” potential is understandable (NRC 2004). In general, dual-use concerns include the use of synthetic biology to de - liberately host-shift pathogenic microorganisms, to engineer drug or vaccine resistance, or to alter virulence potentials. As noted throughout this chapter, the list of “virulence genes” that have defined biological properties is growing at a considerable rate, and this fuels concerns that virulence is a readily malleable trait. Limited research has focused on the potential of synthetic genome design to enhance viral pathogenesis. With an existing genome as a chassis (from either a pathogenic or a nonpathogenic virus), it is certain that virulence genes from DNA and RNA viruses can easily be introduced into recombinant genomes in an attempt to alter the pathogenic potential of the chassis genome. The capabil- ity of using standard recombinant DNA techniques for that purpose on a more limited scale has existed for about 30 years. Moreover, we can imagine synthetic killer viruses that destroy civilization or that cause significant morbidity and mortality—a common topic in cinema. We know how to synthesize such imaginary “doomsday scenario viruses,” but how well the blended genomes will perform in human populations is unknown. For example, expressing the influenza virus NS1 type 1 interferon antagonist gene in SARS-CoV is simple with a top-down engineering approach, but the pathogenic properties of the chimera is difficult to predict. That is because SARS-CoV encodes at least six other interferon antagonist genes, and this raises the question of the ability of NS1 to offer considerable improve - ments in the pathogenic and innate immune antagonism capacity of the coro - navirus. Moreover, most viral proteins form complex interaction networks that are essential for regulating efficient virus growth and virulence. Removing or introducing new potential interaction partners will most likely adversely affect virus-virus and virus-host interaction networks and thus influence pathogenesis outcomes in unanticipated ways. In addition, dramatically altering the genome content of most RNA viruses by inserting genes could easily attenuate virulence, probably by affecting global gene expression or by altering basic RNA structure and genome packaging and release. In spite of those limitations, insertion of the IL-4 gene (cloned from mus musculus domesticus, the common mouse) into murine poxviruses (such as ectromelia virus) or insertion of the SARS-CoV ORF6 gene into mouse hepatitis virus enhanced pathogenesis in mice, and the influenza virus NS1 gene enhanced Newcastle Disease virus replication in human cells. It is also clear that synthetically designed chimeric viruses would elicit fear in exposed populations, regardless of the actual pathogenic outcomes associated with its intentional release. It might be prudent to shift the focus

OCR for page 37
 CHALLENGES OF PREDICTING PATHOGENICITY FROM SEQUENCE away from preventing dual-use proliferation to preparing for it by developing new platforms for rapid vaccine and therapeutic design and stockpiling these reagents against future bioterrorist attacks. It is important to note that many “non-Select Agent” human viruses and microorganisms are extremely pathogenic and encode virulence determinants that could be blended into other genome chassis or enhanced by introducing new virulence determinants. Thus, one danger in creating lists is that it focuses resources on a select subset of human pathogens and ignores the broader in - novation in genome design and gene function that exists throughout the larger microbial world (see Figure 2.2). All Biology Pathogens and Pathogenic Products Infectious Disease CDC Category Agents Select Agents BTRA FIGURE 2.2 The universe of potential genes and sequences that could be drawn upon to create a biological weapon involves all biology. From “All Biology,” some pathogens and pathogenic products such as toxins, venoms, and others may be known to cause disease or death in humans, animals2-2 Figure or plants. In the context of human health, some are recognized as causes of infectious diseases and are reasonably well characterized. Among all infectious diseases, some are further classified as Category A, B, or C pathogens by CDC or NIH, and they may or may not be assigned a biosafety level of laboratory containment (BSL-1, -2, -3, or -4) in the BMBL. Of these, some are designated Select Agents, and a few are prioritized under the DHS Bioterrorism Risk Assessment.

OCR for page 37
8 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS Bottom-Up Approach The most commonly cited concern is that synthetic biology will allow the de noo design and creation of new life forms, such as synthetically designed killer viruses and microorganisms, either from scratch (which is extremely un - likely) or through the combination of existing gene sets from multiple viruses (which is more likely). Humans are surrounded by large numbers of intention - ally genetically modified organisms (for example, domesticated corn from small grass teosinte and domesticated microorganisms used in beer, wine, cheese, and yogurt production) that were generated by breeding, artificial selection or, most recently, direct genetic engineering. In fact, 30 years of genetic manipula - tion with recombinant DNA technology in thousands of laboratories around the world has not resulted in any human-targeted super virus or microorgan - ism. Although that does not preclude it in the future, it raises the question of whether the degree of concern is exaggerated. Will synthetic designer genomes be more dangerous to society than other highly pathogenic microorganisms that already exist in nature, such as measles (about 150,000 deaths per year), HIV (about 2 million deaths per year), influenza and RSV, respiratory syncytial virus, (about 0.5-1 million excess deaths per year), organisms that cause diar- rheal diseases (more than 2 million deaths per year), or malaria (more than 1 million deaths per year)? The scientific community does not have sufficient knowledge to create a novel, viable life form, even a virus, from the bottom up. Designing an infectious viral genome de noo by sequence requires the accurate prediction of protein structure and function, the design of protein-protein interactions and protein machines, all of which must produce progeny virions efficiently in an order of magnitude more complex host cell. If we cannot predict protein structure and function on the basis of sequences with any accuracy, how can we design and synthesize novel viruses that will replicate, regardless of their disease potential? Alternatively, de noo design could focus on existing gene sets by emulat- ing or copying known functions. That is also exceptionally difficult. First, entry requires specific interaction between viral and cellular components, including the deliberate orchestration of a series of sequential conformational changes that mediate docking and entry of the virus genome (particle) into the cell. Regulation of entry often involves co-receptors and other host factors like proteases to regulate the entry process. While the process is clear, the ability to design these highly regulated systems de noo is in its infancy. Second, a functional replication complex is needed to replicate the viral genome. Replica- tion complexes are complex multicomponent protein machines (such as, viral and perhaps cellular proteins) that specifically engage and replicate nucleic acid. The formation of a replication complex includes tailored protein-protein and protein-nucleic acid interaction networks that are not known and cannot

OCR for page 37
 CHALLENGES OF PREDICTING PATHOGENICITY FROM SEQUENCE be predicted or engineered with existing technology. Pathogenesis involves a regulated set of virus-host interaction networks that influences host responses that antagonize or potentiate disease; these networks are poorly understood and cannot be designed de noo. Virulence genes work together, and the levels of expression that permit virus persistence or spread and transmission, depend - ing on the replication mechanism, are highly regulated. Finally, efficient virus egress from the cell usually involves discrete cellular and viral constituents. The protein-protein interactions, the regulation of the components involved in ef - ficient release, and the design of de noo systems are beyond the capacity of the scientific community. The level of abstraction required to piece together a new life from defined parts is difficult enough—it is a misconception that a viable de novo microorganism can be designed today directly from sequences and a pool of uncoupled gene parts—it would be even more difficult to predict the virulence of such a microorganism if it were assembled and recovered. Synthetic Biology—Summary Synthetic biology offers considerable opportunity to improve human health and solve planetary problems in energy, food production, medicine, and public health. However, dual-use applications are of legitimate concern; new synthetic DNA technology alters the old paradigms that regulate pathogen bioavail- ability, but it is the traditional top-down approaches7 that present the greatest threat to altering virus virulence and pathogenesis. Bottom-up approaches remain extraordinarily complex, and it seems unlikely that sufficient vision and understanding exist to design and recover a true human pathogen using this approach. The time frame for the realization of the synthetic biology revolution is impossible to predict and will depend on whether it remains unfettered and on the support that it receives; undoubtedly, however, it will be a reality within the lifetime of most who read this document. WHAT CAN CURRENTLY BE PREDICTED FROM SEQUENCE ABOUT THE IDENTIFICATION OF PATHOGENIC MICROORGANISMS, INCLUDING SELECT AGENTS? It is abundantly clear that the use of sequence alone to predict a naturally occurring or synthetic pathogenic microorganism accurately is infeasible—se - quence cannot provide biological context. Whether synthetic or natural, a sequence by itself has no biological properties. However, a remarkable amount of information can be deduced from a sequence of genes or proteins. To begin with, it is clear that all the microorganisms that we deal with can be divided into relatively well-defined general groups. For example, among the groups of 7 See Chapter 3 for more discussion of the relative threats presented by these approaches.

OCR for page 37
0 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS bacteria, viruses, and eukaryotic pathogens a core of conserved genes of known and hypothetical function make up the unique set of properties that permit a specific group of microorganisms to occupy, persist, and replicate in a particular biological niche, perform certain metabolic functions, and interact with other living things. That core of genes, which generally constitute about three-fourths of the genes specific to a microbial group, is the basis of much of contemporary taxonomy. A remaining set of genes is concerned with specialization. For our purposes, the specialization is pathogenicity and virulence. We can confidently “predict” the sequence of most of the important microbial toxins or at least suspect, on the basis of the structure of a protein or the deduced catalytic site decoded from a nucleic acid sequence, that a toxin might be encoded therein. The common features and themes of many bacterial virulence genes are known. Just as we can tell from the sequence of “core” genes that we are deal - ing with a bacterium of the genus Yersinia or a poxvirus, we can accurately say whether there is a precise sequence of a known virulence gene associated with the plague bacillus or with smallpox. It might be difficult to identify the precise origin of a microbial sequence to a specific Select Agent, but it is surely less difficult to say with some accuracy that a particular sequence is related to a known virulence determinant and to a class of virulence genes with known function. The fact is that the availability of complete microbial genomes, as well as the genomes of their hosts, has made it possible to combine genetic and molecular methods not only to identify microbial genes that are expressed only in certain animal hosts, including humans, but also to determine genes that are essential for pathogenicity using global screening methods. The methods themselves depend on gene sequencing technology. Such lists of genes exist today for Salmonella, Mycobacteria, Yersinia, and the pneumococcus; we believe that they probably exist for most Select Agents in at least a surrogate animal model. Where such lists do not exist, it is relatively easy to determine them experimentally. The utility of sequence information depends on a number of factors. In some circumstances, the quality of a sequence may be a factor, for example, if it comes from a small sample from an infected individual. The length of a sequence is, of course, also a factor. If a sequence is relatively short, it may be of no immediate consequence unless it can be shown to be part of a much larger assembly of genes. What would be required to determine that a sequence came from a known Select Agent? On the one hand, if the sequence were from a known virulence gene or from a specific (as opposed to core) region of the ge - nome, one could say with a good deal of confidence that it was from that organ- ism. On the other hand, if the sequence is from a core common to pathogenic and non-pathogenic microorganisms, as is common for enteric bacteria (and others), the information is less useful, although it is of value from a surveillance standpoint to eliminate suspicion. Difficulty would arise if a sequence were from a common set of core genes—let us say a core common to the Bacillus

OCR for page 37
 CHALLENGES OF PREDICTING PATHOGENICITY FROM SEQUENCE group—and there were sequences that resembled the plasmid-mediated toxin of B. anthracis but they were clearly distinct from the known anthrax virulence determinant. That might heighten the suspicion. However, the level of concern would depend on the biological context. If the sequence were derived from a patient who had sepsis, one would be appropriately alarmed. If the sample were from the soil, the level of suspicion would be far less because soil is a natural habitat for members of the genus Bacillus and they are harmless bacteria that have plasmids encoding related toxin molecules. If the sample were submit - ted to a DNA synthesis company, there would be cause to learn more about its source and about the reason the DNA was being synthesized.8 There is no fool-proof method for predicting whether a sequence will have the biological properties of a Select Agent (such as pathogenicity), but the common sense use of the considerable amount of sequence data we now have, combined with advances in understanding of microbe-host interactions, does provide for a mechanism that is practical and sufficiently flexible to provide guidance about the potential biological consequences of a DNA sequence from nature or from the biochemist’s bench. None of this is meant to trivialize the difficulty of understanding the bio - logical importance and function of microbial determinants of pathogenicity or the complexity of microbe-host interactions. Although we have the complete sequence of many of the most feared microbial pathogens and of their host animals or plants, our attempts to devise novel vaccines or therapeutic agents have been difficult at best. If genomics has taught us anything at all about microbe-host interaction, it is just how intimate and intertwined the biology of the two participants may be. That is not always obvious if one thinks about the ravages of the black death of the 14th century or the horror of Ebola infec - tion. Those are the nightmare of those who deal with bioterror, but the most efficient killers of the world are still infectious agents that have a longer-lasting intimate association with humanity. It is not easy to develop antiviral drugs when the interactions of the viral components and the human host are so closely intertwined that inhibiting the invader may mean poisoning the host. Nor is it easy to develop vaccines against many bacterial pathogens that have evolved to interact in subtle specific ways with the biology of the human immune system. Continued investment in genomics, bioinformatics and basic research on infec - tious diseases of humans, animals and plants, will help us develop new ways to control diseases that plague our planet. Nevertheless, it is our judgment that the sequence information now avail - able, when combined with that being obtained on a daily basis and deposited in public data banks, can be effectively used to provide a new approach to identi - fying and monitoring microorganisms of interest much more effectively than the 8A common question in this regard is, who should carry the burden—those submitting the sequence for synthesis, the synthesis companies, or government?

OCR for page 37
 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS construction of rigid lists that reflect biological reality poorly and therefore can provide only limited utility for national security. Thus, in Chapter 3 we discuss how the Select Agent list might be clarified by sequence-based classification to better circumscribe the taxonomic distinctions blurred by natural and synthetic variation and modification. Moreover, we present a “yellow flag” biosafety system; this approach is not regulatory and therefore could provide sequence information relevant to biosecurity in a more dynamic and timely manner than the Select Agent Regulations.