3
A Proposal for Consideration: Sequence-Based Classification of Select Agents

The previous chapter concluded that the primary answer to our charge is no—prediction of Select Agent status by genome sequence analysis is not feasible.

First, in Chapter 1, the committee found that prediction of Select Agent status is not possible, because Select Agent status involves economic, historical, and policy considerations beyond the biological properties of the organism encoded in its genome.

Second, the committee finds that even if it were possible to assign Select Agent status solely on the basis of genome-encoded biological properties, the answer would remain no. Chapter 2 described why accurate prediction of an organism’s pathogenicity from its genome sequence is not possible now, and will not be feasible in the foreseeable future—certainly not at the level of accuracy appropriate for statutory regulations. Reliable prediction of the hazardous properties of pathogens from their genome sequence alone will require an extraordinarily detailed understanding of host, pathogen, and environment interactions integrated at the systems, organism, population, and ecosystem levels. For the foreseeable future, the only reliable predictor of the hazard posed by a biological agent is actual experience with that agent.

The committee was charged to identify, supposing that the answer to these questions would be no, “the scientific advances that would be necessary to permit serious consideration” of such a predictive framework. The committee believes that we are so far from the goal of a predictive framework that it is premature to plan specific steps towards a Select Agent regulatory system based on predictive genome sequence analysis.

As described in Chapter 2, it is a major goal of all biology to understand how DNA sequence determines the properties of biological systems, ranging



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 73
3 A Proposal for Consideration: Sequence- Based Classification of Select Agents The previous chapter concluded that the primary answer to our charge is no—prediction of Select Agent status by genome sequence analysis is not feasible. First, in Chapter 1, the committee found that prediction of Select Agent status is not possible, because Select Agent status involves economic, histori - cal, and policy considerations beyond the biological properties of the organism encoded in its genome. Second, the committee finds that even if it were possible to assign Select Agent status solely on the basis of genome-encoded biological properties, the answer would remain no. Chapter 2 described why accurate prediction of an organism’s pathogenicity from its genome sequence is not possible now, and will not be feasible in the foreseeable future—certainly not at the level of ac - curacy appropriate for statutory regulations. Reliable prediction of the hazard - ous properties of pathogens from their genome sequence alone will require an extraordinarily detailed understanding of host, pathogen, and environment interactions integrated at the systems, organism, population, and ecosystem levels. For the foreseeable future, the only reliable predictor of the hazard posed by a biological agent is actual experience with that agent. The committee was charged to identify, supposing that the answer to these questions would be no, “the scientific advances that would be necessary to permit serious consideration” of such a predictive framework. The committee believes that we are so far from the goal of a predictive framework that it is premature to plan specific steps towards a Select Agent regulatory system based on predictie genome sequence analysis. As described in Chapter 2, it is a major goal of all biology to understand how DNA sequence determines the properties of biological systems, ranging 

OCR for page 73
 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS in complexity from single macromolecules to pathways, organisms, popula - tions, and ecosystems. Successes in prediction and design at each subsequent level of complexity in biology as a whole are the relevant milestones to watch for, before we will be able to predict confidently from genome sequence analysis how a designed organism would replicate, interact with a host, evade a host immune system, and spread in a population to cause disease. The committee’s view is that for the specific purposes of the Select Agent Regulations, those general biological milestones should be passively monitored, not actively sought. A narrow focus on such milestones for the sole purpose of being able to predict what makes Select Agents dangerous may be a distor- tion of priorities in biology, and may also raise concerns about dual use. The ability to predict pathogenicity from genome sequence automatically confers the ability to design genome sequences of pathogens. However, the committee is not satisfied with answering its charge narrowly and in the negative. The rapidly expanding capabilities of automated gene syn- thethesis and of synthetic genomics to synthesize and “boot” complete Select Agent genomes means that the Select Agent Regulations do need to be defined in terms of genome sequence analysis, not by the phenotypic properties of an encoded agent. A Select Agent genome is covered by the Select Agent Regula- tions whether or not it is ever “booted” into a living agent whose phenotype can be assayed. A DNA synthesis company needs to be able to tell, unambigu - ously and by sequence alone, if it is being asked by a customer to synthesize the genome of a Select Agent. That determination would not be a problem if each Select Agent had a unique genome sequence. However, discrete taxonomic nomenclature in biol - ogy is already challenged by the great diversity and continuum of organisms observed in natural wild isolates, and the rapidly expanding ability of synthetic biology to create highly modified variants and chimeras of naturally occurring genomes poses an even greater challenge to taxonomic naming systems. Select Agent pathogens, like any biological organism, are not defined by a single DNA sequence. Given natural wild variation and the conceivable range of tolerable synthetic variation, a “cloud” of related sequences of similar biological proper- ties are all assigned the same taxonomic name. There may be sequences that are just as closely related but are not Select Agents, including vaccine strains and attenuated research strains that the U.S. government want explicitly to avoid encumbering with the Select Agent Regulations. In its deliberations, the committee found that it was useful and impor- tant to distinguish sequence-based prediction of biological properties from sequence-based classification. A regulatory system based on prediction must be able to recognize that an entirely novel genome sequence (unrelated to any known sequence) encodes a pathogen that should be assigned Select Agent status. A regulatory system based on classification “merely” tries to decide

OCR for page 73
 A PROPOSAL FOR CONSIDERATION whether a sequence is sufficiently similar to that of an already-known, already- named Select Agent to be assigned the same taxonomic name and status. The two concepts are easily confused and sometimes conflated because the state of the art of prediction from biological sequence is generally not based on a physics-based prediction of the molecular structure and function of the parts encoded by a genome, but rather on sequence comparison and classification: If one sequence is similar to another known sequence, it is assumed that they share evolutionary ancestry and have similar biological functions. This chapter explores the conceptual difference between predictive systems and classification systems and considers the ramifications of using sequence- based classification for the Select Agent Regulations. In a narrow sense, the committee has addressed its charge by explaining that prediction-based systems are not feasible. However, the committee interprets its charge more broadly and in this chapter moves beyond infeasible predictive systems to consider a feasible sequence-based classification system. The reader may want to view this chapter as a proposal for consideration, rather than as a recommendation. However, in the sense that sequence-based classification is conflated with rough and limited prediction in biology, this chapter is the committee’s positive and con - structive answer to its charge. We discuss how a sequence-based classification system might be used to encompass what we believe are the most technically feasible and likely scenarios whereby synthetic genomics and synthetic biology could be used to construct a hazardous agent with Select Agent properties. A sequence-based classification system would still be based on a discrete list of Select Agents, but could be used to create a pragmatic “brighter line” for decid - ing whether a new genome sequence should be regarded as one of the existing Select Agents or not. NOVEL AGENTS: SYNTHETIC GENOMICS AND THE SELECT AGENT REGULATIONS We need to examine what we mean by a “novel” synthetic agent. This chapter flows from the strong premise that for the foreseeable future, synthetic pathogens (at least those regulated by the Select Agent Regulations) will be composed largely or entirely of genes derived from existing pathogens. That premise requires careful consideration. We want to outline the most likely threat scenarios that might be created with synthetic genomics, and we want to consider what we should regulate at the level of possession and transfer of spe - cific agents with regulations like the Select Agents Regulations, as opposed to overarching statutory prohibition of use and development of offensive biologi - cal weapons under the Biological Weapons Convention, or as opposed to pru - dent laboratory biosafety guidelines for handling of any pathogenic organism.

OCR for page 73
 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS Synthetic genomics poses three main threat scenarios that would allow a “bad actor”1 to obtain a pathogenic organism suited for use as a weapon: • The bad actor orders a synthetic DNA (or RNA) genome of a known Select Agent. The exact sequence may be modified, either in an at - tempt to alter the phenotype of the agent (perhaps to introduce drug resistance, increase pathogeneticity, or alter host range) or in an at - tempt to circumvent the Select Agent Regulations themselves (by making genetic changes intended to have little or no effect on patho - genicity, but to create ambiguity about the taxonomic classification of the organism). The bad actor then “boots” the synthetic genome into a full organism. We will call such an organism a “modified Select Agent” and we will describe its creation as “modification” of an exist - ing organism. • The bad actor “assembles” a synthetic pathogen by combining parts (genes and regulatory sequences) of known organisms, for example by creating a chimera of two or more viruses, or by attempting to express genes that encode a toxin or mechanism of pathogenesis in an otherwise innocuous “chassis” of a non-Select Agent host, such as any of several commonly used viral vectors. This would include cases in which no individual part originates in a Select Agent.2 We will call the organism created by this scenario a “chimera,” and we will describe its creation as “assembly” of existing parts. • The bad actor “designs” a synthetic pathogen by creating entirely new gene sequences—dissimilar to any known pathogen gene sequences in nature. We will call the organism created by such a process a “de noo” novel agent, and we will describe its creation as “design” (although sequences may be selected from randomized pools by high-throughput in vitro evolution or selection rather than actually designed). Obviously, there is a continuum among these three scenarios. As more new genes are moved into an existing organism, the line between modification and assembly blurs. Only part of an organism may be designed, and the rest assembled or modified. The important distinction is between genetic modifica - tion of an existing organism in essentially its original order, assembly of parts of 1 “Bad actor” is used to mean “an individual or group with nefarious intent.” “Individual” or “person” does not convey the full meaning; “terrorist” was viewed as too strong or specific a term. In addition, a “bad actor” may be either an individual or a group. 2 For example, experiments unexpectedly found that inserting a mouse interleukin-4 (IL-4) gene into ectromelia virus (mousepox) created a highly pathogenic virus against mice. Based on these results, there is concern that a related human virus like vaccinia, engineered to express human IL-4, could become highly pathogenic for humans, though neither vaccinia (the smallpox vaccine virus) nor the IL-4 gene by themselves are Select Agents.

OCR for page 73
 A PROPOSAL FOR CONSIDERATION known organisms in new combinations in new orders, and creation of entirely new gene sequences dissimilar to any known pathogen genes. The three scenarios are ordered by increasing technical difficulty and therefore by decreasing likelihood (see Table 3.1). We want to be sure that we are dealing with the more likely scenarios before worrying inordinately about less likely ones. The first scenario is the simplest, easiest, and most likely to work in the absence of an expensive research and development program; therefore, it is the most dangerous. Most Select Agent viruses can now be booted from synthetic DNA, some relatively easily (such as small positive-stranded RNA viruses) and some with great difficulty (such as large DNA viruses like variola). As the scope of DNA synthesis increases and the technology becomes commoditized, an increasing number of Select Agents can be reconstituted with a modest level of skill in molecular biology. Synthetic genomes identical to complete Select Agent TABLE 3.1 “Modified Select “de novo Select Agent” “Chimeric Select Agent” Agent” Three threat Created by Created by “assembly” of Created by “design” scenarios posed “modification” of an existing parts. (or by high- by synthetic existing organism. throughput in vitro genomics: evolution or selection) Feasibility of Feasible Possible Improbable scenario given Modification is routine Assembly is a frequent Beyond current the state of the genetic engineering molecular biology capabilities; if possible art: technique as an extension at all, likely to require of “modification.” extensive experimental Assembly of a radically selection, refinement, novel agent is beyond the and testing in state of the art. susceptible hosts. Potential Sequence-based Gene-sequence-based Sequence-based solution: classification. classification. function prediction. Anchored around a This would identify Neither design, nor taxonomic name and a individual “parts” of prediction is currently full genome reference genomes (beginning with possible. sequence, this would select agents), define Attempting to define the “space” “space” around each part, design and create around each select and determine which a bioweapon is agent—essentially “parts,” when assembled, prohibited by BWC translating the are operationally and USC Title 18 regulatory language into considered a “complete” Section 175(b), among an operational definition genome for the purpose others. that accommodates the of the Select Agent biological complexity. Regulations

OCR for page 73
8 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS genomes are already clearly covered by the Select Agent Regulations. Whether a modified sequence is still called a Select Agent or not (the question of when the magnitude of modification requires a new discrete name) is a gray area in the current Select Agent Regulations. The second scenario is plausible, but unlikely to work without an extensive research and development effort. Chimeric assemblies are the leading edge of early successes in “synthetic biology.” There are efforts to create standardized “parts”—genes or combinations of genes that have functions that are readily transferable into an organism “chassis” that provides basic biochemical, struc - tural, and replication functions (Kwok 2010). The simplest chimeric pathogens would express a toxin gene in a chassis, but the most dangerous toxin genes are already covered by the Select Agent Regulations regardless of what organ - ism they are in. Chimeric pathogen weapons that evade the current Select Agent Regulations would require the assembly of less obvious pathogenesis mechanisms than expression of known Select Agent toxins. The more such an assembly deviates from a known organism, the less likely it is to work, for all the same reasons that prediction of pathogenicity and other phenotypes from genome sequence is not possible. It is not possible with the current state of knowledge to predict and foresee all detailed interactions of gene products that determine an organism’s overall phenotype. It would require experimental characterization in suitable hosts to be sure that a chimeric weapon worked as intended. For some Select Agents, there are no surrogate experimental hosts for characterizing virulence, and the only suitable host for a human pathogen may be a human. Those considerations raise the research and development bar substantially, and expose such a program to existing legal prohibitions other than the Select Agent Regulations. Thus novel chimeric constructions are unlikely as terror weapons,3 although they are alleged to have been used in national offensive bioweapons research programs. The third scenario is beyond the capabilities of current biology. There is no example of a designed organism or even of a designed genetic pathway, let alone a designed pathogen. Designing a self-replicating organism that has only to in - teract with simple molecules in a test tube is one thing, and it is hard enough; designing a pathogen that has to interact with a complicated host, evade its immune system, and be transmissible in the natural environment adds daunting layers of biological complexity. There are very few examples of single protein sequences that have been designed to fold in a particular novel way (Kuhlman, Dantas et al. 2003). These first few modest successes in de noo design of single proteins constitute the current state of the art. Design and prediction go hand 3 Or at least as functionally dangerous pathogens. A release of a chimeric construct of suitably scary-sounding parts could cause societal damage from fear and panic even if it were utterly inef - fective as an organism. The number of such imaginable scenarios is enormous—probably beyond the scope of the Select Agent Regulations to regulate effectively without overly impeding beneficial biomedical research. The best countermeasure against “mock” chimeric weapons is likely to be a resilient public health emergency response and communication system.

OCR for page 73
 A PROPOSAL FOR CONSIDERATION in hand; our lack of predictive ability in biology means that we cannot now design genomes.4 Synthetic biology’s use of metaphors like “booting” a genome into a living organism or use of a well-known organism, such as Escherichia coli as a “chas- sis” for hosting synthetic constructs (Lee, Chou et al. 2008) may be mislead- ing about the likelihood of de noo design. Among synthetic biologists, this metaphorical language emphasizes the long-term goal of making biological systems as engineerable as computers or machines; but the language also tends to trivialize the complexity of biological systems and the enormous gaps in our understanding of them by making it seem (perhaps especially to non-biologists) that we can already engineer biological systems easily. There are examples in which synthetic genomes have been “booted” into living viruses (Cello et al. 2002) and now even cells (Lartigue et al. 2009; Gibson et al. 2010), but it must be remembered that these experiments have “only” synthesized minor variants of known natural genome sequences. An E. coli “chassis” has been used to express genes for a complex synthetic function, such as the engineering of E. coli to produce pharmaceutically important natural plant products, such as the antimalarial drug arteminisin (Chang et al. 2007), or bulk chemicals, such as 1,3-propanediol (Bio-PDO, Dupont), a starting point for synthesis of plastics (patent WO/2004/101479), but these efforts have required massive multiyear iterative bioengineering and development processes “just” to move known genes from other organisms into the E. coli system and get them to work as de- sired. All existing (and reasonably foreseeable) uses of synthetic biology involve modification or rearrangement of existing biological components. The entirely de noo design of genomes and organisms remains science fiction. We have distinguished three kinds of novel synthetic organisms because we believe that there is a tendency to imagine nightmare scenarios in which a de noo unnamed pathogen, dissimilar to any known pathogen and thus unrecognizable by any sequence comparison protocol, is created deliberately or accidentally with synthetic genomics. Clearly, a regulatory system like the Select Agent Regulations based on a list of known agents and their genome se- quences is not effective for regulating entirely de novo agents. Such a concern seems to have been behind the charge to our committee. If one were worried about prohibiting possession and transfer of de noo agents at the point of their 4 An important exception to the concept that design and prediction go hand in hand is that it is feasible to select de noo sequences for particular functions by high-throughput in vitro evolu- tion or selection, and thus make it possible to arrive at functional sequences without designing or understanding them. Although selection and directed evolution have been widely applied at the level of individual RNAs or proteins, no novel genome sequences have been created this way. Given the complexity of the problem (the number of possible DNA sequences rises in proportion to 4N for a sequence of length N), it is extremely unlikely that complete de noo genomes could be selected in a small number of iterations by such procedures. If it were done, it would require a long effort of iterative artificial evolution—expensive and complicated work beyond any current research program in biology.

OCR for page 73
80 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS creation, one would need to develop a forward-looking system that predicts the Select Agent status of a de noo genome sequence, rather than a backward- looking system based on a taxonomy of known Select Agents. However, as discussed in Chapter 2, such a prediction based system is not feasible. We instead take the view that not only is the creation of de noo agents the most unlikely scenario, but also that it is not and can not be the purpose of the Select Agent Regulations to regulate de noo agents any more than it is their purpose to regulate access to novel emerging diseases in nature. The Select Agent Regulations should not aim to prevent access to all possible pathogens. There is no way to anticipate all possible novel pathogens. The Select Agent Regulations should aim to impede access to the most dangerous known patho- gens. The best defense against the unquantifiable threat of novel synthetic pathogens is not the Select Agent Regulations, but continued enhancements of the laboratory and clinical biosafety measures that we already have to deal with the real and measurable threat of emerging natural pathogens. When a novel agent emerges, research is initiated to study its mechanism of action, the potential threat that it presents, and its susceptibility to countermeasures. When the novel agent has been shown to meet the criteria for Select Agent regula - tion—with respect to not only its genome-encoded biological properties, but other medical and policy considerations—its name can be added to the Select Agent list so that future access to it can be regulated. It is our view that the main biosecurity threat scenarios start with acces- sibility of proven known pathogens, so it is reasonable for the Select Agent Regulations to be backward-looking and based on a list of known agents. These regulations protect us by restricting the availability of agents that we know from experience to be extraordinarily dangerous and that have a high potential for biowarfare or bioterror. The foregoing establishes the premise for the remainder of this chapter. Modified Select Agents, made facile by the commoditization of synthetic ge - nomics, constitute the most important and pressing practical issue related to the Select Agent Regulations. The taxonomic nomenclature of microorganisms is designed for wild isolates of actual organisms that have observable growth phenotypes, not for non-natural modified sequences that exist only as genomic DNA. A system based on (natural) taxonomic nomenclature does not establish a bright line that is sufficient for clear statutory regulation of possession and transfer of synthetic genomic DNA sequences. As a specific working example, consider the situation of a DNA synthesis company. A DNA synthesis company is capable of synthesizing complete ge - nomes, and the company (if unregistered) formally violates the Select Agent Regulations if it possesses or transfers a synthetic Select Agent genome. 5 Who 5 Although we use synthetic genome orders as an illustrative example, we must recognize that from the standpoint of impeding bioterror scenarios, there will be myriad ways to get around any

OCR for page 73
8 A PROPOSAL FOR CONSIDERATION judges whether a DNA sequence constructed by the company is considered to be a Select Agent? How similar to a Select Agent reference sequence does a DNA sequence have to be to still be deemed a Select Agent? Currently, each gene sequence company must define for itself the sequence boundaries around each Select Agent, with only minimal guidance from regulators (DHHS 2009). The companies understandably want increased clarity about what sequences are and are not covered by the Select Agent Regulations. By addressing this issue as an example, we would also deal with a number of other scenarios in which synthetic genomics might be used to create modified Select Agents. And we will also be able to deal with some of the most obvious and likely ways that chimeric agents might be assembled with synthetic biology. As discussed in the next section, we take the view that the most pressing issues can be treated as a sequence classification problem more than a sequence- based prediction problem. CLASSIFICATION IS DISTINCT FROM PREDICTION Whereas sequence-based prediction of the properties of Select Agents will not be feasible in the foreseeable future, sequence-based classification can be addressed with current technology. At least at the level of individual gene se - quences, there is an extensive literature on methods for automated classification of sequences into operationally defined taxonomies. Cellular organisms are rou- tinely classified taxonomically by using small subunit ribosomal RNA sequence comparisons. Everyday examples of assigning protein sequences into annotated families include databases such as Pfam, Interpro, and TIGRfams. The existing literature does not quite suffice for the problem of sequence-based classification of complete Select Agent genomes. Viruses lack ribosomal RNA, for one thing; for another, we need to think about distinguishing a complete genome from a partial one; and we need to worry about artificially modified and chimeric con - structs that would not be constrained to follow the patterns of natural sequence screening procedure used by DNA synthesis companies, ranging from splitting an order into appar- ently innocuous pieces among multiple companies to using offshore companies that do not adhere to U.S. regulations, to simply not using a DNA synthesis company at all. The technology of DNA synthesis is rapidly being commoditized, and DNA oligonucleotide synthesis machines can already be purchased cheaply from eBay. An ebay.com search on “oligo synthesizer” on 10 October 2009 found a used Applied Biosystems 394 DNA/RNA oligo synthesizer on sale for $8,900 (plus $106.16 shipping within 3-8 business days to a committee member’s home in Northern Virginia). With great difficulty and specialized technical skills, genes and even whole genomes can be assembled from individual short oligonucleotides. In much the same vein, a determined bioterrorist can obtain isolates of a Select Agent from the wild. The Select Agent Regulations can only raise the difficulty bar for acquiring cultures of proven highly virulent agents and provide law enforcement with tools to prosecute for possession of variants of such agents; because natural biological organisms are widely available, readily engineered, and increasingly easy to create, it is unrealistic to try to design the Select Agent Regulations to preclude acquisition completely.

OCR for page 73
8 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS evolution. However, the adaptations needed to define complete Select Agent genomes operationally are fairly obvious, as we will discuss. The computational sequence analysis technologies used for sequence-based classification define “sequence spaces” that circumscribe the known variation of sequences that are considered to belong to a useful name, while excluding the known variation of sequences that are considered to be attached to different names. Therefore, a necessary precondition is to have a number of representa - tive sequences that belong to the desired classification and a number of the most closely related sequences that do not belong. An important principle of automated classification (also known as “super- ised learning” methods, in statistical inference) is that given the known se- quences of things that we want to label as Select Agents and things that we do not want to label as Select Agents, there is always a classification scheme that can achieve the desired labeling of known sequences with 100 percent accuracy. The important concern is not that a classification system would misclassify a known sequence, but rather how well a classification system generalizes to cor- rectly classify new sequences that it has not seen before. The existing methods for sequence-based classification of protein and DNA sequences provide a flexible set of software tools for human experts to use to define appropriately generalized sequence spaces on a case-by-case basis using expert knowledge. The methods enable 100 percent classification accuracy on known sequences (essentially by definition—the classification system is defined on the known sequences already labeled with a set of known labels), can be expected to per- form reasonably well on new sequences, and are readily updated if and when erroneous classifications occur. The basic principle of sequence-based classification is simple (Figure 3.1). A sequence is assigned the name of the known group that it is “closest” to, as long as it is also within the range of known variation accepted for that group. If it falls outside the range of known variation of any known named groups, it is operationally declared a “novel” sequence, possibly within some larger and more broadly defined sequence family. The key is the definition of close, that is of what distance means in comparing sequences; various automated computa- tional procedures differ in the details of their approach. Sequence-based classification is related to taxonomic (species-level, evolu - tionary) classification, but the two do not always coincide. They coincide when the desired sequence classification corresponds to slowly evolving traits that are shared amongst a “clade” of evolutionarily related organisms to the exclusion of more distantly related organisms. For example, all variola (smallpox) virus isolates have sequences that are more similar to one another than are variola virus and the most related non-Select Agent orthopoxvirus. Sequence-based classification may differ from evolutionary classification for rapidly or recently evolved traits, particularly if only a small number of crucial changes are in - volved and the same sequence changes and phenotypes have evolved more

OCR for page 73
8 A PROPOSAL FOR CONSIDERATION X3 Non-SA1 X1 SA1 X2 FIGURE 3.1 Profile-Based Classification System—distance. This graphic shows a tree based on sequence similarity. The Select Agent has high se - quence similarity to the non-Select Agent nearest 3-1 Figure neighbor. The shaded spheres indicate the sequence “space,” or known variability, surrounding an agent’s taxonomic name. • X1 sequence is within SA1 “space” and would be classified as Select Agent1. • X2 sequence is most similar to non-SA1 and would be classified as non-Select Agent1. IMPORTANT—this is not prediction! X1 may NOT be a pathogen. X2 may be a pathogen. This classification system identifies and clarifies what is subject to Select Agent Regulations. It cannot predict what is or is not a pathogen. IMPORTANT—classification does NOT designate new Select Agents. It identi - fies sequences as belonging to a known sequence “space.” If the sequence does not belong to an established “space,” it is novel. • X3 is a novel sequence that is more similar to the Select Agent than the non-Select Agent. This would initially be classified as a non-Select Agent. As biological in - formation is acquired, this agent would be evaluated and might be added to the Select Agent list; or it would be considered another near neighbor, non-Select Agent sequence. IMPORTANT—acquiring sequence data and biological information is needed to define the space around Select Agents and close the gap around novel sequences.

OCR for page 73
 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS OUTLINE OF A POSSIBLE SYSTEM FOR PROFILE- BASED CLASSIFICATION OF SELECT AGENTS With current sequence analysis technology, it would be possible to develop an automated and precisely defined system for classifying genome sequences as to whether they are “complete” Select Agents. The system might look some - thing like the following. For each Select Agent, a minimal parts list would be identified by experts. It would be the set of genes that are thought necessary to make an infectious Select Agent genome. The parts list does not have to be exhaustively complete, because it is being used only to classify genomes, not to describe them fully; we might choose to include only a representative subset of the genes in a microbial genome to reduce the size of the classification system (and the work needed to create it). A genome that contained all the parts on the minimal parts list would be classified as a “complete, infectious” genome for operational purposes of the Select Agent Regulations. A genome that did not contain all of the parts would be a “genomic fragment” for the purposes of the Select Agent Regulations. For each part, an automated profile-based classification system would be developed to differentiate the subfamily of sequences belonging to the Select Agent from the larger family of sequences belonging to non-Select Agent organisms. The specificity of these profiles would vary, some of them being very specific to only the Select Agent, and some of them being general and encompassing both Select Agent and non-Select Agent sequences. This step requires expert judgment. The more general models would allow the classifica - tion system to deal with the possibility that some parts are substitutable (by synthetic biologists) with “generic” parts, so these profiles might be made at a more generic level—any RNA-dependent RNA replicase, rather than specifi - cally the Select Agent RNA-dependent RNA replicase, for example. The more specific models would focus on the parts thought to be most responsible for pathogenicity as opposed to core replication, metabolism, and growth func- tions. These specific models (as distinguished from the generic models) might be specially flagged to raise a flag to indicate that parts of a Select Agent are present even though a complete Select Agent genome is not, for the purposes of prudent follow-up on the part of a DNA synthesis company—for instance, if an order might represent an attempt to obtain a Select Agent genome in several individually legal pieces. For each Select Agent, given a minimal parts list and a profile-based clas - sification system for each part the classification system would be tested, bench - marked, and challenged using known genome sequences. To be useful, the classification system would be required to classify correctly all known sequence variants of a Select Agent (and a set of reasonably imaginable ones), and a rep - resentative set of the most closely related non-Select Agent genomes, including

OCR for page 73
 A PROPOSAL FOR CONSIDERATION very close relatives, such as vaccine strains or non-pathogenic variants used in laboratory research. For good classification, it is not sufficient to know a single representative genome sequence of each Select Agent. Using a classification system is an at - tempt to determine whether a novel sequence fits into the “cloud” of sequences representing expected genetic variation for a Select Agent genome, as opposed to the “clouds” of sequences representing the most closely related non-Select Agents. The more sequences are known, the better the expected genetic varia - tion will be understood. Genome sequences of almost all Select Agents are available, but there has been less emphasis on obtaining genome sequences for closely related non-Select Agents. Future studies are sure to discover numerous new microbial and viral species, and it is desirable that these new discoveries not be misclassified as Select Agents just because they are closely related to Se - lect Agents. More systematic genome sequencing of non-Select Agents would improve our knowledge of biodiversity and would be useful in developing a good classification system. The profile classification system would have to be reviewed and revised, as new knowledge accrued that required newly discovered Select Agent or non- Select Agent variants to be classified. The updating process would resemble the continuing curation of other profile library classification systems, such as TIGRfams and Pfam. Because it is automatic and software-based, the classification system could be made readily and transparently available on the Web, where it could be re - viewed and challenged by scientists in the community to be sure, for example, that it was not inadvertently misclassifying useful non-Select Agents, such as vaccines and attenuated research strains.13 Timely testing, updating, and public review of the system would guard against classification errors. Automated annotation of protein function based on sequence similarity analysis is robust but not error-free (Schnoes et al. 2009). This essentially phylogenetics-based system will work better for some Se - lect Agents than for others. The greatest difficulty in clasifying Select Agents with a phylogenetic subfamily system will occur in cases in which very closely related viruses in the same phylogenetic group that have small, easily evolved genetic changes that differentiate highly pathogenic Select Agent strains from low-pathogenicity non-Select Agent strains, that come and go in a phylogenetic tree; the “high-pathogenicity” avian influenza viruses are an example. Similar cases of convergent functional evolution arise in protein function annotation, in wihch a small number of changes in active site residues can shift a protein function and these changes convergently evolve multiple times in multiple lin - eages. Alternative methods that key on critical functional residues have been 13 This would require a process for reclassifying an agent in response to input from the scientific community.

OCR for page 73
8 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS developed to deal with the problem for protein function annotation (Hannen - halli and Russell 2000) and could be deployed and benchmarked for the Select Agent classification problem. There will be cases in which any sequence-based classification system must fail altogether. For example, the bovine spongiform encephalopathy (BSE) prion agent is on the Select Agent list, but prions are an alternatively folded conformation of a host protein; the amino sequence of the prion form of the protein is identical to the benign host form. The only way to distinguish the BSE prion from the natural host protein is by experimental assay. There would be no pretense of prediction in this classification system. Many genomes would be classified as a Select Agent because they have all the parts of a Select Agent, but there is little reason to think that all those parts would necessarily work in concert to produce a working, infectious, patho - genic organism; indeed, most synthetic genomes that had all the independent parts would probably not work as dependent wholes. From the standpoint of dealing with the implications of synthetic biology and synthetic genomes, the utility of the classification system would not be to distinguish successful genome designs from unsuccessful ones—“bootable” pathogens from inert DNA sequences—but to distinguish attempts to synthesize a dangerous ge - nomes similar to a Select Agent from an attempt to synthesize benign genomes from a non-Select Agent organism, a non-pathogenic strain, or a vaccine. The classification system does not distinguish legitimate research from illegitimate research; rather it identifies agents that are restricted under the Select Agent Regulations and provides a means of identifying “sequences of concern” that may be worth monitoring. The goal of the Select Agent Regulations is to restrict availability of the most dangerous known pathogens while not impeding beneficial biomedical re- search on known or emerging pathogens. In dealing with synthetic biology and the potential threat posed by novel agents, our goal is to try to regulate the most obvious attempts to synthesize a potentially working pathogen, and the current state of the art in synthetic biology is the ability to produce new combinations of existing biological parts, not to devise new genomes entirely de noo. We can never exclude radically novel synthetic biology designs, but we can raise the bar to the point where bioterrorists would have to possess knowledge better than the current state of the art with respect to what biological parts are necessary in a pathogen to evade a parts-list-based Select Agent classification system or would have to engage in an offensive biological weapons research program on a scale that would come under the Biological Weapons Convention. A classification system would clearly be the easiest to develop for Select Agents with the smallest parts lists. The easiest would be the protein toxins composed of one or a few proteins, such as abrin and ricin. The next easiest would be the proteins encoding the multistep synthetic pathways for metabolite toxins such as diacetoxyscirpenol, saxitoxin, and tetrodotoxin (on the presump-

OCR for page 73
 A PROPOSAL FOR CONSIDERATION tion that this biosynthetic pathway might be moved in a modular form into a new host to create a new organism that expresses the toxin). Next would be the viruses, ranging from small genomes (such as Lassa virus) to large ones (such as smallpox). The microbial genomes would be the hardest to deal with, and would require the most thought about what parts are generic and what parts are specific to a Select Agent pathogen. CONSIDERATIONS FOR IMPLEMENTATION OF A PROFILE-BASED CLASSIFICATION SYSTEM It is not the role of our committee to recommend specific implementation plans, nor are we properly constituted to do so. But we were tasked with de - scribing an “alternative framework” for oversight, so it is appropriate to make some observations about implementing a profile-based sequence classification system along the lines discussed in this chapter. To be useful for unambiguous regulations, there would need to be a single agreed-on classification system as opposed to multiple competing systems developed by different research groups. That would require a centralized fund- ing plan that would balance the benefits of single source standardization by a single Select Agent classification system team against the need for oversight and review to maintain quality and efficiency in the absence of peer competition. A classification system would require a small team of full-time staff to develop and maintain it. The sequence curation work required is substantial. Classifying the current 82 Select Agents would require 82 parts lists and on the order of several thousand different profiles for the parts, and each Select Agent classification would need to be carefully tested and maintained over time. That would be on the same scale as the curation effort involved in the current Pfam or TIGRfams databases for automated protein sequence annotation. The Pfam database, for example, consists of about 12,000 profiles of common pro- tein domain families, maintained by four to six skilled full-time staff since the mid-1990s, including sequence analysts, database administrators, and software developers. The curation team would need advice from a panel of leading scientists for each group of pathogens. The scientific advisory panels would need to meet regularly to review the relevant literature and research results and would need to develop and maintain up-to-date consensus on the parts lists and parts classifications that define suitable sequence spaces around each Select Agent. These defined sequence spaces would be embodied in an automated classifi - cation system by the curation team. The classification system (and comments from the scientific community on its accuracy, gathered from the scientific community) could be reviewed by the appropriate government departments, and the database system would be approved as guidance in interpreting the law, much as the Centers for Disease Control issued a written document to guide

OCR for page 73
00 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS gene synthesis companies in interpreting the application of the Select Agent Regulations to synthetic DNA. These scientific advisory panels would probably include not only U.S. scientists, but the best scientists from around the world. International partici- pation would have intangible additional benefits. Gene synthesis is an interna - tional industry; international harmonization of regulation and best practices for biosafety and biosecurity in synthetic genomics is an important area. In addition, participation of international scientists in the undertaking could raise awareness of dual-use issues among international researchers—a major objec - tive of the National Strategy for Countering Biological Threats14 and of the NSABB. A balance would need to be struck between the need to keep definitions up to date with the state of scientific knowledge about the genetic composition of plausible complete and infectious Select Agent pathogens and the need to have a stable regulatory environment. It would be undesirable to have high- consequence regulations like the Select Agent Regulations changing on a rapid time scale. It would be unreasonable, for example, to have sequences moving on and off the list on a time scale much faster than the time scale of converting a laboratory to meet the Select Agent Regulations. A suitable time scale might be to issue an updated classification system every two years. This is consistent with the current review process for the Select Agent Program, which is overseen by the Intragovernmental Select Agents and Toxins Technical Advisory Com - mittee (ISATTAC). The periodic expert review and update cycle could be meshed well with recommendations of other recent advisory reports calling for increased cross- agency harmonization of the Select Agent Regulations, and for increased trans - parency in the procedures for moving agents on and off the list. 15 As we have discussed, the decisions made in establishing classification boundaries in sequence space are unavoidably arbitrary. They cannot be inter- preted as biological predictions of whether given synthetic genome sequences would function as dangerous pathogens. Nevertheless, such a system would be an improvement over the current process. It would transparently, consistently, 14 “Transform the international dialogue on biological threats: Activities targeted to promote a robust and sustained discussion among all nations as to the evolving biological threat and identify mutually agreed steps to counter it.” 15 2009 National Research Council report Responsible Research with Biological Select Agents and Toxins, “RECOMMENDATION 2: To provide continued engagement of stakeholders in oversight of the Select Agent Program, a Biological Select Agents and Toxins Advisory Committee (BSATAC) should be established. The members, who should be drawn from academic/research institutions and the private sector, should include microbiologists and other infectious disease researchers (including Select Agent researchers), directors of BSAT laboratories, and those with experience in biosecurity, animal care and use, compliance, biosafety, and operations. Representatives from the federal agencies with a responsibility for funding, conducting, or overseeing Select Agent research would serve in an ex officio capacity . . .” (NRC 2009b).

OCR for page 73
0 A PROPOSAL FOR CONSIDERATION and unambiguously represent the harmonized views of a community of experts. A centralized system would almost certainly lead to better decisions than reli - ance on a series of dispersed judgments by individual scientists in gene synthesis companies who have little specific knowledge about the pathogen sequences that they might be asked to synthesize. It would be prudent to develop the system in phases, starting with a pilot project on a subset of the smallest Select Agent sequences, such as the protein toxins and the smallest Select Agent viruses. Several of the smallest Select Agent sequences are at the same time considered the most dangerous and feasible threats for current synthetic genomic technology, and also the smallest and easiest test cases for a profile-based classification system (because they require definition of the fewest parts). ROLE OF PREDICTION AND CLASSIFICATION IN BIOSAFETY Chapter 2 discussed how high-level biological phenotypes, such as patho- genicity and transmissibility, cannot plausibly be predicted with the degree of certainty required for statutory biosecurity regulations, either now or in the foreseeable future. Nonetheless, predictive ability is a major goal of biology, and it is sure to develop. Any predictive ability will develop first in ways suited for probabilistic risk assessment—for raising potential biosafety concerns about an encoded organism—long before prediction reaches the far higher bar of making completely precise and accurate judgments suitable for declaring that novel synthetic constructs are dangerous weapons and immediately subject to Select Agent Regulations in the absence of actual experience with the encoded organism’s properties. Predictive systems biology will slowly give us a better ability to assess whether a synthetic genome sequence might be more or less hazardous and whether the organism that it encodes might be more or less likely to have the phenotypic properties of Select Agents. Therefore, to the extent that prediction of biological properties of Select Agents does become possible, we believe that it will first be useful in the context of a yellow flag biosafety warning system—warning investigators and their institutions and institutional biosafety committees (IBCs) that a synthetic construct might be hazardous, perhaps inadvertently so, and that the construct and its encoded organism ought to be handled with an appropriate level of biosafety containment. The centralized classification system for pathogen sequences described above could inform biosafety judgments long before prediction has a substan - tial impact. The two concepts of a minimal parts list for a complete pathogen and a profile-based classification system could help scientists to identify con - structs that, although not complete pathogens within the definitions of the Select Agent Regulations, merit close scrutiny for their biosafety and biosecurity implications. Classification is not prediction, but it is completely plausible that constructs that are close to a minimal parts list for a pathogen or are similar

OCR for page 73
0 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS to a pathogen gene merit more thorough biosafety evaluation than constructs that are not. RAISING A YELLOW FLAG FOR “SEQUENCES OF CONCERN” A wide array of DNA sequences are in a category that, although not com - pletely innocuous, should not be subject to the severe constraints on research imposed by the Select Agent Regulations. The Select Agent Regulations cover complete genomes (“nucleic acids that can produce infectious forms of any of the Select Agent iruses”—language that has been formally interpreted as ex- cluding “genomic fragments from Select Agents”). There are many reasons for prudent concern about providing synthetic genomic fragments of Select Agents. A “bad actor” might seek to obtain a genome in fragments that individually did not meet the criteria of the Select Agent Regulations and then assemble the complete genome. The “bad actor” would be in violation of the Select Agent Regulations by constructing a complete genome, but from a regulatory perspec- tive, we might want to make this technically plausible scenario more difficult to carry out and might want to have more advance notice when such a scenario might be playing out. We would like to explore the concept of a “yellow flag” system that could provide a transition between the standard biosafety practices that are applied to the vast majority of research projects and the highly regulated, highly re - strictive practices required by the Select Agent Regulations. (The concept of the yellow flag will also be discussed in Chapter 4 and Appendix L.) Our best chance to reduce public health risks posed by infectious disease is an active, efficient, and safe R&D program directed at known and emerging pathogens. This committee and others believe strongly that overbroad application of the Select Agent Regulations will increase risks to public health by increasing the cost and reducing the speed of critically important research on pathogens. The aim of the yellow flag is to provide a set of practices that improve safety and security without increasing the number of sequences that are covered by the Select Agent Regulations. Restrictions on pathogen research have two primary goals: to make it harder for bad actors to use pathogens as weapons or as tools for bioterror and to avoid the accidental, inadvertent, or ill-advised production of hazardous constructs by well-meaning investigators. Meeting both goals through a yel - low flag system probably requires some form of disclosure of the pathogenic sequences or experimental plans to scientists outside the groups carrying out the research. Thiat disclosure could involve centralized reporting of yellow flag sequences, public disclosure, or some kind of institutional review. A centralized system for reporting yellow flag sequences would allow detection of the simplest scheme for avoiding the Select Agent Regulations—splitting the order for a viral genome or a toxin between two different gene synthesis companies. It would

OCR for page 73
0 A PROPOSAL FOR CONSIDERATION also enable intelligence analysis and monitoring of DNA sequence orders. Pub- lic disclosure would offer the virtues of an open-source system—the power of review by many different observers—but it would require a substantial change in the culture of science. A standardized system for identifying “yellow flag” sequences coupled with biosafety review (and biosecurity monitoring) might offer the simplest approach to reducing risks. We envision that the actions taken in response to a yellow flag could be informal, prudent best practices in that they fall outside the strict regulatory boundaries of the Select Agent Regulations. It would also be possible to use a yellow flag system in more formal ways. For example, an IBC or funding sponsor could ask that yellow-flagged synthetic constructs trigger some sort of special notification for purposes of oversight if for no other reason than to track what laboratories were in possession of such constructs. Similarly, DNA synthe- sis companies might be asked to maintain records of yellow-flagged constructs that they provide, to facilitate forensic investigation in the event of the criminal construction of a complete Select Agent from synthetic parts. The concept of raising a yellow flag on synthetic genomic constructs clearly overlaps with the biosecurity goals of the Select Agents Program—providing a sort of buffer zone that identifies individual Select Agent synthetic parts that do not rise to the precise inclusive definitions of the “complete, infectious” genomes that come under the Select Agent Regulations. Moreover, we view a yellow flag system more broadly from a biosafety perspective. We believe on intuitive grounds that the probability of accidental, inadvertent, or ill- advised hazardous constructs by well-meaning investigators (including hobbyist biohackers) is much higher than the probability of deliberate weapons con - structs,16 although neither probability can be estimated. A yellow flag system could help to build a web of reasonable oversight of synthetic genomic con- structs. Investigators or biohackers could be warned if their constructs included potentially dangerous Select Agent-like parts, to be sure that they were aware and to be sure that they were prepared and equipped to handle the constructs appropriately. It would also be possible to expand the list of yellow flag parts used in profile-based classification beyond the Select Agents list to include other sequences of concern. Initially, the “yellow flag” system could be based on the profile-based clas - sification system described above, and best practices could be promulgated for how investigators, institutions, IBCs, and synthetic genomics companies should deal with constructs deemed to be “potentially risky.” As predictive methods are developed, they would naturally augment such risk assessments. Because the profile-based classification system would be transparently available in its func - 16 At the very least, those engaged in a deliberate weapons program are less likely to use DNA sysnthesis companies as a means of obtaining genomic fragments, when those companies are known to screen orders for “sequences of concern.”

OCR for page 73
0 SEQUENCE-BASED CLASSIFICATION FOR SELECT AGENTS tion for the Select Agent Regulations, it would also be transparently available for parallel development of biosafety best practices of the yellow flag variety. That is, community awareness and best practices for responsibility, review, and oversight of biosafety of synthetic constructs could be developed now, starting with the profile-based classification system, and this biosafety framework would then be in place as scientifically sound de noo predictive systems develop. SHOULD SUCH A SYSTEM BE BUILT? All that said, should such a profile-based classification system be built? Our committee is not constituted to answer that question. Our answer to this part of our charge is that it could be built. Together, the profile-based classification for Select Agents, and the yellow flag system for sequences of concern would address many of the emerging bios- ecurity concerns posed by synthetic biology. Gene synthesis companies, which have to make daily judgments based on the Select Agent Regulations and other regulations, strongly favor development of such a system. As discussed in Chap- ter 1, several of the companies have agreed to a common set of procedures for screening customers and sequences. In addition to helping to identify sequences that qualify as Select Agents, the parts list profile-based classification system would provide information to be used for flagging sequences of concern. The companies have agreed to apply enhanced customer screening to orders that do not meet the definition of the Select Agent Regulations but do include Select Agent sequences or other pathogen sequences. The goal of such screening is to supply sequences only to customers that have a legitimate research use for them and have the resources to handle them safely. The system could also have utility that is independent of its role in clarify- ing the regulation of pathogen sequences. One of the most consistent lessons of modern biological research is that cross-pollination between diverse fields of expertise consistently yields valuable scientific insights and useful new tools (NRC 2009b). Gathering information on the most important human and animal pathogens into a single, consistently annotated database will make it easier for scientists and engineers who are not experts on a particular patho - gen to develop new diagnostics, vaccines, and therapeutics. However, the system we have described does have some objectionable qualities. It may be “overkill.” It erects a moderately complicated and pe - dantic definitional framework around the purely regulatory question of what sequences represent complete, infectious Select Agents or not. One might take an alternative view that the regulatory language could be clarified more simply and elegantly with a language of “reasonableness,” expressing the concepts of sequence classification methods we have described here but without the potential heavy-handedness of a centralized automated computational implementation. For instance, there might be regulations and

OCR for page 73
0 A PROPOSAL FOR CONSIDERATION guidelines that declare the intuitive notion that a synthetic genome is “rea - sonably” expected to be complete and infectious and that the sequence of its parts would appear to a “reasonable” person to be closer to that of a Select Agent reference sequence than to that of the nearest non-Select Agent refer- ence sequence. Such language alone would be an improvement over the use of problematic percent identity thresholds in some current regulatory language, and over the absence of any guidance at all for dealing with modified or chime- ric Select Agent synthetic genomes. The intuitive concepts of sequence-based classification are sufficiently clear for anyone to know whether a sequence is in the vicinity of any reasonable definition of the line. However, a “reasonable - ness” approach does not solve the problem of vagueness that troubles the DNA synthesis companies, researchers, and law enforcement as they try to apply the Select Agent Regulations. Without a precise definition of the sequences cov - ered by the Select Agent Regulations, companies might choose to reject any construct that contains Select Agent-related sequences, and researchers are left unsure whether they are subject to the Select Agent Regulations. Moreover, the reasonableness approach does not begin to address the issue of novel synthetic constructs and sequences of concern, which are a motivator for the develop - ment of an alternative framework for oversight.

OCR for page 73