Read "Sequence-Based Classification of Select Agents: A Brighter Line" at NAP.edu

Page 73 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

3
A Proposal for Consideration: Sequence-Based Classification of Select Agents

The previous chapter concluded that the primary answer to our charge is no—prediction of Select Agent status by genome sequence analysis is not feasible.

First, in Chapter 1, the committee found that prediction of Select Agent status is not possible, because Select Agent status involves economic, historical, and policy considerations beyond the biological properties of the organism encoded in its genome.

Second, the committee finds that even if it were possible to assign Select Agent status solely on the basis of genome-encoded biological properties, the answer would remain no. Chapter 2 described why accurate prediction of an organism’s pathogenicity from its genome sequence is not possible now, and will not be feasible in the foreseeable future—certainly not at the level of accuracy appropriate for statutory regulations. Reliable prediction of the hazardous properties of pathogens from their genome sequence alone will require an extraordinarily detailed understanding of host, pathogen, and environment interactions integrated at the systems, organism, population, and ecosystem levels. For the foreseeable future, the only reliable predictor of the hazard posed by a biological agent is actual experience with that agent.

The committee was charged to identify, supposing that the answer to these questions would be no, “the scientific advances that would be necessary to permit serious consideration” of such a predictive framework. The committee believes that we are so far from the goal of a predictive framework that it is premature to plan specific steps towards a Select Agent regulatory system based on predictive genome sequence analysis.

As described in Chapter 2, it is a major goal of all biology to understand how DNA sequence determines the properties of biological systems, ranging

Page 74 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

in complexity from single macromolecules to pathways, organisms, populations, and ecosystems. Successes in prediction and design at each subsequent level of complexity in biology as a whole are the relevant milestones to watch for, before we will be able to predict confidently from genome sequence analysis how a designed organism would replicate, interact with a host, evade a host immune system, and spread in a population to cause disease.

The committee’s view is that for the specific purposes of the Select Agent Regulations, those general biological milestones should be passively monitored, not actively sought. A narrow focus on such milestones for the sole purpose of being able to predict what makes Select Agents dangerous may be a distortion of priorities in biology, and may also raise concerns about dual use. The ability to predict pathogenicity from genome sequence automatically confers the ability to design genome sequences of pathogens.

However, the committee is not satisfied with answering its charge narrowly and in the negative. The rapidly expanding capabilities of automated gene synthethesis and of synthetic genomics to synthesize and “boot” complete Select Agent genomes means that the Select Agent Regulations do need to be defined in terms of genome sequence analysis, not by the phenotypic properties of an encoded agent. A Select Agent genome is covered by the Select Agent Regulations whether or not it is ever “booted” into a living agent whose phenotype can be assayed. A DNA synthesis company needs to be able to tell, unambiguously and by sequence alone, if it is being asked by a customer to synthesize the genome of a Select Agent.

That determination would not be a problem if each Select Agent had a unique genome sequence. However, discrete taxonomic nomenclature in biology is already challenged by the great diversity and continuum of organisms observed in natural wild isolates, and the rapidly expanding ability of synthetic biology to create highly modified variants and chimeras of naturally occurring genomes poses an even greater challenge to taxonomic naming systems. Select Agent pathogens, like any biological organism, are not defined by a single DNA sequence. Given natural wild variation and the conceivable range of tolerable synthetic variation, a “cloud” of related sequences of similar biological properties are all assigned the same taxonomic name. There may be sequences that are just as closely related but are not Select Agents, including vaccine strains and attenuated research strains that the U.S. government want explicitly to avoid encumbering with the Select Agent Regulations.

In its deliberations, the committee found that it was useful and important to distinguish sequence-based prediction of biological properties from sequence-based classification. A regulatory system based on prediction must be able to recognize that an entirely novel genome sequence (unrelated to any known sequence) encodes a pathogen that should be assigned Select Agent status. A regulatory system based on classification “merely” tries to decide

Page 75 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

whether a sequence is sufficiently similar to that of an already-known, already-named Select Agent to be assigned the same taxonomic name and status. The two concepts are easily confused and sometimes conflated because the state of the art of prediction from biological sequence is generally not based on a physics-based prediction of the molecular structure and function of the parts encoded by a genome, but rather on sequence comparison and classification: If one sequence is similar to another known sequence, it is assumed that they share evolutionary ancestry and have similar biological functions.

This chapter explores the conceptual difference between predictive systems and classification systems and considers the ramifications of using sequence-based classification for the Select Agent Regulations. In a narrow sense, the committee has addressed its charge by explaining that prediction-based systems are not feasible. However, the committee interprets its charge more broadly and in this chapter moves beyond infeasible predictive systems to consider a feasible sequence-based classification system. The reader may want to view this chapter as a proposal for consideration, rather than as a recommendation. However, in the sense that sequence-based classification is conflated with rough and limited prediction in biology, this chapter is the committee’s positive and constructive answer to its charge. We discuss how a sequence-based classification system might be used to encompass what we believe are the most technically feasible and likely scenarios whereby synthetic genomics and synthetic biology could be used to construct a hazardous agent with Select Agent properties. A sequence-based classification system would still be based on a discrete list of Select Agents, but could be used to create a pragmatic “brighter line” for deciding whether a new genome sequence should be regarded as one of the existing Select Agents or not.

NOVEL AGENTS: SYNTHETIC GENOMICS AND THE SELECT AGENT REGULATIONS

We need to examine what we mean by a “novel” synthetic agent. This chapter flows from the strong premise that for the foreseeable future, synthetic pathogens (at least those regulated by the Select Agent Regulations) will be composed largely or entirely of genes derived from existing pathogens. That premise requires careful consideration. We want to outline the most likely threat scenarios that might be created with synthetic genomics, and we want to consider what we should regulate at the level of possession and transfer of specific agents with regulations like the Select Agents Regulations, as opposed to overarching statutory prohibition of use and development of offensive biological weapons under the Biological Weapons Convention, or as opposed to prudent laboratory biosafety guidelines for handling of any pathogenic organism.

Page 76 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

Synthetic genomics poses three main threat scenarios that would allow a “bad actor”¹ to obtain a pathogenic organism suited for use as a weapon:

The bad actor orders a synthetic DNA (or RNA) genome of a known Select Agent. The exact sequence may be modified, either in an attempt to alter the phenotype of the agent (perhaps to introduce drug resistance, increase pathogeneticity, or alter host range) or in an attempt to circumvent the Select Agent Regulations themselves (by making genetic changes intended to have little or no effect on pathogenicity, but to create ambiguity about the taxonomic classification of the organism). The bad actor then “boots” the synthetic genome into a full organism. We will call such an organism a “modified Select Agent” and we will describe its creation as “modification” of an existing organism.
The bad actor “assembles” a synthetic pathogen by combining parts (genes and regulatory sequences) of known organisms, for example by creating a chimera of two or more viruses, or by attempting to express genes that encode a toxin or mechanism of pathogenesis in an otherwise innocuous “chassis” of a non-Select Agent host, such as any of several commonly used viral vectors. This would include cases in which no individual part originates in a Select Agent.² We will call the organism created by this scenario a “chimera,” and we will describe its creation as “assembly” of existing parts.
The bad actor “designs” a synthetic pathogen by creating entirely new gene sequences—dissimilar to any known pathogen gene sequences in nature. We will call the organism created by such a process a “de novo” novel agent, and we will describe its creation as “design” (although sequences may be selected from randomized pools by high-throughput in vitro evolution or selection rather than actually designed).

Obviously, there is a continuum among these three scenarios. As more new genes are moved into an existing organism, the line between modification and assembly blurs. Only part of an organism may be designed, and the rest assembled or modified. The important distinction is between genetic modification of an existing organism in essentially its original order, assembly of parts of

¹

“Bad actor” is used to mean “an individual or group with nefarious intent.” “Individual” or “person” does not convey the full meaning; “terrorist” was viewed as too strong or specific a term. In addition, a “bad actor” may be either an individual or a group.

²

For example, experiments unexpectedly found that inserting a mouse interleukin-4 (IL-4) gene into ectromelia virus (mousepox) created a highly pathogenic virus against mice. Based on these results, there is concern that a related human virus like vaccinia, engineered to express human IL-4, could become highly pathogenic for humans, though neither vaccinia (the smallpox vaccine virus) nor the IL-4 gene by themselves are Select Agents.

Page 77 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

known organisms in new combinations in new orders, and creation of entirely new gene sequences dissimilar to any known pathogen genes.

The three scenarios are ordered by increasing technical difficulty and therefore by decreasing likelihood (see Table 3.1). We want to be sure that we are dealing with the more likely scenarios before worrying inordinately about less likely ones.

The first scenario is the simplest, easiest, and most likely to work in the absence of an expensive research and development program; therefore, it is the most dangerous. Most Select Agent viruses can now be booted from synthetic DNA, some relatively easily (such as small positive-stranded RNA viruses) and some with great difficulty (such as large DNA viruses like variola). As the scope of DNA synthesis increases and the technology becomes commoditized, an increasing number of Select Agents can be reconstituted with a modest level of skill in molecular biology. Synthetic genomes identical to complete Select Agent

TABLE 3.1

	“Modified Select Agent”	“Chimeric Select Agent”	“de novo Select Agent”
Three threat scenarios posed by synthetic genomics:	Created by “modification” of an existing organism.	Created by “assembly” of existing parts.	Created by “design” (or by high-throughput in vitro evolution or selection)
Feasibility of scenario given the state of the art:	Feasible	Possible	Improbable
	Modification is routine genetic engineering	Assembly is a frequent molecular biology technique as an extension of “modification.” Assembly of a radically novel agent is beyond the state of the art.	Beyond current capabilities; if possible at all, likely to require extensive experimental selection, refinement, and testing in susceptible hosts.
Potential solution:	Sequence-based classification.	Gene-sequence-based classification.	Sequence-based function prediction.
	Anchored around a taxonomic name and a full genome reference sequence, this would define the “space” around each select agent—essentially translating the regulatory language into an operational definition that accommodates the biological complexity.	This would identify individual “parts” of genomes (beginning with select agents), define “space” around each part, and determine which “parts,” when assembled, are operationally considered a “complete” genome for the purpose of the Select Agent Regulations	Neither design, nor prediction is currently possible.
			Attempting to design and create a bioweapon is prohibited by BWC and USC Title 18 Section 175(b), among others.

Page 78 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

genomes are already clearly covered by the Select Agent Regulations. Whether a modified sequence is still called a Select Agent or not (the question of when the magnitude of modification requires a new discrete name) is a gray area in the current Select Agent Regulations.

The second scenario is plausible, but unlikely to work without an extensive research and development effort. Chimeric assemblies are the leading edge of early successes in “synthetic biology.” There are efforts to create standardized “parts”—genes or combinations of genes that have functions that are readily transferable into an organism “chassis” that provides basic biochemical, structural, and replication functions (Kwok 2010). The simplest chimeric pathogens would express a toxin gene in a chassis, but the most dangerous toxin genes are already covered by the Select Agent Regulations regardless of what organism they are in. Chimeric pathogen weapons that evade the current Select Agent Regulations would require the assembly of less obvious pathogenesis mechanisms than expression of known Select Agent toxins. The more such an assembly deviates from a known organism, the less likely it is to work, for all the same reasons that prediction of pathogenicity and other phenotypes from genome sequence is not possible. It is not possible with the current state of knowledge to predict and foresee all detailed interactions of gene products that determine an organism’s overall phenotype. It would require experimental characterization in suitable hosts to be sure that a chimeric weapon worked as intended. For some Select Agents, there are no surrogate experimental hosts for characterizing virulence, and the only suitable host for a human pathogen may be a human. Those considerations raise the research and development bar substantially, and expose such a program to existing legal prohibitions other than the Select Agent Regulations. Thus novel chimeric constructions are unlikely as terror weapons,³ although they are alleged to have been used in national offensive bioweapons research programs.

The third scenario is beyond the capabilities of current biology. There is no example of a designed organism or even of a designed genetic pathway, let alone a designed pathogen. Designing a self-replicating organism that has only to interact with simple molecules in a test tube is one thing, and it is hard enough; designing a pathogen that has to interact with a complicated host, evade its immune system, and be transmissible in the natural environment adds daunting layers of biological complexity. There are very few examples of single protein sequences that have been designed to fold in a particular novel way (Kuhlman, Dantas et al. 2003). These first few modest successes in de novo design of single proteins constitute the current state of the art. Design and prediction go hand

³

Or at least as functionally dangerous pathogens. A release of a chimeric construct of suitably scary-sounding parts could cause societal damage from fear and panic even if it were utterly ineffective as an organism. The number of such imaginable scenarios is enormous—probably beyond the scope of the Select Agent Regulations to regulate effectively without overly impeding beneficial biomedical research. The best countermeasure against “mock” chimeric weapons is likely to be a resilient public health emergency response and communication system.

Page 79 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

in hand; our lack of predictive ability in biology means that we cannot now design genomes.⁴

Synthetic biology’s use of metaphors like “booting” a genome into a living organism or use of a well-known organism, such as Escherichia coli as a “chassis” for hosting synthetic constructs (Lee, Chou et al. 2008) may be misleading about the likelihood of de novo design. Among synthetic biologists, this metaphorical language emphasizes the long-term goal of making biological systems as engineerable as computers or machines; but the language also tends to trivialize the complexity of biological systems and the enormous gaps in our understanding of them by making it seem (perhaps especially to non-biologists) that we can already engineer biological systems easily. There are examples in which synthetic genomes have been “booted” into living viruses (Cello et al. 2002) and now even cells (Lartigue et al. 2009; Gibson et al. 2010), but it must be remembered that these experiments have “only” synthesized minor variants of known natural genome sequences. An E. coli “chassis” has been used to express genes for a complex synthetic function, such as the engineering of E. coli to produce pharmaceutically important natural plant products, such as the antimalarial drug arteminisin (Chang et al. 2007), or bulk chemicals, such as 1,3-propanediol (Bio-PDO, Dupont), a starting point for synthesis of plastics (patent WO/2004/101479), but these efforts have required massive multiyear iterative bioengineering and development processes “just” to move known genes from other organisms into the E. coli system and get them to work as desired. All existing (and reasonably foreseeable) uses of synthetic biology involve modification or rearrangement of existing biological components. The entirely de novo design of genomes and organisms remains science fiction.

We have distinguished three kinds of novel synthetic organisms because we believe that there is a tendency to imagine nightmare scenarios in which a de novo unnamed pathogen, dissimilar to any known pathogen and thus unrecognizable by any sequence comparison protocol, is created deliberately or accidentally with synthetic genomics. Clearly, a regulatory system like the Select Agent Regulations based on a list of known agents and their genome sequences is not effective for regulating entirely de novo agents. Such a concern seems to have been behind the charge to our committee. If one were worried about prohibiting possession and transfer of de novo agents at the point of their

⁴

An important exception to the concept that design and prediction go hand in hand is that it is feasible to select de novo sequences for particular functions by high-throughput in vitro evolution or selection, and thus make it possible to arrive at functional sequences without designing or understanding them. Although selection and directed evolution have been widely applied at the level of individual RNAs or proteins, no novel genome sequences have been created this way. Given the complexity of the problem (the number of possible DNA sequences rises in proportion to 4^N for a sequence of length N), it is extremely unlikely that complete de novo genomes could be selected in a small number of iterations by such procedures. If it were done, it would require a long effort of iterative artificial evolution—expensive and complicated work beyond any current research program in biology.

Page 80 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

creation, one would need to develop a forward-looking system that predicts the Select Agent status of a de novo genome sequence, rather than a backward-looking system based on a taxonomy of known Select Agents. However, as discussed in Chapter 2, such a prediction based system is not feasible.

We instead take the view that not only is the creation of de novo agents the most unlikely scenario, but also that it is not and can not be the purpose of the Select Agent Regulations to regulate de novo agents any more than it is their purpose to regulate access to novel emerging diseases in nature. The Select Agent Regulations should not aim to prevent access to all possible pathogens. There is no way to anticipate all possible novel pathogens. The Select Agent Regulations should aim to impede access to the most dangerous known pathogens. The best defense against the unquantifiable threat of novel synthetic pathogens is not the Select Agent Regulations, but continued enhancements of the laboratory and clinical biosafety measures that we already have to deal with the real and measurable threat of emerging natural pathogens. When a novel agent emerges, research is initiated to study its mechanism of action, the potential threat that it presents, and its susceptibility to countermeasures. When the novel agent has been shown to meet the criteria for Select Agent regulation—with respect to not only its genome-encoded biological properties, but other medical and policy considerations—its name can be added to the Select Agent list so that future access to it can be regulated.

It is our view that the main biosecurity threat scenarios start with accessibility of proven known pathogens, so it is reasonable for the Select Agent Regulations to be backward-looking and based on a list of known agents. These regulations protect us by restricting the availability of agents that we know from experience to be extraordinarily dangerous and that have a high potential for biowarfare or bioterror.

The foregoing establishes the premise for the remainder of this chapter. Modified Select Agents, made facile by the commoditization of synthetic genomics, constitute the most important and pressing practical issue related to the Select Agent Regulations. The taxonomic nomenclature of microorganisms is designed for wild isolates of actual organisms that have observable growth phenotypes, not for non-natural modified sequences that exist only as genomic DNA. A system based on (natural) taxonomic nomenclature does not establish a bright line that is sufficient for clear statutory regulation of possession and transfer of synthetic genomic DNA sequences.

As a specific working example, consider the situation of a DNA synthesis company. A DNA synthesis company is capable of synthesizing complete genomes, and the company (if unregistered) formally violates the Select Agent Regulations if it possesses or transfers a synthetic Select Agent genome.⁵ Who

⁵	Although we use synthetic genome orders as an illustrative example, we must recognize that from the standpoint of impeding bioterror scenarios, there will be myriad ways to get around any

Page 81 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

judges whether a DNA sequence constructed by the company is considered to be a Select Agent? How similar to a Select Agent reference sequence does a DNA sequence have to be to still be deemed a Select Agent? Currently, each gene sequence company must define for itself the sequence boundaries around each Select Agent, with only minimal guidance from regulators (DHHS 2009). The companies understandably want increased clarity about what sequences are and are not covered by the Select Agent Regulations. By addressing this issue as an example, we would also deal with a number of other scenarios in which synthetic genomics might be used to create modified Select Agents. And we will also be able to deal with some of the most obvious and likely ways that chimeric agents might be assembled with synthetic biology.

As discussed in the next section, we take the view that the most pressing issues can be treated as a sequence classification problem more than a sequence-based prediction problem.

CLASSIFICATION IS DISTINCT FROM PREDICTION

Whereas sequence-based prediction of the properties of Select Agents will not be feasible in the foreseeable future, sequence-based classification can be addressed with current technology. At least at the level of individual gene sequences, there is an extensive literature on methods for automated classification of sequences into operationally defined taxonomies. Cellular organisms are routinely classified taxonomically by using small subunit ribosomal RNA sequence comparisons. Everyday examples of assigning protein sequences into annotated families include databases such as Pfam, Interpro, and TIGRfams. The existing literature does not quite suffice for the problem of sequence-based classification of complete Select Agent genomes. Viruses lack ribosomal RNA, for one thing; for another, we need to think about distinguishing a complete genome from a partial one; and we need to worry about artificially modified and chimeric constructs that would not be constrained to follow the patterns of natural sequence

screening procedure used by DNA synthesis companies, ranging from splitting an order into apparently innocuous pieces among multiple companies to using offshore companies that do not adhere to U.S. regulations, to simply not using a DNA synthesis company at all. The technology of DNA synthesis is rapidly being commoditized, and DNA oligonucleotide synthesis machines can already be purchased cheaply from eBay. An ebay.com search on “oligo synthesizer” on 10 October 2009 found a used Applied Biosystems 394 DNA/RNA oligo synthesizer on sale for $8,900 (plus $106.16 shipping within 3-8 business days to a committee member’s home in Northern Virginia). With great difficulty and specialized technical skills, genes and even whole genomes can be assembled from individual short oligonucleotides. In much the same vein, a determined bioterrorist can obtain isolates of a Select Agent from the wild. The Select Agent Regulations can only raise the difficulty bar for acquiring cultures of proven highly virulent agents and provide law enforcement with tools to prosecute for possession of variants of such agents; because natural biological organisms are widely available, readily engineered, and increasingly easy to create, it is unrealistic to try to design the Select Agent Regulations to preclude acquisition completely.

Page 82 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

evolution. However, the adaptations needed to define complete Select Agent genomes operationally are fairly obvious, as we will discuss.

The computational sequence analysis technologies used for sequence-based classification define “sequence spaces” that circumscribe the known variation of sequences that are considered to belong to a useful name, while excluding the known variation of sequences that are considered to be attached to different names. Therefore, a necessary precondition is to have a number of representative sequences that belong to the desired classification and a number of the most closely related sequences that do not belong.

An important principle of automated classification (also known as “supervised learning” methods, in statistical inference) is that given the known sequences of things that we want to label as Select Agents and things that we do not want to label as Select Agents, there is always a classification scheme that can achieve the desired labeling of known sequences with 100 percent accuracy. The important concern is not that a classification system would misclassify a known sequence, but rather how well a classification system generalizes to correctly classify new sequences that it has not seen before. The existing methods for sequence-based classification of protein and DNA sequences provide a flexible set of software tools for human experts to use to define appropriately generalized sequence spaces on a case-by-case basis using expert knowledge. The methods enable 100 percent classification accuracy on known sequences (essentially by definition—the classification system is defined on the known sequences already labeled with a set of known labels), can be expected to perform reasonably well on new sequences, and are readily updated if and when erroneous classifications occur.

The basic principle of sequence-based classification is simple (Figure 3.1). A sequence is assigned the name of the known group that it is “closest” to, as long as it is also within the range of known variation accepted for that group. If it falls outside the range of known variation of any known named groups, it is operationally declared a “novel” sequence, possibly within some larger and more broadly defined sequence family. The key is the definition of close, that is of what distance means in comparing sequences; various automated computational procedures differ in the details of their approach.

Sequence-based classification is related to taxonomic (species-level, evolutionary) classification, but the two do not always coincide. They coincide when the desired sequence classification corresponds to slowly evolving traits that are shared amongst a “clade” of evolutionarily related organisms to the exclusion of more distantly related organisms. For example, all variola (smallpox) virus isolates have sequences that are more similar to one another than are variola virus and the most related non-Select Agent orthopoxvirus. Sequence-based classification may differ from evolutionary classification for rapidly or recently evolved traits, particularly if only a small number of crucial changes are involved and the same sequence changes and phenotypes have evolved more

Page 83 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

FIGURE 3.1 Profile-Based Classification System—distance.

This graphic shows a tree based on sequence similarity. The Select Agent has high sequence similarity to the non-Select Agent nearest neighbor. The shaded spheres indicate the sequence “space,” or known variability, surrounding an agent’s taxonomic name.

• X1 sequence is within SA1 “space” and would be classified as Select Agent1.

• X2 sequence is most similar to non-SA1 and would be classified as non-Select Agent1.

IMPORTANT—this is not prediction! X1 may NOT be a pathogen. X2 may be a pathogen. This classification system identifies and clarifies what is subject to Select Agent Regulations. It cannot predict what is or is not a pathogen.

IMPORTANT—classification does NOT designate new Select Agents. It identifies sequences as belonging to a known sequence “space.” If the sequence does not belong to an established “space,” it is novel.

• X3 is a novel sequence that is more similar to the Select Agent than the non-Select Agent. This would initially be classified as a non-Select Agent. As biological information is acquired, this agent would be evaluated and might be added to the Select Agent list; or it would be considered another near neighbor, non-Select Agent sequence.

IMPORTANT—acquiring sequence data and biological information is needed to define the space around Select Agents and close the gap around novel sequences.

Page 84 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

than once in different lineages (this is referred to as convergent evolution). For example, a vaccine strain may differ from a wild-type virus in only a few residues, so a sequence-based classification system might have to focus its definition of the vaccine strain’s space on the specific residue identities. Sequence-based classification is strictly operational (in the sense that it can be applied to non-evolved artificial sequences, not just to naturally evolved ones) as opposed to sequence-based taxonomies that are backed by a notion of biological species evolution and evolutionary trees. A sequence-based classification draws “lines” (decision boundaries) that separate desired groups in sequence space; it is, not necessarily constrained by attempts to reconcile the patterns of common evolutionary descent of an organism and its individual genes. That is important, because we are interested in classification of genetically modified organisms and synthetic DNAs that may have no ancestors or no evolutionary history per se.

The “space” defined by a sequence-based classification will usually be large—almost always vastly larger than the space of verifiably viable sequences. That is one reason to use methods that define a space rather than simply making a list of the sequences that we want to include in a group. For example, if we crudely defined the variola virus “sequence space” as everything within 10 DNA substitutions of about a 186,000 nucleotide reference variola sequence, there would be over 10 such sequences.

It is unlikely that all those over 10⁵⁰ sequences would function as a variola virus. Many substitutions would be deleterious or lethal to the virus. Predicting whether and how well each one would function is the realm of sequence-based prediction, whereas in a sequence-based classification scheme we do not necessarily care. Rather, we care whether the classifications we assign to viable sequences (or at least to sequences that someone would have a reason to synthesize) are correct. It is not a crucial mistake to misclassify a non-viable hypothetical genome that no one would make. That is a key distinction between prediction and classification. The classification system is not predicting that sequences actually “work” but only that they are closer to those of a known Select Agent than to those of any known non-Select Agents. Our main concern in using a broadened definition of a Select Agent’s sequence space is to balance the need to encompass the most likely modifications and chimeras with the need to avoid the classification of a useful non-Select Agent genome (including those of vaccines and attenuated research strains) as Select Agents. We need only be sure that we define suitable non-overlapping sequence spaces that capture all known and expected variation in a named Select Agent and the nearest non-Select Agent groups.

The size of the “space” around any given classification will depend on how closely related the nearest competing classifications are. That is important; classification boundaries are arbitrary and operational and are defined by human experts for the particular classification task at hand. There is no single threshold (for percent identity, BLAST score, and so on) that suffices to define

Page 85 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

the boundary around biological names.⁶ The boundaries for a name’s sequence space must always be set relative to the nearest neighbors that have different names. Experts might define a tight space around a vaccine strain of one virus, a larger space around the pathogenic Select Agent form of the virus, and several competing “adjacent” spaces of different sizes representing the sequence variation observed in non-Select Agent natural relatives of the virus.

High-throughput automated annotation of individual gene, protein, or RNA sequences as belonging to particular sequence superfamilies, families, and subfamilies is already a robust and routine technology. At the whole (cellular) organism level, sequence-based classification has most often been done by selecting one representative gene sequence (usually small subunit ribosomal RNA) and assuming that the phylogenetic tree of that sequence reflects the phylogenetic tree of the species. More recently, methods have begun to calculate consensus phylogenetic trees on the basis of the simultaneous analysis of multiple genes. Virus taxonomy is somewhat more challenging; viruses lack ribosomal RNA genes or any other universally conserved sequence, and seem even more prone to lateral gene transfer than bacteria. But it is likewise increasingly reliant on genome sequence-based classification that uses key conserved genes in each virus group. With a modest amount of research and development for the specific application, methods like those could be adapted to the problem of sequence-based classification of modified Select Agent genomes, and to some extent to chimeric assemblies.

SYNTHETIC GENOME CLASSIFICATION UNDER THE CURRENT SELECT AGENT REGULATIONS

As noted earlier, the current Select Agents Regulations cover not just culturable organisms, but also naked DNA, including synthetic genomes. The Select Agent Regulations language confers Select Agent status on “nucleic acids that can produce infectious forms of any of the Select Agent viruses” and “recombinant nucleic acids that encode for the functional form(s) of Select Agent toxins” (italics added). In a document interpreting the Select Agent Regulations with regard to synthetic genomics (“Applicability of the Select Agent Regulations to Issues of Synthetic Genomics” see Appendix E), we find a written interpretation that “non-infectious components of Select Agent viruses,” including “genomic fragments from Select Agents” and “material from regulated genomes that has been rendered non-infectious” are not covered as Select Agents.

This guidance is talking about using DNA synthesis technology to produce an exact copy of a known Select Agent. That is the most likely application of synthetic genomics; a naturally occurring genome sequence is already known

⁶	That is, each Select Agent requires that its own sequence space to be defined.

Page 86 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

to function.⁷ Almost all Select Agent genomes have now been completely sequenced. The guidance is not attempting to deal with using DNA or synthetic biology to create a novel (modified, chimeric, or de novo) “synthetic” agent. The difficulties of dealing with modified, chimeric, and de novo synthetic agents constitute the reason for the charge to our committee.

Limiting the nucleic acids language to viruses and toxins reflects the current technical state of the art that synthetic microbial genomes cannot be “booted” into new organisms, whereas many viruses can be, and toxin genes can be expressed in any host organism chassis (such as E. coli). The technical barriers to booting microbial genomes into new organisms are starting to fall (Lartigue et al. 2009; Gibson et al. 2010). Synthetic biology will sooner or later be able to “boot” a wide variety of organisms from synthetic DNA. The Select Agent Regulations language will need to broaden to cover nucleic acids of all Select Agents that can be booted from synthetic genomes.

The stickier point from the standpoint of sequence analysis and sequence-based classification—and where one enters a grey area of modified synthetic agents, with engineered differences from wild-type Select Agents—is what exactly is meant by such terms as infectious forms or functional forms or genomic fragments. This might mean having a complete set of all the right parts to attempt to boot an organism, or it might mean that the resulting organism would actually be expected to work as an infectious and/or functional Select Agent. For example, the easiest viruses to boot are the small positive-stranded RNA viruses because the RNA itself is “infectious”; a full-length RNA can simply be synthesized and transfected into a host cell, where it will produce new virus. Researchers refer to the complete RNA transcript as “infectious” in the sense of being complete and sufficient to assemble new virus in laboratory cultured cells, regardless of whether the new virus is “infectious” in the sense of transmitting serious disease to humans or other host organisms. For example, an RNA virus laboratory might synthesize and transfect “infectious” mutant RNAs to study defects in virus replication or maturation in cultured cells, although any resulting viruses might not be harmful to humans.

There is no way for a DNA synthesis company to decide, from sequence alone, whether a variant genome sequence is “infectious.” Advances in synthetic genomics make it clear that the Select Agent Regulations definition is underspecified. Drawing a distinction between an apparently complete set of parts to produce a virus and an actual working pathogenic virus and considering the problems posed by synthetic and natural variation bring us back to

⁷

A qualification: DNA sequencing has an error rate that ranges from about 1/100 to 1/10⁶ in most cases, depending on the sequencing strategy. Because of DNA sequencing error, the genome sequence in the public databases is not guaranteed to be a functional genome. DNA synthesis technology also has an error rate that makes it non-trivial to construct viable large constructs (Gibson, Glass et al., 2010).

Page 87 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

the distinction between classification and prediction. Determining whether a sequence encodes an “infectious form” of a Select Agent is an empirical experimental question and will long remain beyond any foreseeable predictive ability of biology. However, we should be able to use sequence-based classification to establish a reasonable operational definition of the sequence space that circumscribes complete agent genomes, as distinct from incomplete genomes or complete genomes of related non-Select Agents.

There is an important distinction between identifying a suspicious “sequence of concern” that might be part of a Select Agent and determining that a genome sequence is “complete” (“infectious”) and therefore itself subject to the Select Agent Regulations. The former can be in a gray area, but the latter should be as bright a line as possible. The Department of Health and Human Services (DHHS) has recently published a notice, “Screening Framework Guidance for Synthetic Double-Stranded DNA Providers” (DHHS 2009), that gives guidance to DNA synthesis companies on how to screen orders for possible components of Select Agents. The purpose of the DHHS screening guidance is not to define a complete Select Agent genome for the purposes of regulation but rather to harmonize the ways in which companies should screen customers and sequences and follow up on any suspicious orders with additional questions and scrutiny to be sure that appropriate biosafety procedures are used and regulations are followed. The guidance describes a “best match” procedure for identifying that one or more 200 nucleotide fragments of a synthetic DNA order is more similar to a known Select Agent genome sequence than to any non-Select Agent sequence, by a local alignment search of the entire available nucleotide database (Genbank) at the amino acid coding level. Our committee found the suggested guidance in this Federal Register document to be strong, useful, and well thought out for the purpose of identifying sequences of concern, and (as we will discuss in more detail later) “best match” by percent identity is a reasonable rule-of-thumb sequence analysis procedure. However, the guidance is neither well defined nor stable enough for use in defining Select Agents for regulatory purposes. It relies on non-trivial manual examination of results (Select Agent versus non-Select Agent sequences are not clearly identified in Genbank), and the Genbank database itself is rapidly growing and changing, so results will vary. The guidance also makes no attempt to define a complete Select Agent genome.

The concepts that we develop below expand on the existing synthetic genomics guidance, and we describe how a reproducible objective definition of complete Select Agent genomes might be accomplished to delineate clearly which synthetic DNA constructs are and are not covered by the regulations. We emphasize that the system described is based on classification rather than on prediction. The technologies and scientific knowledge required for this system are available or could be available soon.

Page 88 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

CLASSIFICATION DEPENDS ON BOTH GENE CONTENT AND GENETIC DISTANCE

The problem of classifying a genome sequence as a complete Select Agent genome has two dimensions: (a) content—how much sequence (how many parts) must be present to distinguish a potentially complete “infectious form” of the Agent from a non-covered “genomic fragment” or “non-infectious component”; and (b) distance—how close must the sequences of each of those parts be to an actual Select Agent sequence for the same Select Agent taxonomic name to be assigned to the synthetic organism.

Confusion will arise if content and distance are not distinguished. For example, the language of 18 U.S.C. 175c that the National Science Advisory Board for Biosecurity recommended repealing defines variola virus as “a virus that can cause human smallpox or any derivative of the variola major virus that contains more than 85 percent of the gene sequence of the variola major virus or the variola minor virus.”

What does “85 percent of the gene sequence” mean? Does it mean a fragment of 85 percent or more of the full-length variola virus genome? Does it mean a full-length genome of at least 85 percent or greater sequence identity to variola? The NSABB seems to have interpreted it as the latter (“the definition of ‘variola virus’ in 18 U.S.C. 175c is based on genome sequence similarity”). They point out that this is problematic because “there are many genomes of the variola major virus genome and variola minor genomes that are significantly greater than 85 percent similar to sequences found in related but relatively harmless viruses.” With that interpretation, 18 U.S.C 175c outlaws vaccinia virus, the smallpox vaccine. The NSABB recommended repealing this problematic language “particularly because the misuse of variola virus is adequately covered by other criminal laws already in place” (NSABB 2006:12), including the Select Agents Regulations and the Biological Weapons Convention. However, as we are seeing, although the language of the Select Agent Regulations avoids making the error of an overprecise and overbroad definition, it could easily be criticized for being underspecified with regard to technical capabilities in synthetic genomics.

We will deal with the issue of content versus distance separately.

USING “PARTS LISTS” TO DEFINE GENE CONTENT

“Content” is the first issue to address. What combinations of “parts” make a complete Select Agent genome? In the genome of a known Select Agent, how many parts can we alter, delete, or substitute before the organism ceases to be dangerous enough to still be considered a Select Agent?

Our state of knowledge will necessarily be partial. It is not feasible to test all possible alterations of a genome sequence experimentally. However, we have clues, from comparative analysis of genes conserved in similar organisms and

Page 89 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

from experimental studies. We generally have a good idea of what genes are essential (required for viability of the organism under laboratory conditions, such as an RNA polymerase gene in an RNA virus), and some idea of what genes appear to be most critically involved in pathogenesis mechanisms. For the purposes of sequence-based classification, we do not need to have complete knowledge. Partial answers, given the current state of knowledge, suffice for an operational definition of a “complete” Select Agent genome. If an agent actually has 20 genes essential for viability and pathogenesis, and we know only about 10 of them, and we define any genome that contains those 10 genes as a “complete Select Agent genome,” our definition is biologically incorrect, but it can suffice as an operational definition.⁸

That is, for an operational definition of a complete Select Agent genome we can define a parts list of genes that are thought to be necessary but not sufficient for a biologically functional Select Agent genome. For example, we would not worry about the fact that an organism is not merely a bag of genes but contains extensive non-coding regulatory sequence information that would be neglected by a gene-based parts list. Especially in large genomes, we might even choose to simplify a classification system by defining an operationally complete genome as one that has a necessary subset of parts, rather than a complete set. An operational definition of a complete Bacillus. anthracis-like Select Agent genome might use a handful of core essential genes (such as ribosomal RNA and DNA polymerase) combined with virulence genes carried on the pXO1 and pXO2 plasmids.

Synthetic genomics raises a complication that someone might substitute some parts of a pathogen for “generic” parts from other genomes—for instance, swapping out genes for core replication functions or swapping out capsid (coat) proteins for different ones while trying to leave alone genes that code for pathogenicity, or moving a toxic gene or pathway into an innocuous non-Select Agent “chassis,” such as E. coli. We would therefore want a classification system to have a concept of different “resolutions” at which individual parts are defined. For parts that we imagine are more likely to be swappable (such as a DNA polymerase), a part might be “any DNA polymerase,” whereas for parts that we think are essential to the Select Agent’s virulence, a part might be defined specifically as the Select Agent gene or a closely related modification.

The more modification, the less likely the resulting organism will work as a pathogen (this is related to the prediction problem). It is not at all likely, for example, that DNA polymerases are truly generic and swappable parts (except between closely related organisms), at least not without extensive trial-

⁸

It is this difference that makes “classification” feasible and applicable in the context of an oversight system, whereas “prediction” is not. Partial knowledge can usefully be applied for a classification system, while uncertainty and inaccuracy of prediction would result in a system that could not be implemented in a meaningful way.

Page 90 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

and-error engineering. But for an operational definition, we can be relatively inclusive and broad about defining a molecular parts list for each Select Agent for classification purposes without worrying too much about whether the assemblage of parts would actually work as a pathogen. We do not have to be concerned about impeding synthetic genomics designs that incorporate parts similar to Select Agents as long as there is no legitimate research purpose for the construct. The regulatory problem is to avoid writing regulations that unnecessarily impede beneficial biomedical research with innocuous organisms and constructs, such as research on vaccines and harmless non-pathogens related to Select Agents or on isolated genes that offer no possibility of resurrecting a complete organism.

It should be scientifically feasible, with current technology and knowledge, to write a clear and acceptable definition of a covered complete Select Agent genome in the form of a genetic parts list, enumerating a subset of key genes (and possibly key regulatory sequences) that are thought to be required for the growth and pathogenicity of the Select Agent. The primary aim would be to define the sequence space around each Select Agent genome including “trivial” modifications that are most likely to encode a similarly hazardous agent without needing highly technical testing and iterative optimization,⁹ such as removal of non-essential genes, insertion of foreign genes, or deletion or substitution of non-essential residues. Genes crucial to pathogenesis (in which small variations are known to distinguish Select Agents from non-Select Agents) might be defined more specifically to avoid encompassing vaccines and related non-pathogens, and genes involved in backbone pathways such as replication might be defined more broadly. Within this framework, it might prove reasonable to define an even broader sequence space to try to account for the most likely synthetic biology scenarios for chimeric agents, such as substitution of core genes for a more generic “chassis” and leaving alone genes for known mechanisms of pathogenesis.

Any given parts list would reflect only the current state of scientific knowledge about each Select Agent.¹⁰ It would need to be subject to review and revision to keep up with the state of knowledge distinguishing Select Agents from other organisms.

SEQUENCE ANALYSIS OF INDIVIDUAL “PARTS”

Given a parts list that addresses the content of a Select Agent genome, the other question is how to define sequences that are covered for each part—the

⁹	That is, modifications that would not require work on the scale of an offensive bioweapons research program, which we deem to be beyond the scope of concern for counterbioterrorism in general and for the Select Agent Regulations in particular.
¹⁰	And a parts list would have to be defined for each Select Agent.

Page 91 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

genetic distance within which a given sequence is defined as being within the sequence space that defines the part, and distinguishes it from related parts of non-Select Agents. There is substantial established methodology regarding this part of the problem.

First we need to explain what commonly used sequence database similarity search programs like BLAST are doing—what their scores mean—before we try to use the scores in an objectively defined classification system. More broadly speaking, what is the state of the art in sequence family classification in computational biology? How are biological sequences assigned to particular sequence groups rather than to others?

BLAST (like related programs) calculates a sequence alignment of the query sequence to a target database sequence, allowing insertions and deletions. It uses a simple scoring system to find a maximum-scoring local alignment involving any piece of the query sequence that is aligned to any piece of the target (it does not require the entirety of the two sequences to align over their whole lengths). It calculates two numbers: a total score, which is the sum of all the residue alignment scores and gap penalties in the optimal-scoring local alignment, and an E value (expectation value; or P value, probability value, which is essentially the same sort of number) which represents the statistical significance of the total score.

A BLAST score constitutes an answer to the specific question: Is the target sequence homologous (related by evolutionary common ancestry) to the query sequence? The higher the score, the more statistical evidence that supports an inference that the query is homologous to the target.

The E value tells us how many times we should have expected to see a score that high if we searched a random target sequence database of the same size but consisting only of nonhomologous (unrelated) sequences. If the E value is low (say, 0.01 or less), we say that the hit is statistically significant; that is, that it was unlikely to have occurred by chance.

Neither a BLAST score nor an E value is a “distance” suitable for sequence classification into families. A BLAST score measures how likely it is that two sequences are related at all, not how closely related they are, and the E value just measures the statistical significance of the score. For example, the longer the alignment is, the more statistical signal in favor of homology accumulates, the higher the score, and the lower the E value. It is easier to find high-scoring, more significant alignments for long proteins than for short ones. However, a measure of genetic distance between two sequences should be independent of the lengths of the aligned sequences.

The standard unit of distance between two sequences in phylogenetic (evolutionary) inference is residue substitutions per site. Assuming that two sequences are homologous and assuming that we have an alignment of each individual homologous pair of residues in the two sequences, we infer how many residue substitutions have occurred at each aligned residue in the evo-

Page 92 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

lutionary time since the divergence of the two sequences. At short distances, simple percent identity of the alignment approximates the evolutionary distance; at longer distances, more sophisticated models are used to convert the observed dissimilarity between two sequences into a distance estimate. Thus perhaps counterintuitively, simple percent identity (not percent “similarity,” not BLAST score, and not E value) is a reasonable although rough measure of genetic distance. A caveat in using simple percent identity of a BLAST alignment as a genetic distance is that BLAST local alignments tend to identify and align different regions of different targets to the same query, and some regions of a protein or DNA are more conserved (and therefore more highly identical) than other regions.

Different sequences evolve at different rates. (Indeed, even different residues in the same sequence family will evolve at different rates; the residues in a critical enzymatic active site may be highly conserved, and residues in a solvent-exposed surface loop of a protein may be highly variable.) That is why it is not possible to define an overall threshold for genetic distance (or percent identity) that separates one sequence family from another (or one species from another) that works for all sequence families or all species.¹¹

For evolved sequences, the distances can be used to infer a phylogenetic tree. A phylogenetic tree is the best representation of the hierarchy of relationships between a set of sequences. There is no natural discrete classification of a set of homologous sequences into a subset of families. Breaking the set of sequences into more than one discrete family and assigning some sort of (taxonomy-like) name to these subsets must always be arbitrary, relying on some operational goal. Different reasonable people might break the same tree into different subsets to serve different goals.

Over evolutionary time, biological function changes in ways that are not fully predictable from evolutionary distance. A single residue change can change an agonist, an activator of some function, to an antagonist that blocks that very function. Exactly the same protein sequence can be found serving completely different functional roles, such as central metabolic enzymes that also function as “crystallins” to build the transparent lens in an eye, or iron-requiring metabolic enzymes that also serve as regulatory proteins that turn other genes on and off in response to changing iron concentrations. It is only usually the case that homologous proteins have similar functions; the more closely related they are, the more likely they are to have more similar functions.

Those issues bring us back to the crucial difference between prediction and classification. We can classify a new sequence as being within a group of known sequences on a tree; we cannot necessarily predict the function that the

¹¹	That is, just as content (or a parts list) must be defined for each Select Agent, the distance must also be determined for each part of each Select Agent.

Page 93 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

sequence encodes. However, as long as our classification scheme does not mislabel a non-Select Agent as a Select Agent, it may suffice operationally.

METHODS FOR SEQUENCE SUBFAMILY CLASSIFICATION

Armed with background about such sequence comparison programs as BLAST, evolutionary distances, evolutionary trees, and functional sequence subgroups as distinct clades on trees, we return to the problem of screening DNA sequence orders for Select Agents.

Screening for significant BLAST hits to a database of sequences of concern does not work, because the parts of Select Agents are homologous to parts of many non-Select Agent organisms. Many non-Select Agent organisms would trigger false-positive results.

No single threshold on BLAST score or E value solves this problem, because (to give just one reason) scores and E values depend on the length of the alignment, but different sequences of different genes have different lengths. No single threshold of percent identity will work, because some proteins are less conserved in sequence than others (that is, more tolerant of sequence differences while retaining function).

Intuitively, we would like to identify when a sequence is “closer” to a Select Agent sequence than to a non-Select Agent sequence. To do that, we must have a database not only of Select Agent sequences of concern, but also of at least a representative set of homologous non-Select Agent sequences that we are trying to distinguish from Select Agent sequences.

The simplest thing to do with the combined database is to look at the best BLAST hit for a new sequence: If the best BLAST hit is a Select Agent sequence, the new sequence is “closer” to a Select Agent than not.¹²

“Best BLAST match” classification is a reasonable and often used strategy, but it does not represent the state of the art in inferring subfamily classifications. Most important with “best match” it is hard to distinguish the case in which the new sequence belongs to a known subfamily (say, a Select Agent gene) from the case in which the new sequence is a member of a new subfamily (say, a homologous non-Select Agent gene) that is not represented in the database yet. For example, if a sequence is only 30 percent identical to a Select Agent coding gene and 20 percent identical to the nearest known non-Select Agent homolog, but all known variants of the Select Agent gene are greater than 95 percent identical to each other, it is likely that the sequence is from a previously undescribed organism (because it falls well outside the expected range of variation within the known Select Agent sequences). A simple “best match” criterion nonetheless classifies the sequence as a Select Agent. That could be problematic if newly

¹²	This is the method outlined in the DHHS Screening guidance for DNA Synthesis Companies (DHHS, 2009), and the combined database that it suggests is NCBI-Entrez.

Page 94 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

discovered organisms came to be classified as Select Agents solely because their closest known homolog in the database was a Select Agent, regardless of genetic distance or the disease-causing phenotype of the organism.

The state of the art in subfamily classification is called phylogenomic analysis. The position of the new sequence in a tree of known homologous sequences that represent different functional subfamilies (or Select Agent and non-Select Agent sequences) is examined. If the new sequence falls within a clade on the tree that defines the observed range of variation in a subfamily, one can assign it to that subfamily with confidence. If the new sequence falls outside any known subfamily but is still homologous with the overall family, one would not assign it to a subfamily and instead would annotate it according to the more generally conserved features of the larger family. Phylogenomic analysis formalizes the intuitive idea of classifying sequences with their closest sequence relatives, while being careful to see the case in which a sequence is “novel,” falling outside the known range of variation of known nomenclature groups.

Phylogenomic analysis is usually done manually by human experts. The process of making high-quality phylogenetic trees involves several steps (collecting representative sequences, making an accurate multiple alignment, and inferring a reliable sequence tree) that benefit from human judgment. There is a good deal of research into automating phylogenomic analysis, but one would probably not now seek to develop an automated high-reliability classification system based on explicitly tree-based phylogenomic inference.

The state of the art in automated subfamily classification uses models called profiles to represent each different subfamily. Especially in probabilistic form as what are called profile hidden Markov models (profile HMMs), a profile is a model of the space of possible sequences that belong to some group. Representatives of the sequence group are identified and aligned into a multiple sequence alignment. Given a multiple sequence alignment, such software tools as HMMER build a profile of the alignmentin which each aligned column is represented by a position-specific scoring system that captures information about how conserved each residue is in any new homologous sequence that would belong to the sequence family. Profile HMMs calculate the likelihood that a new sequence “belongs” to the profile’s family. A comparison of two profile HMM scores is a hypothesis test for whether a new sequence belongs to one profile or the other. It allows experts to define protein or DNA sequence families of interest and to prebuild profiles that represent the sequences that belong to those families. New sequences are compared with profile libraries and automatically classified on the basis of profile score.

Profile-based classification is related to phylogenomic analysis. The main difference is that the phylogenetic tree is taken into account by human experts in building an appropriate set of profiles before any new sequences are analyzed; afterwards, the profile itself does not use an explicit phylogenetic tree. An expert identifies which subfamilies are of interest for classification and builds

Page 95 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

a profile for each subfamily and for the entire family (and possibly at different levels of the hierarchy of the phylogenetic tree, capturing progressively smaller and more detailed levels of subfamily annotation).

For example, if a sequence analyst wanted to distinguish NAD-dependent malate dehydrogenases, NAD-dependent lactate dehydrogenases, and novel NAD-dependent dehydrogenases that might work on a novel substrate, she might produce a profile of known malate dehydrogenases, a second profile of known lactate dehydrogenases, and a third profile of all known NAD-dependent dehydrogenases. If a new sequence is an NAD-dependent dehydrogenase, it will probably get a significant score against all three profiles (because malate and lactate dehydrogenases are evolutionarily homologous and structurally similar). If it is a lactate dehydrogenase, the highest score will (should) be to the lactate dehydrogenase profile; if it is a novel dehydrogenase, the highest score should be to the general dehydrogenase profile.

Large profile databases are in widespread use for protein-gene annotation. A good example of a large collection used for microbial gene function annotation is the TIGRfams database at the J. Craig Venter Institute. For example, in TIGRfams, NAD-dependent L-lactate dehydrogenases are represented by model TIGR01771 (L-LDH-NAD), and bacterial NAD-dependent malate dehydrogenases are represented by model TIGR01763 (MalateDH_bact). TIGRfams does not include a general dehydrogenase profile to catch cases of novel dehydrogenases; instead it uses a strategy of predefining a curated score threshold for each family.

Different profile databases are in widespread use, aiming at different annotation goals with different levels of protein classification. TIGRfams is an example of a profile library that aims at functional protein subfamily classification. In contrast, databases like Pfam and SMART aim at the “superfamily” level, aiming to capture the widest possible diversity of all remote homologs of protein families, regardless of functional classification. Pfam, for example, has a single profile (PF02615, Ldh_2) that classifies all NAD-dependent dehydrogenases into a superfamily that includes both L-lactate and malate dehydrogenases. Almost all profile libraries are based on the same underlying profile HMM approaches and use one of two software packages (SAM or HMMER).

Profile-based sequence classification is highly automatable because the classification system relies on a relatively stable set of sequence alignments of representative sequences that define the desired families. Thus, the classification system is relatively robust to the exponential growth of the sequence databases. In contrast, classification systems that rely on all-versus-all comparison of all known sequences tend to be unstable, and it is difficult to ensure their quality because the sequence database changes rapidly. A profile-based system can be developed, tested, and benchmarked with care by experts and then maintained and used stably over long time periods.

Page 96 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

OUTLINE OF A POSSIBLE SYSTEM FOR PROFILE-BASED CLASSIFICATION OF SELECT AGENTS

With current sequence analysis technology, it would be possible to develop an automated and precisely defined system for classifying genome sequences as to whether they are “complete” Select Agents. The system might look something like the following.

For each Select Agent, a minimal parts list would be identified by experts. It would be the set of genes that are thought necessary to make an infectious Select Agent genome. The parts list does not have to be exhaustively complete, because it is being used only to classify genomes, not to describe them fully; we might choose to include only a representative subset of the genes in a microbial genome to reduce the size of the classification system (and the work needed to create it). A genome that contained all the parts on the minimal parts list would be classified as a “complete, infectious” genome for operational purposes of the Select Agent Regulations. A genome that did not contain all of the parts would be a “genomic fragment” for the purposes of the Select Agent Regulations.

For each part, an automated profile-based classification system would be developed to differentiate the subfamily of sequences belonging to the Select Agent from the larger family of sequences belonging to non-Select Agent organisms. The specificity of these profiles would vary, some of them being very specific to only the Select Agent, and some of them being general and encompassing both Select Agent and non-Select Agent sequences. This step requires expert judgment. The more general models would allow the classification system to deal with the possibility that some parts are substitutable (by synthetic biologists) with “generic” parts, so these profiles might be made at a more generic level—any RNA-dependent RNA replicase, rather than specifically the Select Agent RNA-dependent RNA replicase, for example. The more specific models would focus on the parts thought to be most responsible for pathogenicity as opposed to core replication, metabolism, and growth functions. These specific models (as distinguished from the generic models) might be specially flagged to raise a flag to indicate that parts of a Select Agent are present even though a complete Select Agent genome is not, for the purposes of prudent follow-up on the part of a DNA synthesis company—for instance, if an order might represent an attempt to obtain a Select Agent genome in several individually legal pieces.

For each Select Agent, given a minimal parts list and a profile-based classification system for each part the classification system would be tested, benchmarked, and challenged using known genome sequences. To be useful, the classification system would be required to classify correctly all known sequence variants of a Select Agent (and a set of reasonably imaginable ones), and a representative set of the most closely related non-Select Agent genomes, including

Page 97 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

very close relatives, such as vaccine strains or non-pathogenic variants used in laboratory research.

For good classification, it is not sufficient to know a single representative genome sequence of each Select Agent. Using a classification system is an attempt to determine whether a novel sequence fits into the “cloud” of sequences representing expected genetic variation for a Select Agent genome, as opposed to the “clouds” of sequences representing the most closely related non-Select Agents. The more sequences are known, the better the expected genetic variation will be understood. Genome sequences of almost all Select Agents are available, but there has been less emphasis on obtaining genome sequences for closely related non-Select Agents. Future studies are sure to discover numerous new microbial and viral species, and it is desirable that these new discoveries not be misclassified as Select Agents just because they are closely related to Select Agents. More systematic genome sequencing of non-Select Agents would improve our knowledge of biodiversity and would be useful in developing a good classification system.

The profile classification system would have to be reviewed and revised, as new knowledge accrued that required newly discovered Select Agent or non-Select Agent variants to be classified. The updating process would resemble the continuing curation of other profile library classification systems, such as TIGRfams and Pfam.

Because it is automatic and software-based, the classification system could be made readily and transparently available on the Web, where it could be reviewed and challenged by scientists in the community to be sure, for example, that it was not inadvertently misclassifying useful non-Select Agents, such as vaccines and attenuated research strains.¹³

Timely testing, updating, and public review of the system would guard against classification errors. Automated annotation of protein function based on sequence similarity analysis is robust but not error-free (Schnoes et al. 2009).

This essentially phylogenetics-based system will work better for some Select Agents than for others. The greatest difficulty in clasifying Select Agents with a phylogenetic subfamily system will occur in cases in which very closely related viruses in the same phylogenetic group that have small, easily evolved genetic changes that differentiate highly pathogenic Select Agent strains from low-pathogenicity non-Select Agent strains, that come and go in a phylogenetic tree; the “high-pathogenicity” avian influenza viruses are an example. Similar cases of convergent functional evolution arise in protein function annotation, in wihch a small number of changes in active site residues can shift a protein function and these changes convergently evolve multiple times in multiple lineages. Alternative methods that key on critical functional residues have been

¹³	This would require a process for reclassifying an agent in response to input from the scientific community.

Page 98 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

developed to deal with the problem for protein function annotation (Hannenhalli and Russell 2000) and could be deployed and benchmarked for the Select Agent classification problem.

There will be cases in which any sequence-based classification system must fail altogether. For example, the bovine spongiform encephalopathy (BSE) prion agent is on the Select Agent list, but prions are an alternatively folded conformation of a host protein; the amino sequence of the prion form of the protein is identical to the benign host form. The only way to distinguish the BSE prion from the natural host protein is by experimental assay.

There would be no pretense of prediction in this classification system. Many genomes would be classified as a Select Agent because they have all the parts of a Select Agent, but there is little reason to think that all those parts would necessarily work in concert to produce a working, infectious, pathogenic organism; indeed, most synthetic genomes that had all the independent parts would probably not work as dependent wholes. From the standpoint of dealing with the implications of synthetic biology and synthetic genomes, the utility of the classification system would not be to distinguish successful genome designs from unsuccessful ones—“bootable” pathogens from inert DNA sequences—but to distinguish attempts to synthesize a dangerous genomes similar to a Select Agent from an attempt to synthesize benign genomes from a non-Select Agent organism, a non-pathogenic strain, or a vaccine. The classification system does not distinguish legitimate research from illegitimate research; rather it identifies agents that are restricted under the Select Agent Regulations and provides a means of identifying “sequences of concern” that may be worth monitoring.

The goal of the Select Agent Regulations is to restrict availability of the most dangerous known pathogens while not impeding beneficial biomedical research on known or emerging pathogens. In dealing with synthetic biology and the potential threat posed by novel agents, our goal is to try to regulate the most obvious attempts to synthesize a potentially working pathogen, and the current state of the art in synthetic biology is the ability to produce new combinations of existing biological parts, not to devise new genomes entirely de novo. We can never exclude radically novel synthetic biology designs, but we can raise the bar to the point where bioterrorists would have to possess knowledge better than the current state of the art with respect to what biological parts are necessary in a pathogen to evade a parts-list-based Select Agent classification system or would have to engage in an offensive biological weapons research program on a scale that would come under the Biological Weapons Convention.

A classification system would clearly be the easiest to develop for Select Agents with the smallest parts lists. The easiest would be the protein toxins composed of one or a few proteins, such as abrin and ricin. The next easiest would be the proteins encoding the multistep synthetic pathways for metabolite toxins such as diacetoxyscirpenol, saxitoxin, and tetrodotoxin (on the presump-

Page 99 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

tion that this biosynthetic pathway might be moved in a modular form into a new host to create a new organism that expresses the toxin). Next would be the viruses, ranging from small genomes (such as Lassa virus) to large ones (such as smallpox). The microbial genomes would be the hardest to deal with, and would require the most thought about what parts are generic and what parts are specific to a Select Agent pathogen.

CONSIDERATIONS FOR IMPLEMENTATION OF A PROFILE-BASED CLASSIFICATION SYSTEM

It is not the role of our committee to recommend specific implementation plans, nor are we properly constituted to do so. But we were tasked with describing an “alternative framework” for oversight, so it is appropriate to make some observations about implementing a profile-based sequence classification system along the lines discussed in this chapter.

To be useful for unambiguous regulations, there would need to be a single agreed-on classification system as opposed to multiple competing systems developed by different research groups. That would require a centralized funding plan that would balance the benefits of single source standardization by a single Select Agent classification system team against the need for oversight and review to maintain quality and efficiency in the absence of peer competition.

A classification system would require a small team of full-time staff to develop and maintain it. The sequence curation work required is substantial. Classifying the current 82 Select Agents would require 82 parts lists and on the order of several thousand different profiles for the parts, and each Select Agent classification would need to be carefully tested and maintained over time. That would be on the same scale as the curation effort involved in the current Pfam or TIGRfams databases for automated protein sequence annotation. The Pfam database, for example, consists of about 12,000 profiles of common protein domain families, maintained by four to six skilled full-time staff since the mid-1990s, including sequence analysts, database administrators, and software developers.

The curation team would need advice from a panel of leading scientists for each group of pathogens. The scientific advisory panels would need to meet regularly to review the relevant literature and research results and would need to develop and maintain up-to-date consensus on the parts lists and parts classifications that define suitable sequence spaces around each Select Agent. These defined sequence spaces would be embodied in an automated classification system by the curation team. The classification system (and comments from the scientific community on its accuracy, gathered from the scientific community) could be reviewed by the appropriate government departments, and the database system would be approved as guidance in interpreting the law, much as the Centers for Disease Control issued a written document to guide

Page 100 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

gene synthesis companies in interpreting the application of the Select Agent Regulations to synthetic DNA.

These scientific advisory panels would probably include not only U.S. scientists, but the best scientists from around the world. International participation would have intangible additional benefits. Gene synthesis is an international industry; international harmonization of regulation and best practices for biosafety and biosecurity in synthetic genomics is an important area. In addition, participation of international scientists in the undertaking could raise awareness of dual-use issues among international researchers—a major objective of the National Strategy for Countering Biological Threats¹⁴ and of the NSABB.

A balance would need to be struck between the need to keep definitions up to date with the state of scientific knowledge about the genetic composition of plausible complete and infectious Select Agent pathogens and the need to have a stable regulatory environment. It would be undesirable to have high-consequence regulations like the Select Agent Regulations changing on a rapid time scale. It would be unreasonable, for example, to have sequences moving on and off the list on a time scale much faster than the time scale of converting a laboratory to meet the Select Agent Regulations. A suitable time scale might be to issue an updated classification system every two years. This is consistent with the current review process for the Select Agent Program, which is overseen by the Intragovernmental Select Agents and Toxins Technical Advisory Committee (ISATTAC).

The periodic expert review and update cycle could be meshed well with recommendations of other recent advisory reports calling for increased cross-agency harmonization of the Select Agent Regulations, and for increased transparency in the procedures for moving agents on and off the list.¹⁵

As we have discussed, the decisions made in establishing classification boundaries in sequence space are unavoidably arbitrary. They cannot be interpreted as biological predictions of whether given synthetic genome sequences would function as dangerous pathogens. Nevertheless, such a system would be an improvement over the current process. It would transparently, consistently,

¹⁴

“Transform the international dialogue on biological threats: Activities targeted to promote a robust and sustained discussion among all nations as to the evolving biological threat and identify mutually agreed steps to counter it.”

¹⁵

2009 National Research Council report Responsible Research with Biological Select Agents and Toxins, “RECOMMENDATION 2: To provide continued engagement of stakeholders in oversight of the Select Agent Program, a Biological Select Agents and Toxins Advisory Committee (BSATAC) should be established. The members, who should be drawn from academic/research institutions and the private sector, should include microbiologists and other infectious disease researchers (including Select Agent researchers), directors of BSAT laboratories, and those with experience in biosecurity, animal care and use, compliance, biosafety, and operations. Representatives from the federal agencies with a responsibility for funding, conducting, or overseeing Select Agent research would serve in an ex officio capacity …” (NRC 2009b).

Page 101 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

and unambiguously represent the harmonized views of a community of experts. A centralized system would almost certainly lead to better decisions than reliance on a series of dispersed judgments by individual scientists in gene synthesis companies who have little specific knowledge about the pathogen sequences that they might be asked to synthesize.

It would be prudent to develop the system in phases, starting with a pilot project on a subset of the smallest Select Agent sequences, such as the protein toxins and the smallest Select Agent viruses. Several of the smallest Select Agent sequences are at the same time considered the most dangerous and feasible threats for current synthetic genomic technology, and also the smallest and easiest test cases for a profile-based classification system (because they require definition of the fewest parts).

ROLE OF PREDICTION AND CLASSIFICATION IN BIOSAFETY

Chapter 2 discussed how high-level biological phenotypes, such as pathogenicity and transmissibility, cannot plausibly be predicted with the degree of certainty required for statutory biosecurity regulations, either now or in the foreseeable future. Nonetheless, predictive ability is a major goal of biology, and it is sure to develop. Any predictive ability will develop first in ways suited for probabilistic risk assessment—for raising potential biosafety concerns about an encoded organism—long before prediction reaches the far higher bar of making completely precise and accurate judgments suitable for declaring that novel synthetic constructs are dangerous weapons and immediately subject to Select Agent Regulations in the absence of actual experience with the encoded organism’s properties. Predictive systems biology will slowly give us a better ability to assess whether a synthetic genome sequence might be more or less hazardous and whether the organism that it encodes might be more or less likely to have the phenotypic properties of Select Agents. Therefore, to the extent that prediction of biological properties of Select Agents does become possible, we believe that it will first be useful in the context of a yellow flag biosafety warning system—warning investigators and their institutions and institutional biosafety committees (IBCs) that a synthetic construct might be hazardous, perhaps inadvertently so, and that the construct and its encoded organism ought to be handled with an appropriate level of biosafety containment.

The centralized classification system for pathogen sequences described above could inform biosafety judgments long before prediction has a substantial impact. The two concepts of a minimal parts list for a complete pathogen and a profile-based classification system could help scientists to identify constructs that, although not complete pathogens within the definitions of the Select Agent Regulations, merit close scrutiny for their biosafety and biosecurity implications. Classification is not prediction, but it is completely plausible that constructs that are close to a minimal parts list for a pathogen or are similar

Page 102 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

to a pathogen gene merit more thorough biosafety evaluation than constructs that are not.

RAISING A YELLOW FLAG FOR “SEQUENCES OF CONCERN”

A wide array of DNA sequences are in a category that, although not completely innocuous, should not be subject to the severe constraints on research imposed by the Select Agent Regulations. The Select Agent Regulations cover complete genomes (“nucleic acids that can produce infectious forms of any of the Select Agent viruses”—language that has been formally interpreted as excluding “genomic fragments from Select Agents”). There are many reasons for prudent concern about providing synthetic genomic fragments of Select Agents. A “bad actor” might seek to obtain a genome in fragments that individually did not meet the criteria of the Select Agent Regulations and then assemble the complete genome. The “bad actor” would be in violation of the Select Agent Regulations by constructing a complete genome, but from a regulatory perspective, we might want to make this technically plausible scenario more difficult to carry out and might want to have more advance notice when such a scenario might be playing out.

We would like to explore the concept of a “yellow flag” system that could provide a transition between the standard biosafety practices that are applied to the vast majority of research projects and the highly regulated, highly restrictive practices required by the Select Agent Regulations. (The concept of the yellow flag will also be discussed in Chapter 4 and Appendix L.) Our best chance to reduce public health risks posed by infectious disease is an active, efficient, and safe R&D program directed at known and emerging pathogens. This committee and others believe strongly that overbroad application of the Select Agent Regulations will increase risks to public health by increasing the cost and reducing the speed of critically important research on pathogens. The aim of the yellow flag is to provide a set of practices that improve safety and security without increasing the number of sequences that are covered by the Select Agent Regulations.

Restrictions on pathogen research have two primary goals: to make it harder for bad actors to use pathogens as weapons or as tools for bioterror and to avoid the accidental, inadvertent, or ill-advised production of hazardous constructs by well-meaning investigators. Meeting both goals through a yellow flag system probably requires some form of disclosure of the pathogenic sequences or experimental plans to scientists outside the groups carrying out the research. Thiat disclosure could involve centralized reporting of yellow flag sequences, public disclosure, or some kind of institutional review. A centralized system for reporting yellow flag sequences would allow detection of the simplest scheme for avoiding the Select Agent Regulations—splitting the order for a viral genome or a toxin between two different gene synthesis companies. It would

Page 103 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

also enable intelligence analysis and monitoring of DNA sequence orders. Public disclosure would offer the virtues of an open-source system—the power of review by many different observers—but it would require a substantial change in the culture of science. A standardized system for identifying “yellow flag” sequences coupled with biosafety review (and biosecurity monitoring) might offer the simplest approach to reducing risks.

We envision that the actions taken in response to a yellow flag could be informal, prudent best practices in that they fall outside the strict regulatory boundaries of the Select Agent Regulations. It would also be possible to use a yellow flag system in more formal ways. For example, an IBC or funding sponsor could ask that yellow-flagged synthetic constructs trigger some sort of special notification for purposes of oversight if for no other reason than to track what laboratories were in possession of such constructs. Similarly, DNA synthesis companies might be asked to maintain records of yellow-flagged constructs that they provide, to facilitate forensic investigation in the event of the criminal construction of a complete Select Agent from synthetic parts.

The concept of raising a yellow flag on synthetic genomic constructs clearly overlaps with the biosecurity goals of the Select Agents Program—providing a sort of buffer zone that identifies individual Select Agent synthetic parts that do not rise to the precise inclusive definitions of the “complete, infectious” genomes that come under the Select Agent Regulations. Moreover, we view a yellow flag system more broadly from a biosafety perspective. We believe on intuitive grounds that the probability of accidental, inadvertent, or ill-advised hazardous constructs by well-meaning investigators (including hobbyist biohackers) is much higher than the probability of deliberate weapons constructs,¹⁶ although neither probability can be estimated. A yellow flag system could help to build a web of reasonable oversight of synthetic genomic constructs. Investigators or biohackers could be warned if their constructs included potentially dangerous Select Agent-like parts, to be sure that they were aware and to be sure that they were prepared and equipped to handle the constructs appropriately. It would also be possible to expand the list of yellow flag parts used in profile-based classification beyond the Select Agents list to include other sequences of concern.

Initially, the “yellow flag” system could be based on the profile-based classification system described above, and best practices could be promulgated for how investigators, institutions, IBCs, and synthetic genomics companies should deal with constructs deemed to be “potentially risky.” As predictive methods are developed, they would naturally augment such risk assessments. Because the profile-based classification system would be transparently available in its func-

¹⁶	At the very least, those engaged in a deliberate weapons program are less likely to use DNA sysnthesis companies as a means of obtaining genomic fragments, when those companies are known to screen orders for “sequences of concern.”

Page 104 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

tion for the Select Agent Regulations, it would also be transparently available for parallel development of biosafety best practices of the yellow flag variety. That is, community awareness and best practices for responsibility, review, and oversight of biosafety of synthetic constructs could be developed now, starting with the profile-based classification system, and this biosafety framework would then be in place as scientifically sound de novo predictive systems develop.

SHOULD SUCH A SYSTEM BE BUILT?

All that said, should such a profile-based classification system be built? Our committee is not constituted to answer that question. Our answer to this part of our charge is that it could be built.

Together, the profile-based classification for Select Agents, and the yellow flag system for sequences of concern would address many of the emerging biosecurity concerns posed by synthetic biology. Gene synthesis companies, which have to make daily judgments based on the Select Agent Regulations and other regulations, strongly favor development of such a system. As discussed in Chapter 1, several of the companies have agreed to a common set of procedures for screening customers and sequences. In addition to helping to identify sequences that qualify as Select Agents, the parts list profile-based classification system would provide information to be used for flagging sequences of concern. The companies have agreed to apply enhanced customer screening to orders that do not meet the definition of the Select Agent Regulations but do include Select Agent sequences or other pathogen sequences. The goal of such screening is to supply sequences only to customers that have a legitimate research use for them and have the resources to handle them safely.

The system could also have utility that is independent of its role in clarifying the regulation of pathogen sequences. One of the most consistent lessons of modern biological research is that cross-pollination between diverse fields of expertise consistently yields valuable scientific insights and useful new tools (NRC 2009b). Gathering information on the most important human and animal pathogens into a single, consistently annotated database will make it easier for scientists and engineers who are not experts on a particular pathogen to develop new diagnostics, vaccines, and therapeutics.

However, the system we have described does have some objectionable qualities. It may be “overkill.” It erects a moderately complicated and pedantic definitional framework around the purely regulatory question of what sequences represent complete, infectious Select Agents or not.

One might take an alternative view that the regulatory language could be clarified more simply and elegantly with a language of “reasonableness,” expressing the concepts of sequence classification methods we have described here but without the potential heavy-handedness of a centralized automated computational implementation. For instance, there might be regulations and

Page 105 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×

guidelines that declare the intuitive notion that a synthetic genome is “reasonably” expected to be complete and infectious and that the sequence of its parts would appear to a “reasonable” person to be closer to that of a Select Agent reference sequence than to that of the nearest non-Select Agent reference sequence. Such language alone would be an improvement over the use of problematic percent identity thresholds in some current regulatory language, and over the absence of any guidance at all for dealing with modified or chimeric Select Agent synthetic genomes. The intuitive concepts of sequence-based classification are sufficiently clear for anyone to know whether a sequence is in the vicinity of any reasonable definition of the line. However, a “reasonableness” approach does not solve the problem of vagueness that troubles the DNA synthesis companies, researchers, and law enforcement as they try to apply the Select Agent Regulations. Without a precise definition of the sequences covered by the Select Agent Regulations, companies might choose to reject any construct that contains Select Agent-related sequences, and researchers are left unsure whether they are subject to the Select Agent Regulations. Moreover, the reasonableness approach does not begin to address the issue of novel synthetic constructs and sequences of concern, which are a motivator for the development of an alternative framework for oversight.

Page 106 Cite

Suggested Citation:"3 A Proposal for Consideration: Sequence-Based Classification of Select Agents." National Research Council. 2010. Sequence-Based Classification of Select Agents: A Brighter Line. Washington, DC: The National Academies Press. doi: 10.17226/12970.

×