Is there a potential for developing and implementing an oversight system for Select Agents that is based on features and properties encoded by nucleic acids? The general answer to this question is yes. The committee believes, however, that the entire concept of “predictive” oversight is flawed in that (1) the current Select Agent list has a non-biological as well as a biological basis for existence and (2) functional “prediction” alone cannot provide a level of certainty sufficient to designate a microorganism as a Select Agent, whose possession is legally restricted. Nevertheless, oversight of novel pathogens, whether natural or synthetic, is clearly seen by policy makers and legal experts to be a necessary component of a comprehensive biosecurity strategy. We propose to discuss here only the biological factors relevant to establishing a sequence-based oversight system that is focused on identifying genes and gene products that are likely to be involved in survival and persistence of a microorganism and its interaction with a host. That would include genes of Select Agents, but would also include a far greater number of genes that are associated with pathogenicity (the ability to cause disease) and virulence (the degree of pathogenicity encoded by a given gene or group of genes). Understanding the basis of such an oversight system requires some understanding of the biology of pathogenicity and of the current limitations of genomic analysis.
It is clear that we are immersed in an age of genomics. As of December 27, 2009, the Web site Genomesonline.org (Genomesonline 2009) reported that 3,606 bacterial genomes were being sequenced and that complete DNA sequences of at least 712 distinct bacterial strains were in the public domain. The completed sequences include all the bacterial Select Agents and most common pathogens of humans, animals, and plants. Entrez Genomes contained 3,498 reference sequences for 2,374 viral genomes, including all of the Select Agents and common plant and animal viral pathogens.
The genomes of prokaryotes possess specific and relatively well-understood promoter sequences (signals), such as transcription factor binding sites, that are relatively easy to identify. The gene sequences that code for a protein occur as one contiguous open reading frame (ORF), which is typically many hundreds or thousands of base pairs long. The nucleotide compositions and frequency of use of stop codons (the punctuation between genes) are well known. Furthermore, protein-coding DNA has periodicities of occurrence and other statistical properties. Therefore, recognizing genes in prokaryotic systems is relatively straightforward, and there are well-designed algorithms to do it with high levels of accuracy.
However, identifying a gene and understanding its function are altogether different matters. At least one-fourth of genes that are identified in bacterial genomes, whether large or small, whether from pathogen or non-pathogen,