defined only with respect to the phylogeny of the genes and not with respect to function.
Identifying orthology by using relative levels of sequence identity. Ideally one would expect that the orthologous genes of two genomes are those that have the highest pairwise identity, having bifurcated relatively recently compared with genes that duplicated before the speciation. The most straightforward approach to identifying orthologous genes is to compare all genes in genomes with each other, and then to select pairs of genes with significant pairwise similarities. A pair of sequences with the highest level of identity then is considered orthologous.
Auxiliary information for detection of orthology. Auxiliary information that is useful to assess orthology is “synteny”: the presence in both genomes of neighboring sequences that are also orthologs of each other. As shown below, there is little conservation of the order of genes in genomes in evolution at a time when divergence of their orthologous genes reaches a level of 50% amino acid identity (see Fig. 3). Hence the potential for using synteny for identifying orthologs is limited mainly to genomes that have speciated only relatively recently. A second type of auxiliary information that can be used is the comparison of genes with those of a third genome. If two genes from different genomes have the highest level of identity both to each other and to a single gene from a third genome, then this is a strong indication that they are orthologs (see ref. 15 for a large-scale implementation of this idea). However for a large fraction of genes identifying orthologs by relative sequence identity is hampered by a variety of evolutionary processes. We describe these in the following sections.
Sequence divergence. At large evolutionary distances, e.g., between Archaea and Bacteria, sequence similarities may be eroded to such an extent that the distance between orthologous sequences is similar to that between sequences that are merely part of the same gene family. More dramatically, homolog sequences can diverge “beyond recognition,” such that the similarity between two orthologs is not higher than the similarity between sequences that are not part of the same gene family and automatic procedures for the recognition of homology fail. A recent survey of genes in Drosophila shows that one-third of the cDNAs code for very fast evolving genes, for which the frequency of amino acid substituting mutations is only a 2-fold lower than that of silent mutations, leading to a situation where homologous proteins are barely recognizable after 8,000 years of evolution (16).
Nonorthologous gene displacement. A second event problematic to ortholog identification is nonorthologous gene displacement. This occurs when two nonorthologous genes that are unrelated or only remotely related perform the same function in two organisms (17). This occurs relatively frequently: a comparison of M.genitalium to H.influenzae revealed 12 clear-cut cases (17). As a consequence orthologs may not be detectable (or are classified as paralogs) in another organism even when the corresponding function is retained.
Gene duplication, gene loss, and horizontal gene transfer. A third process that restricts the identification of orthologous genes is that of gene loss in combination with gene duplication. If two genomes lose different paralogs of an ancestral gene that was duplicated before the speciation event, the remaining genes have highest sequence identity even though they are not orthologs (18). One may test for such an event by checking whether the protein similarity falls into an expected range. This is done implicitly by including (presumably orthologous) sequences from other species in the phylogeny and checking whether the gene tree is in accordance with the species tree (18, 19). Inconsistencies between the species tree and the gene tree can indicate nonorthologous relationships between genes. However, they also can be caused by horizontal gene transfer, in which case the genes still could be orthologs. In general, the identification of orthologous sequences, horizontal gene transfer, and ancient gene duplications cannot be distinguished. Besides the construction of phylogenetic trees an additional strategy for finding horizontal gene transfer is the comparison of nucleotide frequencies within a genome. Recently transferred genes often display nucleotide frequencies that deviate significantly from the rest of the genome (20, 21). A conservative estimate of the amount of genes that recently have been transferred to E.coli, based on nucleotide frequencies and dinucleotide frequencies in genomes is 10%—15% of the E. coli genome (Phil Green, personal communication: ref. 21). A third strategy for finding horizontal gene transfer is synteny. Because gene order is rarely conserved in evolution, the presence in two distant evolutionary branches of the same order of genes, combined with the absence of this gene order in other more closely related branches, can point to horizontal gene transfer. This strategy has been used successfully to find the example of horizontal gene transfer described in Fig. 1.
Orthology in multidomain proteins. In multidomain proteins two levels of orthology can be distinguished: one is at the level of single domains, a second at the level of the whole protein. This may lead to situations where nonorthologous proteins possess orthologous domains. Modularity of genes in the sense that modules can have different positions, but the same function, in various proteins, is not well documented in Bacteria and Archaea. A first step toward modularity, the presence of “gene fusion” or “gene splitting,” however, does occur regularly. Comparative analysis of the genomes H.influenzae and E.coli showed 10 (24) clear-cut cases of genes that were separate in E.coli (H.influenzae), but that were part of a single gene in H.influenzae (E.coli) (unpublished data).
A much more complicated scenario, for which many of the factors described above (multidomain proteins, synteny, and horizontal gene transfer) are involved, is shown in Fig. 1. In general, a combination of the various evolutionary processes described above leads to a situation where, although orthology was defined originally as a one-to-one relationship between proteins, it must be considered a many-to-many relationship.
From homologs to orthologs. The advent of powerful, easyto-use tools, such as PSIBLAST (22), to find homologous sequences is likely to shift the emphasis in sequence analysis from predicting homology to predicting orthology. It is clear that, at present, there is not a single, simple, and perfect solution to the question of orthology. Orthology is methodologically defined, that is, dependent on what is asked of the genomes that are compared, different methods to find orthologous genes are being used. We use a minimal definition when we are interested only in the number of orthologs shared between genomes at various phylogenetic distances. Orthologs then are defined in the following manner: (i) They have the highest level of pairwise identity when compared with the identities of either gene to all other genes in the other’s genome; (ii) the pairwise identity is significant (E, the expected fraction of false positives, is smaller than 0.01), and (iii) the similarity extends to at least 60% of one of the genes. The region of similarity is not required to cover the majority of both genes to include the possibility of gene fusion and gene splitting. In more detailed comparisons between a small number of genomes, auxiliary information was used to determine orthology, such as the order of genes and the comparison to genes from a third genome (see legend to Fig. 1).
Given all of these complications in the finding of orthologs and the oversimplified view of evolution that the term suggests, one could conclude that it is better not to use it at all, or only in those cases where one does not have conflicting information from various sources about the phylogeny of the genes. One also can argue that it is exactly these cases where there are conflicts in the information about orthology from different sources that evolution shows some of its most interesting aspects. Orthology is an important refinement over homology in describing the phylogenetic relations between genes, as long