database (22) contains short, ungapped regions that are highly conserved, according to sequence characteristics. The HSSP database (23) contains global alignments of sequences based on structural alignments. We examined all possible subsets S of amino acids to find those groups that are well conserved. We had two criteria for conservation: (i) compactness—amino acids within the group should substitute for one another with relatively high frequency, and (ii) isolation—amino acids outside the group should substitute for those in the group with relatively low frequency. These criteria follow those often used in cluster analysis (24).
To measure compactness and isolation, we first used the BLOCKS and HSSP databases to provide a set of conditional counts c(a|S), which equals the total number of occurrences of amino acid a in all aligned positions that contain the group S. Conceptually, we found all aligned positions that contain S. and then tabulated all amino acids from those positions. Then, we computed conditional frequencies
where the quantity f(a|S) is defined only for amino acids a not in group S.
For each group, we computed the expected conditional frequencies and the standard error of the proportion for amino acids outside the group:
where c(a′) is the marginal count of amino acid a’ over all aligned positions.
We then computed a separation score for each group, as follows:
where Z(a|S) is a conditional relative deviate, or Z-score. The first term represents our measure of compactness, and the second term represents our measure of isolation. Based on these separation scores, we found all amino acid groups that had a separation score greater than three standard errors, which is equivalent to a significance level of 0.01. Further details of our analysis are presented in ref. 25.
Our criteria were met by 30 substitution groups in the BLOCKS database and 51 substitution groups in the HSSP database. The HSSP database yielded more groups because of its larger size, and because our criterion is based on statistical significance. Twenty substitution groups were conserved empirically in both databases, and the validation by both databases provides good evidence that these groups are indeed conserved in nature. If we arrange these groups hierarchically, we obtain the set of amino acid groups shown in Fig. 1. We used these substitution groups to define the space of motifs available to describe protein families.
Motif Enumeration and Ranking. A conserved region may be described by many possible motifs, with different levels of coverage and specificity. To better understand the choices involved, consider the sequence alignment in Fig. 2a. We can cover all sequences in the training set if we select the smallest group of amino acids that accounts for all of the amino acids
in each position. For example, every sequence has methionine in the first position, so the first position of the motif should specify M. In the second position, both phenylalanine and tyrosine occur. The smallest group of amino acids from Fig. 1 that accounts for the entire position is [FYW], which allows tryptophan to occur in addition to phenylalanine and tyrosine. Using this group is tantamount to inferring that this position requires an aromatic amino acid. In the third position, no allowable group can account for the diverse amino acids that are observed, so to achieve complete coverage we must place a wild-card character in this position.
The resulting motif, shown in Fig. 2b, has complete coverage, because it describes the entire training set, but it can be affected by problems with the data. Consider again the alignment in Fig. 2a. In the eighth position from the right, every sequence but one contains a leucine. The first sequence, however, contains a proline at this position. This may be the result of a sequencing error, a rare mutation, or a sequence that has been erroneously assigned to the family. In any case, if the first sequence was removed from consideration in the formation of the motif, this position in the motif would change from ‘.’ to L. Doing this reduces the coverage of the motif by one sequence, but makes it more specific.
Even in the absence of problems in the data, motifs with high coverage generally may have low specificity, thereby resulting in false positives. In constructing a motif, we are faced then with a fundamental tradeoff between coverage and sensitivity. The EMOTIF algorithm explores this tradeoff for a particular alignment by exhaustively generating all possible motifs using the allowable substitution groups and quantifying the coverage and specificity for each motif.
Another feature of our example bears discussion. The sequences can be partitioned into two subclasses based on the amino acid in the fourth position. The first group has arginine in this position, whereas the second group has lysine. All sequences in the first group have tyrosine in the final position, whereas none in the second group do. Indeed, partitioning the sequences in this way allows the conserved region to be described by two highly specific motifs, rather than a single, more general one. Fig. 2c shows the motif for the first group. Thirteen positions are more specific than the motif for the entire set of sequences, resulting in an factor of 1010 increase in specificity. Thus, by finding motifs that cover only part of the training set, EMOTIF is potentially able to discover subfamilies within a superfamily and characterize them with a specific motif.
We define specificity as the probability that a random sequence would match the motif. To calculate this, we assume that the distribution of amino acids in each position of a random sequence is independent and identically distributed. We use the observed distribution of amino acids in the SWISSPROT database as an estimate for this distribution. The