There are currently many methods for structural alignment (2031). Some of these are associated with probabilistic scoring schemes. In particular, one method (VAST) computes a P value for an alignment based on measuring how many secondary structure elements are aligned as compared with the chance of aligning this many elements randomly (28). Another method (27, 32) expresses the significance of an alignment in terms of the number of standard deviations it scores above the mean alignment score in an all-vs.-all comparison (i.e., a Z-score).

Data Sci Used for Testing. One of the most important aspects of our analysis is that we carefully tested it against the known structural relationships. This testing allowed us to decide unambiguously whether a given comparison resulted in a true or false-positive and to decide objectively between different statistical schemes. In particular, structures were taken from the Protein Data Bank (3334) and definitions of domains, structural classes, and structural similarities were taken from the Structural Classification of Proteins (scop) database (version 1.32: refs. 3537). The creators of scop have clustered the domains in the Protein Data Bank on the basis of sequence identity (38, 39). At a sequence identity level of 40%, this clustering resulted in 941 unique sequences corresponding to the known structural domains. These 941 sequences were what we used as test data for both the sequence and structure comparisons. They contained 390 different superfamilies and 281 different folds. Because they had a considerably closer and more certain relationship than fold pairs, we concentrated here on superfamily pairs. These 2.107 nontrivial, pairwise relationships between the domains formed our set of true-positives.

Sequence Comparison Statistics. Sequence matching was done with standard approaches: In particular, we used the SSEARCH implementation of the Smith-Waterman algorithm (7) [from the FASTA package, version 3, (12, 40); the URL is], with a gap-opening penalty of –12. a gap-extension penalty of –2, and the BLOSUM50 substitution matrix [which has a maximal match score of 13 (for C to C) and an average match score of –0.36].

A probability-density function for sequence-comparison scores. Each pairwise sequence comparison was best quantified by three numbers, Sseq, n, and m, where Sseq is the raw sequence alignment score and n and m are the lengths of the two sequences compared. Comparing all possible pairs of sequences allowed us to calculate an observed probability density, ρ°seq, for the chance of finding a pair of sequences with particular values for Sseq and ln(nm). Fig. 1A shows the density for pairs between all sequences. This includes the scores for

FIG. 1. A probability-density distribution for sequence comparison scores, contoured against Sseq, the sequence alignment score (along the horizontal axis) and ln(nm), where n and m are the lengths of the pair sequences (along the vertical axis). This density is related closely to the raw data (via normalization) obtained by counting the number of pairs with particular S and ln(nm) values. Because of the wide range of density values, contours of log are drawn with an interval of 1 (a full order of magnitude). When contouring the logarithm of a density function, special attention must be paid to the zero values. Here, a zero value is set to 0.001, which effectively lifts the entire surface by 3 log units. The data then are smoothed by averaging with a Gaussian function [exp(~s/(ΔSseq/3)2)] over a window 14 units wide along the Sseq axis. This smoothing together with the treatment of zeros serves to emphasize the smallest observed counts (values of 1) by surrounding them with three contour levels. (A) Data from all 884,540 pairs between any one of the 941 sequences and any other sequence (pairs A-B and B-A are both included). The significant sequence matches are seen as the isolated spots at high values of the score Sseq, (B) Data from 352.168 pairs, including only those pairs of sequences in different scop classes. We also exclude pairs between an all-α or all-β domain and an α+β domain, as well as sequences that are not in one of the five main scop classes: α, β, α/β, α+β. and α+β (multidomain). This exclusion is done to ensure that no significant matches will be found, which indeed is seen in the figure by the absence of any outlying spots at high score values. Thus, the density in B is free of any significant matches and shows the underlying density distribution expected for comparison of unrelated sequences.

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement