There are currently many methods for structural alignment (20–31). Some of these are associated with probabilistic scoring schemes. In particular, one method (VAST) computes a P value for an alignment based on measuring how many secondary structure elements are aligned as compared with the chance of aligning this many elements randomly (28). Another method (27, 32) expresses the significance of an alignment in terms of the number of standard deviations it scores above the mean alignment score in an all-vs.-all comparison (i.e., a Z-score).
Data Sci Used for Testing. One of the most important aspects of our analysis is that we carefully tested it against the known structural relationships. This testing allowed us to decide unambiguously whether a given comparison resulted in a true or false-positive and to decide objectively between different statistical schemes. In particular, structures were taken from the Protein Data Bank (33–34) and definitions of domains, structural classes, and structural similarities were taken from the Structural Classification of Proteins (scop) database (version 1.32: refs. 35–37). The creators of scop have clustered the domains in the Protein Data Bank on the basis of sequence identity (38, 39). At a sequence identity level of 40%, this clustering resulted in 941 unique sequences corresponding to the known structural domains. These 941 sequences were what we used as test data for both the sequence and structure comparisons. They contained 390 different superfamilies and 281 different folds. Because they had a considerably closer and more certain relationship than fold pairs, we concentrated here on superfamily pairs. These 2.107 nontrivial, pairwise relationships between the domains formed our set of true-positives.
Sequence Comparison Statistics. Sequence matching was done with standard approaches: In particular, we used the SSEARCH implementation of the Smith-Waterman algorithm (7) [from the FASTA package, version 3, (12, 40); the URL is ftp://ftp.virginia.edu/pub/fasta], with a gap-opening penalty of –12. a gap-extension penalty of –2, and the BLOSUM50 substitution matrix [which has a maximal match score of 13 (for C to C) and an average match score of –0.36].
A probability-density function for sequence-comparison scores. Each pairwise sequence comparison was best quantified by three numbers, Sseq, n, and m, where Sseq is the raw sequence alignment score and n and m are the lengths of the two sequences compared. Comparing all possible pairs of sequences allowed us to calculate an observed probability density, ρ°seq, for the chance of finding a pair of sequences with particular values for Sseq and ln(nm). Fig. 1A shows the density for pairs between all sequences. This includes the scores for