mologous to each other. If they are homologous, they are likely to have the same, or a closely related, function, although there are exceptions. Inferring that two proteins are homologous when they are far from identical in amino acid sequence and locating their related sequences in a complete genomic sequence requires the application of mathematical and computational methods that have been developed over the last 40 years.
In the 1960s, Emile Zuckerkandl and Linus Pauling (1965) first realized that DNA and protein sequences, molecules they called “semantides” (for information-carrying polymers) contain the history of their divergence from their ancestors. From the information in genetic sequences, one could, they argued, do “paleogenetics” to find the relationships between genes and therefore also between species. This became the field of molecular evolution, which has flourished and become ever more mathematical as the sophistication of the models for evolutionary change has increased and more complex algorithms have become common for inferring evolutionary events and phylogenetic trees. Also in the 1960s, Motoo Kimura (1968) introduced the theory of neutral evolution and its large contribution to sequence divergence. One effect of his work was simply to emphasize the enormous amount of change that can be observed in biological sequences, which makes paleogenetics that much more challenging, because it means large differences in sequence can accumulate without changes in function. Margaret Dayhoff (1965) produced the First Atlas of Protein Sequence and Structure. Among other things, the atlas allowed her to analyze the substitutions observed in closely related proteins and obtain an empirically derived estimate for the rate of substitution of one amino acid for any other. The resulting percentage accepted mutations (PAM) matrices were a much improved measure of the similarity between protein sequences.
To identify the changes between two proteins, one has to find the correct, or at least an optimal, alignment between them. If they are very different, it is not easy to obtain the optimal alignment. Methods referred to generally as dynamic programming, developed by Richard Bellman in 1953, can obtain optimal alignments in such cases very efficiently. Needleman and Wunsch (1970) first published a dynamic programming algorithm to find the optimum alignment between two biological sequences, and over the next several years several variations of that method were developed, differing in how the alignments were scored and how they treated gaps. Most of the efforts were directed at global alignments, where both sequences are aligned along their entire lengths. The more challenging problem was to find local alignments, where only a portion of the two sequences has significant similarity. Local alignment is needed to compare genomes with each other, or even to ask whether a homologue of a particular gene occurs within a genome sequence. Smith and