Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 56 Chapter 3â Seeing Conserved Signals: Using Algorithms to Detect Similarities between Biosequences Eugene W. Myers University of Arizona The sequence of amino acids in a protein determines its three-dimensional shape, which in turn confers its function. Segments of the protein that are critical to its function resist evolutionary pressures because mutations of such segments are often lethal to the organism. These critical "active sites" tend to be conserved over time and so can be found in many organisms and proteins that have similar function. Analogously, functionally important segments of an organism's DNA tend to be conserved and to recur as common motifs. In this chapter, the author introduces algorithms for comparing DNA and protein sequences to reveal similar regions. Particular attention is given to the problem of searching a large database of catalogued sequences for regions similar to a newly determined sequence of unknown function. Since the advent of deoxyribonucleic acid (DNA) sequencing technologies in the late 1970s, the amount of data about the protein and DNA sequence of humans and other organisms has been growing at an exponential rate. It is estimated that by the turn of the century there will be terabytes of such biosequence information, including DNA sequences of entire human chromosomes. Databases of these sequences will contain a wealth of information about the nature of life at the molecular level if we can decipher their meaning.
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 57 Proteins and DNA sequences are polymers consisting of a chain of monomers with a common backbone substructure that links them together. In the case of DNA, there are 4 types of monomers, the nucleotides, each having a different side chain. For proteins, there are 20 types of monomers, the amino acids. With just a few exceptions, the sequence of monomers, that is, the primary structure, of a given protein or DNA strand completely determines the three-dimensional shape of the biopolymer. Because the function of a molecule is determined by the position of its atoms in space, this almost perfect correlation between sequence and structure implies that to know the function of a biopolymer, it in principle suffices to know its primary sequence. The primary sequence of a DNA segment is denoted by a string consisting of the four letters A,C,G, and T. Analogously, the primary sequence of a protein is denoted by a string consisting of 20 letters of the alphabet, one for each type of amino acid. In principle, these strings of symbols encode everything one needs to know about the protein or DNA strand in question. If the primary sequences of two proteins are similar, then it is reasonable to conjecture that they perform the same function. Because DNA's principal role is one of encoding information (including all of an organism's proteins), the similarity of two segments of DNA suggests that they code similar things. Mutation in a DNA or protein sequence is a natural evolutionary process. Errors in the replication of DNA can cause a change in the nucleotide at a given position. Less often, a nucleotide is deleted or inserted. If the mutation occurs in a region of DNA that codes for protein, these changes cause related changes in the primary sequence and, hence, the shape and activity of the protein. The impact of a particular mutation depends on the degree to which the original and new amino acid sequences differ in their physical and chemical properties. Mutations that result in proteins that are so altered that they function improperly or not at all tend to be lethal to the organism. Nature is biased against mutations in those critical regions central to a protein's function and is more lenient toward changes in other regions. Similarity of DNA sequences is a clue to common evolutionary origin. If two proteins in two organisms evolved from a common precursor, one will generally find highly similar segments, reflecting strongly conserved critical regions. If the proteins are very recent derivatives, one might expect to see similarity over the entire length of the sequences. While