Read "Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology" at NAP.edu

« Previous: REFERENCES

Page 90 Cite

Suggested Citation:"Chapter 4 Hearing Distant Echoes: Using Extremal Statistics to Probe Evolutionary Origins ." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.

Page 91 Cite

Page 92 Cite

Page 93 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 90 Chapter 4â Hearing Distant Echoes: Using Extremal Statistics to Probe Evolutionary Origins Michael S. Waterman University of Southern California The comparison of DNA and protein sequences provides a powerful tool for discerning the function, structure, and evolutionary origin of important macromolecules. Sequence comparison sometimes reveals striking matches between molecules that were hitherto not known to be related-immediately suggesting hypotheses that can be tested in the laboratory. In other cases, sequence comparison reveals only weak similarities. In such instances, statistical theory is essential for interpreting the significance of such matches. The author discusses large deviation theory for sequence matching and applies it to evaluate a tantalizing report concerning distant echoes from the earliest period in the origin of life. As soon as new deoxyribonucleic acid (DNA) or protein sequences are determined, molecular biologists immediately examine them for clues about their biological significance. A number of important questions about the function of a newly determined protein are often asked, including the following. What can be inferred about the function of a new protein on the basis of its amino acid sequence? Can one discern the reactions it catalyzes or the molecules it binds? What three-dimensional shape will the linear amino acid sequence of a protein assume when it folds up according to the laws of thermodynamics? Another class of questions concerns the evolutionary relationships between known sequences. For example, some questions concerning

HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 91 hemoglobin are the following. What is the evolutionary relationship between three related a, Ã, and Î³ hemoglobin genes? What is the evolutionary relationship between the hemoglobin molecules from various organisms? What do these sequences tell us about the evolutionary history of humans, chimpanzees, and gorillas? Each of these questions can be approached, if not always entirely solved, by sequence comparison. Sequence comparison is of tremendous interest to molecular biologists because it is becoming easy to determine DNA and protein sequences, whereas it remains difficult to determine molecular structure or function by experimental means. Thus, functional and structural clues from sequence analysis can save years of work at the laboratory bench. An important early example illustrates the point. Some years ago, molecular biologists compared the protein sequence encoded by a cancer-causing gene (or oncogene) called v-sis to the available database of protein sequences. Remarkably, a computer search revealed that the sequence showed more than 90 percent identity to the sequence of a previously discovered gene encoding a growth-stimulating molecule, called platelet- derived growth factor (PDGF). Instantly, cancer researchers had a precise hypothesis about how this oncogene causes unregulated cell growth. Subsequent experiments confirmed the guess. Nowadays, molecular biologists routinely carry out such computer searches against the current databases (which now contain both protein and DNA sequences) and are rewarded with striking and suggestive matches at a high frequency (perhaps 20 to 30 percent for a new gene). In some cases, the matches extend across the entire length of the protein. In other cases, there is a strong match across a restricted domainâexamples include particular sequences at the catalytic site of enzymes that hydrolyze adenosinetriphosphate (ATP) or at the DNAbinding site of proteins that regulate the activity of genes. The frequency with which such strong matches are found is a tribute to the tremendously conservative nature of evolution: many of the basic building blocks of proteins and DNA have been reused in hundreds of different ways. For the majority of new sequences, however, there is no striking match in the database. Although this may change with time (some molecular biologists believe that there are only a few thousand or a few tens of thousands of basic architectural motifs for proteins and that it is

HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 92 just a matter of time before we collect them all), computer searches will turn up only weak similarities. Before attempting to read biological significance into such weak similarities, one must evaluate their statistical significance. Not surprisingly, this is an area in which mathematics has much to offer molecular biology. To motivate the study of the statistical significance of sequence similarities, we consider a single data set that provoked a great deal of excitement a few years ago when a team of researchers thought that they saw extraordinary clues about early evolution in the sequences of genes encoding certain ribonucleic acid (RNA) molecules. The origin of the universe and the origin of life are topics of wide interest to both biologists and nonbiologists. One approach to studying the origin of the universe is to listen to faint echoes from the Big Bang. Similar approaches are used in studying the origin of life. Are there any molecular echoes remaining from the origin of life? Each of the three key molecules in molecular biologyâDNA, RNA, and proteinâhas been championed by some theorists as the earliest self-replicating molecule. Proteins have seemed attractive to some because of their ability to catalyze chemical reactions. DNA has seemed attractive to others because it is a stable store of information. Lately, however, RNA has taken the lead based on the well-known ability of RNA to encode information in the same manner as DNA and the recently discovered ability of RNA to act as nonprotein enzymes that are able to catalyze some chemical reactions. These properties suggest that some RNA sequence might have been able to achieve the key feat of self-replicationâserving as both self-template and replication enzyme. Thus, life may have started out as an RNA world. As indicated in Chapter 1, modern RNAs come in three varieties: messenger RNAs (mRNAs), ribosomal RNAs (rRNAs), and transfer RNAs (tRNAs). mRNAs are the messages copied from genes. rRNAs are components of the macromolecular structure, called the ribosome, used for translating RNA sequences into protein sequences. tRNAs are the ''adapter molecules" that read the genetic code, with an anticodon loop recognizing a particular codon at one end and an attachment site for the amino acid corresponding to this codon at the other. rRNAs and tRNAs are clearly ancient inventions, necessary for the progression from life based only on RNA to organisms employing proteins for efficient catalysis of biochemical reactions.

HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 93 In the early 1980s David Bloch and colleagues reported that they had found that these two types of RNAâ tRNA and rRNAâhad significant sequence similarities implying a common evolutionary ancestry (Bloch et al., 1983). In his paper, Bloch reported: Many tRNAs of E. coli and yeast contain stretches whose base sequences are similar to those found in their respective rRNAs. The matches are too frequent and extensive to be attributed to coincidence. They are distributed without discernible pattern along and among the RNAs and between the two species. They occur in loops as well as in stems, among both conserved and non-conserved regions. Their distributions suggest that they reflect common ancestral origins rather than common functions, and that they represent true homologies. Such tantalizing arguments should be grounded in statisticsâsince we cannot test the origin of life by direct experiment (as we could test a proposed function for a protein based on sequence similarity). In this chapter, we develop some tools for evaluating statistical significance and apply them to Bloch's data. The biological hypothesis that relationships between the RNAs are true homologies is necessarily imprecise. The evidence given is frequent and extensive matchings of stretches of sequences between the molecules, just the sort of matchings that the local algorithm presented below is designed to find. To "test" the biological hypothesis, we form a statistical hypothesis that the sequences are generated with independent and identically distributed letters. Then we test this hypothesis by computing scores using the local algorithm. Since letters in real sequences are not independent, it is possible to change the hypothesis to a Markov hypothesis, for example. This does not change the score distribution very much for the distributions obtained from real sequences. If the score. distribution is consistent with that from comparison of random sequences, we would fail to reject the statistical hypothesis and thus have evidence against the biological hypothesis. If, on the other hand, the scores are frequently too large, showing strongly matching stretches or intervals of sequence, we have evidence for the biological hypothesis and against the statistical hypothesis. Statistical questions are increasingly important in molecular biology. While statistical significance is not directly related to biological

Next: Sequence Alignment »

Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology (1995)

Chapter: Chapter 4 Hearing Distant Echoes: Using Extremal Statistics to Probe Evolutionary Origins

Welcome to OpenBook!

Get Email Updates