National Academies Press: OpenBook
« Previous: BASIC INSIGHTS ABOUT PROTEIN STRUCTURE
Suggested Citation:"THREADING METHODS." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 248
Suggested Citation:"THREADING METHODS." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 249
Suggested Citation:"THREADING METHODS." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 250
Suggested Citation:"THREADING METHODS." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 251
Suggested Citation:"THREADING METHODS." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 252
Suggested Citation:"THREADING METHODS." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 253

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 248 globular proteins partition most hydrophobic groups away from the proteinsolvent interface; and similar sequences yield similar structures, but quite distinct sequences can produce remarkably similar structures. Molecular biologists have suggested that nature can borrow pieces of several structures to construct new structures or simply extract a whole structure and co-opt it for a new use (Dorit et al., 1990). This chapter focuses on two of the major computational approaches that biologists have taken to the protein folding problem: (1) Threading methods, which rely on aligning a "new" protein with sufficiently similar proteins for which structures are already known. Of course, these methods require that sufficiently similar proteins exist in the available databases. (2) Hierarchical condensation methods, which involve identification of the building blocks of protein structure and an understanding of the rules that govern the assembly of these building blocks into higher-order structure. Unlike the template methods, these methods can be applied even to proteins that do not closely resemble previously studied ones. The rest of this chapter outlines current work, underscores the limitations of each strategy, and highlights the computational challenges we face. THREADING METHODS If two proteins have evolved from a common ancestor, their structures tend to remain similar (because structure underlies function) even though their sequences may diverge considerably. The converse of this statement offers a recipe for protein structure prediction: if a "new" protein sequence is recognizably similar to the sequence of a protein of known structure, then it should be possible to approximate the "new" structure by threading the sequence of the "new" protein through the ''old" structure guided by the sequence alignment. The key step is to accurately align the new sequence with the known structure. While this is straightforward when the number of exact residue-for-residue matches between the two sequences exceeds 70 percent, as the degree of identity declines, this task becomes much harder (Smith and Smith, 1990). Without an accurate alignment, model-building efforts are doomed to failure. Proteins are the translated products of information encoded in DNA stored in the chromosomes of the nucleus. As has been discussed in

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 249 previous chapters, a group of three consecutive nucleotides specifies, by the genetic code, an amino acid for addition to a growing protein chain. From time to time, spontaneous mutations occur in the DNA sequence; these mutations produce changes in the amino acid sequence. While some mutations are deleterious and are eliminated by evolutionary selection, other mutations are essentially neutral in their effect and thus suffer no negative selection. As a result, the exact blueprint for a protein will drift over time as mutations accumulate. Genetic recombination can introduce additional variation to a gene. New sections of sequence can be inserted or existing sections deleted. Distinct genes or gene parts can be duplicated or concatenated. To trace the putative evolutionary relationship of two proteins, we must be able to assess the likelihood that a series of mutations, insertions, and deletions could relate the two sequences. The likelihood that any residue will be mutated into another residue could depend on the mechanisms for somatic mutation as well as on the implications of the substitution for the stability or catalytic efficiency of a protein—that is, it could depend both on mutation and selection. We need to determine the likelihood of all possible interchanges between the 20 natural amino acids, that is, create a 20 × 20 transition matrix for a Markov process (although sequence evolution is not strictly Markovian). A number of approaches can be taken: 1. Focusing on mutation, we have to know the frequency of transitions (A ↔ G, C ↔ T) and transversions (AG ↔ CT) as well as the number of base changes required to change the triplet code for one amino acid into another. A minimum base change matrix has been constructed and used to align protein sequences. This mechanistic approach is most successful with very closely related protein sequences. Over long periods of time, however, there is ample opportunity for most mutations to occur, and so the observed spectrum of changes tends to depend more on the selective advantage or disadvantage of the amino acid substitutions than on the probabilistic nature of the mutation process itself. 2. An alternative approach to constructing the 20 × 20 transition matrix depends on the chemical similarity of the residues. Smith and Smith (1990) have employed this strategy to categorize and align all of the known sequences. 3. Finally, an unabashedly empirical approach can be taken. Dayhoff et al. (1972) manually aligned protein sequences from a functional family (for

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 250 example, cytochromes) and then tabulated the observed interchange frequency between all pairs of amino acids. Clearly, the success of the manual alignment and the evolutionary distance between family members have an impact on the interchange matrix. (In particular, Dayhoffs matrix corresponds to the transition matrix for the results of the Markov process over, for example, 250 million years. From this, one can infer the transition matrix for other periods.) In spite of its shortcoming, the Dayhoff matrix has proved extremely useful. It is simplistic to expect that any one matrix can be appropriate for all positions along the chain. In fact, mutational tolerance varies widely at different positions along a protein chain (Overington et al., 1990). Regions of the protein interior and secondary structure are less tolerant of mutation. Conformational restrictions dictate the residue type in tight turns. Some cysteine side chains are covalently cross-linked to another cysteine along the chain in a disulfide bridge (and thus are under strict selective pressure to remain constant), while others exist as free sulfhydryls (and are presumably less constrained). Moreover, two cysteines in a disulfide bridge have correlated fates: if the bridge is broken by the mutation of one residue (and the protein is not sufficiently compromised that evolution selects against it), the remaining cysteine is under much less selective pressure. Proteins are highly cooperative structures, and so coupled behavior of sequentially distant residues is common. For particular proteins with well-understood structures, specific sequence "profiles" have been developed that incorporate the coupled behavior of specific amino acids along the chain. This approach performs extremely well (Bowie et al., 1991), but no general scheme exists for discerning the cooperative aspects of the sequence of a folded protein. The existence of insertions and deletions of sequences during evolution creates a second problem in the development of an alignment metric. What is the correct penalty for creating a gap in one sequence, and how should the penalty grow as the gap becomes longer? With a sufficiently weak penalty for gaps, any two sequences can be aligned even if they are unrelated. To make matters worse, experiments have shown that the energetic penalty for inserting a sequence along the chain varies with the location of the insertion (Sondek and Shortle, 1990). Turn regions are accommodating, but the middles of secondary structure elements are more problematic. Smith has asserted that the gap penalty should be equal to

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 251 G1 + n G2, where G1 is a penalty for creating a gap, G2 is the penalty for extending the gap by one additional residue and n is the length of the insertion. No obvious formalism exists for determining G1 and G2, so they are parameters to be chosen to fit the problem at hand. Sternberg has further complicated this issue by suggesting a strategy to bias the definition of G1 based on the position along a sequence of known structure that relies on the presence or absence of secondary structure (Barton and Sternberg, 1987). From a structural viewpoint, this is extremely sensible. From the computational point of view, the additional difficulties are considerable. More recent work by Gonnet et al. (1992) based on actual insertions and deletions suggests that gap penalties should take the form G1n −1.7. Once the mutation and gap metric have been chosen, one wishes to find the best alignment of two or more sequences. For two sequences, dynamic programming algorithms offer a rapid method for creating the optimal pairwise alignment (Smith and Waterman, 1981). For the alignment of multiple sequences, the problem becomes computationally intractable (alignment of k sequences of length N takes time O(Nk) ), and so iterative pairwise strategies are used. The significance of a final alignment is often evaluated by comparison with a family of "random" sequences. Unfortunately, protein sequences are far from random, and so this approach usually overestimates the significance of an alignment. New methods are being developed to improve the scoring metrics and to produce "random" sequences with the Markov dependences commonly observed in proteins (Karlin and Altschul, 1990). When an accurate alignment is available for a new protein sequence and a protein of known structure, it is possible to construct a useful model of the "new" three-dimensional structure based on the "old" structure. For closely related structures, the residue-by-residue positional error is small. However, the error grows as the divergence between the sequences of the model and known structures increases (Chothia and Lesk, 1986). Because "reasoning by homology'' plays such an important role in the prediction of protein structure, various recipes have been developed for model building (Greer, 1990; Chothia et al., 1989; Jones and Thirup, 1986; Bruccoleri and Karplus, 1987; Ponder and Richards, 1987; Wilson et al., 1993). One method is based on using secondary structure segments to create a molecular scaffold, thereby decomposing the problem. Onto this scaffold,

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 252 particular loops can be added, and the side chain conformation of the individual residue can be specified. The loops can be selected from a "loop thesaurus" (although this suffers from the drawback that it cannot predict novel loops) or by computational search (although this becomes prohibitive for loops of more than 10 residues). The conformation of the side chains cannot be determined by direct conformational search because too many local minima exist, but the problem can be made more tractable by focusing the computation on "rotamer libraries" containing the most commonly found conformations and by performing comparisons with the conformations found in collections of related sequences (Wendoloski and Salemme, 1992). While model construction is straightforward, model validation is much more difficult. Is a homology-built model correct? If not, where is it wrong, and how can it be improved? The problem turns out to be surprisingly hard. Indeed, Novotny et al. (1988) showed that it is possible to construct "incorrect" models that nonetheless satisfy a variety of energetic constraints. They took hemerythrin and an immunoglobulin domain, two proteins with entirely different secondary and tertiary structure. The hemerythrin sequence was forced into the immunoglobulin structure and vice versa. The difficulty in distinguishing the correct and misfolded structures based on energy calculations was both remarkable and profound. This suggests that the errors inherent in existing potential functions may dominate the energy differences between correctly and incorrectly folded proteins. In practice, homology modeling has had varied success (Pearl and Taylor, 1987; Wierenga and Hol, 1993), ranging from important triumphs to ambiguous or erroneous results. Clearly, better methods are needed to evaluate the quality of model-built structures. One such test stems from the observation that a hallmark of folded proteins is the close packing of atoms (Richards, 1977). Using various computational methods, various workers (Richards, 1977; Frauenfelder et al., 1987) have shown that the packing density within a protein turns out to be comparable to that seen in the crystalline arrangement of small organic molecules, with cavities being rare (and those that are observed can be functionally relevant). Gregoret and Cohen (1990) have exploited tests for packing defects as a strategy for sorting correct structures from their incorrectly modeled counterparts. While this approach provides a useful filter for rejecting some structures, well-packed but incorrect structures are found.

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 253 A second approach to distinguishing correct from incorrect structures is based on the concept of solvent- accessible surface area, a notion that has found broad utility in the analysis of the energetics of protein folding and molecular recognition (Lee and Richards, 1971). The solvent-accessible surface area is defined as the area of the locus traced by the center of a sphere of radius 1.4 angstroms (Å) (the radius of a water molecule) when it is rolled along the surface of the protein. The solvent-accessible contact area is defined as the region of the molecule that contacts the probe sphere. These two measures tend to be roughly proportional and can be calculated using numerical integration (Lee and Richards, 1971) or analytical methods from differential geometry (Richmond, 1984). Solvent accessibility calculations turn out to be quite relevant to the energetics of protein folding (Chothia, 1974). While side chains on the exterior of a protein are in contact with water, the interior side chain environment resembles the organic solvent octanol (an eight-carbon alcohol). It is this difference that makes it favorable for hydrophilic side chains to lie on the surface and hydrophobic residues to cluster in the interior. The free energy required to transfer particular amino acid side chains from octanol to water has been measured (Lesser and Rose, 1990). The experimental values correlate closely with the accessible surface area of the side chain (1−Å2 surface area = 47 calories per mole), and the data provide a basis for distinguishing correct from incorrect protein structures. Indeed, Novotny et al. (1988) found that it was possible to sort the correct hemerythrin and immunoglobulin structures from their incorrect models by taking into account the accessible surface area. For this and other reasons, fourth-generation potential functions are being developed that include a solvent accessibility term as a method for implicitly incorporating proteinsolvent interactions. Another issue in evaluating proposed protein structures is the treatment of electrostatic interactions (Gilson and Honig, 1986). Proteins contain many charged side chains, and the peptide backbone forms a dipole. The dielectric behavior of the protein interior is significantly different from the behavior at or near the protein-solvent interface. This creates difficulties in computational efforts to evaluate properties of proteins. For example, a simple coulombic formulation of electrostatic interactions does a poor job of replicating experimental data on the intrinsic affinity of an ionizable group for a proton (pKa). Since the groups in the active site of a protein often have unusual pKa's, advances

Next: PREDICTING HIV PROTEASE STRUCTURE:AN EXCURSION »
Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology Get This Book
×
Buy Paperback | $80.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

As researchers have pursued biology's secrets to the molecular level, mathematical and computer sciences have played an increasingly important role—in genome mapping, population genetics, and even the controversial search for "Eve," hypothetical mother of the human race.

In this first-ever survey of the partnership between the two fields, leading experts look at how mathematical research and methods have made possible important discoveries in biology.

The volume explores how differential geometry, topology, and differential mechanics have allowed researchers to "wind" and "unwind" DNA's double helix to understand the phenomenon of supercoiling. It explains how mathematical tools are revealing the workings of enzymes and proteins. And it describes how mathematicians are detecting echoes from the origin of life by applying stochastic and statistical theory to the study of DNA sequences.

This informative and motivational book will be of interest to researchers, research administrators, and educators and students in mathematics, computer sciences, and biology.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!