Read "Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology" at NAP.edu

« Previous: PREDICTING HIV PROTEASE STRUCTURE:AN EXCURSION

Page 255 Cite

Suggested Citation:"HIERARCHICAL APPROACHES." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.

Page 256 Cite

Page 257 Cite

Page 258 Cite

Page 259 Cite

Page 260 Cite

Page 261 Cite

Page 262 Cite

Page 263 Cite

Page 264 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 255 a region of the genome that could code for a protein with some resemblance to the aspartyl proteases known from prokaryotic and eukaryotic sources. The problems were that the sequence was too short (~ 100 residues) when compared to the known aspartyl proteases (~ 240 amino acids) and that it contained only one of the two key aspartyl groups that form the active site. Moreover, the degree of sequence similarity between the viral sequence and the known aspartyl proteases was sufficiently low that researchers were unsure of exactly what went where. This was of more than passing interest as genetic studies of HIV demonstrated that mutation to the putative aspartyl protease blocked viral replication, and hence this was a promising target for pharmaceutical intervention. Pearl and Taylor (1987) studied the structures of several aspartyl proteases and constructed a template that encoded the essential features of this class of enzymes. A sequence template was found that could be used to scan the database of known sequences and efficiently sort aspartyl proteases from all other proteins. The HIV sequence fit half of the template and exhibited very economical tendencies with regard to the loop regions that joined the Î²- strands in the molecular framework. The prokaryotic and eukaryotic aspartyl proteases contain two domains that are structurally similar and can be related by a dyad axis. Moreover, one of the two aspartate residues in the active site is contributed by each domain. Pearl and Taylor reasoned that HIV, in an attempt to achieve additional genomic economy, elaborated a protease that had to pair or dimerize to form the active enzyme. A three- dimensional model of the HIV protease was constructed by following the template-directed homology to the aspartyl proteases of known structure that facilitated subsequent chemical and biochemical efforts. Their structural model proved to be extremely insightful when compared with the structural data provided by X-ray crystallographers several years later. HIERARCHICAL APPROACHES If protein folding is so hard, how do proteins manage to get it right? Does the folding process obey simple, logical rules that could guide computational efforts to reproduce it? Biochemists tend to describe protein structure in a hierarchical fashion, and many believe that the folding

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 256 process tends to proceed up the hierarchy (Linderstrom-Lang and Schellman, 1959). According to this view, the amino acid sequence or primary structure would first collapse from a disordered chain to form ordered elements of secondary structure, Î±-helices and Î²-strands. These secondary structure elements would coalesce to form a stable tertiary structure, consisting of the packing of the secondary structure elements against one another in the complete protein molecule. Finally, individual protein monomers with stable tertiary structures can sometimes aggregate to form multimers with the interaction defining the quaternary structure, often having complex functional and regulatory roles. This suggests a computational strategy to relate protein sequence and structure. First, predict the location of Î±-helices and Î²-strands, and then pack secondary structure units together to form an approximate tertiary structure that can be refined to the folded protein structure (see Figure 9.7). Is secondary structure prediction possible? Stated more precisely, is secondary structure determined predominantly by local interactions along the chain, or is it dependent on numerous nonlocal (tertiary) interactions? Various small peptides have been synthesized and spectroscopically characterized in an effort to understand the origins of the stability of an Î±-helix (Padmanabhan et al., 1990). Experiments show that several short sequences (< 20 residues in length) can form stable Î±-helices and the individual amino acid conformational preferences correlate with those observed in globular proteins. Studying Î²-structure in this fashion, however, has proved more difficult. Studies of whole proteins shed further light on the degree to which secondary structure arises from local interactions. Experiments have suggested that one can identify a "molten globule" state that can be stabilized under acidic conditions (Kuwajima, 1989). This intermediate state appears to contain native-like secondary structure as inferred from circular dichroism studies, but the tertiary structure appears not to be present. In short, secondary structure appears to be stable in the absence of tertiary or nonlocal interactions. On the other hand, the local information producing secondary structure cannot be too local. For it is known that identical pentapeptide sequences chosen from distinct proteins can adopt entirely different structures (Kabsch and Sander, 1984). For example, the same pentapeptide may form part of an a-helix in one protein and a Î²-strand in another. Thus, the necessary information must extend beyond

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 257 five residues. Presumably, the solution is that the conformation of some sequences is specified in large part by local interactions, while others are stable only in the context of the neighboring sequences. The most difficult challenge for secondary structure prediction methods is to determine the structure of these context-dependent regions. Figure 9.7 A hierarchical condensation model for protein folding. Sequence determines secondary structure, and secondary structure elements assemble to form an approximate tertiary structure. Energy refinement yields a detailed three- dimensional structure. Two general strategies have been applied to the secondary structure prediction problem: statistical approaches and rule-based approaches. The statistical approaches assert that proteins of known tertiary structure provide a useful data set describing secondary structure preferences of individual amino acids. Two presumptions are made: tertiary structure will exert no consistent effects on secondary structure, and the existing database is of sufficient size to provide important information. The first assertion recalls our discussion of the local versus global determinants of protein organization believe that local nteractions play a significant role in protein folding, but, in the literature,

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 258 this remains an open question. The adequacy of our current database can be approached in a more straightforward way. The conformations of over 40,000 amino acids in approximately 200 distinct protein structures are known. These distribute between Î±-structure (~ 30 percent), Î²-structure (~ 30 percent), and turns or coils (~ 40 percent). The likelihood that alanine will appear in an Î±-helix ( LÎ± (Ala) ) can be calculated easily from this data set. Even the 400 amino acid doublet propensities, which reflect the conditional probability that an alanine will occur in an a-helix contingent on the amino acid type of the neighboring residue, can be usefully approximated. However, the current data set is not adequate to provide information about the 8,000 triplet amino acid preferences. Moreover, it is not clear that the triplet interaction preferences will be the sum of three doublet interactions or that complete triplet preferences adequately define the conformational preferences of amino acids. Additional protein crystal structures and studies on model peptide systems will help in overcoming these limitations. In 1974, a landmark paper on protein secondary structure prediction was published by Chou and Fasman (1974). Working with a much smaller protein database, the authors calculated the secondary structure propensities of each amino acid, for example, From this information, residues were classified as helix formers (PÎ± â¥ 1.05), intermediate (0.70 â¤ PÎ± < 1.05), and helix breakers ( PÎ± < 0.70 ). Local clusters of helix formers defined helical nucleation sites. These nuclei were extended toward the N- and C-termini following rules based on the aggregate PÎ± 's. Although no computer algorithm accompanied the initial work, the method was sufficiently simple that it could be applied by hand. The accuracy of this algorithm (that is, the

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 259 percentage of correct predictions on a residue-by-residue basis) approached 60 percent. Subsequent work has developed more sophisticated variations on the theme. In 1979, Robson and co-workers introduced an information-theoretic formalism to supplement conformational preferences of the isolated residues with preferences based on pairwise interactions (Garnier et al., 1978). The method was easy to implement in a computer algorithm and achieved ~ 64 percent accuracy. More recently, various authors (Qian and Sejnowski, 1988; Holley and Karplus, 1989) have employed neural networks, which belong to a general class of machine learning algorithms that can efficiently "learn" an optimal translation of one data string (for example, a protein sequence) into another (for example, the sequential secondary structure assignments). The network is a group of input nodes connected to a group of output nodes with an optional hidden layer or layers of nodes (see Figure 9.8). A matrix of weights is developed to map the input information into the nodes on a path to the output layer. Like neurons in the nervous system, a cooperative nonlinear "firing" potential is used to decide if adequate information has accumulated to switch on an output node (see Figure 9.9). For secondary structure prediction, this "all or none" output node predicts an Î±-helix when the accumulated helical propensity of the residue of interest and its neighbors crosses the threshold. The weights for the connections that relate input nodes to output nodes are learned by example. A window specifies the number of neighboring residues that can contribute to the conformational state of the residue of interest. Case after case of input amino acid sequence and output secondary structure is presented to the network. A least squares algorithm defines an optimal set of weights for the encoding of the data set using a back propagation strategy. A more complete description of neural networks can be found in a chapter of the book Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1 (Rumelhart et al., 1986). Neural networks easily achieve an accuracy of 64 percent, a figure comparable to that for other methods. It is useful to explore the connection weights derived by the network that relate amino acids to their secondary structure preferences. Figure 9.10 is a Hinton diagram of these weights (the magnitude of the weight is proportional to the area of the square; positive weights are in white, and negative weights are in black). Alanine

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 260 Figure 9.8 The secondary structure neural network. The input pattern is a sequence of amino acids centered around a central amino acid. Each amino acid is mapped to one input group, which is a collection of 21 units. Each amino acid causes an input of 1.0 to one of the units of its group and an input of 0.0 to the other units. We typically use the six amino acid residues on each side of the central amino acid, for a total of 13 Ã 21 = 273 input units. There are three output units: helix, strand, and coil. Each input unit is connected to each output unit, and the output unit with the greatest output is taken to be the secondary structure prediction for the central amino acid. Additional input units can be accommodated. Reprinted, by permission, from Kneller et al. (1990). Copyright 1990 by Academic Press Limited.

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 261 and leucine are seen to strongly favor Î±-helices, while proline and glycine disrupt the a-helix structure and prefer turn conformations. This is consistent with the structural information derived from previous studies and reinforces the sensibility of this approach. Figure 9.9 The basic neural network. Circles represent the units, and squares the weights between the units. The larger the square, the greater its absolute value. Solid squares represent negative values, and open squares represent positive values. The bars represent the outputs of the units, with values ranging from 0.0 to 1.0. The symbols are defined as follows: Ok is the output of unit k; wik is the weight to unit i from unit k; and bi is the bias of unit i. The activation of unit i is ai = âwik ok + bi, and its output is . Reprinted, by permission, from Kneller et al. (1990). Copy-right 1990 by Academic Press Limited. Why do neural networks not perform more accurately? We have begun to address this question. If a network is trained on proteins restricted to one structural class, especially all helical proteins, the accuracy improves significantly (Kneller et al., 1990). For example, nodes can be added that capture the alternating distributions of hydrophobic and

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 262 Figure 9.10 Hinton diagram of the weight matrix. The weights connecting each input unit to each output unit are shown. The size of each box correlates with the magnitude of the weight. Solid boxes have negative values, and open boxes have positive values. The three output units are shown as separate blocks. The 273 input units connected to each output unit are divided into 13 groups of 21 units each. Along the abscissa, the groups are located at positions in the range â6, â5, . . ., 0, . . ., 5, 6 relative to the central residue. Along the ordinate, the 21 units are labeled by residue type. Reprinted, by permission, from Kneller et al. (1990). Copyright 1990 by Academic Press Limited.

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 263 hydrophilic residues in a phase-insensitive way. Together, these methods improve secondary structure prediction accuracy for all helical proteins to 79 percent and for all Î² proteins to at least 70 percent. Î±/Î² proteins remain problematic. Presumably, this relates to the fact that the fundamental structural unit in Î±/Î² proteins involves an Î±-helix and the preceding and/or following Î²-strands. This super-secondary structure involves at least 30 residues, far more than are included in the windows currently used. By exploiting a family of aligned protein sequences, improvements in secondary structure prediction are anticipated. Other aspects of this problem continue to make this a fertile area for study. The second general approach to secondary structure prediction is rule-based methods, which try to capture biochemical regularities in protein structure. In an important early paper, Lim (1974) described ''rules" that specify residue combinations along the chain that stabilize or destabilize Î±-helices and Î²-sheets. The rules attempted to capture the notion that secondary structure elements need to be compatible with the overall tertiary structure consisting of a hydrophobic core and hydrophilic exterior. Among other constraints, isolated Î²-strands are stable only in the context of larger Î²-sheets, and the edge strands in these Î²-sheets have significantly different properties than the interior strands. Technical difficulties in the formulation of the rules hampered efforts to implement these ideas in a computer program, but this does not detract from the insightfulness of the approach. Our group has followed up on the rule-based approach pioneered by Lim. We have constructed PLANS, a Pattern Language for Amino and Nucleic Acids Sequences, and implemented this language in LISP and C (Cohen et al., 1983, 1986). Accurate patterns can be written to identify various structural features. For example, rule-based patterns can be used to identify "turns" in protein chains. Turns contain hydrophilic stretches without periodic structure; they tend to be evenly distributed throughout the protein chain. Extremely hydrophilic clusters of amino acids are nearly always turns. Weaker turns can be identified as relatively hydrophilic clusters of residues appropriately separated from the more obvious turns. The spacing between turns depends on the secondary structure content of the protein. For example, an Î±-helical segment bounded by turns contains approximately twice as many residues as a similar Î²-strand segment. The class of a protein (Î±/Î± Î±/Î², Î²/Î²) offers a simple way of specifying the expected link length

FOLDING THE SHEETS: USING COMPUTATIONAL METHODS TO PREDICT THE STRUCTURE OF PROTEINS 264 between turns. Using a hierarchical set of turn patterns embodying these ideas, one can identify turns with ~ 90 percent accuracy. Other work on rule-based methods has focused on finding the exact location of Î±-helices on Î±/Î± proteins (Presnell et al., 1992). Even though helices are heterogeneous objects, patterns have been developed to recognize their beginnings (N-caps or N-termini), cores, and ends (C-caps or C-termini). While the core patterns are very accurate (> 90 percent of helices can be located), the termini, especially the C-termini, remain poorly defined. Because of these deficiencies, amalgamation of the three groups of patterns produces a secondary structure prediction that is only 71 to 78 percent correct. Developing sequence patterns to represent protein substructures is labor intensive. Recently, there have been attempts to automate this process by means of heuristic, iterative algorithms for pattern construction (King and Sternberg, 1990). The hierarchical approach to protein structure prediction is premised on the notion that secondary structure will be a useful computational intermediate for the prediction of overall tertiary structure. How exactly can one use secondary structure information to bootstrap the process? Conceptually, the most straightforward approach to this problem would be to construct all possible three-dimensional arrangements of the secondary structure segments and then eliminate those with structural flaws (high potential energy). Combinatorial approaches can be used to search over the many possible arrangements and evaluate their plausibility (Cohen et al., 1979; Ptitsyn and Rashin, 1975). The approach is particularly well developed for the case of Î±-helices, owing to the fact that the periodicity of Î±-helices tends to favor distinct packing geometries for pairs of Î±-helices (Crick, 1953; Chothia et al., 1977; Richmond and Richards, 1978; Murzin and Finkelstein, 1988). Hydrophobic residues tend to dominate the interfacial region between paired Î±-helices. Moreover, there is a correlation between the extent of the hydrophobic interface on the Î±-helices and the preferred packing geometry. In the next section, we describe an application of this approach to the oxygen-bearing protein myoglobin. Similar work has been done on Î²/Î² (Cohen et al., 1980, 1982; Finkelstein and Reva, 1991). For Î±/Î² proteins, the combinatorial complexity of these proteins is much greater than for Î²/Î² and Î±/Î± proteins, but it is still possible to use

Next: PREDICTING MYOGLOBIN STRUCTURE:AN EXCURSION »

Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology (1995)

Chapter: HIERARCHICAL APPROACHES

Welcome to OpenBook!

Get Email Updates