Page 236

Chapter 9—
Folding the Sheets:
Using Computational Methods to Predict the Structure of Proteins

Fred E. Cohen
University of California, San Francisco

In principle, the laws of physics completely determine how the linear sequence of amino acids in a protein will fold into a complex three-dimensional structure with useful biochemical properties. In practice, however, predicting structure from sequence remains a major unsolved problem. In this chapter the author outlines current approaches to structure prediction. The most fruitful approaches are not based on physical simulations of the folding process, but rather exploit the conservative nature of evolution. Using statistical methods, pattern matching techniques, and combinatorial problem solving, protein structure prediction is becoming steadily more tractable.

At the crossroads of physics, chemistry, biology, and computational mathematics lies the protein folding problem: How does a linear polymer of amino acids assemble into a three-dimensional object capable of executing a precise chemical function? Implicit within this question are both kinetic and thermodynamic issues: Given a particular protein sequence, what is the conformation of the folded state? What path does the unfolded chain follow to reach this folded state? This chapter outlines the history of the protein folding problem, current research efforts, the obstacles to accurate prediction of protein structure, and the areas for future inquiries.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 236
Page 236 Chapter 9— Folding the Sheets: Using Computational Methods to Predict the Structure of Proteins Fred E. Cohen University of California, San Francisco In principle, the laws of physics completely determine how the linear sequence of amino acids in a protein will fold into a complex three-dimensional structure with useful biochemical properties. In practice, however, predicting structure from sequence remains a major unsolved problem. In this chapter the author outlines current approaches to structure prediction. The most fruitful approaches are not based on physical simulations of the folding process, but rather exploit the conservative nature of evolution. Using statistical methods, pattern matching techniques, and combinatorial problem solving, protein structure prediction is becoming steadily more tractable. At the crossroads of physics, chemistry, biology, and computational mathematics lies the protein folding problem: How does a linear polymer of amino acids assemble into a three-dimensional object capable of executing a precise chemical function? Implicit within this question are both kinetic and thermodynamic issues: Given a particular protein sequence, what is the conformation of the folded state? What path does the unfolded chain follow to reach this folded state? This chapter outlines the history of the protein folding problem, current research efforts, the obstacles to accurate prediction of protein structure, and the areas for future inquiries.

OCR for page 236
Page 237 A Primer on Protein Structure Proteins are constructed by the head-to-tail joining of amino acids, chosen from a 20-letter alphabet. The 20 natural amino acids have a common backbone, but a variable side chain or R-group. The R-groups may be large or small, charged or neutral, hydrophobic or hydrophilic, and conformationally restricted or flexible (see Figure 9.1). It is the physical properties of these R-groups that determine the diverse structures into which a given amino acid chain will fold. Broadly speaking, proteins can adopt fibrous or globular shapes. Repetitive amino acid sequences adopt elongated periodic fibrous structures, with common examples including elastin (skin), collagen (cartilage), keratin (hair), and b-fibroin (silk). This chapter focuses on globular proteins. Figure 9.1 Twenty amino acids: R-groups are shown clustered by functional types: aliphatic hydrophobic, aromatic hydrophobic, hydrophilic, negatively charged, positively charged, and conformationally special.

OCR for page 236
Page 238 The enzyme ribonuclease, which catalyzes the breakdown of ribonucleic acid (RNA), provides a useful example. The sequence contains 124 amino acids. Under appropriate conditions, the amino acid chain is covalently cross-linked in four locations through disulfide bridges between cysteines in the protein chain. (The amino acid cysteine has a reactive sulfur atom that forms such bridges, which provide the only covalent bonds joining nonneighboring amino acids in the chain.) In a classic series of experiments, Anfinsen et al. (1961) demonstrated that the amino acid sequence of ribonuclease contained enough information to code for the folded structure. Specifically, he showed that ribonuclease lost its enzymatic activity in the presence of a chemical denaturant (which disrupted the protein's structure) but spontaneously regained its activity when the denaturant was removed. Even when the disulfide pairings were scrambled after denaturation, renaturation could occur. Thus, without any outside assistance, the protein could refold. Independent of the starting conformation, the amino acid sequence contains sufficient information to direct the chain to the correct folded structure. Similar experiments have been repeated with many other proteins. This work would suggest that proteins follow an energy gradient from the denatured state to the native state. The free energy difference between these two states favors the folded state, and the height of the activation barrier along the folding pathway governs the rate of chain assembly (see Figure 9.2). Recently, molecular biologists have discovered that some proteins can assist the folding process. These proteins, dubbed foldases, include the chaperonins (Kumamoto, 1991) that prevent proteins from assembling inside an undesirable cellular compartment, prolyl isomerases that increase the rate of the cis-trans isomerization of the amino acid proline (Fischer and Schmid, 1991), and protein disulfide isomerases (Freedman, 1989), which shuffle disulfide bridges. While it is conceivable that these foldases might take a protein to a kinetically trapped final state different from the state of lowest free energy, this seems unlikely. Instead, I imagine that these foldases simply lower the activation barrier to folding into the lowest energy state. In the absence of an appropriate foldase, the height of the activation barrier might be such that in some cases, protein folding will not occur on a biologically sensible time scale.

OCR for page 236
Page 239 Figure 9.2 Thermodynamics of protein folding: the folding chain must surmount a free energy barrier   to move from the denatured to the native state. The native state is more stable than the denatured state by free energy DG. One reason for the tremendous interest in the protein folding problem is that it has become simple to determine the amino acid sequence of large numbers of proteins while it remains difficult to determine the structure of even a single protein. The first protein sequences were laboriously determined by classical biochemical methods (Konigsberg and Steinman, 1977). The proteins in question were isolated, purified to homogeneity, and enzymatically digested into smaller fragments. Amino acids in each such fragment were chemically cleaved, one residue at a time, from one end and from each successive amino acid. Automated methodologies and improved chemistry accelerated this process, but protein sequencing remained a tedious task until molecular biology supplied a different approach (Maxam and Gilbert, 1980). By determining the deoxyribonucleic acid (DNA) sequence of the gene encoding the protein (using methods that were quite rapid), one could infer the amino acid sequence of the protein by simply translating the DNA codons according to the genetic code. The approach is much faster and more reliable than direct protein sequencing. With the advent of this technology has come a flood consisting of tens of thousands of protein sequences.

OCR for page 236
Page 240 By comparison, the rate at which new protein structures are determined remains a trickle because the structure determination remains a formidable experimental task. X-ray crystallography was the first technique used to determine the structure of proteins (Kendrew, 1963). One must first coax a protein to crystallize with sufficient regularity to diffract X-rays. Then the crystal must be bombarded with X-rays and the X-ray diffraction pattern collected, either on film or with an electronic detector system. In principle, the X-ray diffraction pattern corresponds to a Fourier transform of the electron density D of the crystal—with the amplitude and phase of the signal at each point corresponding to the amplitude and phase of the corresponding complex Fourier coefficient. Unfortunately, detectors can record only the amplitude, not the phase. Solving for an X-ray crystal thus involves determining the density D from ||, which can be a formidable task. In general, the problem is underdetermined. A mathematical approach is to add constraints (for example, D must be everywhere positive, since it represents a density). An experimental approach is to use additional information from the X-ray diffraction pattern obtained when the protein is crystallized in the presence of a heavy atom (for example, mercury, uranium, or platinum) or anomalous scatterers (for example, selenium) bound to the protein in a covalent or non-covalent fashion. The difference between the original and modified patterns or the patterns as a function of X-ray wavelength provides the missing phase information. Although the approach is very powerful, it requires that the protein architecture not be significantly changed by this molecular perturbation, and it is more successful when several derivatives are available for study (Blundell and Johnson, 1976). Finally, one can start with a good guess at the protein structure. The Fourier transform of this structure yields a set of intensities and phases. The hypothetical structure is rotated and translated until the intensities match the experimental data. If the correlation between the hypothetical and actual structure is strong, then the structure determination can succeed without the need for heavy atom derivatives. More recently, nuclear magnetic resonance (NMR) spectroscopy has been used to determine protein structure (Wuthrich, 1986). Pairs of hydrogen atoms (protons) produce resonances when they lie in neighboring positions in the protein chain or when they lie very close together in space. By determining the correspondence of resonances with

OCR for page 236
Page 241 individual amino acids in the protein, one can determine which amino acids lie near each other. Based on these constraints, one can use the mathematical technique of distance geometry (Crippen and Havel, 1988) or restrained molecular dynamics with simulated annealing to build a partially constrained structure. (The isotopes 13C and 15N can also provide additional information.) Currently, this approach requires a noncrystalline but highly concentrated protein solution and works only for relatively small proteins (the resonances broaden as the molecule size increases and its tumbling time decreases). Basic Insights about Protein Structure If a protein sequence contains sufficient information to code for a folded structure, it should be possible to construct a potential energy function that reflects the energetics of an assembling polypeptide chain. In principle, one would ''only" need to find the minimum of this potential function to know the protein's folded state. In practice, this goal has proved elusive. Some early workers defined molecular force fields compatible with the experimentally measured conformational preferences of small molecules (Lifson and Warshel, 1969). Unfortunately, attempts to fold a denatured chain using this approach were unsuccessful (Levitt, 1976; Hagler and Honig, 1978) because multiple local minima along the potential energy surface trapped the folding chain in unproductive conformations (see Figure 9.3). Even with improved search strategies including molecular dynamics and Monte Carlo methods, it has not been possible to find the native structure from a random starting point (Howard and Kollman, 1988; Wilson and Doniach, 1989). This has been called the "multiple minima problem." It remains a critical problem for the conformational analysis of complex molecules. Despite the inability to fold proteins de novo, this approach has proved valuable for studying the behavior of proteins by studying small perturbations around the known structure. Because direct computation is difficult, one approach would be to look for patternsand regularities in protein structures that might simplify the task of prediction. In fact, considerable insight can be gained by simply looking at experimentally determined protein structures. First of all, one

OCR for page 236
Page 242 Figure 9.3 The multiple minima problem: a two-dimensional schematic  of the energy surface of a folding protein. Different starting  points lead to different metastable states. Only S 2  finds the  global  minimum. observes that proteins tend to employ certain stereotypical local conformations called secondary structures. The most important are called a-helices and b-sheet structures and were suggested by Pauling (Pauling et al., 1951) based on first principles. In an a-helix, the chain follows a right-handed spiral with hydrogen bonds between the amino group (NH) of one amino acid and the carbonyl group (C=O) of an amino acid a few steps further along the chain. The result is a stable structure with a sequentially local network of hydrogen bonds (see Figure 9.4A). b-sheets offer a different solution to the hydrogen bonding problem. These sheets involve segments of the chain that are sequentially distant but conformationally similar, forming an alternating pattern of hydrogen bonds (see Figure 9.4B). The b-strands may lie parallel or antiparallel to one another. In fibrous proteins, repeated amino acid sequences yield elongated a-helices like a-keratin (or hair) and b-sheets like b-fibroin (or silk). Globular proteins must contain amino acid sequences that break a-helix and b-sheet

OCR for page 236
Page 243 Figure 9.4 (A) An alpha helix. (B) A b-sheet: four parallel  b-strands are shown. Hydrogen bonds exist  between oxygen atoms on one strand and nitrogen atoms on the neighboring strand.

OCR for page 236
Page 244 structure and cause the chain to turn back toward the center of the molecule. Secondary structure provides a useful building block for constructing more complex protein structure (Crick, 1953; Levitt and Chothia, 1976). Proteins are usefully classified by their use of secondary structures: a/a proteins are structures dominated by a-helices (for example, myoglobin); b/b proteins are predominantly b-sheet structures (for example, plastocyanin); a/ß proteins are characterized by the regular alternation of a-helices and b-strands (for example, flavodoxin); and a + ß proteins are characterized by the irregular alternation of a-helices and b-strands (for example, lysozyme) (see Figure 9.5). Although the building blocks are common, the connectivity of the chain varies within these folding classes. Molecular biologists have borrowed the term "topology" (inappropriately) to describe the path that the chain takes in joining consecutive secondary structure elements. For example, many proteins contain four a-helices packed one against another to form a square four-helix bundle. With one helix taken as the reference point, the other three helices can be visited in six distinct orders. Moreover, each of these three helices can lie parallel or antiparallel to the reference helix. Thus, 48 motifs are possible. Is there any preference in the arrangements found in nature? By their general structure, a-helices have a dipole moment with partial positive charges near their N-terminus (start) and partial negative charges near their C-terminus (end). If electrostatic considerations are significant, one might expect to see antiparallel arrangements predominate (since opposite charges attract). In fact, a review of available protein structures reveals that 17 of 18 four-helix bundle structures conform to this expectation (Presnell and Cohen, 1989). Of the six possible motifs involving antiparallel arrangements, five have been observed in nature so far, and the sixth is expected to crop up as the database of protein structures grows (see Table 9.1 and Figure 9.6). An important corollary of the study of four-helix bundles is that quite distinct sequences can adopt similar structures: the code for folding is degenerate. Further insight into protein structure is gained by considering the physicochemical properties of the different amino-acid side chains. Some side chains (those called hydrophilic) interact favorably with water, while others (called hydrophobic) do not. For globular proteins, one would expect (Kauzmann, 1959) that the hydrophilic side chains would tend to

OCR for page 236
Page 245 Figure 9.5 Tertiary structure classes.

OCR for page 236
Page 246 Table 9.1 Topologies of Currently Known Four-a-Helix Bundles Number of Overhand All Antiparallel Others Connections(s) Left-handed Right-handed (right-handed) 0 Complement C3a Cytochrome b-562     Complement C5a Cytochrome c'     Cytochrome b5 Methemerythrin     Interleukin 2 TMV coat protein     T4 lysozyme     1 Ferritin Phospholipase C (b) Cytochrome P-450cam 2 Human growth hormone NOTE: There are no left-handed topologies for "other" four-a-helix bundles. TMV is the tobacco mosaic virus. dominate the exterior of the protein (where it interacts with the aqueous environment) while hydrophobic side chains would occupy the molecule's interior. Richards devised a simple method for defining the "solvent-accessible" portion of a protein by rolling a sphere with a radius comparable to that of a water molecule along the molecular surface (Lee and Richards, 1971). When amino acid residues are categorized in this way, it is indeed found that hydrophobic residues tend to occur on the inside and hydrophilic residues tend to occur on the outside, although the correlation is far from perfect. Solvent-accessible surface area calculations have shed light on the importance of the "hydrophobic effect" in driving protein folding and have proved valuable in dissecting the stabilization of protein—protein and secondary structure—secondary structure inter-actions. In summary, the analysis of protein structures has produced some unassailable conclusions: packing is an important element of protein stability; secondary structure is a common component of protein structure;

OCR for page 236
Page 261 Figure 9.9 The basic neural network. Circles represent the units, and squares the weights between the units. The larger the square, the greater its absolute value. Solid squares represent negative values, and open squares represent positive values. The bars represent the outputs of the units, with values ranging from 0.0 to 1.0. The symbols are defined as follows:  Ok  is the output of unit k; wik  is the weight to unit i from unit k; and  bi  is the bias of unit  i. The activation of unit i  is ai =  Swik ok + bi, and its output is . Reprinted, by permission, from Kneller et al. (1990). Copy-right © 1990 by Academic Press Limited. and leucine are seen to strongly favor a-helices, while proline and glycine disrupt the a-helix structure and prefer turn conformations. This is consistent with the structural information derived from previous studies and reinforces the sensibility of this approach. Why do neural networks not perform more accurately? We have begun to address this question. If a network is trained on proteins restricted to one structural class, especially all helical proteins, the accuracy improves significantly (Kneller et al., 1990). For example, nodes can be added that capture the alternating distributions of hydrophobic and

OCR for page 236
Page 262 Figure 9.10 Hinton diagram of the weight matrix. The weights connecting each input unit to each output unit are shown. The size of each box correlates with the magnitude of the weight. Solid boxes have negative values, and open boxes have positive values. The three output units are shown as separate blocks. The 273 input units connected to each output unit are divided into 13 groups of 21 units each. Along the abscissa, the groups are located at positions in the range -6, - 5, . . ., 0, . . ., 5, 6 relative to the central residue. Along the ordinate, the 21 units are labeled by residue type. Reprinted, by permission, from Kneller et al. (1990).  Copyright © 1990 by Academic Press Limited.

OCR for page 236
Page 263 hydrophilic residues in a phase-insensitive way. Together, these methods improve secondary structure prediction accuracy for all helical proteins to 79 percent and for all b proteins to at least 70 percent. a/b proteins remain problematic. Presumably, this relates to the fact that the fundamental structural unit in a/b proteins involves an a-helix and the preceding and/or following b-strands. This super-secondary structure involves at least 30 residues, far more than are included in the windows currently used. By exploiting a family of aligned protein sequences, improvements in secondary structure prediction are anticipated. Other aspects of this problem continue to make this a fertile area for study. The second general approach to secondary structure prediction is rule-based methods, which try to capture biochemical regularities in protein structure. In an important early paper, Lim (1974) described ''rules" that specify residue combinations along the chain that stabilize or destabilize a-helices and b-sheets. The rules attempted to capture the notion that secondary structure elements need to be compatible with the overall tertiary structure consisting of a hydrophobic core and hydrophilic exterior. Among other constraints, isolated b-strands are stable only in the context of larger b-sheets, and the edge strands in these b-sheets have significantly different properties than the interior strands. Technical difficulties in the formulation of the rules hampered efforts to implement these ideas in a computer program, but this does not detract from the insightfulness of the approach. Our group has followed up on the rule-based approach pioneered by Lim. We have constructed PLANS, a Pattern Language for Amino and Nucleic Acids Sequences, and implemented this language in LISP and C (Cohen et al., 1983, 1986). Accurate patterns can be written to identify various structural features. For example, rule-based patterns can be used to identify "turns" in protein chains. Turns contain hydrophilic stretches without periodic structure; they tend to be evenly distributed throughout the protein chain. Extremely hydrophilic clusters of amino acids are nearly always turns. Weaker turns can be identified as relatively hydrophilic clusters of residues appropriately separated from the more obvious turns. The spacing between turns depends on the secondary structure content of the protein. For example, an a-helical segment bounded by turns contains approximately twice as many residues as a similar b-strand segment. The class of a protein (a/a a/b, b/b) offers a simple way of specifying the expected link length

OCR for page 236
Page 264 between turns. Using a hierarchical set of turn patterns embodying these ideas, one can identify turns with ~ 90 percent accuracy. Other work on rule-based methods has focused on finding the exact location of a-helices on a/a proteins (Presnell et al., 1992). Even though helices are heterogeneous objects, patterns have been developed to recognize their beginnings (N-caps or N-termini), cores, and ends (C-caps or C-termini). While the core patterns are very accurate (> 90 percent of helices can be located), the termini, especially the C-termini, remain poorly defined. Because of these deficiencies, amalgamation of the three groups of patterns produces a secondary structure prediction that is only 71 to 78 percent correct. Developing sequence patterns to represent protein substructures is labor intensive. Recently, there have been attempts to automate this process by means of heuristic, iterative algorithms for pattern construction (King and Sternberg, 1990). The hierarchical approach to protein structure prediction is premised on the notion that secondary structure will be a useful computational intermediate for the prediction of overall tertiary structure. How exactly can one use secondary structure information to bootstrap the process? Conceptually, the most straightforward approach to this problem would be to construct all possible three-dimensional arrangements of the secondary structure segments and then eliminate those with structural flaws (high potential energy). Combinatorial approaches can be used to search over the many possible arrangements and evaluate their plausibility (Cohen et al., 1979; Ptitsyn and Rashin, 1975). The approach is particularly well developed for the case of a-helices, owing to the fact that the periodicity of a-helices tends to favor distinct packing geometries for pairs of a-helices (Crick, 1953; Chothia et al., 1977; Richmond and Richards, 1978; Murzin and Finkelstein, 1988). Hydrophobic residues tend to dominate the interfacial region between paired a-helices. Moreover, there is a correlation between the extent of the hydrophobic interface on the a-helices and the preferred packing geometry. In the next section, we describe an application of this approach to the oxygen-bearing protein myoglobin. Similar work has been done on b/b (Cohen et al., 1980, 1982; Finkelstein and Reva, 1991). For a/b proteins, the combinatorial complexity of these proteins is much greater than for b/b and a/a proteins, but it is still possible to use

OCR for page 236
Page 265 secondary structure as a guide to approximate tertiary structure. A complete description of these combinatorial algorithms can be found in a recent review (Cohen and Kuntz, 1989). If the hierarchical approach to protein structure prediction is to succeed, secondary structure prediction must improve (to at least the 80 percent accuracy level), the combinatorial methods must be further refined, and the radius of convergence of existing potential functions must be extended to allow optimization of the final structure (Wilson and Doniach, 1989; Sippl, 1990; Troyer and Cohen, 1991). Predicting Myoglobin Structure: An Excursion Myoglobin, a 153-residue oxygen-carrying protein, was the first protein structure to be determined by X-ray crystallography. It is composed of six long a-helices and two other smaller helices that do not contribute to the protein's hydrophobic core. In the 1970s we showed that it is possible to construct three-dimensional models of myoglobin by the successive addition of helices to an initial helix using the putative hydrophobic interfaces, while respecting the geometric preferences of helix-helix interactions. From the work of Pauling (1967), we know the conformation and hence the atomic coordinates of the backbone of an a-helix. To begin, we are free to place helix A (residues 3 through 18) so that its axis is coincident with the x-axis with its centroid at the origin. Residue 10 is the center of a potential helixhelix interaction site and creates a sticky patch on the surface of helix A. One possible pairing of helices would join A and B (20 through 35) through sticky patches centered at residues 10 and 28. A line segment perpendicular to the axis of helix A that passes through the Ca of residue 10 with a length of 8.5 Å can be used to place helix B such that the segment passes through the Ca of residue 28, is normal to the axis of helix B, and terminates at this axis. Helix E could be placed via its interaction with helix B, and so on. While the packing of helices B and E will be sensible, nothing in this procedure prevents helices A and E from colliding. For the six helices of myoglobin, there are 14 likely helix-helix interaction sites and 3.4 ´ 108 possible structures.

OCR for page 236
Page 266 To complete this calculation, a PDP 11/70 filling an entire machine room used to work for 48 hours. Today, a laptop computer can complete this same calculation in much less than one hour. An algorithm with a tree architecture can be used to generate these structures. Fortunately, many of the possible structures violate steric constraints (that is, parts of the molecule collide) or disrupt the connectivity of the chain (that is, the interhelix portion of the protein chain cannot stretch from the end of one helix to the start of the next helix), and so large branches of the "tree" can be removed from further consideration. Remarkably, only 20 plausible structures are obtained. Using the additional information that myoglobin contains an iron-bearing heme group, the list can be winnowed: only 2 of the 20 structures could use two histidines to chelate an iron atom surrounded by a heme ring in a sterically reasonable way. As it happens, these 2 structures are extremely similar (rootmean-square displacement (rmsd) between Ca atoms = 0.7 Å) and resemble the crystal structure of myoglobin (rmsd = 4.4 Å). Presumably, detailed energy calculations could be used to refine these structures. To date, the radius of convergence of existing molecular dynamics algorithms is too small to close the 4- Å gap between these approximate structures and the X-ray structure. Conclusion The protein folding problem is enormously important to biologists. Sequences for exciting new proteins are relatively easy to determine. Structural data for these molecules are much more difficult to obtain. Yet proteins contain a structural blueprint within their sequence. The computational challenge to unravel this blueprint is great. This chapter has highlighted the important problems in this field and identified fertile territory for new investigations. Acknowledgments This work was supported by grants from the National Institutes of Health, the Searle Scholars Program, and the Advanced Research Projects Agency.

OCR for page 236
Page 267 References Anfinsen, C.B., E. Haber, M. Sela, and F.H. White, 1961, "The kinetics of the formation of native ribonuclease during oxidation of the reduced polypeptide domain," Proceedings of the National Academy of Sciences USA 47, 1309-1314. Barton, G.J., and M.J.E. Sternberg, 1987, "Evaluation and improvements in the automatic alignment of protein sequences," Protein Engineering 1, 89-94. Blundell, T.L., and L.N. Johnson, 1976, Protein Crystallography, New York: Academic Press. Bowie, J.U., R. Luthy, and D. Eisenberg, 1991, "A method to identify protein sequences that fold into a known 3-dimensional structure," Science 253, 164-170. Bruccoleri, R.E., and M. Karplus, 1987, "Prediction of the folding of short polypeptide segments by uniform conformational sampling," Biopolymers 26, 137-168. Chothia, C., 1974, "Hydrophobic bonding and accessible surface area in proteins," Nature 248, 338-339. Chothia, C., and A.M. Lesk, 1986, "The relation between divergence of sequence and structure in proteins," EMBOJ. 5, 823-826. Chothia, C., A.M. Lesk, A. Tramontano, M. Levitt, S.J. Smith-Gill, G. Air, S. Sheriff, E.A. Padlan, D. Davies, and W.R. Tulip, 1989, "Conformations of immunoglobulin hypervariable regions," Nature 343, 877-883. Chothia, C., M. Levitt, and D. Richardson, 1977, "Structure of proteins: Packing of a-helices and pleated sheets," Proceedings of the National Academy of Sciences USA 74, 41304134. Chou, P.Y., and G.D. Fasman, 1974, "Conformational parameters for amino acids on helical, b-sheet, and random coil regions calculated from proteins," Biochemistry 13, 211-245. Cohen, F.E., M.J.E. Sternberg, and W.R. Taylor, 1980, "Analysis and prediction of protein b-sheet structures by a combinatorial approach," Nature 285, 378-382. Cohen, F.E., M.J.E. Sternmerb, and W.R. Taylor, 1982, "The analysis and prediction of tertiary structure of globular proteins involving the packing of a-helices against a b-sheet: A combinatorial approach," Journal of Molecular Biology 156, 821-862. Cohen, F.E., R.A. Abarbanel, I.D. Kuntz, and R.J. Fletterick, 1983, "Secondary structure assignment for a/b proteins by a combinatorial approach," Biochemistry 22, 4894-4904. Cohen, F.E., R.A. Abarbanel, I.D. Kuntz, and R.J. Fletterick, 1986, "Turn prediction in proteins using a pattern-matching approach," Biochemistry 25, 266-275. Cohen, F.E., and I.D. Kuntz, 1989, "Tertiary structure predictions," pp. 647-706 in Prediction of Protein Structure and the Principles of Protein Conformation, G.D. Fasman (ed.), New York: Plenum. Cohen, F.E., T.J. Richmond, and F.M. Richards, 1979, "Protein folding: Evaluation of some simple rules for the assembly of helices into tertiary structures with myoglobin as an example," Journal of Molecular Biology 132, 275-288. Crick, F.H.C., 1953, "The packing of a-helices: Simple coiled coils," Acta Crystallogr. 6, 689-697.

OCR for page 236
Page 268 Crippen, G.M., and T.F. Havel, 1988, Distance Geometry and Molecular Conformation, New York: John Wiley & Sons. Dayhoff, M.O., L.T. Hunt, P.J. McLaughlin, and D.D. Jones, 1972, "Gene duplications in evolution: The globins," pp. 17-30 in Atlas of Protein Sequence and Structure, Vol. 5, M.O. Dayhoff (ed.), Silver Spring, Md.: National Biomedical Research Foundation. Dorit, R.L., L. Schoenbach, and W. Gilbert, 1990, "How big is the universe of exons?" Science 250, 1377-1382. Finkelstein, A.V., and B.A. Reva, 1991, "A search for the most stable folds of protein chains," Nature 351,497-499. Fischer, G., and F.X. Schmid, 1991, "The mechanism of protein folding: Implications of in vitro refolding mode for de novo protein folding and translocation in the cell," Biochemistry 29, 2205-2212. Frauenfelder, H., H. Hartmann, M. Karplus, I.D. Kuntz, Jr., J. Kuriyan, F. Darak, G.A. Petsko, D. Ringe, R.F. Tilton, Jr., and M.L. Connolly, 1987, "Thermal expansion of a protein," Biochemistry 26, 254-261. Freedman, R.B., 1989, "Protein disulfide isomerase: Multiple roles in the modification of nascent secretory proteins," Cell 57, 1069-1072. Garier, J., D.J. Osguthorpe, and B. Robson, 1978, "Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins," Journal of Molecular Biology 120, 97-120. Gilson, M.K., and B.H. Honig, 1986, "The dielectric constant of a folded protein," Biopolymers 25, 2097-2119. Gonnet, G.H., M.A. Cohen, and S.A. Benner, 1992, "Exhaustive matching of the entire protein sequence database," Science 256, 1443-1445. Greer, J., 1990, "Comparative modeling methods: Application to the family of the mammalian serine proteases," Proteins Struct. Funct. Genet. 7, 317-334. Gregoret, L.M., and F.E. Cohen, 1990, "Novel method for the rapid evaluation of packing in protein structures," Journal of Molecular Biology 211, 959-974. Hagler, A.T., and B. Honig, 1978, "On the formation of protein tertiary structure on a computer," Proceedings of the National Academy of Sciences USA 75, 554-558. Holley, L.H., and M. Karplus, 1989, "Protein secondary structure prediction with a neural network," Proceedings of the National Academy of Sciences USA 86, 152-156. Howard, A.E., and P.A. Kollman, 1988, "An analysis of current methodologies for conformational searching of complex molecules," J. Med. Chem. 31, 1675-1679. Jones, T.A., and S. Thirup, 1986, "Using known substructures in protein model building and crystallography," EMBOJ. 5, 819-822. Kabsch, W., and C. Sander, 1984, "On the use of sequence homologies to predict protein structure: Identical pentapeptides can have completely different conformations," Proceedings of the National Academy of Sciences USA 81, 1075-1078. Karlin, S., and S.F. Altschul, 1990, "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes," Proceedings of the National Academy of Sciences USA 87, 2264-2268. Kauzmann, W., 1959, "Some factors in the interpretation of protein denaturation," Adv. Protein Chem. 14, 1-63. Kendrew, J.C., 1963, "Myoglobin and the structure of proteins," Science 139, 1259-1266.

OCR for page 236
Page 269 King, R.D., and M.J.E. Sternberg, 1990, "Machine learning approach for the prediction of protein secondary structure," Journal of Molecular Biology 216,441-457. Kneller, D.G., F.E. Cohen, and R. Langridge, 1990, "Improvements in protein secondary structure prediction by an enhanced neural network," Journal of Molecular Biology 214, 171-182. Konigsberg, W.H., and H.M. Steinman, 1977, "Strategy and methods of sequence analysis," pp. 1-178 in The Proteins, Vol. 3, 3rd ed., H. Neurath and R.L. Hill (eds.), New York: Academic Press. Kumamoto, C.A., 1991, "Molecular chaperones and protein translocation across the Escherichia coli inner membrane," Molecular Microbiology 5, 19-22. Kuwajima, K., 1989, "The molten globule state as a clue for understanding the folding and cooperativity of globular-protein structure," Proteins Struct. Funct. Genet. 6, 87-103. Lee, B., and F.M. Richards, 1971, "The interpretation of protein structures: Estimation of solvent accessibility," Journal of Molecular Biology 55, 379-400. Lesser, G.J., and G.D. Rose, 1990, "Hydrophobicity of amino acid subgroups in proteins," Proteins Struct. Funct. Genet. 8, 6-13. Levitt, M., 1976, "A simplified representation of protein structures and implications for protein folding," Journal of Molecular Biology 104, 59-107. Levitt, M., and C. Chothia, 1976, "Structural patterns in globular proteins," Nature 261, 552558. Lifson, S., and A. Warshel, 1969, "Consistent force field for calculations of conformations, vibrational spectra, and enthalpies of cycloalkane and n-alkane molecules," J. Chem. Phys. 49, 5116-5129. Lim, V.I., 1974, "Structural principles of the globular organization of protein chains: A stereochemical theory of globular protein secondary structure," Journal of Molecular Biology 88, 857-894. Linderstrom-Lang, K.V., and J.A. Schellman, 1959, "Protein structure and enzyme activity," pp. 443-510 in The Enzymes, Vol. 1, P.D. Boyer (ed.), New York: Academic Press. Maxam, A., and W. Gilbert, 1980, "Nucleic acids Part I," pp. 499-560 in Methods in Enzymology, Vol. 65, L. Grossman and K. Moldave (eds.), New York: Academic Press. Murzin, A.G., and A.V. Finkelstein, 1988, "General architecture of the alpha-helical globule," Journal of Molecular Biology 204, 749-769. Novotny, J., A.A. Rashin, and R.E. Bruccoleri, 1988, "Criteria that discriminate between native proteins and incorrectly folded models," Proteins Struct. Funct. Genet. 4, 19-30. Overington, J., M.S. Johnson, A. Sali, and T. L. Blundell, 1990, "Tertiary structural constraints on protein evolutionary diversity: Templates, key residues and structure prediction," Proceedings of the Royal Society of London, Series B: Biological Sciences 241, 132-145. Padmanabhan, S., S. Marqusee, T. Ridgeway, T.M. Laue, and R.L. Baldwin, 1990, "Relative helix-forming tendencies of nonpolar amino acids," Nature 344, 268-270. Pauling, L., 1967, The Chemical Bond: A Brief Introduction to Modern Structural Chemistry, Ithaca, N.Y.: Cornell University Press.

OCR for page 236
Page 270 Pauling, L., R.B. Corey, and H.R. Branson, 1951, "The structure of proteins: Two hydrogenbonded helical configurations of the polypeptide chain," Proceedings of the National Academy of Sciences USA 37, 205-211. Pearl, L.H., and W.R. Taylor, 1987, "A structural model for the retroviral proteases," Nature 329, 351-354. Ponder, J.W., and F.M. Richards, 1987, "Tertiary templates for proteins: Use of packing criteria in the enumeration of allowed sequences for different structural classes," Journal of Molecular Biology 193, 775-791. Presnell, S.R., B.I. Cohen, and F.E. Cohen, 1992, "A segment based approach to protein structure prediction," Biochemistry 31, 983-993. Presnell, S.R., and F.E. Cohen, 1989, "Topological distribution of four-alpha-helix bundles," Proceedings of the National Academy of Sciences USA 86, 6592-6596. Ptitsyn, O.B., and A.A. Rashin, 1975, "A model of myoglobin self-organization," Biophys. Chem. 3, 1-20. Qian, N., and T.J. Sejnowski, 1988, "Predicting the secondary structure of globular proteins using neural network models," Journal of Molecular Biology 202, 865-884. Richards, F.M., 1977, "Areas, volumes, packing, and protein structure," Annu. Rev. Biophys. Bioeng. 6, 151-176. Richmond, T.J., 1984, "Solvent accessible surface area and excluded volume in proteins," Journal of Molecular Biology 176, 63-89. Richmond, T.J., and F.M. Richards, 1978, "Packing of a-helices: Geometrical constraints and contact areas," Journal of Molecular Biology 119, 537-555. Rumelhart, D.E., G.E. Hinton, and R.J. Williams, 1986, Parallel Distributed Processing. Explorations in the Microstructure of Cognition, Vol. 1, Cambridge, Mass.: MIT Press, pp. 318-362. Sippl, M.J., 1990, "Calculation of conformational ensembles from potentials of mean force: An approach to the knowledge-based prediction of local structures in globular proteins," Journal of Molecular Biology 213, 859-883. Smith, R.F., and T.F. Smith, 1990, "Automation generation of primary sequence patterns from sets of related protein sequences," Proceedings of the National Academy of Sciences USA 87, 118-122. Smith, T.F., and M.S. Waterman, 1981, "Comparison of biosequences," Adv. Appl. Math. 2, 482. Sondek, J., and D. Shortle, 1990, "Accommodation of single amino acid insertions by the native state staphylococcal nuclease," Proteins Struct. Funct. Genet. 7, 299-305. Troyer, J.M., and F.E. Cohen, 1991, "Simplified models for understanding and predicting protein structure," pp. 57-80 in Reviews in Computational Chemistry, K.B. Lipkowitz and D.B. Boyd (eds.), New York: VCH Publishers, Inc. Wendoloski, J.J., and F.R. Salemme, 1992, "PROBIT-A statistical approach to modeling proteins from partial coordinate data using substructure libraries," J Molec. Graphics 10, 124-126. Wierenga, R.K., and W.G. Hol, 1993, "Predicted nucleotide binding properties of p21 protein and its cancer associated variant," Nature 302, 842-844. Wilson, C., and S. Doniach, 1989, "A computer model to dynamically simulate protein folding: Studies with Crambin," Proteins Struct. Funct. Genet. 6, 193-209.

OCR for page 236
Page 271 Wilson, C., L.M. Gregoret, and D.A. Agard, 1993, "Modeling side-chain conformation for homologous proteins using an energy-based rotamer search," Journal of Molecular Biology 229, 996-1006. Wuthrich, K., 1986, NMR of Proteins and Nucleic Acids, New York: John Wiley & Sons.