Click for next page ( 70


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 69
6. Tertiary Structure of Proteins and Nucleic Acids: Theory ENERGY OPT~IIZATION According to the thermodynamic hypothesis, based on Anfin- sen's (Anfinsen et al., 1961) experiments on the oxidative refolding of bovine pancreatic ribonuclease A from the reduced form, the amino acid sequence determines the three-dimensional structure of a protein in a given medium as the thermodynamically most stable one. It must be emphasized that this hypothesis applies to the length of the polypeptide chain at the stage when folding takes place, and not at some subsequent processing stage. For example, this hypothesis would be applicable to the single-chain pro-insulin and (possibly) not to the processed product, the two- chain disulfide-linked insulin. Thus, it is a challenge to chemists to understand how the interatomic interactions within the polypep- tide chain and the interactions between the atoms of the chain and those of the surrounding solvent lead to the thermodynamically stable structure, that is, the one for which the statistical weight of the system is a maximum. The hypothesis that the statistical weight is a maximum im- mediately implies that some kind of optimization strategy is neces- sary to find the most stable structure. This requires procedures to generate arbitrary conformations of a polypeptide chain, compute 69

OCR for page 69
70 the statistical weight for each conformation, and then alter the conformation so that it ultimately corresponds to the global maxi- mum of the statistical weight. Although the problem is formidable, current indications are that it can be solved using a sound] scien- tific approach without resorting to ad hoc procedures to deduce "folding rudest that do not explain the molecular bash of such "rules." To compute the global statistical weight, one optimizes the conformational energy of the polypeptide, incorporates the effect of the solvent, and takes account of the entropy of the system. Procedures are available for carrying out such computations, and currently available supercomputers permit the computations to be applied to very large systems. Although these procedures will undoubtedly improve, they are now adequate for computating and obtaining results that can be checked experimentally. The major difficulty still to be overcome, although partial success has been achieved (see review by Gibson and Scheraga, 1988; ado, Robson, 1986; Crippen, 1984), armed from the presence of many local m~n- ima in the multidunensional energy surface. Although algorithms are available for minimizing an energy function of many variables, there are no efficient ones for passing from one local minimum, over a potential barrier, to the next local minunum and ulti- mately to the global niinimum of the potential energy in a very high dimensional space. Thus, minimization leads to the nearest local minimum, where the procedure is trapped. This trapping in a local, rather than the global, minimum ~ referred to as the "multiple-minnna problem. Efforts are being made to overcome this problem using a variety of procedures, including approxima- tions that place the system in the potential well in which the global minimum lies (Gibson and Scheraga, 1988~. Then, any approxi- mations (introduced in the initial stages) are abandoned, and the full energy function is minimized. The use of molecular dynamics for minimization is an alternative strategy; it is considered in the next section. The energy minimization approach and its associated compu- tational program, (lescribed here for polypeptides and proteins, is equally applicable to any other type of macromolecules, such as polynucleotides and polysaccharides, as well as to interactions between the various types of macromolecule. General references to this methodology include the works of Anfinsen and Scheraga

OCR for page 69
71 (1975), Nemethy and Scheraga (1977), Levitt (1982), Karplus and McCammon (1983), Scheraga (1984), and Richards (1986~. Generation of an Mbitrary Confonnation Analytical geometry and associated matrix algebra provide the tools to generate a polypeptide chain. To do so, internal co- ordinates (dihedral angles for rotating about bonds) or Cartesian coordinates may be used as independent variables. When inter- nal coordinates are used, the bond lengths and bond angles are held fixed, but chosen very carefully from x-ray structures of mode} systems so as to properly reflect the geometric results of side chain- backbone interactions within each amino acid residue. The validity of this approach has been demonstrated for both polypeptides and polysaccharides systems in which strained conformations rarely arise (Scheraga, 1984~. When Cartesian coordinates are used, and hence bond lengths and bond angles are allowed to vary, one faces the problem that force constants for these motions are not as well known as the geometric features of the molecule. Further, bond- angle bending modes are enharmonic, and no currently available forcefield takes anharmonicity into account. Thus, we will have to overcome inadequacies in the force constants and problems of anharmonicity before we can rely on computed bond angles. Impo- sition of fixed bond lengths and bond angles is at least reconciled with observed crystal structures of small molecules. The subject of fixed versus flexible geometries has been discussed by Swenson et al. (1978~. Potential Functions Many research groups have developed potential energy func- tions with which to carry out such computations on polypep- tides, polysaccharides, polynucleotides, and synthetic polymers. The various potential functions have many similarities, but diner enough in detail to make them difficult to compare. This difficulty is compounded by the small number of cases for which parameters that characterize the strengths of the various interactions have been established in a selconsistent way on experimental data such as crystal structures, lattice energies, and barriers to inter- nal rotation. At present, the potential functions for polypeptides,

OCR for page 69
72 poly~accharides and synthetic polymers have been better param- eterized than have those for polynucleotides, prunarily because there are fewer reliable data for mode} nucleotide compounds. Similarly, more experimental data will be required for proper pa- rameterization of the potential functions for the prosthetic groups attached to biological macromolecules. Until recently, the poten- tial functions involving water molecules still would not account adequately for the observed radial distribution function of water, but this situation is improving (see, e.g., the work of Jorgensen et al., 1983~. It must be emphasized that the molecule responds to the to- tal (in principle, quantum mechanical) potential function, and its partition into various empirical components (such as non- bonded, electrostatic, hydrogen-bonding, and other interactions) is at best arbitrary, although there ~ some physical basis for as- signing such names to the various components. Consequently, one must avoid combining components from forcefields from different research groups. Each forcefield must be parameter~zed by itself in a self-consistent way. The total energy of a given conformation is expressed as a sum of the energies of aD nonbonded pairs of atoms. When considering biological function, such as the formation of an enzyme-substrate complex, then the pair interactions both within and between the two partners of the complex must be included in the total energy. Thus, the influence of each member of the complex on the other (the so-called induced fit phenomenon) is taken into account. Consequently, the conformations of the partners in the complex can differ from their conformations as individual species. This means that the computed conformation of an isolated hormone may not resemble the biologically active one when it ~ bound to a receptor (Scheraga, 1984~. Although potential functions could still be improved, those for which parameters have been determined in a self-consistent way have led to many computed structures that have subsequently been checked by experiment. For example, the computed struc- ture of the colIagen-like poly(Gly-Pro-Pro) (Miller and Scheraga, 1976) agrees with the subsequently determined crystal structure (Okuyama et al, 1976) within a root-mean square (rms) deviation of 0.3 A. Scheraga (1984) has cited many similar examples of experimental verification of computed structures. Therefore, we can be confident that the problem of adequate potential functions

OCR for page 69
73 is not serious, and more effort should be devoted to the most difficult problem of all the multiple-m~nuna problem. After the multiple-munnna problem is solved, it wiD then be worthwhile to reexamine the possibility or necessity of improving the potential functions further. Salvation There are several methods to mclude the effect of hydration in the computations. Hydration tends to force the polar groups to the surface of the molecule, putting them In contact with water, and forces the nonpolar groups to the interior, removing them from contact with water. One method includes the water molecules explicitly, and cal- culates the interaction energy between the molecule and the water. The success of this approach depends on the adequacy of the po- tential function describing the water-water interaction. Another approach ignores the structural features of the water molecule, and assigns a hydration shell (and an accompanying free energy of hydration) to each atom or group of atoms. As the conforma- tion changer and hydration shells overlap, a free-energy penalty is assessed. The hydration-shell mode} is parameterized on experi- mental data on free energies of hydration (Kang et al., 1987), but current efforts are being made to obtain such free energies from Monte CarIo and molecular dynamic studies of aqueous solutions of small molecules (see, e.g., Jorgensen et al., 1983~. The compu- tation of free energies by these simulation techniques faces many theoretical obstacles, although this very active field of research is progressing quickly. Entropy A variety of methods exist to incorporate entropic effects. One entropic effect arises from the hydration; this effect is treated as described earlier. The other entropic effect armes from the conformational fluctuations of the molecule. Several procedures may be used to compute this contribution, the most direct being the evaluation of the second derivative of the potential function. This describes the curvature at the bottom of the potential energy well and hence the fluctuation in conformation about each local niinirnum in the potential energy surface. I,ocal minima that are

OCR for page 69
74 not the global minimum of the potential energy can become the conformations of higher statistical weight if there ~ a large enough entropy gain from large conformational fluctuations. Paine and Scheraga (1987) have encountered such eEects. Optim;xati~ Procedures Optimization procedures are available for searching for the local minima. These Include direct energy minimization, Monte CarIo, and molecular dynamics proceclures. In energy minimiza- tion, the variables describing the conformation are altered system- atically so as to lower the energy continuously. The Monte CarIo method makes random changes in the conformational variables and accepts the new conformation according to various protocob that compute the energies before and after the random changes in the conformation. ~ molecular dynamics calculations, Newton's equations of motion for the atoms of the macromolecule (subject to interatomic forces determined by the potential functions) are solved to obtain a trajectory in conformational space. Very ef- ficient energy minimization algorithms exist, even for functions of many variables, but lead only to local minima (however, see the following section). Conventional Monte CarIo procedures can overcome local minima but tend not to cover conformational space efficiently enough. However, this difficulty is being overcome use ing modifications that include adaptive unportance sampling and other efficiency-seeking procedures (Gibson and Scheraga, 1988~. Molecular dynamics can also surmount local barriers, but the pi- cosecond time scale of practical computations does not approach the millisecond time scale of actual protein founding. Because most published papers do not provide the relevant data, it is difficult to compare the computer time needed for various optunization procedures. The required computation time will be very sensitive to how well the code is optimized, whether parallel processing Is carried out, and other factors. To obtain such benchmarks, it would be necessary to run each procedure on several different computer systems a task not yet undertaken. Solutions to the Muitiple-Mm;ma Problem for Macromolecules Since no mathematical procedures are available to locate the global minimum for any macromolecule (except in energy surfaces

OCR for page 69
75 of very low dunensionaiity), as mentioned above we must first re- sort to approximate procedures to obtain structures that might lie close to that of the native macromolecule. Then, the approx~na- tions are abandoned, and fuB-scale energy minimization, Monte CarIo, or molecular dynamics procedure ~ earned out. A variety of such procedures have been developed (Gibson and Scheraga, 1988~. These include: . a "build-up~ method, in which large structures are built up from ensembles of low-energy conformations of smaller ones (with energy minirn~zation being carried out at each stage); optimization of electrostatic interactions; optimization in a space of high dimensionality where fewer intervening barriers exist (with subsequent relax- ation back to three dimensions); Monte CarIo sampling among local minnna (accompa- nied by energy niinimization); adaptive importance Monte CarIo sampling (to drive the system more efficiently to the global minimums; pattern recognition methods to assemble organized back- bone structures as alpha-helices and beta-sheets; use of distance constraints either from experiment (Nu- clear O`rerhauser Effects (NOEs), nonradiative energy transfer, or NMR on spin-labeled molecules) or from statistical analysm of x-ray structures of proteins; and empirical methods to predict the locations of alpha- helices, beta-sheets, and beta-turns. Numerous valid predictions of global minimum structures of peptides have been made using these methods (Gibson and Scher- aga, 1988~. However, they have thus far been successful only for structures that contain at most 20 residues, and current efforts (most of which require access to supercomputers) are being made to extend these methods to larger molecules-to proteins contain- ing on the order of 100 amino acid residues. Successes and Failures Numerous structures have been predicted and subsequently confirmed exper~rnentaDy (Scheraga, 1984~. The right- or left- handed twists of the fundamental structures (alpha-helices and

OCR for page 69
76 beta-sheets) from which proteins are built have been accounted for by energy minimization. The observed packing features of alpha-helices and beta-sheets have likewise been accounted for in energetic terms. Parameters calculated for conformational tran- sitions (e.g., the helix-coi} transition in water) have been verified by experunent. The computed structures of open-chain and cyclic molecules (e.g., the 2~residue membrane-bound portion of melit- tin and the l~residue gramicidin S. respectively), and those of colIagen-like poly-tripeptides have also been verified by experi- ment (Scheraga, 1984~. Finally, the computed structure of an enzyme-substrate complex (hen egg white lysozyme and a hexas- accharide substrate) (Pincus and Scheraga, 1979) has been verified by experiment (Sm~th-Gill et al., 1984~. These and other examples should give us confidence in the validity of the potential functions and computational methodology (Gibson and Scheraga, 1988~. The failures, in the sense of not yet having solved the protein- folding problem, exist because no one has yet used optimization techniques to deduce the three-dimensional structure of even a smaH protein, such as the 58-residue bovine pancreatic trypsin inhibitor (BPTI). Current procedures applied to BPTI have not yet yielded a computed structure that comes closer to the x-ray structure than 2-3 A. Several procedures that work to overcome the multiple-m~nima problem on small molecules become compu- tationally intensive as they are used on larger molecules. However, the increasing use of supercomputers will help overcome this prom fem. ~edunents to Progress Although supercomputers will allow larger calculations and thus cover conformational space better, workers in this field will need additional tune to be allotted on these machines to do the research necessary to achieve greater efficiency. Parallel process ing offers a breakthrough, and this will require now ~nft:"r~r. t ~1, ~_ ~ 1 ~ --- 1.~^- ~ w~u~c~ W cake advantage ot the hardware enhancements. With new hard- ware and software, it should be possible to surmount the major hurdle created by the multiple-minima problem. However, it conceivable that bottlenecks may develop as we attempt to scale up procedures that work on 2~re~idue segments to proteins con- taining 100 to 200 residues. We will also need imaginative new approaches to overcome this problem.

OCR for page 69
77 Potential functions should be improved, especially those for polynucleotides and prosthetic groups, and for water-water inter- actions, but this is not now the most serious problem. Certainly, this problem should be addressed again when the multiple-minima problem is solved for bovine pancreatic trypsin inhibitor. Fmally, new developments will be needed to bring molecular dynamics from the picosecond to the millisecond tune scale. Future Protects . . At every stage in the development of conformat~ona~ energy calculations over the past 25 years, we always seemed to face in- surmountable obstacles. However, the steady progress during this period indicates that many of these obstacles have been overcome. The remaining major hurdle is the multiple-minima problem (Gibe son and Scheraga, 1988), but we have an array of possible solutions to it. The solutions have worked for small molecules, and current and impending developments in computer hardware and software should justify our confidence that, within 5 to 10 years, we may ex- pect to understand how interatomic interactions dictate not only the final folded structure but the pathways taken by the newly formed polypeptide chain to reach the native structure. MOLECULAR DYNAMICS The principle behind a molecular dynamics simulation ~ sim- ply the application of Newton's equations of motion to the atoms of one or more molecules. Newton's equations relate three in- dependent quantities: time, conformation (atomic coordinates), and potential energy. As the calculation progresses and the posi- tions and velocities of the atoms change, the system will traverse many different states; as the simulation is prolonged, the observed states together approach a perfect sample of the thermal equip rium ensemble of all states the system will occupy. The thermal equilibrium distribution may also be sampled without considering motion, using appropriate purely statistical methods (Monte CarIo techniques). In principle, a Monte Cario calculation might pro- duce a representative sample using less computer time. Noguti and Go (1985) indicate how, with knowledge of the second-derivative matrix of the potential energy, the atomic coordinates can be ef- fectively used to speed up the Monte CarIo process. However, it is

OCR for page 69
78 as yet uncertain whether this accelerated Monte CarIo procedure produces a more rapid exploration of conformation space of a pro- tein than a molecular dynamics simulation. Thus, the molecular dynamics sunulation gives us a way to make theoretical estimates of mean atomic positions and deviations from the mean; of rates of motion and conformation change; and of ensemble averages, including thermodynamic functions such as energy, enthalpy, spe- cific heat, and free energy. Since free energies can be expressed as equilibrium constants and vice versa, simulations are being used to obtain theoretical estimates of differences of affinity of proteins for small molecules. Recent results show remarkably good agree- ment with experimental values. Major pharmaceutical companies have already noted the usefulness of accurately predicting these differences. Molecular dynamics simulations, although simple in concept, were not practical before the acivent of high speed computers. This method of theoretical chemistry is particularly useful for the study of condensed phases and was first used to study the structure and dynamics of liquids. I,ater, several investigators applied existing techniques to protein molecules (Karplus and McCammon, 1983; Berendsen [cf. Hermans, 1985, Beveridge and Jorgenson, 19874~. At present, several laboratories are active in the field. More are be- com~ng involved as the methods are applied to increasingly quan- titative studies that aim to reproduce experunental observations as closely as possible in the computer model. Many investiga- tars express the belief that molecular dynamics calculations will soon produce useful predictions of structure, dynamics and ther- modynamics of proteins, nucleic acids, and complexes of these macromolecules with one another and with other molecules. The simulation requires two pieces of information at the out- set: a starting conformation and potential energy function or forcefield. For a protein, current technology requires that the starting conformation be firmly based on experimental observa- tion: because many conformations exist at local minimum energy, a conformation that is very different from the correct most stable conformation evolves too slowly to reach the correct conformation in the length of a typical calculation. The forcefield is a very simple empirical approximation to the underlying physics, which properly should be expressed in terms of quantum mechanics but is totally unmanageable in that form. Parameters of the forcefields now in use have been proposed on

OCR for page 69
79 the basis of a variety of experunental data and to some extent on theoretical considerations. Overall, the several forcefields pro- posed by different groups for computation of the internal energy of proteins tended to have very similar sets of parameters. Recently developed forcefields for water-water and water-protein interac- tions permit the simulation of dynamics of proteins in solution, which is a prerequisite for modeling events at the protein surface, including most interactions of proteins with other molecules. (The problems of developing adequate forcefields are discussed in the following section on "Solvation.~ ~ Carrying out molecular dynamics sunulations of proteins Is very much an art of the feasible, the limiting factor always being the available computing power. One is always facing the conse- quence of an inescapable physical fact: that the most rapidly fluc- tuating atomic motions, bond stretching, and bond angle bending vibrations have periodicities of the order of once in every few fem- toseconds (lo-~5 see). Current simulation methodology requires that periodic motions be sampled several times per period, and each sampling requires an evaluation of the system's potential energy, requiring computer time in milliseconds on the fastest machines, Cray and Cyber 205. Clearly, simulations cannot now span a time that is on the biological time scale of microseconds to seconds. Unavoidably, molecular dynamics simulations use sim- pie forcefields to span a longer time. Given more computer time, those working in the field will improve simulations in various ways: use of more detailed forcefields; longer simulations; simulation of larger systems posing new physical and biological questions; and application of new, more time-consuming, dynamics methods to ask different questions about the system. To those working in the field, the future is bright; ideas and interesting problems abound, and new computer technology continues to widen the limits of feasibility. RESULTS Collected papers for symposia held in 1984 and 1985 give an overview of methods and results of applications of molecular dynamics to proteins." Subsequent work achieved many of the Germans, 1985; Beveridge and Jorgensen, 1986; results described in these symposium papers are not explicitly referenced in this section. Some

OCR for page 69
9s for proteins because of the ubiquitous charges and requmite coun- terions present in solutions of nucleic acids. For these macro- molecules, it Is insufficient to deal with solvent alone. Current ap- proaches have considered various polyionic charges, solvent, and, in some cases, counterions. In many studies, the polyionic phos- phate charges have been artificially reduced to about 25 percent of the physical value to account crudely for counterion association (Hingerty and Broyde, 1982; Tidor et al., 1983~. This value arises from the counterion condensation formalism, which describes a required fractional counterion screening for counterions that are far from the polyelectrolye (Manning, 1978~. This approach Is a valuable qualitative tool, but we do not expect quantitative results from such a strong approximation. Only recently have initial all-atom studies of polynucleotide- ~on-solvent systems been carried out (Corongiu and Clementi, 1981; Seibe] et al., 1985; van Gunsteren et al., 1986), but it is clear that the exceptionally time-consuming nature of these simulations with ions present does not yet permit such calculations to be very informative in practical ways. To simulate a duplex oligonucleotide without added salt for 2 nanoseconds (a relevant motional time scale for the polymer) would require roughly 1,000 hours of supercomputer time. TO the case of nucleic acids, an intermediate ground state exists that is not relevant for many proteins. That is, the set of explicitly simulated atoms can be extended to include the macromolecule and ambient ions, while retaining the implicit treatment of only the solvent. In terms of the number of atoms to be followed, the simplification is substantial. In any of the cases discussed above in which the solvent is treated implicitly, one still must implement realistic potentials of mean force, or, at least, invoke a firmly based dielectric continuum treatment. Since the potential payoff of knowing viable implicit salvation routes is very large, it is unportant to encourage research into implicit modeling in biopolymer related systems. A potentially useful approach to the ionic atmosphere that avoids even the intermediate-level treatment of ions is the im- plementation of solvent and ion-averaged potentials within the biopolymer. The use of reduced phosphate charges is an ad hoc form of such a potential function. An oversimplified but well- founded alternative is the use of a Debye-Huckel-like screening between polymer sites (Hesselink et al., 1973; Manning, 1978;

OCR for page 69
96 Soumpasis, 1984), employing the bulk solvent dielectric constant. The latter approach cannot, however, account for the unusually high degree of ionic association that ~ present In the immediate vicinity of a polyion. Another unusually promising approach involves applying more analytical theories for the influence of the solution environment, while retaining a detailed description of the biopolymer. In essence, one evaluates the effective potentials that govern the intrapolymer interactions for fixed polymer configuration by (numerically) solv- ing the relevant equations of an essentially analytical theory. An example of significant recent progress along these lines is the work of Pack and coworkers (Klein and Pack, 1983~. They have used an algorithm for solution of the Poisson-Boltzmann equation for the ionic distributions around a detailed DNA model, and from such distributions the relevant free energies of different conforma- tions are, in principle, obtainable. At least for simplifier] models of DNA, the Poisson-Boltzmann mean field theory has proved accurate compared to computer simulation for the same mod- els (Murthy et al., 1985~. Closely related approaches have been considered for enzyme-substrate binding involving charged species (Klapper et al., 1986~. However, a Poisson-Boltzmann treatment is tied to a dielectric continuum view of the solvent, although dielectric heterogeneity can be readily accounted for within this context. A brief comment on biopolymer dynamics is appropriate here. Dynamics are clearly connected to the general question of pro- tein folding, and are likely to be significant for function in many cases. Although molecular clynarn~cs may not be directly related to the issue of predicting function, it is clearly connected to the more general question of protein folding. As for the equilibrium time-independent problem, one can, in principle, consider the full atomic description. However, one can also focus only on an explicit subset of solute atoms, such as biopolymer or biopolymer plus ions. The formal theory to be applied when only some of the atoms are considered explicitly is well established (for recent discussions in a variety of contexts see: Adelman, 1982; Ermak and McCammon, 1978; Tully, 1981~. The motion proceeds according to the forces prescribed by the effective potential, but with additional random forces and friction due to the implicit solvent collisions. In general, these solvent forces are not simple, but depend on the history of

OCR for page 69
97 the solute dynamics (memory) and the current solute conforma- tion (hydrodynarn~c interaction). In the general case, the relevant equation is the so-called generalized I.angevin equation with the friction described by a memory function that embodies the sta- tistical properties of the solvent collisional correlations in space and time. The approximation that neglects any frictional correla- tion involves only constant drag coefficients, the so-called ordinary Langevin equation. Further approximation leads to equations of a diffusion type. As with most areas of theory in physical science, the time- dependent theory lags behind the equilibrium theory in terms of development and application. A few examples of attempts to am ply these methods to realistic and simplified models exist (Levy et al., 1979; McCammon et al., 1980) but only once does it ap- pear that a polypeptide has been examined (Brooks and Karplus, 1986~. The problems encountered in any application involve, first, computational Innitations, since the dynamics that are of bio- chem~cal interest are relatively slow. Perhaps more significant is our extremely limited knowledge of the memory function and hy- drodynamic interactions for a complex solute. In principle, we can test our assumptions against all-atom simulations or the relevant functions extracted from these simulations, but this route is itself Innited by the current sparsity of the requmite simulations. Never- theless, future developments in this area seem very likely, although they are further away than are those in equilibrium theory. Conclusions The ability to adequately test predictions of theoretical calcu- lations is an element of overriding importance in the future of the modeling of salvation and electrostatics. This testing can occur at two levels: first and foremost by comparing theory and experiment and second, by comparing results obtained through convenient but approximate theory with those derived from accurate theoretical treatment. The first is essential for accuracy and the second for the future development of viable theoretical methods for studying increasingly complex systems. Therefore, we should continue to encourage both experiment and theory for both macromolecular and smaller mode} compounds. We cannot expect an immediate and completely satisfactory way to account for the influence of the solution environment on

OCR for page 69
98 the behavior of macromolecules. The above discussion indicates that some unsolved and many partially solved problems remain. Nevertheless, reasonable approaches already exist that provide adequate grounds for qualitative study. The degree of quantita- tive accuracy is not yet well established and awaits both further mode! calculations and further thermodynamic and spectroscopic experimental data, so that we may make unequivocal comparisons between theory and experiment. Based on the steady progress de- scribed above over only the past few years, we have every reason to expect rapid incremental progress to continue. A clear view of the capabilities of current models and methods for describing flexible small molecules should be available within only a few years. At the same time, the large amount of theoretical activity both in the development of well-founded approximate approaches and in the simulation of atom~c-leve! solvatec! molecules virtually assures our ability to make the appropriate comparisons between the two in the near future. A very significant element in recent develop- ments has come from aIgorithnuc breakthroughs. Biases] sampling techniques and thermodynamic perturbation/integration methods are two new methods that contribute essential capabilities to the theoretical effort. Hence, we should encourage theoretical devel- opments as much as computational applications. The rapidly increasing access to the necessary computer facil- ities has contributed significantly to progress, and it is essential that this access continue to grow. For all-atom modem of the environment, the current limitations on macromolecular simula- tion are primarily computational. Although limitations of the mode] theory are also likely, we now have insufficient data to make that judgement. An order-of-magnitude increase In available com- puting power would be enough to make a dramatic difference in this area; two orders of magnitude would permit simulations into the Interesting many-nanosecond regime. Such changes are likely within the next five years through the combined effects of new hardware, improved performance, and lower cost. To explore adequately events such as protein folding that occur on much longer time scales (or involve vast conformational exploration), these computational improvements would still be inadequate by many orders of magnitude. Thus, a theoretical breakthrough appears necessary if we are to make real progress within the next several years. Such a breakthrough would be, for example, the demonstration of an implicit treatment for the

OCR for page 69
99 solution environment that yielded accurate biopolymer dynamics on a nanosecond time scale when compared to a full atomic-level simulation. Such a treatment could then be applied for longer times. Considering the very limited knowledge available about the performance of alternative implicit modeling techniques, we can not now say whether such an approach is workable, even in prin ciple. However, the process of determining the usefulness of these techniques requires a one to two order-of-magnitude gain in com puter power. In summary, the rapid progress in our ability to describe the environmental aspects of bipolymer systems gives solid ground for optimism that this element of biomolecular modeling will not im- pede development of useful predictive methods. However, for the most challenging aspects, we are at least several years away from demonstrating the ability to mimic accurately solution environ- mental effects. HEURISTIC METHODS There are two major approaches to the prediction of three- dimensional structures of proteins: modeling by extension and hierarchical searching. Both methods can combine heuristic ideas and energy calculations. They differ from the energy calculations described earlier and from each other in the way they arrive at starting structure. Modeling by extension uses the known struc- ture of a protein or proteins with strong sequence homology to the unknown. The hierarchical methods use packing considerations derived from the crystallography of many proteins. The following section describes these approaches in more detail. Homology Protein Homology Proteins occur in families. Evidence for this comes first from protein sequence homology and then from the architectural simi- larity of homologous proteins determined by x-ray crystallography and NMR spectroscopy. A family of proteins can be modeled by homology if several conditions are fulfilled. First and most im- portant, the structure of at least one member of the family must

OCR for page 69
100 be known. Second, the three-dimensional protein to be modeled must be sufficiently homologous to that of the known protein. Many proteins have been modeler! over the past five years, and the general consensus is that if two proteins share at least 30 per- cent similarity, then computer graphics and energy modeling will be useful. If the global homology ~ less than 30 percent, then it is difficult but not ~mpomible to say whether the two proteins are in the same family. If there are unportant conserved residues such as disulfide bridges then even 30 percent homology might be sufficient. The difficulty of modeling a given protein depends on the range of homology with the known structure. When the homology is between 80 and 100 percent, normally only the surface amino acids are changing. In these cases, there is usually no change in peptide length. With homology of between 50 and 80 percent, again, mainly the surface amino acids are changing, but there may be additions and deletions in the peptide length. The amino acids on the surface of the protein can be easily substituted. Energy minimization and/or molecular dynamics are sufficient to reduce the errors caused by any changes in surface sidechain conforma- tion. Surface-charged amino acids under molecular dynamics ei- ther must be neutralized by artificially altering the parameters or by adding a solvent box about the protein. When interior amino acids are changed from one protein to another, the changes are either to make a long amino acid shorter, thus creating a hole in the interior of the protein, or a long-short pair of amino acids are changed to a short-Ion" pair of amino acids. When amino acids are added or deleted in a helix, they are generally in multiples of three amino acids. This preserves the hydrophobicity/hydrophilicity relations in the helix. In contrast, beta strancIs tend to have insertions or deletions of two amino acids, thus preserving the inside/outside relations (Feldmann et al., 1985~. The inside amino acids of a protein normally are hy- drophobic, while the outside amino acids are normally hydrophilic. Graphic modeling of such changes is accomplished by moving the additions and deletions in helices and beta strands toward the ends of the secondary structure feature. Graphic modeling of loops is easy to do but fraught with large inaccuracies. Insertion or dele- tion of amino acids on a loop can be accomplished by breaking the peptide chain, making the appropriate change and then bending

OCR for page 69
101 the loop ends to accorrunodate the change. Energy modeling un- der local conditions of relatively high sunulated temperature can be used to explore the local conformational space. There are two ways to align sequences, automatically and manually. The automatic alignment methods such as Wilbur and Lipman tend to align the sequence for highest local match. Man- ual methods (see Feldmann et al., 1985) permit the alignment of secondary structure features which minimize the number of dis- turbances which must be made to the protein and convert from the crystallographic structure to the mode! structure. One of the ways to overcome the uncertainties of the structure of a particular loop is to build a library of representative loops. Alwyn Jones at Uppsala has done this and recently integrated it into FRODO, his modeling program. After Al the changes have been made either by graphic meth- ods or by using a loop library, extensive molecular dynamics sim- ulation is generally used to improve the quality of the model. Whether a broad range of scientists can use molecular dynamics calculations depends on the availability of appropriate modeling software and on a variety of displays and workstations. To com- plete the modeling by molecular dynamics calculations, sufficient computer power must be available either on a personal supercom- puter (PSC) or by network access to a national supercomputer. Modeling by E5ctension Twenty years ago, PhiBips (1967) made a mode] of the protein lactalbumin without obtaining a single crystal. This was possible because the amino acid sequence of lactalbum~n had been found to be 35 percent identical with that of the enzyme lysozyme, a pro- tein whose crystal structure had recently been determined. The residues for lactalbumin were simply placed in the equivalent po- sitions known to be occupied by the residues of lysozyme (Browne et al., 1969~. Subsequently, Warme et al. (1974) underscored the validity of the approach by computational approaches to the structures of the two proteins. The structure of alpha-lactalbumin computed by energy minimization by Warme et al. (1974) has recently been verified by x-ray crystallography (D.C. Phillips, personal commu- nication, 1987~. Since then, "modeling by homologous extension"

OCR for page 69
102 has become a common if sometimes casually applied approach, that has benefited from modern computer graphics (Greer, 1985~. A recent example of the utility of modeling by extension ~ that an important but elusive factor, angiogenin, involved in the biogenesis of blood vessels was isolated after a search of more than 15 years. The sequence of this factor was determined and found to be 45 percent identical to pancreatic ribonuclease. As a re- sult, Palmer et al. (1986) promptly generated a three-dimensional structure. Naturally, the closer the resemblance of the unknown protein to the one whose structure has been determined, the more accu- rate the modeled structure. Recently, however, Moult and James (1986) have shown that it is possible to construct good models even when the sequence resemblance is barely recognizable. In many instances, then, all that ~ needed ~ the family relationship of the new protein. Exon Shnffling Many recently evolved proteins exhibit evidence of "exon shuf- fling.~ In this phenomenon, mosaic proteins result from the ge- nom~c rearrangement of segments that encode small portions of different proteins. For such proteins, it ~ thought that the pew tide segments, which often range from 30 to 90 amino acids in length, all fold independently (DoolittIe, 1985~; as such, they con- stitute domains in the truest sense. As of m-1987, about six such domains had been found in a variety of proteins In different three- dimensional settings. Most of them are tightly folded and contain disulfide bonds that hold the structure in place. They include such well-known motifs as: the "ELF domain, the ~Kringle," and the "fibropectin fingers. They are readily identified by rou- tine computer searches of sequence, and, when such a structure has been identified, can be immediately modeled in place. Exon shuffling, which ~ due at least in part to the additional recom- bination that ensues from the presence of introns between exons, is not restricted to the small set of stable structures listed above, and it is anticipated that hybrid and mosaic proteins of many sorts will be identified. In all these situations, prior knowledge of any structural motif wiD aid in interpreting the overall struc- ture. Doubtless, other motifs will emerge as more sequences are determined, compared and analyzed.

OCR for page 69
103 HIERARCHICAL MODELS OF PROTEIN FORDING The major goal of hierarchical modeling is to build three- dimensional structures that incorporate directly or indirectly the architectural principles by which nature constructs globular pro- teins. The direct approach uses empirical rules or models that capture recognized aspects of protein folding. The indirect ap- proach uses homology to provide the basic structure and explores the local environment with energy calculations or other modeling efforts. This latter work was described in the previous section. Here, we assess the success of rule-based procedures. Investigators have used these procedures at various levels of formality. These efforts generally follow the same plan: predict- ing secondary structure followed by packing secondary features. They also use the Kauzmann hydrophobic mode} as the basic necking principle. Classifying protein domains by structure has ~ -- r ~ ~ ~ ~ {~ been particularly important because it provides major rules tior a discussion and extensive references see Richardson, 1981; Cohen et al., 1983~. Such rules include statements about the geometry of helices packing against helices and beta sheets and beta sheet- beta sheet packing. The efforts have also led to the development of a list of rules concerning the ordering of strands within a beta sheet. We should note clearly that rules such as these only sum- marize what is observed in known structures; they are not derived from first principles of physics or chemistry. Nevertheless, many of the idealized structures built to be consistent with such rules do look recognizably like protein domains and some are rather close (average error 4 A) to the crystallographic result. PATTERN RECOGNITION AND ARTIFICIAL INT1:~LIGENCE We differentiate the techniques of pattern recognition from those of artificial intelligence, especially its subdiscipline of ex- pert systems. Pattern recognition is usually defined as including numerical techniques for clustering observed data into binary or higher order categories (but see below). Artificial intelligence is usually defined to include use of rule-based systems to express and use empirical knowledge of a subject. The usual techniques of pattern recognition, such as linear discriminate analysis, cluster analysis, and other parametric and

OCR for page 69
104 nonparametric methods, have not proved useful In the analysis of protein structures and functions. These methods pose difficulties in determining the statistical significance of a derived cIa~ifica- tion. Given the relative lack of knowledge about structure-function relationships in complex molecules such as proteins, it ~ very dif- ficult even to pick a reasonable set of structural descriptors upon which to build a clustering scheme. Other definitions of pattern recognition, however, are less con- troversial in technique, if not in interpretation of results. These techniques involve, for example, the presentation of three- dimensional protein structures in the form of C-alpha distance maps (Kuntz, 1975; Rao and Rossmann, 1973) to infer the pres- ence of secondary and super secondary structures and location of intron/exon boundaries (Go, 1981~. The technology of expert systems may be applied when em- pirical knowledge, which can be expressed In rule-based systems, can be used to solve problems. For example, production rules of the form IF (x) THEN (y) may be an integral part of a hierar- chical pattern comparison described previously (Figure 4-1~. To achieve high performance, such rule-based approaches are often supplemented with methods and data from other sources. Two systems under development, KARMA (Klein et al., 1986) and PROTEAN (Hayes-Roth et al., 1986) illustrate this point. The KARMA system employs rule-based proposal and evaluation of small molecules and their predicted binding activities in receptor sites of known proteins. A variety of mathematical and graphic procedures are used to evaluate candidate structures and their affinities for binding. PROTEAN uses artificial intelligence tech- niques with interatomic (nonbonded) (listance information from NMR and a variety of mathematical and graphic techniques to ex- plore structural possibilities for proteins of known primary struc- ture. Cohen et al. (1983, 1986b) have carried out one of the most ex- tensive projects in their studies of alpha/beta domains and four he- lix bundles. In the former case, they identified secondary features by using pattern matching and then built tertiary structures from exhaustive combinatorial packing of the secondary elements. In the most favorable case, flavodoxin, they could generate a unique prediction for the alpha carbons involved in helix or sheet of the protein. Similar predictions for molecules such as interIeukin-2

OCR for page 69
105 have been made based on a core structure of four helices (Cohen, et al., 1986b). The important strengths of such projects are (1) they achieve low resolution structures of the central residues of proteins that contain many protein features, including a prediction of the ~ac- tiven portion of the molecules; (2) the computational labor is modest; and (3) the rule system can be tested directly against known structures and their homologs. IN some sense, they are the best solution currently available to the folding problem. On the negative side, the low resolution is an obvious limitation. Details of loops are often neglected. Only certain classes of proteins can be dealt with successfully. The next several years should bring improvements in all as- pects. More realistic models will reduce errors. Loops and side chains can be treated either from rule-based or energy-based ap- proaches. Expansion to more protein structural classes is proceed- ing rapidly. It is difficult to see the ultimate limitations of these heuristic methods. We are hopeful that they yield good first-order approximations that can be refined by the energy minimization and molecular dynamics calculations.