Read "Computer Assisted Modeling: Contributions of Computational Approaches to Elucidating Macromolecular Structure and Function" at NAP.edu

« Previous: 5 Tertiary Structure of Proteins and Nucleic Acids: Experimental

Page 69 Cite

Suggested Citation:"6 Tertiary Structure of Proteins and Nucleic Acids: Theory." National Research Council. 1987. Computer Assisted Modeling: Contributions of Computational Approaches to Elucidating Macromolecular Structure and Function. Washington, DC: The National Academies Press. doi: 10.17226/1136.

Page 70 Cite

Page 71 Cite

Page 72 Cite

Page 73 Cite

Page 74 Cite

Page 75 Cite

Page 76 Cite

Page 77 Cite

Page 78 Cite

Page 79 Cite

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Page 87 Cite

Page 88 Cite

Page 89 Cite

Page 90 Cite

Page 91 Cite

Page 92 Cite

Page 93 Cite

Page 94 Cite

Page 95 Cite

Page 96 Cite

Page 97 Cite

Page 98 Cite

Page 99 Cite

Page 100 Cite

Page 101 Cite

Page 102 Cite

Page 103 Cite

Page 104 Cite

Page 105 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

6. Tertiary Structure of Proteins and Nucleic Acids: Theory ENERGY OPT~IIZATION According to the thermodynamic hypothesis, based on Anfin- sen's (Anfinsen et al., 1961) experiments on the oxidative refolding of bovine pancreatic ribonuclease A from the reduced form, the amino acid sequence determines the three-dimensional structure of a protein in a given medium as the thermodynamically most stable one. It must be emphasized that this hypothesis applies to the length of the polypeptide chain at the stage when folding takes place, and not at some subsequent processing stage. For example, this hypothesis would be applicable to the single-chain pro-insulin and (possibly) not to the processed product, the two- chain disulfide-linked insulin. Thus, it is a challenge to chemists to understand how the interatomic interactions within the polypep- tide chain and the interactions between the atoms of the chain and those of the surrounding solvent lead to the thermodynamically stable structure, that is, the one for which the statistical weight of the system is a maximum. The hypothesis that the statistical weight is a maximum im- mediately implies that some kind of optimization strategy is neces- sary to find the most stable structure. This requires procedures to generate arbitrary conformations of a polypeptide chain, compute 69

70 the statistical weight for each conformation, and then alter the conformation so that it ultimately corresponds to the global maxi- mum of the statistical weight. Although the problem is formidable, current indications are that it can be solved using a sound] scien- tific approach without resorting to ad hoc procedures to deduce "folding rudest that do not explain the molecular bash of such "rules." To compute the global statistical weight, one optimizes the conformational energy of the polypeptide, incorporates the effect of the solvent, and takes account of the entropy of the system. Procedures are available for carrying out such computations, and currently available supercomputers permit the computations to be applied to very large systems. Although these procedures will undoubtedly improve, they are now adequate for computating and obtaining results that can be checked experimentally. The major difficulty still to be overcome, although partial success has been achieved (see review by Gibson and Scheraga, 1988; ado, Robson, 1986; Crippen, 1984), armed from the presence of many local m~n- ima in the multidunensional energy surface. Although algorithms are available for minimizing an energy function of many variables, there are no efficient ones for passing from one local minimum, over a potential barrier, to the next local minunum and ulti- mately to the global niinimum of the potential energy in a very high dimensional space. Thus, minimization leads to the nearest local minimum, where the procedure is trapped. This trapping in a local, rather than the global, minimum ~ referred to as the "multiple-minnna problem. Efforts are being made to overcome this problem using a variety of procedures, including approxima- tions that place the system in the potential well in which the global minimum lies (Gibson and Scheraga, 1988~. Then, any approxi- mations (introduced in the initial stages) are abandoned, and the full energy function is minimized. The use of molecular dynamics for minimization is an alternative strategy; it is considered in the next section. The energy minimization approach and its associated compu- tational program, (lescribed here for polypeptides and proteins, is equally applicable to any other type of macromolecules, such as polynucleotides and polysaccharides, as well as to interactions between the various types of macromolecule. General references to this methodology include the works of Anfinsen and Scheraga

71 (1975), Nemethy and Scheraga (1977), Levitt (1982), Karplus and McCammon (1983), Scheraga (1984), and Richards (1986~. Generation of an Mbitrary Confonnation Analytical geometry and associated matrix algebra provide the tools to generate a polypeptide chain. To do so, internal co- ordinates (dihedral angles for rotating about bonds) or Cartesian coordinates may be used as independent variables. When inter- nal coordinates are used, the bond lengths and bond angles are held fixed, but chosen very carefully from x-ray structures of mode} systems so as to properly reflect the geometric results of side chain- backbone interactions within each amino acid residue. The validity of this approach has been demonstrated for both polypeptides and polysaccharides systems in which strained conformations rarely arise (Scheraga, 1984~. When Cartesian coordinates are used, and hence bond lengths and bond angles are allowed to vary, one faces the problem that force constants for these motions are not as well known as the geometric features of the molecule. Further, bond- angle bending modes are enharmonic, and no currently available forcefield takes anharmonicity into account. Thus, we will have to overcome inadequacies in the force constants and problems of anharmonicity before we can rely on computed bond angles. Impo- sition of fixed bond lengths and bond angles is at least reconciled with observed crystal structures of small molecules. The subject of fixed versus flexible geometries has been discussed by Swenson et al. (1978~. Potential Functions Many research groups have developed potential energy func- tions with which to carry out such computations on polypep- tides, polysaccharides, polynucleotides, and synthetic polymers. The various potential functions have many similarities, but diner enough in detail to make them difficult to compare. This difficulty is compounded by the small number of cases for which parameters that characterize the strengths of the various interactions have been established in a sel£consistent way on experimental data such as crystal structures, lattice energies, and barriers to inter- nal rotation. At present, the potential functions for polypeptides,

72 poly~accharides and synthetic polymers have been better param- eterized than have those for polynucleotides, prunarily because there are fewer reliable data for mode} nucleotide compounds. Similarly, more experimental data will be required for proper pa- rameterization of the potential functions for the prosthetic groups attached to biological macromolecules. Until recently, the poten- tial functions involving water molecules still would not account adequately for the observed radial distribution function of water, but this situation is improving (see, e.g., the work of Jorgensen et al., 1983~. It must be emphasized that the molecule responds to the to- tal (in principle, quantum mechanical) potential function, and its partition into various empirical components (such as non- bonded, electrostatic, hydrogen-bonding, and other interactions) is at best arbitrary, although there ~ some physical basis for as- signing such names to the various components. Consequently, one must avoid combining components from forcefields from different research groups. Each forcefield must be parameter~zed by itself in a self-consistent way. The total energy of a given conformation is expressed as a sum of the energies of aD nonbonded pairs of atoms. When considering biological function, such as the formation of an enzyme-substrate complex, then the pair interactions both within and between the two partners of the complex must be included in the total energy. Thus, the influence of each member of the complex on the other (the so-called induced fit phenomenon) is taken into account. Consequently, the conformations of the partners in the complex can differ from their conformations as individual species. This means that the computed conformation of an isolated hormone may not resemble the biologically active one when it ~ bound to a receptor (Scheraga, 1984~. Although potential functions could still be improved, those for which parameters have been determined in a self-consistent way have led to many computed structures that have subsequently been checked by experiment. For example, the computed struc- ture of the colIagen-like poly(Gly-Pro-Pro) (Miller and Scheraga, 1976) agrees with the subsequently determined crystal structure (Okuyama et al, 1976) within a root-mean square (rms) deviation of 0.3 A. Scheraga (1984) has cited many similar examples of experimental verification of computed structures. Therefore, we can be confident that the problem of adequate potential functions

73 is not serious, and more effort should be devoted to the most difficult problem of all the multiple-m~nuna problem. After the multiple-munnna problem is solved, it wiD then be worthwhile to reexamine the possibility or necessity of improving the potential functions further. Salvation There are several methods to mclude the effect of hydration in the computations. Hydration tends to force the polar groups to the surface of the molecule, putting them In contact with water, and forces the nonpolar groups to the interior, removing them from contact with water. One method includes the water molecules explicitly, and cal- culates the interaction energy between the molecule and the water. The success of this approach depends on the adequacy of the po- tential function describing the water-water interaction. Another approach ignores the structural features of the water molecule, and assigns a hydration shell (and an accompanying free energy of hydration) to each atom or group of atoms. As the conforma- tion changer and hydration shells overlap, a free-energy penalty is assessed. The hydration-shell mode} is parameterized on experi- mental data on free energies of hydration (Kang et al., 1987), but current efforts are being made to obtain such free energies from Monte CarIo and molecular dynamic studies of aqueous solutions of small molecules (see, e.g., Jorgensen et al., 1983~. The compu- tation of free energies by these simulation techniques faces many theoretical obstacles, although this very active field of research is progressing quickly. Entropy A variety of methods exist to incorporate entropic effects. One entropic effect arises from the hydration; this effect is treated as described earlier. The other entropic effect armes from the conformational fluctuations of the molecule. Several procedures may be used to compute this contribution, the most direct being the evaluation of the second derivative of the potential function. This describes the curvature at the bottom of the potential energy well and hence the fluctuation in conformation about each local niinirnum in the potential energy surface. I,ocal minima that are

74 not the global minimum of the potential energy can become the conformations of higher statistical weight if there ~ a large enough entropy gain from large conformational fluctuations. Paine and Scheraga (1987) have encountered such eEects. Optim;xati~ Procedures Optimization procedures are available for searching for the local minima. These Include direct energy minimization, Monte CarIo, and molecular dynamics proceclures. In energy minimiza- tion, the variables describing the conformation are altered system- atically so as to lower the energy continuously. The Monte CarIo method makes random changes in the conformational variables and accepts the new conformation according to various protocob that compute the energies before and after the random changes in the conformation. ~ molecular dynamics calculations, Newton's equations of motion for the atoms of the macromolecule (subject to interatomic forces determined by the potential functions) are solved to obtain a trajectory in conformational space. Very ef- ficient energy minimization algorithms exist, even for functions of many variables, but lead only to local minima (however, see the following section). Conventional Monte CarIo procedures can overcome local minima but tend not to cover conformational space efficiently enough. However, this difficulty is being overcome use ing modifications that include adaptive unportance sampling and other efficiency-seeking procedures (Gibson and Scheraga, 1988~. Molecular dynamics can also surmount local barriers, but the pi- cosecond time scale of practical computations does not approach the millisecond time scale of actual protein founding. Because most published papers do not provide the relevant data, it is difficult to compare the computer time needed for various optunization procedures. The required computation time will be very sensitive to how well the code is optimized, whether parallel processing Is carried out, and other factors. To obtain such benchmarks, it would be necessary to run each procedure on several different computer systems a task not yet undertaken. Solutions to the Muitiple-Mm;ma Problem for Macromolecules Since no mathematical procedures are available to locate the global minimum for any macromolecule (except in energy surfaces

75 of very low dunensionaiity), as mentioned above we must first re- sort to approximate procedures to obtain structures that might lie close to that of the native macromolecule. Then, the approx~na- tions are abandoned, and fuB-scale energy minimization, Monte CarIo, or molecular dynamics procedure ~ earned out. A variety of such procedures have been developed (Gibson and Scheraga, 1988~. These include: . a "build-up~ method, in which large structures are built up from ensembles of low-energy conformations of smaller ones (with energy minirn~zation being carried out at each stage); optimization of electrostatic interactions; optimization in a space of high dimensionality where fewer intervening barriers exist (with subsequent relax- ation back to three dimensions); Monte CarIo sampling among local minnna (accompa- nied by energy niinimization); adaptive importance Monte CarIo sampling (to drive the system more efficiently to the global minimums; pattern recognition methods to assemble organized back- bone structures as alpha-helices and beta-sheets; use of distance constraints either from experiment (Nu- clear O`rerhauser Effects (NOEs), nonradiative energy transfer, or NMR on spin-labeled molecules) or from statistical analysm of x-ray structures of proteins; and empirical methods to predict the locations of alpha- helices, beta-sheets, and beta-turns. Numerous valid predictions of global minimum structures of peptides have been made using these methods (Gibson and Scher- aga, 1988~. However, they have thus far been successful only for structures that contain at most 20 residues, and current efforts (most of which require access to supercomputers) are being made to extend these methods to larger molecules-to proteins contain- ing on the order of 100 amino acid residues. Successes and Failures Numerous structures have been predicted and subsequently confirmed exper~rnentaDy (Scheraga, 1984~. The right- or left- handed twists of the fundamental structures (alpha-helices and

76 beta-sheets) from which proteins are built have been accounted for by energy minimization. The observed packing features of alpha-helices and beta-sheets have likewise been accounted for in energetic terms. Parameters calculated for conformational tran- sitions (e.g., the helix-coi} transition in water) have been verified by experunent. The computed structures of open-chain and cyclic molecules (e.g., the 2~residue membrane-bound portion of melit- tin and the l~residue gramicidin S. respectively), and those of colIagen-like poly-tripeptides have also been verified by experi- ment (Scheraga, 1984~. Finally, the computed structure of an enzyme-substrate complex (hen egg white lysozyme and a hexas- accharide substrate) (Pincus and Scheraga, 1979) has been verified by experiment (Sm~th-Gill et al., 1984~. These and other examples should give us confidence in the validity of the potential functions and computational methodology (Gibson and Scheraga, 1988~. The failures, in the sense of not yet having solved the protein- folding problem, exist because no one has yet used optimization techniques to deduce the three-dimensional structure of even a smaH protein, such as the 58-residue bovine pancreatic trypsin inhibitor (BPTI). Current procedures applied to BPTI have not yet yielded a computed structure that comes closer to the x-ray structure than 2-3 A. Several procedures that work to overcome the multiple-m~nima problem on small molecules become compu- tationally intensive as they are used on larger molecules. However, the increasing use of supercomputers will help overcome this prom fem. ~edunents to Progress Although supercomputers will allow larger calculations and thus cover conformational space better, workers in this field will need additional tune to be allotted on these machines to do the research necessary to achieve greater efficiency. Parallel process ing offers a breakthrough, and this will require now ~nft:"r~r. t ~1, ~_ ~ 1 ~ --- 1.~^- ~ w~u~c~ W cake advantage ot the hardware enhancements. With new hard- ware and software, it should be possible to surmount the major hurdle created by the multiple-minima problem. However, it conceivable that bottlenecks may develop as we attempt to scale up procedures that work on 2~re~idue segments to proteins con- taining 100 to 200 residues. We will also need imaginative new approaches to overcome this problem.

77 Potential functions should be improved, especially those for polynucleotides and prosthetic groups, and for water-water inter- actions, but this is not now the most serious problem. Certainly, this problem should be addressed again when the multiple-minima problem is solved for bovine pancreatic trypsin inhibitor. Fmally, new developments will be needed to bring molecular dynamics from the picosecond to the millisecond tune scale. Future Protects . · . At every stage in the development of conformat~ona~ energy calculations over the past 25 years, we always seemed to face in- surmountable obstacles. However, the steady progress during this period indicates that many of these obstacles have been overcome. The remaining major hurdle is the multiple-minima problem (Gibe son and Scheraga, 1988), but we have an array of possible solutions to it. The solutions have worked for small molecules, and current and impending developments in computer hardware and software should justify our confidence that, within 5 to 10 years, we may ex- pect to understand how interatomic interactions dictate not only the final folded structure but the pathways taken by the newly formed polypeptide chain to reach the native structure. MOLECULAR DYNAMICS The principle behind a molecular dynamics simulation ~ sim- ply the application of Newton's equations of motion to the atoms of one or more molecules. Newton's equations relate three in- dependent quantities: time, conformation (atomic coordinates), and potential energy. As the calculation progresses and the posi- tions and velocities of the atoms change, the system will traverse many different states; as the simulation is prolonged, the observed states together approach a perfect sample of the thermal equip rium ensemble of all states the system will occupy. The thermal equilibrium distribution may also be sampled without considering motion, using appropriate purely statistical methods (Monte CarIo techniques). In principle, a Monte Cario calculation might pro- duce a representative sample using less computer time. Noguti and Go (1985) indicate how, with knowledge of the second-derivative matrix of the potential energy, the atomic coordinates can be ef- fectively used to speed up the Monte CarIo process. However, it is

78 as yet uncertain whether this accelerated Monte CarIo procedure produces a more rapid exploration of conformation space of a pro- tein than a molecular dynamics simulation. Thus, the molecular dynamics sunulation gives us a way to make theoretical estimates of mean atomic positions and deviations from the mean; of rates of motion and conformation change; and of ensemble averages, including thermodynamic functions such as energy, enthalpy, spe- cific heat, and free energy. Since free energies can be expressed as equilibrium constants and vice versa, simulations are being used to obtain theoretical estimates of differences of affinity of proteins for small molecules. Recent results show remarkably good agree- ment with experimental values. Major pharmaceutical companies have already noted the usefulness of accurately predicting these differences. Molecular dynamics simulations, although simple in concept, were not practical before the acivent of high speed computers. This method of theoretical chemistry is particularly useful for the study of condensed phases and was first used to study the structure and dynamics of liquids. I,ater, several investigators applied existing techniques to protein molecules (Karplus and McCammon, 1983; Berendsen [cf. Hermans, 1985, Beveridge and Jorgenson, 19874~. At present, several laboratories are active in the field. More are be- com~ng involved as the methods are applied to increasingly quan- titative studies that aim to reproduce experunental observations as closely as possible in the computer model. Many investiga- tars express the belief that molecular dynamics calculations will soon produce useful predictions of structure, dynamics and ther- modynamics of proteins, nucleic acids, and complexes of these macromolecules with one another and with other molecules. The simulation requires two pieces of information at the out- set: a starting conformation and potential energy function or forcefield. For a protein, current technology requires that the starting conformation be firmly based on experimental observa- tion: because many conformations exist at local minimum energy, a conformation that is very different from the correct most stable conformation evolves too slowly to reach the correct conformation in the length of a typical calculation. The forcefield is a very simple empirical approximation to the underlying physics, which properly should be expressed in terms of quantum mechanics but is totally unmanageable in that form. Parameters of the forcefields now in use have been proposed on

79 the basis of a variety of experunental data and to some extent on theoretical considerations. Overall, the several forcefields pro- posed by different groups for computation of the internal energy of proteins tended to have very similar sets of parameters. Recently developed forcefields for water-water and water-protein interac- tions permit the simulation of dynamics of proteins in solution, which is a prerequisite for modeling events at the protein surface, including most interactions of proteins with other molecules. (The problems of developing adequate forcefields are discussed in the following section on "Solvation.~ ~ Carrying out molecular dynamics sunulations of proteins Is very much an art of the feasible, the limiting factor always being the available computing power. One is always facing the conse- quence of an inescapable physical fact: that the most rapidly fluc- tuating atomic motions, bond stretching, and bond angle bending vibrations have periodicities of the order of once in every few fem- toseconds (lo-~5 see). Current simulation methodology requires that periodic motions be sampled several times per period, and each sampling requires an evaluation of the system's potential energy, requiring computer time in milliseconds on the fastest machines, Cray and Cyber 205. Clearly, simulations cannot now span a time that is on the biological time scale of microseconds to seconds. Unavoidably, molecular dynamics simulations use sim- pie forcefields to span a longer time. Given more computer time, those working in the field will improve simulations in various ways: use of more detailed forcefields; longer simulations; simulation of larger systems posing new physical and biological questions; and application of new, more time-consuming, dynamics methods to ask different questions about the system. To those working in the field, the future is bright; ideas and interesting problems abound, and new computer technology continues to widen the limits of feasibility. RESULTS Collected papers for symposia held in 1984 and 1985 give an overview of methods and results of applications of molecular dynamics to proteins." Subsequent work achieved many of the Germans, 1985; Beveridge and Jorgensen, 1986; results described in these symposium papers are not explicitly referenced in this section. Some

80 possibilities proposed In these papers, but did not deviate radically from the directions anticipated at the symposia. The following section summarizes achievements and describes possible future applications and advances. This section is divided into three parts that cover structure, dynamics and kinetics, and thermodynamics of macromolecules. The section concludes with a prognosis of developments to come. The Determmation of Macromolecular fit~cture The first molecular dynamics calculations of protein molecules produced trajectories whose mean atomic positions deviated very significantly from the starting positions (by root-mean-square (rms) displacements of several Angstroms! which Born Lo within a much smaller error from x-ray crystallography. With the development of better forcefields and the inclusion of a solvent en ~ ~ _^ ~ Have ~ vironment or even a complete crystalline matrix that consisted of solvent and other protein molecules, the root-mean-equare differ- ence of the atomic positions decreased significantly. Nevertheless, x-ray crystallographic structures, especially after crystallographic refinement, have a precision well inside this difference. The sit- uation appears to be reversed with regard to the thermal distri- bution of the atorn~c positions about their means. Except in rare instances, x-ray crystallography produces a single parameter for each atom that represents the width of an isotropic Gaussian dis- tribution of the atomic center. In contrast, the results of molecular dynamics simulation can be used to describe in detail anisotropic distributions of any shape, even distributions with several max- ima. In the one case where results of molecular dynamics simu- lation have been compared with anisotropic thermal parameters from x-ray crystallography, the agreement was very good. Meth- ods for introducing theoretical estimates of thermal restraints into crystallographic structure refinement are being developed. Molecular dynamics simulations show r.nn.cliA-ral~^ ~:~ ~, ~ _ ~- HA VIl~i:~; Ut ;, "~`e to Increase our knowledge of structures proposed on the basis of incomplete information, particularly information derived interesting work has been reported on nucleic acids. However, the technical difficulties of working with tese highly charged molecules much exceed the difficulties encountered in simulations of proteins; given a limited amount of resources, it is understandable that technically less formidable problems have received priority.

81 from two-dimensional NMR. Two-dimensional NMR produces a set of distances between hydrogen atoms, the Nuclear Overhauser Effect (NOE) distances; for any given (small) protein, many but not all of these distances can be assigned to individual atom pairs. Regular structures such as helices and beta sheets are easily identi- fied and assigned to particular segments of the molecule. However, it is seldom possible to obtain a sufficiently large number of the longer distances that define the relative packing of the regular parts and the structure of irregular chain segments. In this situa- tion, additional information must be brought to bear, an obvious choice for this information being the constraints imposed on the structure by its chemistry and by the requirement of adequate interchain "packing." As these requirements are those observed in a typical molecular dynamics simulation, this has led to the development of a method of molecular dynamics with added con- straints, i.e., the NOE distances. By varying the importance of the constraints that determine the conformation and varying the tem- perature (the total kinetic energy), the structure can be made to evolve to one with a lower conformational energy. This structure also meets the requirements imposed by the NOE measurements as well as or better than the starting mode} and may show new distances between hydrogens that are sufficiently short to be de- tected in the NOE measurement, but whose assignment had been ambiguous (see also the section on Tertiary structures of macro- molecules using NMR" in this report). Spectroscopy/Einetics and Molecular Dyn~mice Molecular dynamics Is a unique too! for simulating time- dependent processes in condensed systems. The problem of eval- uating the time dependence of motion of a protein is formidable. Because of a lack of symmetry, each atom introduces molecular motion at three new frequencies (normal modes), each of which may be distributed over all atoms. This both overstates and un- derstates the situation: it is an overstatement because the fre- quencies of many normal modes (e.g. bonds/retching modes) are predictable and correspond to localized vibrations; it is an un- derstatement both because each conformation of minimum energy (that contributes to the thermal ensemble) contributes its own set of normal modes, and because transitions between conformations, across energy barriers, produce additional motion. This additional

82 motion is usually not a periodic oscillatory motion, but one gov- erned by the statistics of barrier crossing. The high-frequency oscilIations of a molecular dynarn~cs tra- jectory are easily determined, but are also the lent interesting type of motion. Slower motions are far more likely to be relevant to biological function. These typically involve many atoms and have large amplitudes, which are also expected of molecular motions required for the macromolecule's biological function, although not every low-frequency mode will be significant in this respect. Important motions that have been studied by simulation in- clude the hinge bending of lysozyme and the internal breathing motions necessary to transport oxygen to reach the active site theme group) of myogIobin and hemoglobin. The hinge-bending motion is typical of that presumably required for many enzymes to accept a substrate in the active site and release the products of catalysis. Because of the low frequency of these motions, the hinge bending was not simulated by a direct molecular dynamics calculation. Instead, its frequency was estimated by combining an analysis of the potential energy required to bend the hinge re- gion with hydrodynamic consiclerations. In contrast, the breathing motion of myogiobin was observed in a molecular dynamics sim- ulation of 100 picoseconds (10-~°sec) to have a period of roughly 30 pico~econds. It would have been missed had its frequency been only twice as large. The motion of carbon monoxide in hemoglobin following the photodissociation of carbon monoxide hemoglobin has been sim- ulated with molecular dynamics (Henry et al., 1985) to compare the results with extensive spectroscopic experunental studies of the events that follow this reaction. The agreement Is very good; a striking result of the simulation was the considerable local in- crease of atomic thermal motion that follows the absorption of the photon and breakage of the heme-CO bond. This increase in ther- mal motion has a very significant effect on the early kinetics, i.e. during the time required for the excess kinetic ener~v to dis~inat.P into the protein and then into the solvent. -,, ~ Case and McCammon (1986) have analyzed the dynamics of ligands in the interior of the myogIobin molecule, with emphasis on the cletails of the passage of the ligand molecule into and out of the heme pocket. This movement of oxygen is an example of a process requiring passage of the system over a (free) energy barrier. The breathing motions of myogIobin appear to open passages or gates

83 that, when open, allow probes (and by implication, oxygen) to move back and forth between cavities inside the protein (Tilton et al., in press). Very few thorough investigations have been conducted of such gated events. The best is a study of the manner in which tyrosine rings inside proteins rotate by 180°, a process for which exper- imental information is available from NMR spectroscopy. This rotation is an essentially stochastic process, as opposed to the regularly occurring oscillatory motions, and can occur only when the protein assumes particular favorable local conformations, in- cidentally, in the course of its internal vibrations. This is often described as a "gates event. The kinetic process is best studied by placing the protein in a gate-open conformation and determin- ing the relaxation, during which the otherwise rare event (i.e. ring flip) may take place with a good probability (Gosh and McCam- mon, 1987~. Because the trajectories are reversible, the required kinetic information can be extracted. A difficulty of these studies is that the scientist chooses what parts form the gate and how it opens. As this work has been refined with careful attention to detail and use of improved potential functions, the model's kinetic parameters have tended to approach the experimentally observed values. However, because the motion is so localized, this study of tyrosine ring flips may be the only well-developed example of its kind. We seem far from being able to analyze the kinetic path- ways with molecular dynamics, let alone predict rate constants, for the biologically important conformation changes of allosteric proteins, in which many residues readjust their conformation and parts of the protein may undergo relative shifts in position of many Angstroms. Many interesting conformational relaxation processes of pro- teins are too slow to be directly accessible using current techniques of molecular dynamics simulation. However, there is a range of fundamental interest (load to 10-9 see) that can be studied by both molecular dynamics and NMR spin relaxation spectroscopy. We believe there is substantial opportunity for productive com- parison of NMR and the results of molecular dynamics simulation. Perturbed NMR resonances for spin one-half nuclei relax primar- ily because of modulation of dipolar interactions by global and internal molecular motion. Resonances in NMR spectra can also be assigned to discrete sites and, in cases where the geometry of the dipolar interaction is wed defined, time scales for motion at a

84 particular site can be extracted. Although relaxation mechanisms can be very complex, especially for protons, useful analysis should be possible in some cases. Use of i3C NMR can, for example, simplify relaxation time analysis, because most relaxation interac- tions occur with directly bonded protons (Wagner and Brnhw~ler, 1986~. It is also becoming increasingly possible to introduce amino acids enriched in i3C at specific sites. NMR of t3 C-enriched pro- teins and peptides offers substantial possibilities for extraction of experimental time scales of motion in the 10-~2 - 10-9 range for verification of theoretical predictions. When motion of groups or relaxation interactions are complex, molecular dynamics sunulations may also Improve the interpreta- tion of NMR data. Here, Levy et al. (1981) have shown that it is possible to construct appropriate dipolar correlation functions from states sampled in a molecular dynamics simulation. In prin- ciple, this allows calculation of NMR relaxation parameters that can be used to validate modem used for Interpreting NMR data. Thus, the Improvement of molecular dynamics simulations and the development of experunental methods for determining structure may prove symbiotic. Carrying out this dual strategy will require substantial investment in producing an accurate description of spin relaxation, as well as coordination among those developing simulation programs. Thermodyn~m~ce of Macromolecules Physicists have known for a considerable time about tech- niques to calculate equilibrium thermodynamic properties from molecular dynamics calculations. The techniques have been am plied to proteins very recently, but already their use has shifted the emphasis of the simulation field to calculations of free energy dif- ferences. Several factors explain the great current interest in this application, the most important being the availability of many precise experimental data for a variety of equilibria that involve biological macromolecules and the unexpectedly excellent theoret- ical estimates that the simulations produce. The first successes were obtained in studies of the hydration of small molecules and ions, in which the free energy of transfer of a small molecule to bulk water could be compared with accurate experimental data (see following section on "Solvation" ). An important feature of the ongoing research program on

85 macromolecular equilibria is that investigators are carefully iso- lating a small subset of the global problem to avoid overwhelming available computers. For example, in studying enzyme inhibition equilibria, McCarrunon's group is using molecular dynamics simu- lations to estimate the differences in binding free energy of a series of small inhibitors to the enzyme trypsin. A complete calculation consists of two parts, one In which one substrate bound to the pro- tein ~ replaced with another, and one in which the first substrate solvated in water is replaced by the other. As can be seen from the following thermodynamic cycle (E is enzyme, S! and S2 are two different substrates), E (solvated) + S1 (solvated) ~ E-S1 I I E (solvated) ~ S2 (solvated) ~ E-S2 the difference of the two free energy changes obtained from the simulations (indicated by vertical arrows) is equal to the difference of free energy of binding the two substrates to the enzyme (indi- cated by horizontal arrows). The agreement between theory and experiment is of the order of a few k]/mole, which amounts to a factor of two to three in the equilibrium constant. These methods are easily adapted to the estimation of differences in affinity of substrates and inhibitors caused by alteration of the enzyme by site-directed mutagenesis (e.g., Bash et al., 1987b). In one study of the interaction of a protein and a small molecule, the binding of xenon gas to myogIobin, Hermans and Shankar (1987) found that the molecular dynamics simulation was able to give a direct estimate of the binding equilibrium constant, which was within a factor of two of that observed experimentally. Similar methods are being used to study conformational equi- libria of macromolecules. Most thoroughly studied are confor- mational equilibria of the Canine dipeptide, a well-known low- molecular weight mode! of a polypeptide. The most important result concerns the equilibrium between two conformations, al- pha and beta, which correspond to different helical structures of polypeptides. In an environment of water molecules, simulations performed by two groups with different methods gave similar re- sults: a preference by a factor of two to five for the (extended)

86 beta conformation. Experiment indicates, unprecisely, a value of around 10. Prognosis of Developments The success of the free energy sunulations has suddenly changed the scope and emphasis of molecular dynamics simula- tions. The early simulations either clarified properties of proteins that were difficult to study experimentally (kinetics and dynamics on the picosecond time scale) or ebe genre unsatisfactory agreement with experiment (mean atomic positions). However, free energy calculations suffer from neither of these problems. In addition, the results are in a field that Is traditionally of great interest to biochemists. Biochemists routinely and accurately achieve exper- imental determination of free energies of binding (from binding equilibrium constante) of inhibitors and substrates to enzymes. Furthermore, the agreement between theory and experiment ~ so good that molecular dynamics sunulations are wiclely believed to be a useful too} to predict the inhibitory power of new compounds. This too} will at least screen out a large fraction of possible in- hibitors, and thereby greatly reduce the synthetic work required in the search for the perfect inhibitor. Replace "inhibitor" with "drug," and one realizes the potential of these tools. Add to this the possibility of predicting the properties of genetically altered proteins produced by the biotechnology industry and the demand for such tools soars. This new application has created sudden and perhaps unex- pected demands for computer time for two reasons. First, investi- gators suddenly have a seemingly limitless number of technologi- cally and biochemically interesting questions to answer; one may envision the possibility of rationalizing the inhibition constants of all studied inhibitors of any one enzyme and its mutants (the latter designed and manufactured in the laboratory on order). Second, as the emphasis has shifted from problems of structure and dynamics to problems of equilibrium thermodynamics, there is less reason to analyze the detain of most trajectories and conformations. This is because free energy simulations typically pass through a series of artificially constructed intermediates that are physically un- realizable. Thus, each researcher will be able to perform more simulations without being overwhelmed by the time requirements

87 of analyzing the results. Consequently, one researcher can more effectively use more computer time. Accordingly, progress in free energy simulations, although po- tentially very rapid, is heavily limited by available computer time. As recently as 5 years ago the demands of these calculations ex- ceeded the available computer power. At present, each of several research groups is using hundreds of machine hours of Cray time. III addition, a number of groups have been able to acquire Star array processors, which may have the power of a Cray but a much lower price. Dedicating one or more array processors full time to the single task of molecular dynamics is extremely efficient in terms of total cost of hardware and programming. Similarly, the economics of building a hard-wired special-purpose machine for molecular dynamics may be justified in terms of the economics of building and operating several copies of the final product, and such a machine is ahnost complete. In spite of these rapid de- velopments, the scope of free-energy simulations is still severely limited. Within a short time, we will need a radical increase In com- puter time to realize possibilities that are now clearly defined. An immediate Refold increase appears needed, and does not seem an extravagant objective with proper planning (i.e., duplicate existing hardware that Is already programmed and otherwise inexpensive). The pharmaceutical and biotechnology industries will make some investment since companies' efforts at rational drug design require simulation capability that works in parallel with NMR and x-ray determination of physiologically crucial enzymes. Apart from drug and protein design, others within and out- side of industry will apply these techniques to the broad problems of protein-protein and protein-nucleic acid recognition, by using a combination of molecular dynamics simulations and the results of site-directed mutagenesis. Once we have dealt with the problem of rationalizing these equilibria in terms of molecular interactions in atomic detail, our attention will shift to the application of the newly acquired skills to problems of the dynamics of interaction of proteins with other molecules, which will presumably require just as much computer time. In contrast to the protein folding prom lem described in the precarious section of this report, the problem of computer modeling of the dynamics of protein interactions can be tackled in a series of small, increasingly complex steps, each of

88 which solves a Excrete problem of immediate biochemical ~nter- est, yet also provides additional insight and experience needed to advance the technical expertise. Beyond the need for sufficient computer time, we must im- prove forcefields for proteins, carbohydrates, and nucleic acids by determining better values for the parameters and extending these to include drug-type molecules; attempts should also be made to adapt molecular dynamic techniques in ways designed to over- come some of the intrinsic imperfections of existing forcefields. At present, there ~ a disturbing trend towards the development and commercialization of proprietary software instead of free exchange of programs and subroutines, and there appears to be a paral- le} development of a proprietary forcefield. The members of this committee hope that these trends are temporary; the emphasis on co~runercialization ~ appearing too early in the scientific process. If the trend permits, a national agency should commission the de- velopment of state-o£the-art programs and forcefields that would be available to all workers. SOLVATION AND ELECT1tOSTATICS IN COMPUTER SIMULATION O1? BIOPOLY-:1IS Biomolecular systems function in viva in an aqueous solution environment. That environment includes solvent as well as a gum stantial component that connate of a variety of salt ions. Because these components interact significantly with each other and typi- cally also with the macromolecular species, this environment can contribute substantially to the observed state of macromolecules in solution. Manifestations of its influence include the relative stability of various macromolecular conformations and the biD'd- ing constants that characterize the association of macromolecules with each other or with other biochemically significant molecules. The simplest effect of water can be thought of as that of a high dielectric medium that screens the interaction of charged and polar groups. Hence, the interaction ~ in effect much smaller than the corresponding interaction in the absence of solvent. How- ever, solvent influence cannot be described solely in such terms. For example, it has long been appreciated that nonpolar moieties are preferentially exclucled from water, and such "hydrophobic"

89 effects have long been believed to be a major component in pros tein conformational free energies (Kauzmann, 19593. For all po- lar interactions, hydrogen bonding in particular, the net energy is determined primarily by the mismatch between short-ranged solvent-solute and solute-solute interactions. In the case of the highly charged nucleic acid polymers, the identity and distribution of solution counterions also seems to be particularly significant. As this brief discussion makes clear, we cannot expect to succeed in the quantitative treatment of biopolymer structure and function without paying due attention to the molecular role of the solution environment. In particular, we must take into account the significant role played by the environment in determining the strength of ligand binding, as well as the relative stability (free energy) of the varied structures that must be encountered during protein folding and that may accompany function. In the following section, we describe briefly alternative frame- works for considering environmental effects and discuss the current state of theory in this area. We focus our attention on the limi- tations of currently available results and methods, as well as on the potential for significant progress in the near future. Finally, we point out important areas for attention in the short term, and comment on the prospects for successful quantitative treatments in the longer term. All-Atom Modelmg Detailed molecular models for the solution environment take the same form as those used for the biopolymer as such. That is, the solution components are represented as a collection of sites, typically atomic sites, that carry partial electrostatic charges and are each associated with a spherical short-ranged potential that accounts for short distance interatomic repulsion and for London attractive forces. Molecular entities are usually specified through sets of bond length and angle constraints. For water, those models that most successfully reproduce experimental liquid data (Jor- gensen et al., 1983) include an additional charged site that is not aligned with any of the three nuclei of each molecule. The sol- vent molecules (and ions) then interact with each other and with macromolecular components by a superposition of electrostatic and short-ranged terms. This format for the potential is itself an approximation, and

go the quantitative limitations of this form are not yet completely determined. The most significant limitation ~ that, in reality, the polarity of a molecule in solution is substantially influenced by its surroundings due to electronic polarization. For hydrogen-bonded liquicis such as water, it is known that a successful nonpolarizable mode} must include electrostatic site charges for each molecule that correspond to an increased dipole moment with respect to the gas phase; the increase represents the average adclitional polarization induced by neighboring molecules (StiHinger, 1975~. Such effects appear, in principle, for all components of the solution, including macromolecular species. Although we have no evidence that the neglect of explicit polar~zability ~ now limiting the predictive power of such models, it should be kept in mind as a potential limitation to quantitative prediction. Whether such effects are included in modeling efforts is limited primarily by computational rather than theoretical capabilities, so even in the worst case, such effects can eventually be added later. Carrying out any computational study of macromolecular be- havior taking fuH atomic account of the solvent is an extraordi- narily demanding task, since the surrounding solvent constitutes most of the whole system. In the presence of finite concentrations of ions, the problem is substantially more difficult, since in that case, the solvent associated with ionic dilution must be included as well (25 ion pairs require 15,000 water molecules to dilute to 0.1M). Such a computer simulation study remains at least an order of magnitude beyond what Is now feasible. Nevertheless, consider- able progress is being made in simulating macromolecular systems and related mode} compounds, taking full account of the solvent environment. This progress is due in large part to the vast increase in available computational power. In particular, in the area of mode} systems, this power has per- mitted a few studies of small molecule conformational equilibrium (Jorgensen, 1982; Rosenberg et al., 1982; Zichi and Rossky, 1986) and one direct investigation of the conformational free energy of a dipeptide (Mezei et al., 1985~. Carrying out such studies requires specialized sampling techniques, termed umbrella sanding, that allow the accurate determination of relative populations of con- formers that are separated from one another by significant free energy barriers. Such techniques and their efficient use are the products of recent research on simulations of molecular mode! systems.

91 The results of these studies are consistent with the limited experimental data available, as documented in the cited reports. A few other studies have been carried out on peptides in water without any attempt to fully explore conformational space (Brady and Karplus, 1985; Hagler et al., 19SOa). Recently, a number of studies have been carried out that aimed to evaluate directly the relative hydration free energies of small molecules, amino acids, and nucleotide bases (Bash et al., 1987a; Jorgensen and Ravimohan, 1985; Lybrand et al., 1986; Singh et al., 1987~. Such studies are essential for calibrating the potentials in use. These relative free energy quantities are amenable to cal- culation using a thermodynamic perturbation approach, another too! added recently to simulators' methods (Postma et al., 1982~. The results of the studies are encouraging in that the investi- gators obtained relative free energies within about ~ kc al of exper- imental deterrn~nations. Although this level of accuracy may not be sufficient to determine quantitatively the stability of systems involving many such groups, it strongly suggests that the potential functions are sufficiently close to being right that relatively small adjustments may adequately finish the job. Similarly encouraging results have been obtained in small mode] system binding equilibria. Studies of relative affinities of ions for a cyclic ionophore (Lybrand et al., 1986) and of base pair stacking and hydrogen bonding (Bash et al., 1987) have been carried out. In the latter case, the results do not agree quantita- tively with experunental estimates but are, again, close enough to warrant optimism. In parallel, several groups have carried out true macromolec- ular binding studies. These will be discussed in detail in the following section on molecular dynarnic8 of biopolymers. In the area of globular protein structure per se, several at- tempts have now been made to compare the structural and dy- narn~c behavior of the hydrated mode! to experimental hydrated crystal structures and to the results obtained in the absence of solvent (Kruger et al., 1985; Teleman, 1986; van Gunsteren and Berendsen, 1984; van Gunsteren and Karplus, 1982; van Gun- steren et al., 1983; Wong and McCammon, in press). The results of these studies are encouraging in that they show that the ad- dition of solvent makes simulated structures agree better with experimental crystallographic atomic positions. Significant quan- titative differences remain, however. Further, for hydrated single

92 proteins, a simulation initiated In the crystallographic protein structure was found to produce an increasingly deviant structure (as measured by the R factor) from the crystallographic structure as the simulation proceeds (Kruger et al., 1985; Teleman, 1986~. In a recent hydrated crystal study, similar behavior appeared to be present (Berendsen et al., 1986~. One possible interpretation is that these studies simply sample fluctuations that are not fully av- eraged, and the deviation is only an apparent one the simulations are relatively short, less than 100 picoseconds. Alternatively, for the noncrystalline simulations, this result may reflect real differ- ences between crystal and solution structures. However, one also should suspect the underlying theoretical interaction potentials as the source of the deviation, and much more testing ~ required to narrow the alternatives. In this context, it is important to emphasize that the state of the art has not yet reached a level where computational complex- ity is the only Inniting issue, as a simple example can illustrate. Although short-range solute-solvent forces play a very important role, we have already noted that the dielectric screening of solute charges is of substantial importance. In light of this, it is no- table that for those popular water models for which the dielectric constant has been determined, the agreement with experiment is not very good; at room temperature the so-called MCY mode} yields a value near 35 (Neumann, 1985), while for the structurally and thermodynamically excellent TIP4P mode! of Jorgensen (Jor- gensen et al., 1983) one finds about 50 (Neumann, 1986), and the ST2 mode! yields about 120 (Steinhauser, 1983), all compared to the experimental result of about 80. Clearly, for the interaction of charges at long range, such discrepancies would be quantitatively disastrous. This does not imply that results obtained, for example, for polypeptide conformational equilibria would have comparable relative errors, but it does indicate that caution is warranted, and that further mode] development ~ necessary. Inexplicit Environmental Modeling Since the solvent and small ions as such are often not of primary interest, it is in principle simplest to avoid giving an explicit account of the surrounding solution and treat its influence only implicitly. Formally, this can be done by introducing effective, or so-called solvent-averaged, potentials among the solute atoms

93 of explicit interest. The rigorous existence and formulation for such a reduction is well known. However, such effective potentials are not generally represented by only pairwise interactions, but can be resolved into pair, three-body, and other terms. The pair term, for example, ~ the potential of mean force between an isolated solute pair in an infinite amount of solvent. The lack of pairwise additivity ~ present even if the full unseduced system is described by pairwise additive potentials. The use of a continuum dielectric mode! of an ionic solution represents the supplest form of a pairwise additive effective potential, where, in addition, the pair potential is only roughly modeled. The question is whether pairwise additivity of the effective potentials is a good approximation. The validity of this approx- imation remains largely untested, although for ionic solutions pairwise additive sem~empirical potentiab adequately reproduce experunental solution thermodynamics up to about 1M concen- tration (see Friedman et al., 1973 and references therein). Current macromolecular modeling of the effects of solution environments typically invokes such effective potentials in a rel- atively crude form. Most often, an effective dielectric constant typical of a nonpolar, polarizable material is used to account for polarization screening of electrostatic charges (Weiner et al., 1984~. In some treatments, a modification to the potential to account for short-range, molecular, solvent effects is then added in some treat- ments (Gibson and Scheraga, 1967; Nemethy et al., 1978; Hodes et al., 1979a, 1979b; Kang et al., 1987~. This added "hydration shell" term introduces a free energy bonus or penalty associated with the close approach of solute atoms, typically proportional to the overlap volume of the first salvation shells of the approach- ing polypeptide atoms. This approach is closely analogous to the method widely used in models of ionic solutions pioneered by Friedman et al. (1973~. A significant problem of the implicit approaches to solvents now in use is that they use an ad hoc form for effective potentials, the reliability of which is not well established. The ability of such potential functions to produce correct quantitative results for protein/nucleic acid system is obviously difficult to assess since the system is complex, with many theoretical parameters and relatively little experimental data for comparison. However, Go and Scheraga (1984) have demonstrated that such approaches are useful in analyzing differential hydration effects in specific

94 cases. The current procedure ~ to establish the parameters for the potentiab from the thermodynamics of hydration (Nemethy et al., 1978; Hodes et al., 1979a, 1979b; Kang et al., 1987~; this procedure ~ valid at the present stage. However, in the near future, we should emphasize more direct comparison to spectroscopic and NMR results for small molecules such as oligopeptides. In fact, some skepticism of the current form of the potentiak is warranted, given the results obtained for the most simple so- lute systems. It is known, for example, that for ionic solutions, a detailed molecular solvent treatment of the interionic potential of mean force does not closely resemble the hydration shell model, although both are consistent with observed thermodynamics (Pet- titt and Roseky, 1986~. The true effective potential exhibits large oscillations as a function of distance. The minnna are shifted in spatial position compared to the simpler model, but the depth of the alternative potentiab appears to compare favorably. Hence, the hydration shell mode! may be a viable way to estunate solvent effects associated with native versus completely unfoIcled states, but not for intermediate structures. This last hypothesis is consist tent with the use of aqueous thermodynamic data to parameterize the potential. It ~ clear that the folded state prediction per se is of great import in the a priori prediction of protein structure and function. Recent efforts to generalize the molecular solvent theory avail- able for ionic solutions to the atoms that make up peptides appear prorn~sing (Pettitt and Karplus, 1985; Pettitt et al., 1986), but no quantitative comparison to experunental data has yet been made. Even if we accept a purely continuum fluid description of the solvent environment, the issue of dielectric screening itself is a major one. If one is studying a protein crystal, a small dielectric constant may well be appropriate. However, the use of such a value to determine structures does not seem warranted. Particularly for atomic charges near the solvent interface in a folded protein, one expects that the pure solvent value ~ more relevant. Recent work is aimed directly at rigorously examining the usefuIness of different d~tance-dependent dielectric functions for such interactions within a dielectric continuum picture, but for dielectrically inhomogeneous systems (Gilson et al., 1985; Klapper et al., 1986~. This work may well produce an optimal and well- foundec} treatment within the scope of this simplified model. The approach for nucleic acid modeling is less refined than

9s for proteins because of the ubiquitous charges and requmite coun- terions present in solutions of nucleic acids. For these macro- molecules, it Is insufficient to deal with solvent alone. Current ap- proaches have considered various polyionic charges, solvent, and, in some cases, counterions. In many studies, the polyionic phos- phate charges have been artificially reduced to about 25 percent of the physical value to account crudely for counterion association (Hingerty and Broyde, 1982; Tidor et al., 1983~. This value arises from the counterion condensation formalism, which describes a required fractional counterion screening for counterions that are far from the polyelectrolye (Manning, 1978~. This approach Is a valuable qualitative tool, but we do not expect quantitative results from such a strong approximation. Only recently have initial all-atom studies of polynucleotide- ~on-solvent systems been carried out (Corongiu and Clementi, 1981; Seibe] et al., 1985; van Gunsteren et al., 1986), but it is clear that the exceptionally time-consuming nature of these simulations with ions present does not yet permit such calculations to be very informative in practical ways. To simulate a duplex oligonucleotide without added salt for 2 nanoseconds (a relevant motional time scale for the polymer) would require roughly 1,000 hours of supercomputer time. TO the case of nucleic acids, an intermediate ground state exists that is not relevant for many proteins. That is, the set of explicitly simulated atoms can be extended to include the macromolecule and ambient ions, while retaining the implicit treatment of only the solvent. In terms of the number of atoms to be followed, the simplification is substantial. In any of the cases discussed above in which the solvent is treated implicitly, one still must implement realistic potentials of mean force, or, at least, invoke a firmly based dielectric continuum treatment. Since the potential payoff of knowing viable implicit salvation routes is very large, it is unportant to encourage research into implicit modeling in biopolymer related systems. A potentially useful approach to the ionic atmosphere that avoids even the intermediate-level treatment of ions is the im- plementation of solvent and ion-averaged potentials within the biopolymer. The use of reduced phosphate charges is an ad hoc form of such a potential function. An oversimplified but well- founded alternative is the use of a Debye-Huckel-like screening between polymer sites (Hesselink et al., 1973; Manning, 1978;

96 Soumpasis, 1984), employing the bulk solvent dielectric constant. The latter approach cannot, however, account for the unusually high degree of ionic association that ~ present In the immediate vicinity of a polyion. Another unusually promising approach involves applying more analytical theories for the influence of the solution environment, while retaining a detailed description of the biopolymer. In essence, one evaluates the effective potentials that govern the intrapolymer interactions for fixed polymer configuration by (numerically) solv- ing the relevant equations of an essentially analytical theory. An example of significant recent progress along these lines is the work of Pack and coworkers (Klein and Pack, 1983~. They have used an algorithm for solution of the Poisson-Boltzmann equation for the ionic distributions around a detailed DNA model, and from such distributions the relevant free energies of different conforma- tions are, in principle, obtainable. At least for simplifier] models of DNA, the Poisson-Boltzmann mean field theory has proved accurate compared to computer simulation for the same mod- els (Murthy et al., 1985~. Closely related approaches have been considered for enzyme-substrate binding involving charged species (Klapper et al., 1986~. However, a Poisson-Boltzmann treatment is tied to a dielectric continuum view of the solvent, although dielectric heterogeneity can be readily accounted for within this context. A brief comment on biopolymer dynamics is appropriate here. Dynamics are clearly connected to the general question of pro- tein folding, and are likely to be significant for function in many cases. Although molecular clynarn~cs may not be directly related to the issue of predicting function, it is clearly connected to the more general question of protein folding. As for the equilibrium time-independent problem, one can, in principle, consider the full atomic description. However, one can also focus only on an explicit subset of solute atoms, such as biopolymer or biopolymer plus ions. The formal theory to be applied when only some of the atoms are considered explicitly is well established (for recent discussions in a variety of contexts see: Adelman, 1982; Ermak and McCammon, 1978; Tully, 1981~. The motion proceeds according to the forces prescribed by the effective potential, but with additional random forces and friction due to the implicit solvent collisions. In general, these solvent forces are not simple, but depend on the history of

97 the solute dynamics (memory) and the current solute conforma- tion (hydrodynarn~c interaction). In the general case, the relevant equation is the so-called generalized I.angevin equation with the friction described by a memory function that embodies the sta- tistical properties of the solvent collisional correlations in space and time. The approximation that neglects any frictional correla- tion involves only constant drag coefficients, the so-called ordinary Langevin equation. Further approximation leads to equations of a diffusion type. As with most areas of theory in physical science, the time- dependent theory lags behind the equilibrium theory in terms of development and application. A few examples of attempts to am ply these methods to realistic and simplified models exist (Levy et al., 1979; McCammon et al., 1980) but only once does it ap- pear that a polypeptide has been examined (Brooks and Karplus, 1986~. The problems encountered in any application involve, first, computational Innitations, since the dynamics that are of bio- chem~cal interest are relatively slow. Perhaps more significant is our extremely limited knowledge of the memory function and hy- drodynamic interactions for a complex solute. In principle, we can test our assumptions against all-atom simulations or the relevant functions extracted from these simulations, but this route is itself Innited by the current sparsity of the requmite simulations. Never- theless, future developments in this area seem very likely, although they are further away than are those in equilibrium theory. Conclusions The ability to adequately test predictions of theoretical calcu- lations is an element of overriding importance in the future of the modeling of salvation and electrostatics. This testing can occur at two levels: first and foremost by comparing theory and experiment and second, by comparing results obtained through convenient but approximate theory with those derived from accurate theoretical treatment. The first is essential for accuracy and the second for the future development of viable theoretical methods for studying increasingly complex systems. Therefore, we should continue to encourage both experiment and theory for both macromolecular and smaller mode} compounds. We cannot expect an immediate and completely satisfactory way to account for the influence of the solution environment on

98 the behavior of macromolecules. The above discussion indicates that some unsolved and many partially solved problems remain. Nevertheless, reasonable approaches already exist that provide adequate grounds for qualitative study. The degree of quantita- tive accuracy is not yet well established and awaits both further mode! calculations and further thermodynamic and spectroscopic experimental data, so that we may make unequivocal comparisons between theory and experiment. Based on the steady progress de- scribed above over only the past few years, we have every reason to expect rapid incremental progress to continue. A clear view of the capabilities of current models and methods for describing flexible small molecules should be available within only a few years. At the same time, the large amount of theoretical activity both in the development of well-founded approximate approaches and in the simulation of atom~c-leve! solvatec! molecules virtually assures our ability to make the appropriate comparisons between the two in the near future. A very significant element in recent develop- ments has come from aIgorithnuc breakthroughs. Biases] sampling techniques and thermodynamic perturbation/integration methods are two new methods that contribute essential capabilities to the theoretical effort. Hence, we should encourage theoretical devel- opments as much as computational applications. The rapidly increasing access to the necessary computer facil- ities has contributed significantly to progress, and it is essential that this access continue to grow. For all-atom modem of the environment, the current limitations on macromolecular simula- tion are primarily computational. Although limitations of the mode] theory are also likely, we now have insufficient data to make that judgement. An order-of-magnitude increase In available com- puting power would be enough to make a dramatic difference in this area; two orders of magnitude would permit simulations into the Interesting many-nanosecond regime. Such changes are likely within the next five years through the combined effects of new hardware, improved performance, and lower cost. To explore adequately events such as protein folding that occur on much longer time scales (or involve vast conformational exploration), these computational improvements would still be inadequate by many orders of magnitude. Thus, a theoretical breakthrough appears necessary if we are to make real progress within the next several years. Such a breakthrough would be, for example, the demonstration of an implicit treatment for the

99 solution environment that yielded accurate biopolymer dynamics on a nanosecond time scale when compared to a full atomic-level simulation. Such a treatment could then be applied for longer times. Considering the very limited knowledge available about the performance of alternative implicit modeling techniques, we can not now say whether such an approach is workable, even in prin ciple. However, the process of determining the usefulness of these techniques requires a one to two order-of-magnitude gain in com puter power. In summary, the rapid progress in our ability to describe the environmental aspects of bipolymer systems gives solid ground for optimism that this element of biomolecular modeling will not im- pede development of useful predictive methods. However, for the most challenging aspects, we are at least several years away from demonstrating the ability to mimic accurately solution environ- mental effects. HEURISTIC METHODS There are two major approaches to the prediction of three- dimensional structures of proteins: modeling by extension and hierarchical searching. Both methods can combine heuristic ideas and energy calculations. They differ from the energy calculations described earlier and from each other in the way they arrive at starting structure. Modeling by extension uses the known struc- ture of a protein or proteins with strong sequence homology to the unknown. The hierarchical methods use packing considerations derived from the crystallography of many proteins. The following section describes these approaches in more detail. Homology Protein Homology Proteins occur in families. Evidence for this comes first from protein sequence homology and then from the architectural simi- larity of homologous proteins determined by x-ray crystallography and NMR spectroscopy. A family of proteins can be modeled by homology if several conditions are fulfilled. First and most im- portant, the structure of at least one member of the family must

100 be known. Second, the three-dimensional protein to be modeled must be sufficiently homologous to that of the known protein. Many proteins have been modeler! over the past five years, and the general consensus is that if two proteins share at least 30 per- cent similarity, then computer graphics and energy modeling will be useful. If the global homology ~ less than 30 percent, then it is difficult but not ~mpomible to say whether the two proteins are in the same family. If there are unportant conserved residues such as disulfide bridges then even 30 percent homology might be sufficient. The difficulty of modeling a given protein depends on the range of homology with the known structure. When the homology is between 80 and 100 percent, normally only the surface amino acids are changing. In these cases, there is usually no change in peptide length. With homology of between 50 and 80 percent, again, mainly the surface amino acids are changing, but there may be additions and deletions in the peptide length. The amino acids on the surface of the protein can be easily substituted. Energy minimization and/or molecular dynamics are sufficient to reduce the errors caused by any changes in surface sidechain conforma- tion. Surface-charged amino acids under molecular dynamics ei- ther must be neutralized by artificially altering the parameters or by adding a solvent box about the protein. When interior amino acids are changed from one protein to another, the changes are either to make a long amino acid shorter, thus creating a hole in the interior of the protein, or a long-short pair of amino acids are changed to a short-Ion" pair of amino acids. When amino acids are added or deleted in a helix, they are generally in multiples of three amino acids. This preserves the hydrophobicity/hydrophilicity relations in the helix. In contrast, beta strancIs tend to have insertions or deletions of two amino acids, thus preserving the inside/outside relations (Feldmann et al., 1985~. The inside amino acids of a protein normally are hy- drophobic, while the outside amino acids are normally hydrophilic. Graphic modeling of such changes is accomplished by moving the additions and deletions in helices and beta strands toward the ends of the secondary structure feature. Graphic modeling of loops is easy to do but fraught with large inaccuracies. Insertion or dele- tion of amino acids on a loop can be accomplished by breaking the peptide chain, making the appropriate change and then bending

101 the loop ends to accorrunodate the change. Energy modeling un- der local conditions of relatively high sunulated temperature can be used to explore the local conformational space. There are two ways to align sequences, automatically and manually. The automatic alignment methods such as Wilbur and Lipman tend to align the sequence for highest local match. Man- ual methods (see Feldmann et al., 1985) permit the alignment of secondary structure features which minimize the number of dis- turbances which must be made to the protein and convert from the crystallographic structure to the mode! structure. One of the ways to overcome the uncertainties of the structure of a particular loop is to build a library of representative loops. Alwyn Jones at Uppsala has done this and recently integrated it into FRODO, his modeling program. After Al the changes have been made either by graphic meth- ods or by using a loop library, extensive molecular dynamics sim- ulation is generally used to improve the quality of the model. Whether a broad range of scientists can use molecular dynamics calculations depends on the availability of appropriate modeling software and on a variety of displays and workstations. To com- plete the modeling by molecular dynamics calculations, sufficient computer power must be available either on a personal supercom- puter (PSC) or by network access to a national supercomputer. Modeling by E5ctension Twenty years ago, PhiBips (1967) made a mode] of the protein lactalbumin without obtaining a single crystal. This was possible because the amino acid sequence of lactalbum~n had been found to be 35 percent identical with that of the enzyme lysozyme, a pro- tein whose crystal structure had recently been determined. The residues for lactalbumin were simply placed in the equivalent po- sitions known to be occupied by the residues of lysozyme (Browne et al., 1969~. Subsequently, Warme et al. (1974) underscored the validity of the approach by computational approaches to the structures of the two proteins. The structure of alpha-lactalbumin computed by energy minimization by Warme et al. (1974) has recently been verified by x-ray crystallography (D.C. Phillips, personal commu- nication, 1987~. Since then, "modeling by homologous extension"

102 has become a common if sometimes casually applied approach, that has benefited from modern computer graphics (Greer, 1985~. A recent example of the utility of modeling by extension ~ that an important but elusive factor, angiogenin, involved in the biogenesis of blood vessels was isolated after a search of more than 15 years. The sequence of this factor was determined and found to be 45 percent identical to pancreatic ribonuclease. As a re- sult, Palmer et al. (1986) promptly generated a three-dimensional structure. Naturally, the closer the resemblance of the unknown protein to the one whose structure has been determined, the more accu- rate the modeled structure. Recently, however, Moult and James (1986) have shown that it is possible to construct good models even when the sequence resemblance is barely recognizable. In many instances, then, all that ~ needed ~ the family relationship of the new protein. Exon Shnffling Many recently evolved proteins exhibit evidence of "exon shuf- fling.~ In this phenomenon, mosaic proteins result from the ge- nom~c rearrangement of segments that encode small portions of different proteins. For such proteins, it ~ thought that the pew tide segments, which often range from 30 to 90 amino acids in length, all fold independently (DoolittIe, 1985~; as such, they con- stitute domains in the truest sense. As of m-1987, about six such domains had been found in a variety of proteins In different three- dimensional settings. Most of them are tightly folded and contain disulfide bonds that hold the structure in place. They include such well-known motifs as: the "ELF domain, the ~Kringle," and the "fibropectin fingers. They are readily identified by rou- tine computer searches of sequence, and, when such a structure has been identified, can be immediately modeled in place. Exon shuffling, which ~ due at least in part to the additional recom- bination that ensues from the presence of introns between exons, is not restricted to the small set of stable structures listed above, and it is anticipated that hybrid and mosaic proteins of many sorts will be identified. In all these situations, prior knowledge of any structural motif wiD aid in interpreting the overall struc- ture. Doubtless, other motifs will emerge as more sequences are determined, compared and analyzed.

103 HIERARCHICAL MODELS OF PROTEIN FORDING The major goal of hierarchical modeling is to build three- dimensional structures that incorporate directly or indirectly the architectural principles by which nature constructs globular pro- teins. The direct approach uses empirical rules or models that capture recognized aspects of protein folding. The indirect ap- proach uses homology to provide the basic structure and explores the local environment with energy calculations or other modeling efforts. This latter work was described in the previous section. Here, we assess the success of rule-based procedures. Investigators have used these procedures at various levels of formality. These efforts generally follow the same plan: predict- ing secondary structure followed by packing secondary features. They also use the Kauzmann hydrophobic mode} as the basic necking principle. Classifying protein domains by structure has ~ -- r ~ ~ ~ ~ {~ been particularly important because it provides major rules tior a discussion and extensive references see Richardson, 1981; Cohen et al., 1983~. Such rules include statements about the geometry of helices packing against helices and beta sheets and beta sheet- beta sheet packing. The efforts have also led to the development of a list of rules concerning the ordering of strands within a beta sheet. We should note clearly that rules such as these only sum- marize what is observed in known structures; they are not derived from first principles of physics or chemistry. Nevertheless, many of the idealized structures built to be consistent with such rules do look recognizably like protein domains and some are rather close (average error 4 A) to the crystallographic result. PATTERN RECOGNITION AND ARTIFICIAL INT1:~LIGENCE We differentiate the techniques of pattern recognition from those of artificial intelligence, especially its subdiscipline of ex- pert systems. Pattern recognition is usually defined as including numerical techniques for clustering observed data into binary or higher order categories (but see below). Artificial intelligence is usually defined to include use of rule-based systems to express and use empirical knowledge of a subject. The usual techniques of pattern recognition, such as linear discriminate analysis, cluster analysis, and other parametric and

104 nonparametric methods, have not proved useful In the analysis of protein structures and functions. These methods pose difficulties in determining the statistical significance of a derived cIa~ifica- tion. Given the relative lack of knowledge about structure-function relationships in complex molecules such as proteins, it ~ very dif- ficult even to pick a reasonable set of structural descriptors upon which to build a clustering scheme. Other definitions of pattern recognition, however, are less con- troversial in technique, if not in interpretation of results. These techniques involve, for example, the presentation of three- dimensional protein structures in the form of C-alpha distance maps (Kuntz, 1975; Rao and Rossmann, 1973) to infer the pres- ence of secondary and super secondary structures and location of intron/exon boundaries (Go, 1981~. The technology of expert systems may be applied when em- pirical knowledge, which can be expressed In rule-based systems, can be used to solve problems. For example, production rules of the form IF (x) THEN (y) may be an integral part of a hierar- chical pattern comparison described previously (Figure 4-1~. To achieve high performance, such rule-based approaches are often supplemented with methods and data from other sources. Two systems under development, KARMA (Klein et al., 1986) and PROTEAN (Hayes-Roth et al., 1986) illustrate this point. The KARMA system employs rule-based proposal and evaluation of small molecules and their predicted binding activities in receptor sites of known proteins. A variety of mathematical and graphic procedures are used to evaluate candidate structures and their affinities for binding. PROTEAN uses artificial intelligence tech- niques with interatomic (nonbonded) (listance information from NMR and a variety of mathematical and graphic techniques to ex- plore structural possibilities for proteins of known primary struc- ture. Cohen et al. (1983, 1986b) have carried out one of the most ex- tensive projects in their studies of alpha/beta domains and four he- lix bundles. In the former case, they identified secondary features by using pattern matching and then built tertiary structures from exhaustive combinatorial packing of the secondary elements. In the most favorable case, flavodoxin, they could generate a unique prediction for the alpha carbons involved in helix or sheet of the protein. Similar predictions for molecules such as interIeukin-2

105 have been made based on a core structure of four helices (Cohen, et al., 1986b). The important strengths of such projects are (1) they achieve low resolution structures of the central residues of proteins that contain many protein features, including a prediction of the ~ac- tiven portion of the molecules; (2) the computational labor is modest; and (3) the rule system can be tested directly against known structures and their homologs. IN some sense, they are the best solution currently available to the folding problem. On the negative side, the low resolution is an obvious limitation. Details of loops are often neglected. Only certain classes of proteins can be dealt with successfully. The next several years should bring improvements in all as- pects. More realistic models will reduce errors. Loops and side chains can be treated either from rule-based or energy-based ap- proaches. Expansion to more protein structural classes is proceed- ing rapidly. It is difficult to see the ultimate limitations of these heuristic methods. We are hopeful that they yield good first-order approximations that can be refined by the energy minimization and molecular dynamics calculations.

Next: 7 Functional Aspects of Proteins and Nucleic Acids »

Computer Assisted Modeling: Contributions of Computational Approaches to Elucidating Macromolecular Structure and Function (1987)

Chapter: 6 Tertiary Structure of Proteins and Nucleic Acids: Theory

Welcome to OpenBook!

Get Email Updates