Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 69
6.
Tertiary Structure of Proteins and
Nucleic Acids: Theory
ENERGY OPT~IIZATION
According to the thermodynamic hypothesis, based on Anfin-
sen's (Anfinsen et al., 1961) experiments on the oxidative refolding
of bovine pancreatic ribonuclease A from the reduced form, the
amino acid sequence determines the three-dimensional structure
of a protein in a given medium as the thermodynamically most
stable one. It must be emphasized that this hypothesis applies
to the length of the polypeptide chain at the stage when folding
takes place, and not at some subsequent processing stage. For
example, this hypothesis would be applicable to the single-chain
pro-insulin and (possibly) not to the processed product, the two-
chain disulfide-linked insulin. Thus, it is a challenge to chemists to
understand how the interatomic interactions within the polypep-
tide chain and the interactions between the atoms of the chain and
those of the surrounding solvent lead to the thermodynamically
stable structure, that is, the one for which the statistical weight of
the system is a maximum.
The hypothesis that the statistical weight is a maximum im-
mediately implies that some kind of optimization strategy is neces-
sary to find the most stable structure. This requires procedures to
generate arbitrary conformations of a polypeptide chain, compute
69
OCR for page 69
70
the statistical weight for each conformation, and then alter the
conformation so that it ultimately corresponds to the global maxi-
mum of the statistical weight. Although the problem is formidable,
current indications are that it can be solved using a sound] scien-
tific approach without resorting to ad hoc procedures to deduce
"folding rudest that do not explain the molecular bash of such
"rules."
To compute the global statistical weight, one optimizes the
conformational energy of the polypeptide, incorporates the effect
of the solvent, and takes account of the entropy of the system.
Procedures are available for carrying out such computations, and
currently available supercomputers permit the computations to
be applied to very large systems. Although these procedures will
undoubtedly improve, they are now adequate for computating and
obtaining results that can be checked experimentally. The major
difficulty still to be overcome, although partial success has been
achieved (see review by Gibson and Scheraga, 1988; ado, Robson,
1986; Crippen, 1984), armed from the presence of many local m~n-
ima in the multidunensional energy surface. Although algorithms
are available for minimizing an energy function of many variables,
there are no efficient ones for passing from one local minimum,
over a potential barrier, to the next local minunum and ulti-
mately to the global niinimum of the potential energy in a very
high dimensional space. Thus, minimization leads to the nearest
local minimum, where the procedure is trapped. This trapping
in a local, rather than the global, minimum ~ referred to as the
"multiple-minnna problem. Efforts are being made to overcome
this problem using a variety of procedures, including approxima-
tions that place the system in the potential well in which the global
minimum lies (Gibson and Scheraga, 1988~. Then, any approxi-
mations (introduced in the initial stages) are abandoned, and the
full energy function is minimized. The use of molecular dynamics
for minimization is an alternative strategy; it is considered in the
next section.
The energy minimization approach and its associated compu-
tational program, (lescribed here for polypeptides and proteins,
is equally applicable to any other type of macromolecules, such
as polynucleotides and polysaccharides, as well as to interactions
between the various types of macromolecule. General references
to this methodology include the works of Anfinsen and Scheraga
OCR for page 69
71
(1975), Nemethy and Scheraga (1977), Levitt (1982), Karplus and
McCammon (1983), Scheraga (1984), and Richards (1986~.
Generation of an Mbitrary Confonnation
Analytical geometry and associated matrix algebra provide
the tools to generate a polypeptide chain. To do so, internal co-
ordinates (dihedral angles for rotating about bonds) or Cartesian
coordinates may be used as independent variables. When inter-
nal coordinates are used, the bond lengths and bond angles are
held fixed, but chosen very carefully from x-ray structures of mode}
systems so as to properly reflect the geometric results of side chain-
backbone interactions within each amino acid residue. The validity
of this approach has been demonstrated for both polypeptides and
polysaccharides systems in which strained conformations rarely
arise (Scheraga, 1984~. When Cartesian coordinates are used, and
hence bond lengths and bond angles are allowed to vary, one faces
the problem that force constants for these motions are not as well
known as the geometric features of the molecule. Further, bond-
angle bending modes are enharmonic, and no currently available
forcefield takes anharmonicity into account. Thus, we will have
to overcome inadequacies in the force constants and problems of
anharmonicity before we can rely on computed bond angles. Impo-
sition of fixed bond lengths and bond angles is at least reconciled
with observed crystal structures of small molecules. The subject
of fixed versus flexible geometries has been discussed by Swenson
et al. (1978~.
Potential Functions
Many research groups have developed potential energy func-
tions with which to carry out such computations on polypep-
tides, polysaccharides, polynucleotides, and synthetic polymers.
The various potential functions have many similarities, but diner
enough in detail to make them difficult to compare. This difficulty
is compounded by the small number of cases for which parameters
that characterize the strengths of the various interactions have
been established in a sel£consistent way on experimental data
such as crystal structures, lattice energies, and barriers to inter-
nal rotation. At present, the potential functions for polypeptides,
OCR for page 69
72
poly~accharides and synthetic polymers have been better param-
eterized than have those for polynucleotides, prunarily because
there are fewer reliable data for mode} nucleotide compounds.
Similarly, more experimental data will be required for proper pa-
rameterization of the potential functions for the prosthetic groups
attached to biological macromolecules. Until recently, the poten-
tial functions involving water molecules still would not account
adequately for the observed radial distribution function of water,
but this situation is improving (see, e.g., the work of Jorgensen et
al., 1983~.
It must be emphasized that the molecule responds to the to-
tal (in principle, quantum mechanical) potential function, and
its partition into various empirical components (such as non-
bonded, electrostatic, hydrogen-bonding, and other interactions)
is at best arbitrary, although there ~ some physical basis for as-
signing such names to the various components. Consequently, one
must avoid combining components from forcefields from different
research groups. Each forcefield must be parameter~zed by itself
in a self-consistent way. The total energy of a given conformation
is expressed as a sum of the energies of aD nonbonded pairs of
atoms.
When considering biological function, such as the formation
of an enzyme-substrate complex, then the pair interactions both
within and between the two partners of the complex must be
included in the total energy. Thus, the influence of each member of
the complex on the other (the so-called induced fit phenomenon)
is taken into account. Consequently, the conformations of the
partners in the complex can differ from their conformations as
individual species. This means that the computed conformation
of an isolated hormone may not resemble the biologically active
one when it ~ bound to a receptor (Scheraga, 1984~.
Although potential functions could still be improved, those for
which parameters have been determined in a self-consistent way
have led to many computed structures that have subsequently
been checked by experiment. For example, the computed struc-
ture of the colIagen-like poly(Gly-Pro-Pro) (Miller and Scheraga,
1976) agrees with the subsequently determined crystal structure
(Okuyama et al, 1976) within a root-mean square (rms) deviation
of 0.3 A. Scheraga (1984) has cited many similar examples of
experimental verification of computed structures. Therefore, we
can be confident that the problem of adequate potential functions
OCR for page 69
73
is not serious, and more effort should be devoted to the most
difficult problem of all the multiple-m~nuna problem. After the
multiple-munnna problem is solved, it wiD then be worthwhile to
reexamine the possibility or necessity of improving the potential
functions further.
Salvation
There are several methods to mclude the effect of hydration
in the computations. Hydration tends to force the polar groups to
the surface of the molecule, putting them In contact with water,
and forces the nonpolar groups to the interior, removing them
from contact with water.
One method includes the water molecules explicitly, and cal-
culates the interaction energy between the molecule and the water.
The success of this approach depends on the adequacy of the po-
tential function describing the water-water interaction. Another
approach ignores the structural features of the water molecule,
and assigns a hydration shell (and an accompanying free energy
of hydration) to each atom or group of atoms. As the conforma-
tion changer and hydration shells overlap, a free-energy penalty is
assessed. The hydration-shell mode} is parameterized on experi-
mental data on free energies of hydration (Kang et al., 1987), but
current efforts are being made to obtain such free energies from
Monte CarIo and molecular dynamic studies of aqueous solutions
of small molecules (see, e.g., Jorgensen et al., 1983~. The compu-
tation of free energies by these simulation techniques faces many
theoretical obstacles, although this very active field of research is
progressing quickly.
Entropy
A variety of methods exist to incorporate entropic effects.
One entropic effect arises from the hydration; this effect is treated
as described earlier. The other entropic effect armes from the
conformational fluctuations of the molecule. Several procedures
may be used to compute this contribution, the most direct being
the evaluation of the second derivative of the potential function.
This describes the curvature at the bottom of the potential energy
well and hence the fluctuation in conformation about each local
niinirnum in the potential energy surface. I,ocal minima that are
OCR for page 69
74
not the global minimum of the potential energy can become the
conformations of higher statistical weight if there ~ a large enough
entropy gain from large conformational fluctuations. Paine and
Scheraga (1987) have encountered such eEects.
Optim;xati~ Procedures
Optimization procedures are available for searching for the
local minima. These Include direct energy minimization, Monte
CarIo, and molecular dynamics proceclures. In energy minimiza-
tion, the variables describing the conformation are altered system-
atically so as to lower the energy continuously. The Monte CarIo
method makes random changes in the conformational variables
and accepts the new conformation according to various protocob
that compute the energies before and after the random changes in
the conformation. ~ molecular dynamics calculations, Newton's
equations of motion for the atoms of the macromolecule (subject
to interatomic forces determined by the potential functions) are
solved to obtain a trajectory in conformational space. Very ef-
ficient energy minimization algorithms exist, even for functions
of many variables, but lead only to local minima (however, see
the following section). Conventional Monte CarIo procedures can
overcome local minima but tend not to cover conformational space
efficiently enough. However, this difficulty is being overcome use
ing modifications that include adaptive unportance sampling and
other efficiency-seeking procedures (Gibson and Scheraga, 1988~.
Molecular dynamics can also surmount local barriers, but the pi-
cosecond time scale of practical computations does not approach
the millisecond time scale of actual protein founding.
Because most published papers do not provide the relevant
data, it is difficult to compare the computer time needed for
various optunization procedures. The required computation time
will be very sensitive to how well the code is optimized, whether
parallel processing Is carried out, and other factors. To obtain
such benchmarks, it would be necessary to run each procedure on
several different computer systems a task not yet undertaken.
Solutions to the Muitiple-Mm;ma Problem for Macromolecules
Since no mathematical procedures are available to locate the
global minimum for any macromolecule (except in energy surfaces
OCR for page 69
75
of very low dunensionaiity), as mentioned above we must first re-
sort to approximate procedures to obtain structures that might lie
close to that of the native macromolecule. Then, the approx~na-
tions are abandoned, and fuB-scale energy minimization, Monte
CarIo, or molecular dynamics procedure ~ earned out. A variety
of such procedures have been developed (Gibson and Scheraga,
1988~. These include:
.
a "build-up~ method, in which large structures are built
up from ensembles of low-energy conformations of
smaller ones (with energy minirn~zation being carried
out at each stage);
optimization of electrostatic interactions;
optimization in a space of high dimensionality where
fewer intervening barriers exist (with subsequent relax-
ation back to three dimensions);
Monte CarIo sampling among local minnna (accompa-
nied by energy niinimization);
adaptive importance Monte CarIo sampling (to drive the
system more efficiently to the global minimums;
pattern recognition methods to assemble organized back-
bone structures as alpha-helices and beta-sheets;
use of distance constraints either from experiment (Nu-
clear O`rerhauser Effects (NOEs), nonradiative energy
transfer, or NMR on spin-labeled molecules) or from
statistical analysm of x-ray structures of proteins; and
empirical methods to predict the locations of alpha-
helices, beta-sheets, and beta-turns.
Numerous valid predictions of global minimum structures of
peptides have been made using these methods (Gibson and Scher-
aga, 1988~. However, they have thus far been successful only for
structures that contain at most 20 residues, and current efforts
(most of which require access to supercomputers) are being made
to extend these methods to larger molecules-to proteins contain-
ing on the order of 100 amino acid residues.
Successes and Failures
Numerous structures have been predicted and subsequently
confirmed exper~rnentaDy (Scheraga, 1984~. The right- or left-
handed twists of the fundamental structures (alpha-helices and
OCR for page 69
76
beta-sheets) from which proteins are built have been accounted
for by energy minimization. The observed packing features of
alpha-helices and beta-sheets have likewise been accounted for in
energetic terms. Parameters calculated for conformational tran-
sitions (e.g., the helix-coi} transition in water) have been verified
by experunent. The computed structures of open-chain and cyclic
molecules (e.g., the 2~residue membrane-bound portion of melit-
tin and the l~residue gramicidin S. respectively), and those of
colIagen-like poly-tripeptides have also been verified by experi-
ment (Scheraga, 1984~. Finally, the computed structure of an
enzyme-substrate complex (hen egg white lysozyme and a hexas-
accharide substrate) (Pincus and Scheraga, 1979) has been verified
by experiment (Sm~th-Gill et al., 1984~. These and other examples
should give us confidence in the validity of the potential functions
and computational methodology (Gibson and Scheraga, 1988~.
The failures, in the sense of not yet having solved the protein-
folding problem, exist because no one has yet used optimization
techniques to deduce the three-dimensional structure of even a
smaH protein, such as the 58-residue bovine pancreatic trypsin
inhibitor (BPTI). Current procedures applied to BPTI have not
yet yielded a computed structure that comes closer to the x-ray
structure than 2-3 A. Several procedures that work to overcome
the multiple-m~nima problem on small molecules become compu-
tationally intensive as they are used on larger molecules. However,
the increasing use of supercomputers will help overcome this prom
fem.
~edunents to Progress
Although supercomputers will allow larger calculations and
thus cover conformational space better, workers in this field will
need additional tune to be allotted on these machines to do the
research necessary to achieve greater efficiency. Parallel process
ing offers a breakthrough, and this will require now ~nft:"r~r. t
~1, ~_ ~ 1 ~
--- 1.~^- ~ w~u~c~ W
cake advantage ot the hardware enhancements. With new hard-
ware and software, it should be possible to surmount the major
hurdle created by the multiple-minima problem. However, it
conceivable that bottlenecks may develop as we attempt to scale
up procedures that work on 2~re~idue segments to proteins con-
taining 100 to 200 residues. We will also need imaginative new
approaches to overcome this problem.
OCR for page 69
77
Potential functions should be improved, especially those for
polynucleotides and prosthetic groups, and for water-water inter-
actions, but this is not now the most serious problem. Certainly,
this problem should be addressed again when the multiple-minima
problem is solved for bovine pancreatic trypsin inhibitor.
Fmally, new developments will be needed to bring molecular
dynamics from the picosecond to the millisecond tune scale.
Future Protects
. · .
At every stage in the development of conformat~ona~ energy
calculations over the past 25 years, we always seemed to face in-
surmountable obstacles. However, the steady progress during this
period indicates that many of these obstacles have been overcome.
The remaining major hurdle is the multiple-minima problem (Gibe
son and Scheraga, 1988), but we have an array of possible solutions
to it. The solutions have worked for small molecules, and current
and impending developments in computer hardware and software
should justify our confidence that, within 5 to 10 years, we may ex-
pect to understand how interatomic interactions dictate not only
the final folded structure but the pathways taken by the newly
formed polypeptide chain to reach the native structure.
MOLECULAR DYNAMICS
The principle behind a molecular dynamics simulation ~ sim-
ply the application of Newton's equations of motion to the atoms
of one or more molecules. Newton's equations relate three in-
dependent quantities: time, conformation (atomic coordinates),
and potential energy. As the calculation progresses and the posi-
tions and velocities of the atoms change, the system will traverse
many different states; as the simulation is prolonged, the observed
states together approach a perfect sample of the thermal equip
rium ensemble of all states the system will occupy. The thermal
equilibrium distribution may also be sampled without considering
motion, using appropriate purely statistical methods (Monte CarIo
techniques). In principle, a Monte Cario calculation might pro-
duce a representative sample using less computer time. Noguti and
Go (1985) indicate how, with knowledge of the second-derivative
matrix of the potential energy, the atomic coordinates can be ef-
fectively used to speed up the Monte CarIo process. However, it is
OCR for page 69
78
as yet uncertain whether this accelerated Monte CarIo procedure
produces a more rapid exploration of conformation space of a pro-
tein than a molecular dynamics simulation. Thus, the molecular
dynamics sunulation gives us a way to make theoretical estimates
of mean atomic positions and deviations from the mean; of rates
of motion and conformation change; and of ensemble averages,
including thermodynamic functions such as energy, enthalpy, spe-
cific heat, and free energy. Since free energies can be expressed as
equilibrium constants and vice versa, simulations are being used
to obtain theoretical estimates of differences of affinity of proteins
for small molecules. Recent results show remarkably good agree-
ment with experimental values. Major pharmaceutical companies
have already noted the usefulness of accurately predicting these
differences.
Molecular dynamics simulations, although simple in concept,
were not practical before the acivent of high speed computers. This
method of theoretical chemistry is particularly useful for the study
of condensed phases and was first used to study the structure and
dynamics of liquids. I,ater, several investigators applied existing
techniques to protein molecules (Karplus and McCammon, 1983;
Berendsen [cf. Hermans, 1985, Beveridge and Jorgenson, 19874~.
At present, several laboratories are active in the field. More are be-
com~ng involved as the methods are applied to increasingly quan-
titative studies that aim to reproduce experunental observations
as closely as possible in the computer model. Many investiga-
tars express the belief that molecular dynamics calculations will
soon produce useful predictions of structure, dynamics and ther-
modynamics of proteins, nucleic acids, and complexes of these
macromolecules with one another and with other molecules.
The simulation requires two pieces of information at the out-
set: a starting conformation and potential energy function or
forcefield. For a protein, current technology requires that the
starting conformation be firmly based on experimental observa-
tion: because many conformations exist at local minimum energy,
a conformation that is very different from the correct most stable
conformation evolves too slowly to reach the correct conformation
in the length of a typical calculation.
The forcefield is a very simple empirical approximation to the
underlying physics, which properly should be expressed in terms
of quantum mechanics but is totally unmanageable in that form.
Parameters of the forcefields now in use have been proposed on
OCR for page 69
79
the basis of a variety of experunental data and to some extent
on theoretical considerations. Overall, the several forcefields pro-
posed by different groups for computation of the internal energy of
proteins tended to have very similar sets of parameters. Recently
developed forcefields for water-water and water-protein interac-
tions permit the simulation of dynamics of proteins in solution,
which is a prerequisite for modeling events at the protein surface,
including most interactions of proteins with other molecules. (The
problems of developing adequate forcefields are discussed in the
following section on "Solvation.~ ~
Carrying out molecular dynamics sunulations of proteins Is
very much an art of the feasible, the limiting factor always being
the available computing power. One is always facing the conse-
quence of an inescapable physical fact: that the most rapidly fluc-
tuating atomic motions, bond stretching, and bond angle bending
vibrations have periodicities of the order of once in every few fem-
toseconds (lo-~5 see). Current simulation methodology requires
that periodic motions be sampled several times per period, and
each sampling requires an evaluation of the system's potential
energy, requiring computer time in milliseconds on the fastest
machines, Cray and Cyber 205. Clearly, simulations cannot now
span a time that is on the biological time scale of microseconds to
seconds. Unavoidably, molecular dynamics simulations use sim-
pie forcefields to span a longer time. Given more computer time,
those working in the field will improve simulations in various ways:
use of more detailed forcefields; longer simulations; simulation of
larger systems posing new physical and biological questions; and
application of new, more time-consuming, dynamics methods to
ask different questions about the system. To those working in the
field, the future is bright; ideas and interesting problems abound,
and new computer technology continues to widen the limits of
feasibility.
RESULTS
Collected papers for symposia held in 1984 and 1985 give
an overview of methods and results of applications of molecular
dynamics to proteins." Subsequent work achieved many of the
Germans, 1985; Beveridge and Jorgensen, 1986; results described in
these symposium papers are not explicitly referenced in this section. Some
OCR for page 69
9s
for proteins because of the ubiquitous charges and requmite coun-
terions present in solutions of nucleic acids. For these macro-
molecules, it Is insufficient to deal with solvent alone. Current ap-
proaches have considered various polyionic charges, solvent, and,
in some cases, counterions. In many studies, the polyionic phos-
phate charges have been artificially reduced to about 25 percent
of the physical value to account crudely for counterion association
(Hingerty and Broyde, 1982; Tidor et al., 1983~. This value arises
from the counterion condensation formalism, which describes a
required fractional counterion screening for counterions that are
far from the polyelectrolye (Manning, 1978~. This approach Is a
valuable qualitative tool, but we do not expect quantitative results
from such a strong approximation.
Only recently have initial all-atom studies of polynucleotide-
~on-solvent systems been carried out (Corongiu and Clementi,
1981; Seibe] et al., 1985; van Gunsteren et al., 1986), but it
is clear that the exceptionally time-consuming nature of these
simulations with ions present does not yet permit such calculations
to be very informative in practical ways. To simulate a duplex
oligonucleotide without added salt for 2 nanoseconds (a relevant
motional time scale for the polymer) would require roughly 1,000
hours of supercomputer time.
TO the case of nucleic acids, an intermediate ground state exists
that is not relevant for many proteins. That is, the set of explicitly
simulated atoms can be extended to include the macromolecule
and ambient ions, while retaining the implicit treatment of only
the solvent. In terms of the number of atoms to be followed, the
simplification is substantial.
In any of the cases discussed above in which the solvent is
treated implicitly, one still must implement realistic potentials of
mean force, or, at least, invoke a firmly based dielectric continuum
treatment. Since the potential payoff of knowing viable implicit
salvation routes is very large, it is unportant to encourage research
into implicit modeling in biopolymer related systems.
A potentially useful approach to the ionic atmosphere that
avoids even the intermediate-level treatment of ions is the im-
plementation of solvent and ion-averaged potentials within the
biopolymer. The use of reduced phosphate charges is an ad hoc
form of such a potential function. An oversimplified but well-
founded alternative is the use of a Debye-Huckel-like screening
between polymer sites (Hesselink et al., 1973; Manning, 1978;
OCR for page 69
96
Soumpasis, 1984), employing the bulk solvent dielectric constant.
The latter approach cannot, however, account for the unusually
high degree of ionic association that ~ present In the immediate
vicinity of a polyion.
Another unusually promising approach involves applying more
analytical theories for the influence of the solution environment,
while retaining a detailed description of the biopolymer. In essence,
one evaluates the effective potentials that govern the intrapolymer
interactions for fixed polymer configuration by (numerically) solv-
ing the relevant equations of an essentially analytical theory. An
example of significant recent progress along these lines is the work
of Pack and coworkers (Klein and Pack, 1983~. They have used
an algorithm for solution of the Poisson-Boltzmann equation for
the ionic distributions around a detailed DNA model, and from
such distributions the relevant free energies of different conforma-
tions are, in principle, obtainable. At least for simplifier] models
of DNA, the Poisson-Boltzmann mean field theory has proved
accurate compared to computer simulation for the same mod-
els (Murthy et al., 1985~. Closely related approaches have been
considered for enzyme-substrate binding involving charged species
(Klapper et al., 1986~. However, a Poisson-Boltzmann treatment
is tied to a dielectric continuum view of the solvent, although
dielectric heterogeneity can be readily accounted for within this
context.
A brief comment on biopolymer dynamics is appropriate here.
Dynamics are clearly connected to the general question of pro-
tein folding, and are likely to be significant for function in many
cases. Although molecular clynarn~cs may not be directly related
to the issue of predicting function, it is clearly connected to the
more general question of protein folding. As for the equilibrium
time-independent problem, one can, in principle, consider the full
atomic description. However, one can also focus only on an explicit
subset of solute atoms, such as biopolymer or biopolymer plus ions.
The formal theory to be applied when only some of the atoms are
considered explicitly is well established (for recent discussions in a
variety of contexts see: Adelman, 1982; Ermak and McCammon,
1978; Tully, 1981~. The motion proceeds according to the forces
prescribed by the effective potential, but with additional random
forces and friction due to the implicit solvent collisions. In general,
these solvent forces are not simple, but depend on the history of
OCR for page 69
97
the solute dynamics (memory) and the current solute conforma-
tion (hydrodynarn~c interaction). In the general case, the relevant
equation is the so-called generalized I.angevin equation with the
friction described by a memory function that embodies the sta-
tistical properties of the solvent collisional correlations in space
and time. The approximation that neglects any frictional correla-
tion involves only constant drag coefficients, the so-called ordinary
Langevin equation. Further approximation leads to equations of a
diffusion type.
As with most areas of theory in physical science, the time-
dependent theory lags behind the equilibrium theory in terms of
development and application. A few examples of attempts to am
ply these methods to realistic and simplified models exist (Levy
et al., 1979; McCammon et al., 1980) but only once does it ap-
pear that a polypeptide has been examined (Brooks and Karplus,
1986~. The problems encountered in any application involve, first,
computational Innitations, since the dynamics that are of bio-
chem~cal interest are relatively slow. Perhaps more significant is
our extremely limited knowledge of the memory function and hy-
drodynamic interactions for a complex solute. In principle, we can
test our assumptions against all-atom simulations or the relevant
functions extracted from these simulations, but this route is itself
Innited by the current sparsity of the requmite simulations. Never-
theless, future developments in this area seem very likely, although
they are further away than are those in equilibrium theory.
Conclusions
The ability to adequately test predictions of theoretical calcu-
lations is an element of overriding importance in the future of the
modeling of salvation and electrostatics. This testing can occur at
two levels: first and foremost by comparing theory and experiment
and second, by comparing results obtained through convenient but
approximate theory with those derived from accurate theoretical
treatment. The first is essential for accuracy and the second for
the future development of viable theoretical methods for studying
increasingly complex systems. Therefore, we should continue to
encourage both experiment and theory for both macromolecular
and smaller mode} compounds.
We cannot expect an immediate and completely satisfactory
way to account for the influence of the solution environment on
OCR for page 69
98
the behavior of macromolecules. The above discussion indicates
that some unsolved and many partially solved problems remain.
Nevertheless, reasonable approaches already exist that provide
adequate grounds for qualitative study. The degree of quantita-
tive accuracy is not yet well established and awaits both further
mode! calculations and further thermodynamic and spectroscopic
experimental data, so that we may make unequivocal comparisons
between theory and experiment. Based on the steady progress de-
scribed above over only the past few years, we have every reason to
expect rapid incremental progress to continue. A clear view of the
capabilities of current models and methods for describing flexible
small molecules should be available within only a few years.
At the same time, the large amount of theoretical activity both
in the development of well-founded approximate approaches and in
the simulation of atom~c-leve! solvatec! molecules virtually assures
our ability to make the appropriate comparisons between the two
in the near future. A very significant element in recent develop-
ments has come from aIgorithnuc breakthroughs. Biases] sampling
techniques and thermodynamic perturbation/integration methods
are two new methods that contribute essential capabilities to the
theoretical effort. Hence, we should encourage theoretical devel-
opments as much as computational applications.
The rapidly increasing access to the necessary computer facil-
ities has contributed significantly to progress, and it is essential
that this access continue to grow. For all-atom modem of the
environment, the current limitations on macromolecular simula-
tion are primarily computational. Although limitations of the
mode] theory are also likely, we now have insufficient data to make
that judgement. An order-of-magnitude increase In available com-
puting power would be enough to make a dramatic difference in
this area; two orders of magnitude would permit simulations into
the Interesting many-nanosecond regime. Such changes are likely
within the next five years through the combined effects of new
hardware, improved performance, and lower cost.
To explore adequately events such as protein folding that
occur on much longer time scales (or involve vast conformational
exploration), these computational improvements would still be
inadequate by many orders of magnitude. Thus, a theoretical
breakthrough appears necessary if we are to make real progress
within the next several years. Such a breakthrough would be,
for example, the demonstration of an implicit treatment for the
OCR for page 69
99
solution environment that yielded accurate biopolymer dynamics
on a nanosecond time scale when compared to a full atomic-level
simulation. Such a treatment could then be applied for longer
times.
Considering the very limited knowledge available about the
performance of alternative implicit modeling techniques, we can
not now say whether such an approach is workable, even in prin
ciple. However, the process of determining the usefulness of these
techniques requires a one to two order-of-magnitude gain in com
puter power.
In summary, the rapid progress in our ability to describe the
environmental aspects of bipolymer systems gives solid ground for
optimism that this element of biomolecular modeling will not im-
pede development of useful predictive methods. However, for the
most challenging aspects, we are at least several years away from
demonstrating the ability to mimic accurately solution environ-
mental effects.
HEURISTIC METHODS
There are two major approaches to the prediction of three-
dimensional structures of proteins: modeling by extension and
hierarchical searching. Both methods can combine heuristic ideas
and energy calculations. They differ from the energy calculations
described earlier and from each other in the way they arrive at
starting structure. Modeling by extension uses the known struc-
ture of a protein or proteins with strong sequence homology to the
unknown. The hierarchical methods use packing considerations
derived from the crystallography of many proteins. The following
section describes these approaches in more detail.
Homology
Protein Homology
Proteins occur in families. Evidence for this comes first from
protein sequence homology and then from the architectural simi-
larity of homologous proteins determined by x-ray crystallography
and NMR spectroscopy. A family of proteins can be modeled by
homology if several conditions are fulfilled. First and most im-
portant, the structure of at least one member of the family must
OCR for page 69
100
be known. Second, the three-dimensional protein to be modeled
must be sufficiently homologous to that of the known protein.
Many proteins have been modeler! over the past five years, and
the general consensus is that if two proteins share at least 30 per-
cent similarity, then computer graphics and energy modeling will
be useful. If the global homology ~ less than 30 percent, then
it is difficult but not ~mpomible to say whether the two proteins
are in the same family. If there are unportant conserved residues
such as disulfide bridges then even 30 percent homology might be
sufficient.
The difficulty of modeling a given protein depends on the
range of homology with the known structure. When the homology
is between 80 and 100 percent, normally only the surface amino
acids are changing. In these cases, there is usually no change in
peptide length. With homology of between 50 and 80 percent,
again, mainly the surface amino acids are changing, but there may
be additions and deletions in the peptide length. The amino acids
on the surface of the protein can be easily substituted. Energy
minimization and/or molecular dynamics are sufficient to reduce
the errors caused by any changes in surface sidechain conforma-
tion. Surface-charged amino acids under molecular dynamics ei-
ther must be neutralized by artificially altering the parameters or
by adding a solvent box about the protein.
When interior amino acids are changed from one protein to
another, the changes are either to make a long amino acid shorter,
thus creating a hole in the interior of the protein, or a long-short
pair of amino acids are changed to a short-Ion" pair of amino acids.
When amino acids are added or deleted in a helix, they are
generally in multiples of three amino acids. This preserves the
hydrophobicity/hydrophilicity relations in the helix. In contrast,
beta strancIs tend to have insertions or deletions of two amino
acids, thus preserving the inside/outside relations (Feldmann et
al., 1985~. The inside amino acids of a protein normally are hy-
drophobic, while the outside amino acids are normally hydrophilic.
Graphic modeling of such changes is accomplished by moving the
additions and deletions in helices and beta strands toward the ends
of the secondary structure feature. Graphic modeling of loops is
easy to do but fraught with large inaccuracies. Insertion or dele-
tion of amino acids on a loop can be accomplished by breaking the
peptide chain, making the appropriate change and then bending
OCR for page 69
101
the loop ends to accorrunodate the change. Energy modeling un-
der local conditions of relatively high sunulated temperature can
be used to explore the local conformational space.
There are two ways to align sequences, automatically and
manually. The automatic alignment methods such as Wilbur and
Lipman tend to align the sequence for highest local match. Man-
ual methods (see Feldmann et al., 1985) permit the alignment of
secondary structure features which minimize the number of dis-
turbances which must be made to the protein and convert from
the crystallographic structure to the mode! structure.
One of the ways to overcome the uncertainties of the structure
of a particular loop is to build a library of representative loops.
Alwyn Jones at Uppsala has done this and recently integrated it
into FRODO, his modeling program.
After Al the changes have been made either by graphic meth-
ods or by using a loop library, extensive molecular dynamics sim-
ulation is generally used to improve the quality of the model.
Whether a broad range of scientists can use molecular dynamics
calculations depends on the availability of appropriate modeling
software and on a variety of displays and workstations. To com-
plete the modeling by molecular dynamics calculations, sufficient
computer power must be available either on a personal supercom-
puter (PSC) or by network access to a national supercomputer.
Modeling by E5ctension
Twenty years ago, PhiBips (1967) made a mode] of the protein
lactalbumin without obtaining a single crystal. This was possible
because the amino acid sequence of lactalbum~n had been found to
be 35 percent identical with that of the enzyme lysozyme, a pro-
tein whose crystal structure had recently been determined. The
residues for lactalbumin were simply placed in the equivalent po-
sitions known to be occupied by the residues of lysozyme (Browne
et al., 1969~.
Subsequently, Warme et al. (1974) underscored the validity
of the approach by computational approaches to the structures of
the two proteins. The structure of alpha-lactalbumin computed
by energy minimization by Warme et al. (1974) has recently been
verified by x-ray crystallography (D.C. Phillips, personal commu-
nication, 1987~. Since then, "modeling by homologous extension"
OCR for page 69
102
has become a common if sometimes casually applied approach,
that has benefited from modern computer graphics (Greer, 1985~.
A recent example of the utility of modeling by extension ~
that an important but elusive factor, angiogenin, involved in the
biogenesis of blood vessels was isolated after a search of more than
15 years. The sequence of this factor was determined and found
to be 45 percent identical to pancreatic ribonuclease. As a re-
sult, Palmer et al. (1986) promptly generated a three-dimensional
structure.
Naturally, the closer the resemblance of the unknown protein
to the one whose structure has been determined, the more accu-
rate the modeled structure. Recently, however, Moult and James
(1986) have shown that it is possible to construct good models
even when the sequence resemblance is barely recognizable. In
many instances, then, all that ~ needed ~ the family relationship
of the new protein.
Exon Shnffling
Many recently evolved proteins exhibit evidence of "exon shuf-
fling.~ In this phenomenon, mosaic proteins result from the ge-
nom~c rearrangement of segments that encode small portions of
different proteins. For such proteins, it ~ thought that the pew
tide segments, which often range from 30 to 90 amino acids in
length, all fold independently (DoolittIe, 1985~; as such, they con-
stitute domains in the truest sense. As of m-1987, about six such
domains had been found in a variety of proteins In different three-
dimensional settings. Most of them are tightly folded and contain
disulfide bonds that hold the structure in place. They include
such well-known motifs as: the "ELF domain, the ~Kringle,"
and the "fibropectin fingers. They are readily identified by rou-
tine computer searches of sequence, and, when such a structure
has been identified, can be immediately modeled in place. Exon
shuffling, which ~ due at least in part to the additional recom-
bination that ensues from the presence of introns between exons,
is not restricted to the small set of stable structures listed above,
and it is anticipated that hybrid and mosaic proteins of many
sorts will be identified. In all these situations, prior knowledge
of any structural motif wiD aid in interpreting the overall struc-
ture. Doubtless, other motifs will emerge as more sequences are
determined, compared and analyzed.
OCR for page 69
103
HIERARCHICAL MODELS OF PROTEIN FORDING
The major goal of hierarchical modeling is to build three-
dimensional structures that incorporate directly or indirectly the
architectural principles by which nature constructs globular pro-
teins. The direct approach uses empirical rules or models that
capture recognized aspects of protein folding. The indirect ap-
proach uses homology to provide the basic structure and explores
the local environment with energy calculations or other modeling
efforts. This latter work was described in the previous section.
Here, we assess the success of rule-based procedures.
Investigators have used these procedures at various levels of
formality. These efforts generally follow the same plan: predict-
ing secondary structure followed by packing secondary features.
They also use the Kauzmann hydrophobic mode} as the basic
necking principle. Classifying protein domains by structure has
~ -- r ~ ~ ~ ~ {~
been particularly important because it provides major rules tior a
discussion and extensive references see Richardson, 1981; Cohen
et al., 1983~. Such rules include statements about the geometry
of helices packing against helices and beta sheets and beta sheet-
beta sheet packing. The efforts have also led to the development
of a list of rules concerning the ordering of strands within a beta
sheet. We should note clearly that rules such as these only sum-
marize what is observed in known structures; they are not derived
from first principles of physics or chemistry. Nevertheless, many
of the idealized structures built to be consistent with such rules do
look recognizably like protein domains and some are rather close
(average error 4 A) to the crystallographic result.
PATTERN RECOGNITION AND ARTIFICIAL
INT1:~LIGENCE
We differentiate the techniques of pattern recognition from
those of artificial intelligence, especially its subdiscipline of ex-
pert systems. Pattern recognition is usually defined as including
numerical techniques for clustering observed data into binary or
higher order categories (but see below). Artificial intelligence is
usually defined to include use of rule-based systems to express and
use empirical knowledge of a subject.
The usual techniques of pattern recognition, such as linear
discriminate analysis, cluster analysis, and other parametric and
OCR for page 69
104
nonparametric methods, have not proved useful In the analysis of
protein structures and functions. These methods pose difficulties
in determining the statistical significance of a derived cIa~ifica-
tion. Given the relative lack of knowledge about structure-function
relationships in complex molecules such as proteins, it ~ very dif-
ficult even to pick a reasonable set of structural descriptors upon
which to build a clustering scheme.
Other definitions of pattern recognition, however, are less con-
troversial in technique, if not in interpretation of results. These
techniques involve, for example, the presentation of three-
dimensional protein structures in the form of C-alpha distance
maps (Kuntz, 1975; Rao and Rossmann, 1973) to infer the pres-
ence of secondary and super secondary structures and location of
intron/exon boundaries (Go, 1981~.
The technology of expert systems may be applied when em-
pirical knowledge, which can be expressed In rule-based systems,
can be used to solve problems. For example, production rules of
the form IF (x) THEN (y) may be an integral part of a hierar-
chical pattern comparison described previously (Figure 4-1~. To
achieve high performance, such rule-based approaches are often
supplemented with methods and data from other sources. Two
systems under development, KARMA (Klein et al., 1986) and
PROTEAN (Hayes-Roth et al., 1986) illustrate this point. The
KARMA system employs rule-based proposal and evaluation of
small molecules and their predicted binding activities in receptor
sites of known proteins. A variety of mathematical and graphic
procedures are used to evaluate candidate structures and their
affinities for binding. PROTEAN uses artificial intelligence tech-
niques with interatomic (nonbonded) (listance information from
NMR and a variety of mathematical and graphic techniques to ex-
plore structural possibilities for proteins of known primary struc-
ture.
Cohen et al. (1983, 1986b) have carried out one of the most ex-
tensive projects in their studies of alpha/beta domains and four he-
lix bundles. In the former case, they identified secondary features
by using pattern matching and then built tertiary structures from
exhaustive combinatorial packing of the secondary elements. In
the most favorable case, flavodoxin, they could generate a unique
prediction for the alpha carbons involved in helix or sheet of the
protein. Similar predictions for molecules such as interIeukin-2
OCR for page 69
105
have been made based on a core structure of four helices (Cohen,
et al., 1986b).
The important strengths of such projects are (1) they achieve
low resolution structures of the central residues of proteins that
contain many protein features, including a prediction of the ~ac-
tiven portion of the molecules; (2) the computational labor is
modest; and (3) the rule system can be tested directly against
known structures and their homologs. IN some sense, they are the
best solution currently available to the folding problem. On the
negative side, the low resolution is an obvious limitation. Details
of loops are often neglected. Only certain classes of proteins can
be dealt with successfully.
The next several years should bring improvements in all as-
pects. More realistic models will reduce errors. Loops and side
chains can be treated either from rule-based or energy-based ap-
proaches. Expansion to more protein structural classes is proceed-
ing rapidly. It is difficult to see the ultimate limitations of these
heuristic methods. We are hopeful that they yield good first-order
approximations that can be refined by the energy minimization
and molecular dynamics calculations.