The generation of δX_{i} (but not the decision on its acceptance) depends only on X_{i} and does not take into account the penalty function on structural diversity. This protocol may lead to a large number of step rejections. However, the above choice is the simplest to use in a parallel computing environment, and it further leaves some room for future publications.
The parallel environment is important because the computations are pursued typically on a cluster of workstations or on a parallel computer with multiple CPUs. Each of the homologous proteins is assigned to one processor. The processor calculates the displacement δX_{i} and the energy ε_{i}(X_{i})+Δ_{i}(X_{i}) and decides whether to accept or to reject the move. The correlation with other structures is built using the function Δ_{i}(X_{i}) that depends on all the other coordinates. The conformations of the set are therefore sampled from the canonical ensemble with an “energy”, E_{total}.
To compute the penalty function Δ_{i}(X_{i}), it is necessary to have the coordinates of all the proteins on each of the processors. The update of the coordinates can be done with every Monte Carlo step. However, to reduce communication overhead, we usually update the coordinates only after a few Monte Carlo steps.
A related algorithm can be easily formulated for molecular dynamics, solving the Newton’s equations of motion:
where M is the mass matrix for protein i. Nevertheless, the lattice approach suggests a number of unique advantages for the protein folding problem, which are discussed elsewhere (3).
We now return to the functional form of the penalty on structural variations. We experiment with two measures:
Root mean square difference (RMSD) of the shared C_{α} coordinates after optimal overlap (4):
L is the fraction of dissimilar contacts in the maps of the two structures. L=(number of dissimilar contacts)/(total number of contacts in the two maps). (Two residues are considered at a contact if their C_{α} distance is ≤6.5 Å.)
The RMSD is a common and useful measure of global similarity. However, it is doing poorly in detecting similar folds of structural segments. For example, if the secondary structure elements are predicted correctly but their packing is incorrect, the RMSD is typically high. In contrast, L, which is not as widely spread as the RMSD, detects local similarities and shows more uniform decrease in value as the structure quality decreases.
Both functions are useful in comparing the final structure to the native fold; however, the task of forcing the different chains to look alike is best done with the RMSD. The application of L is problematic because maps with no contacts at all have (of course) no dissimilar contacts. As a result, restraining the structures to similar L values pushes the system to unfolded swollen states. We therefore used the RMSD. The specific functional form of Δ_{ij}(X_{i}, X_{j}) is listed below:
Simulation Protocol and Results. We provide a numerical example for a family of pancreatic hormones (5). In Table 1, we list the seven sequences that were used in the runs with coupled optimizations (6).
We performed 100 Monte Carlo simulated annealing runs of the protein 1ppt and 142 coupled runs of structures of seven sequences, which were optimized in parallel. Only one experimental structure (of 1ppt) is available, and we compared with it the results of the computations. In Fig. 1, we show the energies of the 100 Monte Carlo runs of 1ppt as a function of the RMSD from the native structure. Also shown are the energies of the 142 coupled runs.
It is clearly seen that runs, which employed seven coupled proteins, cluster near lower RMSD values and therefore provide better prediction. The lowest energy structure of the coupled and the uncoupled runs (our best guess of the native conformation) are shown in Fig. 2. Again, the coupled runs provide a better answer. The improvement does not require an increase in computational effort. Each of the uncoupled 1ppt runs was seven times longer than the run of the seven sequences.
Another example for a protein family (homeodomain) can be found in ref. 6. Yet, another study employed coupling in a two-dimensional lattice (7) and showed even more profound improvement.
Here we discuss the question of ''why.” Why does the proposed algorithm improve structure prediction? We have seen one example, and other examples are available in the literature (6, 7). From a global optimization perspective, it may be surprising that optimizing a system, N times larger (N homologous sequences) is easier than optimizing one sequence at a time. At the limit in which the optimizations are completely independent, they should take approximately N times longer.
Obviously, the coupling plays an important role in increasing the computation efficiency and accuracy. To understand the effect, it is useful to consider a simpler system first in which the “homologous” proteins are all of the same molecule. Hence, only structural diversity remains. E_{total} is now E_{total}= [ε(X_{i})+Δ_{i}]. A single energy function [ε(X_{i})] is used for the different conformations of the proteins {i=1,…, N}.
In Fig. 3, we compare annealing results with coupled and uncoupled energy function for the protein Ifsd. The distribution clearly shows that better energies are obtained when the coupling (of identical proteins) is employed. Hence, a better optimization protocol is obtained without sequence diversity. However, it is important to emphasize that the quality of the structures (as opposed to the energies) is not necessarily better because it depends also on the quality of the energy function.