Read "Computational Needs and Resources in Crystallography: Proceedings of a Symposium, Albuquerque, New Mexico, April 8, 1972." at NAP.edu

« Previous: What are the Needs of Crystallographers?

Page 47 Cite

Suggested Citation:"What New Developments are in the Wind?." National Research Council. 1973. Computational Needs and Resources in Crystallography: Proceedings of a Symposium, Albuquerque, New Mexico, April 8, 1972.. Washington, DC: The National Academies Press. doi: 10.17226/18587.

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Page 53 Cite

Page 54 Cite

Page 55 Cite

Page 56 Cite

Page 57 Cite

Page 58 Cite

Page 59 Cite

Page 60 Cite

Page 61 Cite

Page 62 Cite

Page 63 Cite

Page 64 Cite

Page 65 Cite

Page 66 Cite

Page 67 Cite

Page 68 Cite

Page 69 Cite

Page 70 Cite

Page 71 Cite

Page 72 Cite

Page 73 Cite

Page 74 Cite

Page 75 Cite

Page 76 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Session II What New Developments Are in the Wind? Session Chairman: William R. Busing 47

New Computational Techniques, Particularly For Refinement (1) Carroll K. Johnson The two principal numerical techniques used to refine crystal structures (2) are the Fourier transform method and the method of linearized least-squares. The following remarks will be restricted to the least-squares approach; however, significant developments are also occurring in the Fourier field, the Fast Fourier Transform algorithm being used to decrease computing time substantially. An important preliminary for any crystal structure refinement is the selection of an appropriate mathematical model for the structure under study. The selection is usually influenced by the following three consi- derations. 1. What is the relative importance to the investigator of the dif- ferent types of information that can be obtained from a structure refine- ment? 2. Are there any unusual problems involved, such as major disorder in the structure or poor quality diffraction data? 3. Are the available computer hardware, program software, and computing budget adequate to handle the proposed refinement? Ideally, consideration number 1 is of greatest importance, and the refinement model should be based on the particular type of chemical or physical information that the investigator wants to gain from the structure refinement. There seem to be two different areas of interest to crystal- lographers doing crystal structure analysis. The first area concerns the geometrical properties of the idealized configuration of point atoms (i.e., metrical properties such as distances and angles), and the second area concerns the elucidation of atomic density function properties such as electron density. There are two different schools of thought concerning what is the best method to use in refining a crystal structure. These two schools may be termed the free-model school and the constrained-model school. The free-model school reasons that we should refine a structure in the least restrictive way possible, with independent parameters for each atom so that the final results are unbiased by preconceived chemical con- cepts incorporated into the model. The most commonly used model, with 3 positional parameters and 6 anisotropic temperature factor parameters for each atom, is an example of an unconstrained model. The constrained-model school argues that we should put as much chem- ical information as possible into the model so that the variables to be 48

determined are reduced to the basic parameters of direct interest to the investigator. Examples of constrained models are the rigid-body model, the segmented-body model, and the models which force chemically symmetrical groups to be geometrically symmetrical even though they are not crystallographically equivalent. Such constraints can be applied to both positional and thermal-motion parameters. The majority of the crystallographers seem to follow the free- model school of reasoning. The advantage of the unconstrained model is its simplicity and easy direct application to a wide variety of problems. A disadvantage is often the large number of variable parameters that must be handled when crystal structures of even modest complexity are refined. For example, a full-matrix refinement with anisotropic thermal parameters for a 45-atom structure will involve at least 406 variables and will require 82 621 words of core storage for the least-squares matrix alone. The economic importance of the least-squares calculation is empha- sized by the survey taken by Dr. Hamilton for this symposium. The sur- vey shows that 80 to 90% of the computing time used by U.S. crystallo- graphers is spent in the structure-refinement step. Furthermore, the greater part of this computer time is used in forming the matrix of the least-squares normal equations; consequently, it is often worthwhile and sometimes essential to approximate the matrix by an alternate matrix requiring less computer time and less computer memory. Table 1 lists some old and some new methods for approximating the crystallographic least-squares matrix. The principal approach used to minimize computer core requirements is to omit as many off-diagonal terms as possible, thus transforming the full matrix to a sparse matrix. The block-diagonal matrix with one atom per block is the most commonly used sparse-matrix approximation although further reduction is possible. Diag- onal matrix approximations are of little value for general crystallographic refinement because of the oblique coordinate systems used for trigonal, monoclinic, and triclinic crystals. TABLE 1 Approximations For The Crystallographic Least-Squares Matrix 1. Sparse matrix approximations (a) Block diagonal with one atom per block (b) Cross-word puzzle (block diagonal + first neighbor interaction terms) TABLE 1 continued 49

TABLE 1 continued 2. Recycle and update approximations (a) Use the same full matrix unchanged for several cycles (b) Recalculate only the block-diagonal sub- matrices and simply rescale the rest of the old full matrix (c) Recalculate only the matrix elements influenced by parameters which undergo appreciable shifts 3. Analytical matrix approximations An untried but seemingly logical extension from the one-atom block- diagonal matrix is the "cross-word puzzle" matrix where all interaction terms between close-neighbor atom are added to the block-diagonal matrix. It is well known that close-neighbor atoms have a greater least-squares interaction than distant-neighbor atoms. The cross-word matrix would have to be stored by blocks and inverted with a partitioned-matrix inver- sion scheme. The Recycle and Update Procedures listed in Table 1 assume that the complete matrix has been calculated and stored once and that the changes in it from cycle to cycle are small. The option of using the same matrix unchanged for several cycles was available in the original Busing and Levy least-squares program for the IBM-704. Unfortunately, there is no recorded evaluation of the usefulness of this approximation; however, most of the verbal reports received indicate erratic behavior. There are several rather obvious modifications of the Recycle Procedure which might prove to be useful. For example, the atom-block-diagonal submatrices might be recalculated each cycle and the rest of the matrix simply rescaled by the new over-all scale factor. Alternatively, an algorithm might be devised whereby the only matrix elements to be updated would be those in- volving parameters that shifted appreciably in the preceding cycle. The final method listed in Table 1 utilizes a completely different approach to reduce computer time. It replaces the time-consuming numer- ical summations over the thousands of reciprocal lattice points by anal- ytical integrations. The results on analytical matrix approximations pre- sented here are from my own work; however, I recently learned that Pro- fessor Verner Schomaker at the University of Washington has derived in- dependently a number of related results. There are five factors which are functions of the scaled reciprocal- 50

lattice vector t^ (i.e. t_ = 2trh) in each term of the sum for any particular matrix element. The five factors for a centrosymmetric structure with refinement based on F are listed in Table 2. TABLE 2 Factors In The Pl Crystallographic Least-Squares Matrix Sums Which Are Functions Of The Reciptocal Lattice Vector t, = 2ffh* (1) F*(t)/o*[FÂ£(t)] (2) f (t)f (t) m â n ~ (3) exp{- y t'[(b + b )/2772]t}. L* """-" j^~m â.11 ~~~~ (4) t.t. or t.t.t. or t.t t. t g ; (i,j,k,je = 1,2,3) 1J 1 J K IJKi (5) Tcos[t'(x - x )] Â± COS[t'(x -f x )]â¢ sin"*' â¢**M Â«**n ' sin Â«Â»* â¢Â»â¢!Â« ~n x , XH are positional vectors and b , b are anisotropic thermal-motion matrices for atoms m and n. The first factor in Table 2 contains the calculated squared structure factor divided by the variance of the observed squared structure factor. This factor can be eliminated from the list by making the following approx- imation: Approximation 1 - The magnitude of the calculated squared structure is assumed to be proportional to the variance of the observed squared factor. Consequently, the ratio F^/o^(Fp) becomes a constant for all reciprocal lattice points. Approximation 1 is completely valid for the special case where variances are based on counting statistics alone with no correction for background counts. The second factor in Table 2, a product of atomic scattering factors 51

for atoms m and n, may be replaced by an analytical expression. Approximation 2 - The product of two atomic scattering factors is assumed to be approximated adequately by a short sum of spherical Gaussian functions. Sums of 3 to 5 Gaussian functions currently are used successfully to re- place scattering- factor table- look-up procedures in crystallographic programs. The same tabulated Gaussian coefficients could be used in a double summation; however, a more efficient procedure is to fit new Gaussian coefficients directly to the scattering factor product, taking care to make the fit acceptable for the entire range of The third factor in Table 2, the product of anisotropic Gaussian temperature factors for atoms m and n, presents no difficulty. The fourth factor t.t....t is the n degree product of the components for the three-dimensional reciprocal lattice vector t. The 2n , 3 , and 4th degree products occur in position-position, position-thermal, and thermal-thermal matrix elements respectively. The fifth factor is a product of trigonometric terms which can be rewritten as a sum of trig- onometric terms with arguments containing inner products of t with an interatomic vector between atoms m and n. When the periodic properties of the trigonometric functions and the crystal lattice are considered, this factor is seen to contain the Patterson vectors between all atoms of types m and n in the crystal. By incorporating approximations 1 and 2, we can write a simplified equation for any element in the crystallographic least-squares matrix L. For example, for space group Pl, the equation for the interaction term relating the i component of the position vector .x^ for atom m and the j component of the position vector for atom n (i,j = 1,2,3) is L(xm'xn} = K EZ'EV exp(-Â£'M t/2) cosCt/^) (1) with the matrix M defined as SB M = 0 G-1 + (b In these equations, K is a constant, Â£ is an interatomic vector between atoms on crystallographic sites m and n (i.e., x - x , JXL, - Jc , x + g , and -jc - x ), J)m and b are anisotropic temperature factor matrices for atoms m and n, G~^ is the contravariant metric matrix, and QJ.,,/3 are 52

coefficients in the Gaussian expansion for the scattering-factor product for atom pair m,n. The main step in the approximation procedure involves replacing the summation over Â£ in Eq. (1) by an integration over Â£. Approximation 3 - We assume that enough reciprocal-lattice points are included in the reciprocal-lattice summation to justify the replacement of the summation by an integration without including higher-order correction terms. Approximation 3 is based on the classical Euler-MacLaurin summation for- mula suitably generalized to the three-dimensional case. The one- dimensional Euler-MacLaurin summation formula is m b f(a + kh) = I /f(t)dt + |[f(b) + f(a)J +-.. , (3) k=0 a where h = (b - a)/m. The higher-order terms (not shown) involve powers of h and odd-order derivatives of f at the limits a and b. A number of special cases now occur, depending upon the integration limits. In the simplest case, the integration extends over all of the reciprocal space and we obtain the result, Lfx1 v^h â¢ K-' V V r inf~iÂ« r M-II < 'â i MW '~ K i/Z/V-IMI H^Cjr.M-MexpC-E M-'y/2) (4) with H.-CyjM-1) = z.z. - Mr1., and z = M-1V. ^â¢J â i J ij ~ ~ -c. The tensor component H..(Y.M ) is a second-order three-dimensional Hermite T_ 1 '^ dc polynomial. Corresponaing formulas for the position-thermal and the thermal-thermal interaction have the same form as the position-position interaction equation shown in Eq. 4 except that the second order H.. is replaced by the third and fourth order polynomials H^K and H-jiW respectively. Equation 4 represents the asymptotically limiting situation which is approached only when the entire reciprocal-lattice data set is inclu- ded. A more common experimental practice is spherical truncation of the data set at some fixed value of |t|. An exact solution for this general case with anisotropic temperature factors and spherical truncation is quite difficult, but some success has been achieved with empirical correct- ion factors applied to Eq. 4. If the temperature factor for each atom is isotropic and the truncation is spherical, the finite summation over t in 53

Eq. 1 can be replaced by an integration which has an anlytical solution involving Legendre polynomials and error functions. Equations 1 and 4, which are specialized for space group Pl, can be generalized to include any centrosymmetric space group by incorporating a double summation over the symmetry operation for the space group. The task of obtaining an analytical formulation for the noncentrosymmetric space groups is less straightforward because the first factor given in Table 1 (i.e., FÂ£ (t)/cr {jo^t)]) is replaced by three terms. For example in the noncentrosymmetric space group Pl, the matrix element for a position position interaction may be written p i Â° Â° 2A B -i'| t/2>cos[(xm-xn)'t] A2 - B2 1tjexp(-t'F t/2)cos[(xn+xn)'t] (5) -t Â£ t/2)sin[(5iB+xn)'t]l where A and B are the real and imaginary parts of the structure factor and F " A + B Â« The problem is to predict the behavior of the factors (AÂ§ - B^/Qo2^ + B*)] and A^/fa2^ + B2)] . Intuitively, it seems that the terms containing these factors may tend to integrate to zero if the entire reciprocal lattice is included and if the structure is a "random structure"; however, the conjecture has not been proven. The integral behavior of these terms for a real structure with a truncated data set seems rather unpredictable. Evaluation of the analytical matrix technique is underway. With favorable conditions (i.e., low crystallographic symmetry and extensive, but finite, diffraction data) the computing time required to form the matrix has been reduced by an order of magnitude. For cases where the symmetry is very high and the data collected are not extensive, there may be no saving of time. The principal testing of the procedure to date has been for an application rather different from least-squares refinement. We use the inverted analytical matrix to calculate the complete variance-covariance matrix for a published structure without computing structure factors or their derivatives. The only data needed to generate the analytical matrix are the structural parameters, a matrix scale factor, and a truncation parameter. The latter two parameters can usually be obtained quite 54

accurately from the standard deviations, which usually are published with the crystal-structure paper. In addition to the matrix approximations described above, there are also possibilities for saving computer time by utilizing some special redundancy properties of the full crystallographic least-squares matrix in space group Pl. The basic approach is simply to examine the equations for the elements in the matrix. An example is for a hypothetical structure displaying space group symmetry Pl, with two atoms (m and n) in the asym- metric unit. If we write out the equation for the 171 supposedly unique elements in the symmetric 18 by 18 matrix, ^, for positional and aniso- tropic thermal parameters, we quickly discover that considerable redun- dancy is present, only 103 elements actually being unique. The remaining 68 elements are simple multiples of other elements. For example, we find 1919 1199 1 â¢} 9 ^ 23 IT that L(b^2,bm 2) = 4L(b* ^bj 2) and L<bJ 3, bn 3) = L(bm ,bj 3) = 2L(b >t>n ) = 2L(b ,b ). For other centrosymmetric space groups, the redundant linear combinations are fewer and more complicated. 55

DISCUSSION Sparks: Concerning the calculation of correlation coefficients from position parameters and thermal parameters, I always thought that wouldn't work if you had a situation where you had refined a set of data having not only the termination problem you mention but also a lot of weak reflections that are just left out of the data set. It would seem to me this would tend, in some peculiar way, to bias the results. Johnson: The numerical agreement between the variances calculated from the regular inverted least-squares matix and those calculated from the inverted analytical matrix is usually quite good for any data set. The agreement for the covariances become much better if the data set is fairly extensive. Missing reflections may present a serious problem if the data set is quite sparse, but we have not examined this aspect numerically or theoretically. Templeton: This sounds like magic until you think about it, but there's a way of restating it which I think makes evident what you're doing. If you have a published structure, and use this to calculate structure factors, this is a data set more or less like the experimental data set, more to the structure's right and less to the structure's wrong. From that data set you select, depending on your knowledge, which reflections have been left out. For example commonly people say, "We observed 1600 reflections of which 400 were zero." If you simply chop off the 400 smallest ones then you would have a very good replica. One thing I noticed in your suggestion that one leave out matrix ele- ments not affected by temperature, it is not evident how you know which ones these are because they're not necessarily just the ones labeled by the subscripts of the parameter that is shifted. Johnson: I have to admit that I have not thought this through in detail and cannot at present describe an algorithm that would keep track of the major changes from cycle to cycle. Templeton: Then part of my next question is, how do you know which ones they are, because all of the derivatives include in them the structure factor? Johnson: But we have in this formulation eliminated the structure factor, but you're right, very good point. However, the structure factor was eliminated in the analytical formulation and perhaps, as an approxima- tion, the same reasoning could be extended to this approach. Hamilton: Would you predict that this may be the answer for people who are refining protein structures and that they should really be think- ing seriously about this method? 56

Johnson: I must admit I harbor the fond hope that the analytical approach might someday be applicable to protein refinement. Unfortunately, I cannot at present see how to handle the non-centrosymmetric problem properly. Luckily, you scheduled me in a session where I can discuss what should be done and not necessarily what can be done. I think it is certainly worthwhile to put some additional effort into this approach to see if it might be a feasible solution for proteins. Busing: You say this speeds up the computation of the matrix by perhaps a factor of 10. If the number of observations and parameters is very large, does this method become even more favorable? Johnson: The approximation improves as the number of observations increases and we are in best shape if the data set includes everything that can possibly be measured. In this case we also obtain our maximum time advantage. Additional parameters may also improve the time advantage because the sum over the Patterson vectors converges rapidly as a function of interatomic separation; consequently the long vectors can safely be omitted from the summation. 57

New Computer Needs for Direct Methods D. Sayre Our purpose in this part of the symposium is to try to foresee whether new developments in crystallographic techniques are likely to generate changes in the computational resources that crystallographers should have. In this paper we consider whether such changes are likely to arise out of developments in direct methods. Structures of Moderate Size In the direct methods, the problem of structure determination is con- verted into the mathematical problem of solving a system â usually a large system â of equations or relations involving the structure fac- tors as principal variables. The equations or relations express the non-negativity, atomicity, or other property of the structure, and the phases of the structure factors are the quantities solved for. Up to a certain size of structure it is feasible to attempt the solution with- out any initial information about its location, i.e. the solution can be attempted ab initio. In this case, which is called the pure direct method case, if the solution is successful a very convenient structure- determining method results. In the usual formulation of this approach, the basic step is that of passing from a situation in which the phases of i of the structure factors exist (these phases being such as to satisfy the relations being solved) to the situation in which i+1 values (still satisfying the rela- tions) exist. The information available for this 1 step consists of all relations involving the as yet unassigned structure factor phases, which can be written in terms of the structure factor magnitudes plus the i phases already assigned. Unfortunately, the phase-limiting relations known to us today have the property of being generally weak for small values of i, though they become reasonably strong by the time i reaches, say, 10N, where N is the number of independent atoms in the structure; moreover the larger N is the weaker the relations become for small i. The result is that the solution process must pass through an initial stage in which the partial solutions that must be retained in some form if the possibility of missing the solution altogether is to be avoided, may branch to impractically large numbers, and the severity of this stage rises steeply with N. (It rises roughly as mn, where n is the number of branch-points and m is their average multiplicity, and where both are increasing functions of N.) This rapid rise with N in the average time needed to produce a solution, although its exact behavior as a function of N is not known, is what puts the limit on the size of structure that can be handled by purely direct methods. For a procedure of this kind, if its method of exploring partial 58

solutions allows it quickly to go deeply into some of those solutions, it may produce a solution to a problem of size N in a time short compared to the time needed to guarantee a solution at that size. This effect complicates the interpretation of empirical data on the relationship be- tween the difficulty of solution and N. Thus we may state on the basis of the recent solution by pure direct methods of the structure of adeno- sine triphosphate, with 72 atoms in the asymmetric unit of P2i2..2-, (Kennard et al., 1971), and of that of valinomycin, with 78 atoms in P2i (Duax and Hauptman, 1972), that the difficulty of solution is probably not in general beyond the range of our computers up to N=75, but this conclusion must be somewhat tentative at present. Much of the work going on in direct methods today can be traced to the desire to strengthen the phase-limiting relations at low values of i. Thus Karle (1971) has recently re-examined the determinantal inequali- ties introduced into crystallography by himself and Hauptman and has sug- gested that E^ may be more closely located by higher-order forms of the determinants than by the m=3 forms commonly used until now; at the 1 step in the solution process orders up to the i+ls would be available. Similarly Weeks and Hauptman (1971) have re-examined the tangent formula which is often used to produce an estimate of an i+1 phase from the values of i phases and have suggested a modification intended to reduce somewhat the errors which occur when i is small. Again the interest in the method of Tsoucaris (1970) stems from the possibility that his prin- ciple of estimating the phase of E, by maximizing a certain Karle-Hauptman determinant may provide a still more accurate localization of that phase. Finally both Karle (1970) and Hauptman (1970) have recently been concerned with improving the accuracy of the type of formula that permits approxi- mate values, if not of the phases at least of the cosines of the structure invariants, to be computed from the structure factor magnitudes alone; these values are therefore available at every step i. Indeed the most interesting thing about this latter type of relation may be the fact that the possession from the outset of the full set of cosine invariants, even in approximation, allows one to think in terms of pure direct methods that are not step-by-step in nature; these may possibly have no greater overall chances of success than the step-by-step technique, but would at least be free to some extent of the extreme branching in which the latter can so easily get caught. For example, in the centrosymmetrical space groups the cosine invariants are exactly what is needed to compute the double Patterson function. Large Structures Above the size of structures we have been considering thus far lies the entire range of large structures. Although these have not been approach- able by direct methods up to now, it may well be that the most significant developments in direct methods will relate to structures of this class. The circumstance making this possible is that certain equation-solving 59

techniques can accept some initial information about the location of the solution, while physical phasing techniques based upon the inclusion of heavy atoms can provide such initial information, though not sufficient to determine all desired phases. The most dramatic instance of this mixed type of solution to date is that of Sobell et al. (1971) in solving the structure of an actinomycin-deoxyguanosine complex containing 140 atoms in the asymmetric unit of P2^2^2j. In this case the single heavy-atom isomorphous replacement technique was followed up both by tangent-formula refinement and Patterson search methods. Going upward in size, tangent- formula refinement has been attempted without success in the extension of protein phases beyond 3 A resolution, by Hoppe and Gassmann (1964), and by Reeke and Lipscomb (1969). An attempt has been made also by Dickerson, but with results that have not been published. Barrett and Zwick (1971) attempted phase extension beyond the limit of multiple isomorphous replace- ment on myoglobin by a different method, which is not really a direct meth- od, but without success. Some recent work of my own on a method similar in aim to tangent- formula refinement may be of interest. The method may be characterized as a stronger solution technique than tangent-formula refinement, but also computationally more expensive. The first stage of the work (Sayre, 1972) consisted of a demonstration that structure factor phases can be refined in much the same way as we refine atomic parameters, i.e. by least-squares minimization of an appropriate function of the quantities being refined. The structures considered obeyed the relations a, F, = VFkFh-k anc* refinement process consisted in minimizing the function ^Ja^F^ - The favorable thing about these refinements was that they could be succes- fully initiated from a quite incomplete set of starting phases. For ex- ample, it was sufficient to have the phases out to 3 A resolution, with no phase information beyond that point initially available. The principle requirement in using the above function for minimization is that there be available a reasonably complete set of magnitudes out to 1.5 A or better; thus, given an initial set of phases over the low-resolution terms, the effect of the refinement process is to produce a set of phases extending over the higher-resolution terms as well. Significant errors in the phases initially given could be tolerated and corrected by the refinement process. Final mean errors for the phases ran from 8Â° to 10Â° in these experiments, in which all magnitudes were assumed to be accurately known. This direct phase-refinement process for phase extension is now being tried on the protein rubredoxin. L.H. Jensen at the University of Washington has kindly suppj-ied observed magnitudes for rubredoxin to 1.5 A together with phases to 3 A as determined several years ago by multiple isomorphous replacement and anomalous dispersion techniques. A refined set of phases for the full set of magnitudes (5034 terms in all) has been produced by the above process, and the resulting Fouriers at 1.5 A reso- lution have been sent to Jensen for evaluation. The Fouriers are definite- ly those of an atomic structure, but it is not yet known whether the method placed the atoms in the correct positions. 60

The use of this technique in protein crystallography carries with it certain difficulties. Firstly, the computation is a massive one, owing to the large set of quantities to be refined. On rubredoxin each cycle of the refinement takes just under an hour on a large machine (IBM 360/91), with a total of approximately 35 hours required for the refinement as a whole. The cost of this amount of computation at a typical large computing installation might be about $25 000. This fi- gure could be reduced through reprogramming and redesign of the algor- ithm, but no real estimate of the reduction has been made. On the neg- ative side, the computation increases approximately as N , N being the number of independent atoms. Of course if one drops below the protein size-range, the N dependence helps considerably; thus at 200 atoms a refinement might cost about $5000. Secondly, although the method will seek out a minimum of the function in question, there is little experience as yet with real structures con- cerning the likelihood that the minimum is the one corresponding to the physically correct structure. In any event, the phases obtained by this type of refinement should not be regarded as final phases but as being subject to final refinement by the usual methods. The phase-refinement technique, in other words, should be thought of as a possible bridge from the preliminary phasing techniques to final refinement of atom para- meters. In another development Luzzati, Tardieu, and Taupin (1972) have described the phasing of reflections from biological membranes and re- lated systems by a direct method. The method is based upon the recogni- tion that structures of this type consist of separate constituents (water, hydrocarbon, and protein, for example) with a somewhat different electron density for each. At low resolution, therefore, the electron-density function for the structure should be divisible into regions such that the density is reasonably flat within each region and changes fairly rapidly only at the region boundaries. The authors have found a means of expressing this requirement in terms of conditions among the structure factors, and have devised a discrete search method for selecting sets of phases by computer that approximately satisfy the requirement. The method is appa- rently in regular use at their laboratory, where it facilitates consider- ably the structure determination for this type of system. It shows the possibility of direct methods in structure work that does not even approach atomic resolution. The authors indicate a belief that the technique may be applicable to other biological macromolecules, at low resolution, inclu- ding proteins. Conclusions 1. For the crystallography of structures below 30-40 atoms in size, developments in direct methods will probably not have much future effect on the scale of computational needs. This does not mean that future 61

technical improvements in direct methods for such structures are in any way ruled out, or that each individual crystallographer working in this area today necessarily has adequate computing resources, but simply that the main impact of direct methods on computation for such structures has already largely taken place. 2. The steep general rise of difficulty of the pure direct methods with N makes a gradual extension of their range beyond the present approx- imate limit at N=75 much more likely than a rapid one. From the point of view of computational resources, maintaining this gradual extension is likely to involve a rather steeply increasing need for resources. In addition, as N increases, not only the mean time but the uncertainty of the time to solution can be expected to rise, i.e. the actual time to solution in any given case may prove to be much smaller, or much larger , than the predicted mean time. As larger structures are attempted by purely direct methods, this unpredictability of machine time will increas- ingly have to be allowed for in the budgeting of resources. This latter conclusion may not necessarily hold for the non-step-by-step type of solution, if that type of solution should develop. 3. The next few years may see for the first time the application of direct methods, sometimes in combination with other methods, to proteins and other large structures. If this occurs, very large machine resources will be required. References Barrett, A.N. and Zwick, M. (1971). Acta Cryst. A27, 6. Duax, W.L. and Hauptman, H. (1972). Program of ACA Winter Meeting, Albuquerque, paper 03. Hauptman, H. (1970). Proceedings of the International Summer School on Crystallographic Computing, Ottawa, August 1969, pp. 41-51. Copenhagen: Munksgaard. Hoppe, W. and Gassmann, J. (1964). Ber. Bunsenges. 68, 808. Karle, J. (1970). Acta Cryst. B26, 1614. Karle, J. (1971). Acta Cryst. B27. 2063. Kennard, 0., Isaacs, N.W., Motherwell, W.D.S., Coppola, J.C., Wampler, D.L., Larson, A.C., and Watson, D. G. (1971). Proc. Roy. Soc. Lond. A325, 401. 62

Luzzati, V., Tardieu, A., and Taupin, D. (1972). J. Mol. Biol. 64, 269. Reeke, G.N. Jr. and Lipscomb, W.N. (1969). Acta Cryst. A25. 2614. Sayre, D. (1972). Acta Cryst. A28, 210. Sobell, H.M., Jain, S.C., Sakore, T.D., and Nordman, C.E. (1971). Nature New Biology 231, 200. Tsoucaris, G. (1970) Acta Cryst. A26, 492. Weeks, C. and Hauptman, H. (1971). Program of ACA Winter Meeting, Columbia, South Carolina, paper H4. DISCUSSION Kartha: While it is quite true that the additional reflections did sharpen up the map, the crucial question is, have you got the phases between 3A and 2A correctly? Did you make any comparison of these phases with those obtained by the use of isomorphous series and anomalous scattering data in this region? Sayre: Until now I have not known any of Jensen's phases except his experimental phases to 3A, and as a consequence I have not been able to make the comparisons you mention. At this meeting, however, Jensen has given me a tape containing his latest set of phases, and I expect to start making comparisons between the two sets within a few days. Kartha: A map using only the 3A-2A would be worth calculating to see how the map with your phases on the one hand and the isomorphous series phases on the other compares. Sayre: Perhaps so. I personally have thought that the most interesting comparison would be with Jensen's final 1.5 A set. Dewar: If direct methods start to involve large amounts of computer time, we should be wary of using criteria like computer costs per 63

structure and computer costs per man year, because it may actually represent a good use of resources to crank those figures up to a much higher level than they are now. We tend to write off other costs to some extent, partly because the cost of estimating other costs is difficult in university environments. We may worry about the computer costs, but personnel costs are extremely difficult to assess. I think you should not pay too much attention to statistics of this type, as this kind of work will tend to change the picture considerably. Hall: I've been assisting Dr. Sobell at Rochester in trying to solve another form of the actinomycin complex you mentioned. So far this has been unsuccessful but it may be of interest to note the computer times involved in the calculation of cosine averages for this par- ticular complex. There are approximately one thousand E's above 1.4, ten thousand reflections altogether. A cosine calculation for one invariant takes about two seconds on the UNIVAC ll08 and when you consider that there are about 80 000 invariants, this represents a sizable amount of computer time. I don't think we can justify that sort of expenditure now. You might, of course, only consider look- ing at a thousand of those invariants. Even so, two thousand seconds on UNIVAC ll08 time is still a significant cost in such an analysis. Sayre: I interpret your remark as pointing out the cost of a structure- determining method that uses most or all of the cosine invariants, even if it permits the avoidance of a step-by-step approach. Doubt- less your point is well taken. I think we might agree that the cen- tral question concerns the expansiveness of the number of cosine invariants relative to that of the solution tree in a step-by-step solution procedure. Hall: I'd like to acknowledge that Dr. Hauptman and the group at the Medical Foundation in Buffalo have been working in collaboration with us on this structure and that it was in fact the MDKS average that I referred to. Sayre: Which is a more complicated one. Hall: Right. And it was just a preliminary calculation to try and get some estimate of the reliability of the invariant relationships. â¢ Donnay: With that beautiful map you have shown us, are you not at a point where you can place atoms and get some idea of an R index? Sayre: I suppose that would be possible, but in fact the map was pro- duced only about 5 days ago. Donnay: But this could eventually be an indication of whether or not you are heading in the right direction? 64

Sayre: Not as yet. In this work I have until now avoided thinking about atom positions, and have been careful to treat the problem entirely as a problem in equation-solving. I plan to continue in this vein for the present. Of course, should the phase-refinement method prove to be a generally valid one, it would as you say be at the present stage that one would try to place the atoms. Kartha: You mention the difficulty of using direct methods as a function of the number of atoms, but in protein crystallography usually there is another problemâthat not only the number increases but the ratio of the number of parameters to number of reflections goes up very much. What is the correlation? Sayre: Regarding the minimum-finding process, suppose there are p phases to be adjusted to produce a minimum. The number of available squaring- method equations, s, will be at least p, and the number of observa- tional equations will therefore be at least 2p, since each squaring- method equation is an equation between complex quantities. Thus the minimization process can always be carried out under conditions of at least 2-fold overdetermination. When data are complete within the CuK sphere, the minimum corresponds to the true phases with an average error that may be only 1 - 2Â°. Under conditions of more severe data-incompleteness, the position of the minimum may move further away from the position corresponding to the true phases. Based on exper- iments with artificial structures, an average error of 8 - 10Â° may be expected from this cause as a consequence of the data incompleteness typical of protein crystallography. In these experiments an average temperature factor of 20 was assumed, and an instrumental sensitivity such that 8 independent reflections per atom (excluding H atoms) would be observed. Xuong: What is the influence of the data accuracy on the results of this kind of refinement? Sayre: I have only preliminary results bearing on that question. It appears, again using artificial structures, that the structure factor magnitudes can be perturbed with random errors up to a standard devia- tion of 30% without as a rule preventing the refinement from locating the correct minimum. Similarly the procedure appears normally to tolerate the introduction of random errors up to a standard deviation of 30Â° in the starting set of phases (to 3 A). 65

The Role of the Minicomputer in the Crystallography Laboratory Robert A. Sparks It is universally recognized that the minicomputer plays an important role in the crystallographic laboratory. The semi-automatic and auto- matically controlled diffractometers have offered welcome relief for crystallographers who previously had to measure large amounts of data manually. As with most computer-controlled instruments, early programs tended to be written to collect data in much the same way that the cry- stal lographer would have used to operate the instrument manually. Later programs have taken advantage of the flexibility of the computer to perform tasks that would have been virtually impossible to perform with the manual or semi-automatic diffractometer. Thus, it is now possible to: 1. Automatically center reflections. 2. Sample the profile for each reflection during data collection. 3. Choose different scan speeds dependent on the intensity of the reflections. 4. Search for peak maxima for all but the weakest reflections. 5. Measure reflections at many azimuthal angles about the dif- fraction vectors. 6. Measure regions of reciprocal space in a three-dimensional fashion to obtain diffuse-scattering information. 7. Automatically redetermine the crystal orientation if the crystal should move during data collection. 8. Obtain information about crystal quality and crystal symmetry. All of this can be done with a slow computer having a minimal amount of core (4000 words). For subsequent processing of the collected data a magnetic tape drive is desirable. Magnetic tape is chosen because it is an inexpensive means of storing large amounts of data in formats that are universally recognized by large and small computers. As collection methods become more complex one soon realizes that the limiting factor is the amount of core available. Thus, it is becoming fairly common for computer-controlled diffractometers to have more than 4000 words of core or to have a disk for program overlays. For reasons of economy, manufacturers of computer-controlled dif- fractometers have chosen the least expensive computers that can easily be interfaced to the many control and acquisition functions of the diffrac- tometer. Almost all commercial instruments are of the one instrument- one minicomputer type. Other configurations, such as several instruments- 66

one medium-size computer or one instrument-one minicomputer-communication link-large computer, have the possible advantage that more computer capa- bility becomes available for the diffractometer for at least part of the time. However, such approaches are expensive because they are almost always one of a kind. The advantage of the one instrument-one minicomputer is that the development cost, of which a large part is for software, can be spread over many identical instruments. Although the diffractometer experiments are slow, the inexpensive minicomputers used to control the diffractometers are not slow. The state of computer technology is such that computers with memory-cycle times of about one microsecond and execution times of one to three microseconds for most commands are no more expensive to build than computers that are one-half or one-tenth as fast. Therefore, the minicomputer is used to perform some calculations that do not use the diffractometer-control features of the computer. Thus, the minicomputer is used to determine indices for reflections, best least-squares unit-cell parameters, and Lorentz-polarization factors. All of these tasks can be performed by the large computer but are more conveniently done by the minicomputer. Some crystallographers have used the small computer for tasks more traditionally performed on the large computer. Thus, Eric Gabe uses the PDP-8 which controls his Picker diffractometer to do structure-factor calculations. Shiono for many years has used the IBM ll30 to do almost all types of crystallographic calculations. For the most part, however, crystallographic computations are done on the most powerful computer available. Why this is so is illustrated in the first two columns of Table 1, which compar_e the characteristics of the Nova 1200, used to control the Syntex Pi Autodiffractometer, with the characteristics of the CDC 6600, one of the most powerful computers used for crystallo- graphic calculations. Although core speeds are not very different, the CDC 6600 can achieve effective speeds of up to 100 nanoseconds because the memory has been divided into independent blocks of 4096 words each. In every other respect the CDC 6600 is a much more powerful computer. Because of the large core memory, all structure factor data and the nor- mal equations of a least-squares program can be resident. Because of the fast instruction registers, tight loops can be executed with no need to continually reference slow core. Because of the many arithmetic units and addressing and indexing registers, many operations can take place simultaneously. Not of least importance is the fact that CDC has an ex- cellent FORTRAN compiler which makes efficient use of all this sophisti- cated hardware. On the other hand, the minimal Nova 1200 configuration does not have enough core and is so slow that many of the important crystallographic programs would be virtually impossible to run for all but the smallest structures. I believe, however, that the most serious limitation is the unavailability of compilers of higher-level languages producing progams to minimize amount of core needed at run time. This deficiency has 67

meant that the crystallographer has not easily been able to tailor his data-collection programs to meet his requirements. The minicomputer industry however, is, advancing rapidly. The industry is extremely competitive and prices for all parts of the hard- ware are decreasing at a phenomenal rate. New innovations - for example, semiconductor memory - are introduced into minicomputers almost simul- taneously with introduction in large computers. Good compilers with full FORTRAN IV capability are available for computers with larger core memories (12 000 words or more for the NOVA computers). Disk operating systems that are as flexible and easy to use as those found on large computers are now arailable. Finally, fast floating-point hardware is now optional from some of the minicomputer manufacturers and also from several independent firms. The third column of Table 1 lists the characteristics of a system that would satisfy all or almost all of the crystallographer's computing needs. In addition to the basic 4000-word NOVA with a magnetic tape drive required for the PT Autodiffractometer, this system has an additional 12 000 words of core, 131 000-word fixed-head disc, and floating-point hardware. Software consists of FORTRAN IV and a Disc Operating System. Crystallographic data-processing programs would be written in FORTRAN IV. Programs would reside on magnetic tape reels and be loaded on to disc when needed. Large programs would consist of several overlays. Large arrays would also reside on disc and be brought in to core one sector at a time. Diffractometer programs would also be written in FORTRAN IV but using machine-language subroutines for driving the goniometer axes, reading the encoders and sealer, opening and closing the shutter, etc. Table 1 Comparison of Nova 1200 and GDC 6600 Nova 1200 with Structure Minimal Nova 1200 Determination Package CDC 6600 Magnetic Tape Drive many 1 1 Core Speed i.Ons 1.2 MS 1.2jis Word Size 60 bits 16 bits 16 bits Core Size 131 000 words 4000 words 16 000 words 131 000-word disk Table 1 continued 68

Table 1 Comparison of Nova 1200 and CDC 6600 CDC 6600 Minimal Nova 1200 Nova 1200 with Structure Determination Package Operand, Addressing, and Indexing Registers Fast Instruction Registers Floating Multiply Arithmetic Units FORTRAN IV & Operating System 24 4 8 (60 bits each) None (60 bits) 2 ms (32 bits) 10 Very Good No None 15.6 MS (32 bits) 24.2 jxs (64 bits) Good Almost all crystallographic programs could be run on such a system. It is hard to justify the cost of a plotter for the infrequent use crystallographers would make of it. Therefore, it would probably be most economical to generate plotting information on magnetic tape on this sys- tem and then have the actual plotting done at central facilities. Fourier maps would be generated on magnetic tape and either printed at central facilities or printed on the slow printer by the NOVA 1200. At ten char- acters per second a large Fourier map could take several hours to print. In many cases, good peak-picking programs exist that eliminate the need to print the maps. There is no question that such a system is feasible for almost all crystallographic calculations. The structure of vitamin B12 was solved and refined on a computer with a configuration closer to the basic NOVA 1200 than to the system proposed here. Indeed much of the philosophy of disc (or drum) usage and external plotting and printing of large files is identical to that used on the large computers of 5 and 10 years ago. Time-sharing of data collection and data processing presents prob- lems not associated with the amount of core or the arithmetic processing speed, but rather with the allocation of peripherals to the two tasks. Data collection must have the magnetic tape drive available for output of the intensity data. Therefore, production of a Fourier map could not be done simultaneously with data collection. However, Fourier calculations are quite fast (except for printing) and interrupting data collection for the few minutes necessary to generate the map and output it to magnetic 69

tape is not a serious limitation. Happily, the least-squares calculation which takes the bulk of time for structure determination requires the magnetic-tape drive only for the brief time necessary to dump all the data onto disc. After this, several operations could be performed with- out using the magnetic-tape drive and could be effectively overlapped with data collection. The proposed system in the crystallographer's laboratory is clearly more convenient than a centralized computing facility. It is, in most cases, also more economical. The inner loop of a least-squares program (namely the generation of the normal equations) was written in FORTRAN and executed on a number of different computers. The program is shown in Figure 1 and the results of the test in Table 2. The FORTRAN compilers that produce the most effi- cient code were used on the CDC 6600 and IBM 370/155. Because the floating- point hardware is fairly new for the NOVA machines, the FORTRAN compiler has not yet been modified to produce code for this feature. Reasonable substitutions were made in the assembly listing generated for the soft- ware floating-point version in order to produce the "FORTRAN-like" code. The hand-optimized version was an assembly language program written to be executed as efficiently as possible. If the matrix is large enough to require that it be stored on disc, the data-channel transfers would in- crease the NOVA 1200 times in this example by about 0.8. If 64-bit floating-point numbers are required, an increase of about 257o is required for the NOVA 1200 times. Table 2 Comparison of Time for Least-Squares Inner Loop time CDC 6600 0.93 s (60-bit words) IBM 370/155 7.5 s (32-bit words) HP 2100 A 150 s (32-bit words) (Hardware multiply/divide Software floating-point) HP 2100 A 29 s (32-bit words) (Hardware floating-point) Nova 800 (Software floating-point) 206 s (32-bit words) Table 2 continued 70

Table 2 Comparison of Time for Least-Squares Inner Loop time Nova 800 (Hardware floating-point) FORTRAN-like code generation 16.8 s (32-bit words) Hand-optimized code 13.2 s (32-bit words) Nova 1200 360 s (32-bit words) (Software floating-point) Nova 1200 (Hardware floating-point) FORTRAN-like code generation 24.2 s (32-bit words)* Hand-optimized code 17.5 s (32-bit words) Calculated from Nova 800 performance. Typically, we at Syntex use about one hour of CDC 6600 computer time for a structure with 40-50 non-hydrogen atoms in the asymmetric unit. If the FORTRAN test in Figure 1 is typical, the "FORTRAN-like" time on the NOVA 1200 would be 26 hours for this same structure. This amount of time is small compared to typical data collection times of one to two weeks. Even without overlap of data collection and data processing there would not be a serious deterioration of diffractometer usage. With simultaneous least-squares calculations and data collection, diffractometer servicing will be negligibly affected. N = 64 NREF = 100 M = N + 1 MM = M + 1 DO 6001 IP = 1, NREF DO 20 I = 1, M 20 DV(I) = I *IP*0.9 K = 1 DO 5001 J = 1, N B = DV (J) IF (B.NE.O) GO TO 5002 K = K + MM - J Figure 1 FORTRAN test program (continued) 71

GO TO 5001 5002 DO 5003 L = J, M A (K) = A (K) + DV (L) *B 5003 K = K + 1 5001 CONTINUE 6001 CONTINUE Figure 1 FORTRAN test program (continued) Even though the NOVA 1200 with floating point hardware is 26 times slower than the CDC 6600 and 3.2 times slower than the IBM 370/155, turn- around time will in many cases favor the dedicated computer because it is located in the crystallographer's laboratory. Another important feature of the small dedicated system compared to the large very fast computer is that it is impossible on the former system to find out one day that an error made by a student has exhausted the year's computer budget. Because of the above arguments, Syntex has decided to make avail- able to customers a Structure Determination Package which would consist of a 131 000-word fixed head disc, 12 000-word core, and floating point hardware for those who already have a Pl Autodiffractometer or AD-l Auto- donsitometer, and a stand-alone unit consisting of a NOVA 1200, a 131 000- word fixed head disc, 16 000-word core, floating point hardware, and a magnetic tape drive for those who do not have the Syntex instruments. Software will consist of a FORTRAN IV compiler modified to make efficient use of the floating point hardware, a Disc Operating System modified to allow time-sharing of data collection with data processing, machine lan- guage subroutines for the diffractometer, FORTRAN versions of the current diffractometer programs, and FORTRAN programs properly broken up into overlays for the basic crystallographic programs. The user will be able to add his own FORTRAN or assembly language programs to the library. At this early stage, it looks quite probably that the selling price would be $30 000 for hardware and software for the attachment to existing in- struments, and about $45 000 for the stand-alone option. First deliveries are scheduled for the second quarter of 1973. The comparison of the cost of the system proposed here compared with existing costs at centralized computing facilities is difficult to make. University computing centers may charge the scientist anywhere from nothing up to the actual cost of the computing service, depending on what other sources of funds are available to the centers. Commercial rates are set to provide a profit for the company providing the service, but are usually complex functions of CPU time, amount of core used, amount of 72

input and output, and job priority. In Palo Alto the Control Data Center provides the most economical service for crystallographic type problems. Syntex pays about $1000 per structure for their service. Clearly, for us, the break-even point would be 30 structures for the $30 000 attachment or 45 structures for the $45 000 stand-alone configuration. In conclusion, whether crystallographers would be inclined to buy the Syntex package or whether they would wish to buy directly from the computer manufacturers and provide their own software, I believe that serious consideration should be given to the small dedicated computer. Not only does it provide the desirable features of FORTRAN data-collection programs and the convenience of having one's own computer, but it also provides, in many cases, a substantial cost saving compared to the cen- tralized computer approach used by most crystallographers today. DISCUSSION Young: If you are doing full-matrix least-squares, what is the maximum number of parameters you can handle with this sort of adorned mini- system? Sparks: It turns out to be the same figure Jim Ibers quoted, 240, be- cause the disc size is 131 000 words. A good suggestion by Mike Murphy is that instead of using a fixed-head disc as we are here, we ought to be using a movable-head disc which costs quite a bit less for the amount of disc space that would be available. Then the capacity would be something like two million words. Young: If you put all those core packages and discs plus an extra arith- metic unit on the mini-computer, why do you bother with putting a diffractometer on it? Sparks: I've given you the choice. $45 000 or $30 000. Young: No. My point is that what you've done is build a separate com- puter system, and the fact that the diffractometer is hooked on is incidental. 73

Sparks: It does give some capability for the collection programs that we do not now have. A couple of years ago you made a strong point that we ought to be writing these collection programs in FORTRAN. Lowrey: Professor S. H. Bauer at Cornell University has an extensive system for electron diffraction that is built around the PDP-8 and he has made extensive use of cathode-ray-tube display. He is able to search his electron diffraction data and his radial distri- butions and look at very fine portions. With respect to Fourier maps, instead of having to print them out you can set up a graphic interaction display for picking out the things you want. Bauer is able to do a great deal of electron diffraction using solely the small computer. He considers the advantage is that not only is it cheap but it is under his direct control so that he can run all night and have a guarantee of getting his programs back, and not have the problems of priorities on commercial computing systems. Sparks: We also sell a three-dimensional display. Ibers: Two points might be kept in mind. (1) It is easier to get com- puting money in a grant than it is to get $45 000 to buy a small computer. (2) It may be possible to sneak small computers into laboratories throughout a campus by claiming that these computers are controlling experiments, but their presence makes computer cen- ter directors very nervous, for good reason. If the small computer proliferates throughout the campus you are in trouble. Suppose we have 20 computers of the type you have discussed. In effect a million dollars has been spent and it has not benefited the central computing facility at all. For the good of the university community it might have been more reasonable to put that million dollars into the central facility. In any event there are obviously political problems that are by no means negligible. Sparks: Yes. I am aware of this. My feeling is that the instrument ought to be treated as having a very special application. It is not by any means a general purpose computer. Fritchie: Do you have any idea what the annual maintenance costs are on this $45 000 system? Computer alone perhaps? Sparks: I do not have that figure. What is it on the diffractometer? Dewar: It will be around 77o. Coppens: What is the capacity of the system? In other words, how many crystallographers can it handle? Sparks: It depends on how productive those crystallographers are. It's 74

better to say, how many structures could you reasonably hope to do on a system like this. We think that for a 40-50 atom struc- ture it would take twenty-six hours for the structure determination. It certainly takes quite a bit longer to collect the data. So really, you are still limited by the amount of time it takes to collect the data. Coppens: So the system has over-capacity for one crystallographic group. Sparks: Yes, it has indeed. Corfield: I think this system is not totally unreasonable, but what makes it reasonable is the availability of inexpensive hardware floating- point arithmetic units. We've had at Ohio State University for the past two or three years a system rather more sophisticated than this but that does not have hardware floating-point arithmetic. Present- ly we do all our least-squares and all our Fourier summations in- house, but once we get up to a couple of hundred variables, it would be worth our while to use a larger computer because of the limita- tions of the software floating-point arithmetic on our in-house machine. Medrud: If this kind of approach is attractive to other crystallographers, there is another encouraging factor in the change in attitude of some of the minicomputer manufacturers. Our first contacts with them, with regard to our application, were met with disdain. The most recent contacts other people in our group have had with them indicate much more interest in systems development. They formerly wanted to hand you a computer and a bag of hardware for interfacing and say "go to it", but now they are willing to discuss a system comparable to yours. 75

Next: What are the Funding Agencies Doing, and What are Their Plans for the Future? »

Computational Needs and Resources in Crystallography: Proceedings of a Symposium, Albuquerque, New Mexico, April 8, 1972. (1973)

Chapter: What New Developments are in the Wind?

Welcome to OpenBook!

Get Email Updates