Molecular Diversity and Combinatorial Chemistry in Drug Discovery
Overview of the Drug Discovery Process
The discovery of new drugs is a time-consuming, risky, and expensive process. These things are true even though in the past 15 years there has been a dramatic increase in the number of three-dimensional structures of proteins that can be used as scaffol ds for the conceptual and computational aspects of drug design. The discovery traditionally moves through several stages once the biological target has been chosen (see Science, 1994).
First, a moderately active compound, a "lead," is identified from clues provided by the literature, through "random" screening of many compounds or through targeted screening of compounds identified by three-dimensional searching or docking. If the thr ee-dimensional structure of the biological target macromolecule is known, then one may use three-dimensional searching to identify existing compounds that are complementary to the binding site in the target (Kuntz, 1992; Martin, 1992). Alternatively, if a number of structurally unique compounds bind to the target, one may propose a three-dimensional pharmacophore and search databases for matches to it (Martin, 1992).
| A pharmacophore is a chemical identity and geo-metrical arrangement of the key substituents of a molecule that confer biochemical or pharma-cological effects. |
Once a lead has been identified, hundreds to thousands of additional compounds are designed and synthesized to optimize the biological profile. The cost and environmental impact of synthesis and patentability are also important issues. If the three-dim ensional structure of the biological target is known, then molecular modeling might be used in the design (Erickson and Fesik, 1992). As testing data are accumulated, statistical three-dimensional quantitative structure-activity relationships (three-dime nsional QSAR) may help set priorities for synthesis. Any attractive compounds found are tested in more detail through advanced protocols that more reliably forecast therapeutic and toxicity potential.
Lastly, the surviving compounds are prioritized, and the best compound known at that time is prepared for clinical trial.
New mathematical techniques could have an impact on the rate of new compound discovery if the potency of compounds could be forecast more quickly and accurately before their synthesis. Many of the improvements in computational chemistry discussed elsew here in this report would also impact the ability to forecast affinity based on the structure of the ligand and the macromolecular target. However, additional opportunities exist for cases in which the structure of the macromolecular target is not known, cases for which the forecast is based on three-dimensional QSAR investigations (Kubinyi, 1993). The most explored method is comparative molecular field analysis (CoMFA; Cramer et al., 1988). With CoMFA, molecules are aligned with each other; then for ea ch molecule, the interaction energies with various probes are calculated at intersections of a three-dimensional lattice that encloses all the molecules. The relationships between these thousands of energy values and the potencies of the 10 to 100 molecul es are established by the statistical method of partial least squares (PLS) with leave-one-out cross-validation (see Frank and Friedman, 1993, for background on PLS and comparisons of the method to other statistical procedures). When a CoMFA model is fou nd, it generally has quite robust forecasting ability: the average error in forecasting the potency of 85 compounds in eight datasets is 0.55 logs or 0.8 kilocalories per mole (Martin et al., in press).
However, there are indications that one may fail to find a model, even though one exists, because of the coarseness of the lattice spacing (2 A) and the sensitivity of PLS to noise. PLS can find only linear relationships between properties and biologic al potency; a method that could detect nonlinear relationships would be an improvement and might model more sets of data. Limited experiences with neural nets have shown no improvement over PLS. There might be an optimization method that could select the relevant variables from a pool of thousands. It would have to be roughly as fast as PLS (a minute or so to do leave-one-out cross-validation on 25 compounds) since one of the elements of the analysis is to compare results with different properties calcul ated at the lattice points, adding whole molecule properties, comparing alignment rules, investigating outliers, and combining and separating subseries of molecules.
Sources of Molecular Diversity
The weak point in the whole scenario of new drug discovery has been identification of the "lead." There may not be a "good" lead in a company's collection. The wrong choice can doom a project to never finding compounds that merit advanced testing. Us ing only literature data to derive the lead may mean that the company abandons the project because it cannot patent the compounds found. These concerns have led the industry to focus on the importance of molecular diversity as a key ingredient in the sea rch for a lead. Compared to just 10 years ago, orders of magnitude more compounds can be designed, synthesized, and tested with newly developed strategies. These changes present an opportunity for the imaginative application of mathematics.
Automated testing methods employing simplified assays, mixing strategies, robotics, bar-coding, etc., have led many pharmaceutical and biotechnology companies to test every available compound, perhaps 10
to 10<
img src="images/symbols/super6.gif"> of them, in biological assays of interest (Gallop et al., 1994; Gordon et al., 1994). Testing a collection generally takes approximately six months. This operation presents several challenges: (1) Is it really ne
cessary to test all of the compounds in order to identify the series of compounds that will show the activity? (2) Should a pilot set of compounds be tested first to adjust the assay conditions and forecast how many active compounds will be found? If so,
how would this set be selected? (3) What compounds, available from outside vendors, should be selected for purchase to complement the set of in-house compounds? Is there a way to quantify their worth other than the cost to synthesize in-house?
Concurrently, synthetic chemists developed new strategies that provide large numbers of compounds for biological testing typically as mixtures. Such libraries, synthesized in a few months, can contain 10
to 10
different chemical structures (Baum, 1994). Although this number of compounds seems high, note that it has been estimated that there are 10

stable chemical compounds of molecular weight less than 750 that contain only carbon, hydrogen, nitrogen, oxygen, and sulfur. Even factoring in their possibility of synthesis and realistic chemic
al and physical properties still leaves on the order of 10

compounds to consider. How, then, does one choose which 10
compounds should be included in the first library, or the second?
A final strategy to enhance molecular diversity results from computer programs that design molecules to meet specified three-dimensional criteria, typically based on the experimental structure of a protein binding site (Rothstein and Murcko, 1993). The programs design molecules to meet geometric criteria and include electrostatic complementarity at the level of force fields such as those used for molecular dynamics. The diversity arises from the combinatorics: a protein binding site usually contains at least four or five hydrogen-bonding or charged groups; a ligand might interact with most or all of them, and many different templates might be able to fit into the binding site and orient polar groups for optimal interaction. Hence, it is expected tha t a huge number of nicely fitting molecules might be designed. Although design programs could be set up to produce only those molecules that could be synthesized readily, this severely limits the diversity. Hence, it is likely that the designed molecule s will have to be made by traditional synthesis. This places a realistic upper limit of 25 molecules to be selected. Even if binding affinity could be forecast precisely, we are a long way from forecasting every type of toxicity or drug metabolism quirk that a molecule might possess. Again, we face the problem of selection of the most diverse sample from a population.
Current Computational Approaches to Compound Selection
There are three aspects to the problem of selecting samples from large collections of molecules: First, what molecular properties will be used to describe the compounds? Second, how will the similarity of these properties between pairs of molecules be quantified? Third, how will the molecules be grouped or clustered?
For datasets of size 10
and higher, the standard method of describing the molecules for clustering encodes the presence or absence of substructural features in a bit-string, typically of length 256-1024 (Willet
t, 1987; Hodes, 1989). In modern systems, these substructural features are recognized by enumerating all paths of length 0-7 in the molecular graph and using these to populate one or more of the bits (Weininger et al., 1994). It typically takes one to t
wo hours on a modern workstation to generate such fingerprints of a database of 10
compounds. The time required for this process increases linearly with the number of compounds.
The second step is to calculate the similarity of every molecule to every other molecule in the dataset. The similarity measure traditionally used, the Tanimoto coefficient, is expressed as
where Sim
is the similarity of molecule i to molecule j, F
is the number of features (bits set to 0 or 1) in common between molecule i and molecule j, F
is the number of bits set in molecule i, and F
is the number of bits set in molecule j. For the same 10
compounds, this process takes on the order of 24 hours. Since every molecule is compared with every other, it scales as the squar
e of the number of compounds. Lastly, the Jarvis-Patrick clustering method (Jarvis and Patrick, 1973) is used to group the compounds. This method is based on comparing the nearest neighbors of compounds and is very fast, taking only seconds to accomplis
h. Although each of these steps is feasible, none is optimal.
Opportunities for Improvements in Computational Approaches to Compound Selection
Molecular fingerprints are not the best descriptor to use to select compounds for bioactivity since the biological properties of compounds depend on their three-dimensional complementarity of shape and electronic properties with those of the target biom olecule. Clearly, we would like to consider the three-dimensional structures of the molecules--shape, the location of intermolecular recognition sites such as hydrogen-bonding or charged groups, and the way in which the position of these features changes with changes in conformation. Speeding up such calculations or representations of the results would be a big help.
However, the problem of how to represent conformational flexibility in this context is a bigger challenge--do we need a totally different way to represent the structures than coordinates or distances between pairs of atoms? It is important to recognize that two-dimensional molecular structure is the basis of both chemical synthesis strategy and patent claims, and so the representation must also include the two-dimensional structure. This is a clear example where the choice of model is significant and must be the result of close collaboration between mathematical and chemical scientists.
Any new molecular descriptor will require that one define a corresponding metric for the similarity or distance between compounds to be used in grouping them. For example, in contrast to substructural features, which are either present or absent, dista nces are continuous and do not fall so easily into a bit-string. Should distances be binned--if so, should the bins be fuzzy or overlapping? How is similarity evaluated in such cases?
On the other hand, one might quantitate the similarity of two molecules by the size and composition of the maximum common substructure. Experience with the Bron-Kerbosh algorithm (Bron and Kerbosh, 1973) has shown a rate of 10
comparisons per hour. For a dataset of size 10
, it would take 10
comparisons or 10
hours to prepare the similarity matrix! Similarly, 10
molecules would require 10
comparisons and 10 hours. Although parallelization might allow one to perform the calculations, a better algorithm might accomplish the same thing with greater efficiency. It might be possible to eliminate most of the comparisons while
retaining all important ones.
Improvements in the grouping of compounds are sorely needed. The Jarvis-Patrick algorithm performs poorly on sets of very diverse compounds. For example, a typical result produces a few large clusters, each containing very different compounds, and many singletons. There are sometimes clusters that contain compounds with a similarity of 0.2 on a scale of 0.0 to 1.0, with 1.0 the upper limit. Clearly, this is not clustering similar compounds together.
Much better results are found, albeit with datasets of 1000, by using statistically based agglomerative clustering methods. For 1000 compounds, the clustering takes approximately one day and would scale roughly as the square of the number of compounds. Since we typically would expect to investigate no more than 1/100 as many clusters as original compounds, divisive methods might have an advantage because in this approach, clustering starts with one huge cluster, divides clusters into tighter ones, and could stop once the target number of clusters was formed.
At this time, no method other than Jarvis-Patrick is known to the computational chemistry community that will group 10
or 10
objects in a time scale of less than a wee
k (Willett et al., 1986; Willett, 1987; Whaley and Hodes, 1991). There are unpublished reports that the divisive Guenoche (1991) algorithm classifies 10
compounds overnight on a personal computer once the pairwi
se similarities have been calculated. However, it seems possible that there may be better ways to discover the groups of compounds in a dataset.
References
Baum, R.M., 1994, Combinatorial approaches provide fresh leads for medicinal chemistry, Chemical and Engineering News (February 7) 20-26.
Bron, C., and J. Kerbosch, 1973, Algorithm 457. Finding all cliques of an undirected graph. Commun. ACM 16:575.
Cramer III, R.D., D.E. Patterson, and J.D. Bunce, 1988, Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins, J. Am. Chem. Soc. 110:5959-5967.
Erickson, J.W., and S.W. Fesik, 1992, Macromolecular X-ray crystallography and NMR as tools for structure-based drug design, In Annual Reports in Medicinal Chemistry, Vol. 27, M.C. Vernuti, ed., Academic Press, New York, pp. 271-289.
Frank, I.E., and J.H. Friedman, 1993, A statistical view of some chemometrics regression tools, Technometrics 35:109-135.
Gallop, M.A., R.W. Barrett, W.J. Dower, S.P.A. Fodor, and E.M. Gordon, 1994, Applications of combinatorial technologies to drug discovery. 1. Background and peptide combinatorial libraries, J. Medicinal Chem. 37:1233-1251.
Gordon, E.M., R.W. Barrett, W.J. Dower, S.P.A. Fodor, and M.A. Gallop, 1994, Applications of combinatorial technologies to drug discovery. 2. Combinatorial organic synthesis, library screening strategies, and future directions, J. Medicinal Chem. 37:1387-1401.
Guenoche, A., P. Hansen, and B. Jaumard, 1991, Efficient algorithms for divisive hierarchical clustering with diameter criterion, J. Classification 8:5-30.
Hodes, L., 1989, Clustering a large number of compounds. 1. Establishing the method on an initial sample, J. Chem. Inform. Comput. Sciences 29:66-71.
Jarvis, R.A., and E.A. Patrick, 1973, Clustering using a similarity measure based on shared nearest neighbors, IEEE Trans. Comput. C-22:1025-1034.
Kubinyi, H., 1993, 3D QSAR in drug design, Theory, Methods, and Applications, ESCOM, Leiden, 759 pp.
Kuntz, I.D., 1992, Structure-based strategies for drug design and discovery, Science, 257:1078-1082.
Martin, Y.C., 1992, 3D database searching in drug design, J. Medicinal Chem. 35:2145-2154.
Martin, Y.C., K.-H. Kim, and C.T. Lin, in press, Comparative molecular field analysis: CoMFA, in Linear Free Energy Relationships in Biology, M. Charton, ed.
Rothstein, S.H., and M.A. Murcko, 1993, GroupBuild: A fragment- based method for de novo drug design, J. Medicinal Chem. 36:1700-1710.
Science, 1994, Research news: Drug discovery on the assembly line, 264:399-1401.
Weiniger, D., C. James, and J. Yang, 1994, Daylight Chemical Information Systems, Manual to Version 4.34, Daylight Chemical Information Systems, Irvine, Calif.
Whaley, R., and L. Hodes, 1991, Clustering a large number of compounds. 2. Using the connection machine, J. Chem. Inform. Comput. Sciences 31:345-347.
Willett, P., 1987, Similarity and Clustering in Chemical Information Systems, Research Studies Press, Letchworth.
Willett, P., V. Winterman, and D. Bawden, 1986, Implementation of nonhierarchic cluster analysis methods in chemical information systems, J. Chem. Inform. Comput. Sciences 26:109-118.
NAS Home Page | NAP Home Page | Reading Room | Report Home Page