Cells are composed of molecules, and their properties are largely determined by the chemical reactions the molecules perform and by the structures the chemicals form. The molecules range from the small and ubiquitous, such as water and various salts and metal ions, to the very large molecules that are specific to living systems, such as the genetic material of DNA. In between are a wide variety of organic molecules necessary for life, including sugars and lipids, vitamins and other enzyme cofactors, and the nucleic acid and amino acid monomers required for DNA, RNA, and protein synthesis. The molecules interact in many ways and are capable of recognizing one another; some are active in the form of larger complexes. These dynamic interactions are the essence of the processes in living cells. Cells in turn have mechanisms for accurately controlling their composition, for obtaining from the environment molecules they can use, and for synthesizing those that they need. They can modify their composition depending on their environment and their fate. At every cell division, they must essentially duplicate every component, except that occasionally cells undergo asymmetric division so that the two daughter cells can take on different roles.
For the last 50 years, the field of molecular biology has examined the hereditary and information-carrying molecules of DNA, RNA, and proteins. These molecules will be the focus of this chapter. Determining the double-stranded structure of DNA made clear how genetic information is replicated and passed from generation to generation. Since then, much
effort has been devoted to constructing a flow chart for the various forms of molecular information. One of the fruits of this labor has been a sense that many of the main information highways are known—for example, the central dogma that describes the irreversible flow of information from DNA to RNA to proteins. The fact that reality is more complex and that surprises continually emerge, such as the recent appreciation for the variety of roles played by RNA, shows that this enormously successful field still has a long road ahead before it will be “solved” in any comprehensive sense.
One of the most successful achievements of molecular biology has been a nearly complete catalogue of underlying DNA codes for a diverse set of information-bearing macromolecules: The genomes of humans and many microbes, plants, and other animals are known with considerable completeness and accuracy. There has thus emerged the feeling that biological explanation is no longer primarily to be sought in finding new molecular actors but in understanding their individual functions and the patterns of organization and interaction that collectively determine the functions of the cell.
THE MATHEMATICS-BIOLOGY CONNECTION
Molecular biology has always relied heavily on mathematics. From the analysis of sequences to techniques for determining the three-dimensional structures of molecules to studies of the dynamics of entities ranging from individual molecules up to entire networks, mathematical techniques and computational algorithms are critical.
Because of the rapid advances in the technology for DNA sequencing, DNA sequences are now easily obtained, and protein sequences can be inferred with reasonably high accuracy and completeness. Thus, we now have an abundance of those sequences for analysis. DNA, RNA, and proteins are all linear polymers, or strings, made from a small alphabet of residues, 4 for DNA and RNA and 20 for proteins. The specific properties of any molecule, or the functions it serves, are determined by its sequence and its structure (in the appropriate context), though of course the structure is a result of the sequence and the molecule’s environment. One of the mathematical challenges of biology, then, is determining the mapping from sequence space to function space. The set of linear sequences over a small alphabet leads quite naturally to the concept of sequence space and the universe of all possible sequences. The concept of function space is a little harder to imagine, but certainly we could categorize all of the functions we know and consider them to be a partial set of all possible functions.
For proteins, and for some RNAs, function is critically dependent on
structure, and knowledge of structure can provide information about function. Hence, the mapping from sequence space to structure space is part of the challenge, and may be part of the solution, of creating a map from sequence space to function space. The structure of a protein provides strong clues about its biochemical function—for example, the mechanism for action by an enzyme—but at the moment, there have been only a few successes in predicting biological function from sequence. The structures of these macromolecules are also important for other research purposes—for example, they are the starting point for predicting biochemical action or for modeling the dynamics of the macromolecules, for suggesting ways to inhibit the action of undesired proteins, for predicting potential chemical inhibitors or activators of a given protein, or for altering a protein’s functionality through its environment or through reengineering its sequence and, consequently, its structure. Therefore, developing the ability to map from sequence space to structure space is a critical challenge that, if met, would have a significant impact on all biological sciences and on our understanding of life. Currently, inferences about structure and function rely on the simple assumption that sequences that are “close” in sequence space (using metrics determined from studies of evolution) are likely to map to nearby points in structure and function space. That is generally true, but there are complications. Short stretches of a protein can be exceptions to this general situation, and larger proteins are composites of folded segments or domains. In the absence of experimental determination through an x-ray crystallographic structure determination, we do not know in any detail how to ascertain where the boundaries are for domains. We also do not know which sequence differences are most critical or might be most indicative of exceptions or might most effectively predict the biological function.
Even a catalog of all the components of a cell (a complete “parts list”) detailing not only their sequences but also their structure and function would not really explain the properties of that cell, because the system is far from equilibrium and in a very dynamic state. The properties of molecules often depend on their dynamics, from the catalytic activities of enzymes to the assembly of multicomponent structures, and many of a cell’s molecules need to be transported to specific locations within or outside the cell in order to perform their functions. Cells sense their environment and respond to various stimuli by sending signals throughout the cell and to neighboring cells, modifying their behavior. Metabolic networks are subject to feedback regulation and other kinds of control, and the expression of specific genes is controlled by networks of regulatory factors and their interactions with each other and with cellular signals. Because many cellular processes are due to the actions of a small number of molecules, stochastic fluctuations cannot be ignored. In general, then, understanding
the properties of cells depends on modeling the dynamics of the individual molecules and their interactions. The dynamics has both spatial and temporal components and is largely nonlinear and discrete. Mathematical analysis of dynamical systems has been essential for our current understanding, but the field appears to be on the verge of a major expansion. New technologies offer the opportunity to greatly increase what we can measure about cellular components. Using those data to inform dynamical models of the cell is critical to advancing our understanding of biology and is likely to tax existing mathematical techniques, opening up new areas of mathematical and biological research.
AREAS OF MATHEMATICAL APPLICATIONS FOR MOLECULES
The central role of sequences in mathematics is unquestioned. The discovery that DNA, RNA, and proteins are all composed of linear sequences provided a strong link between active areas of research in both molecular biology and the mathematical sciences. A well-developed set of results for sequence analysis has been developed in computer science, some of which are useful for biological problems (Gusfield, 1997; Waterman, 1995). The most classical problem of string searches is to find exact occurrences of a sequence (usually short, which we will call the “pattern”) in another sequence (usually long, called the “text”). The Boyer-Moore and Knuth-Morris-Pratt algorithms are a sophisticated pair of algorithms that have been developed to alleviate this problem (Pevzner, 2000). These techniques make it possible to find exact matches to the pattern in linear time (time that is proportional to the sum of pattern and text lengths) or better; the obvious naive method takes quadratic time (time proportional to the product of pattern and text lengths). Suffix trees also accomplish this task and, more importantly, are also useful for more general problems (Gusfield, 1997). Certain biological problems involve exact patterns, an important example being restriction sites (usually four to eight base pairs in length), where certain enzymes cut DNA molecules.
However, it is usually the case that sequence problems in biology involve approximate matching. Sites where a protein can bind to DNA, for example, usually are very inexact, in the sense that there may be a few distinct positions where certain bases are strongly preferred but no unique binding site. Dynamic programming (DP) is a useful method for many problems of approximate pattern matching (Waterman, 1995). Usually taking quadratic time, various DP algorithms can find the approximate location of patterns in texts, the best relationship between two and more sequences (the alignment problem), and the best overlap between two
sequences. These algorithms are indispensable for sequence assembly, the process of inferring a genome sequence from an oversampled set of short, randomly located sequences. One alignment problem is to find the best relationship between the letters of two sequences, and another is to find the best segments that match well. The latter problem, usually called local alignment, is most often used in database searches, where one wants to find domains of one sequence that are similar to domains of another sequence by common ancestry—that is, domains that are homologous. Because the databases of known sequences have become very large, DP is too slow for the full searches. Various heuristic methods, such as BLAST (Altschul et al., 1990), have been used for this problem and can be fairly effective, but greater sensitivity at detecting distant homologies is needed.
Critical to the assessment of similarity between sequences is a model for evolutionary processes. Before protein and DNA sequences, the information used to classify an organism and infer its history was its observable properties (e.g., wings or gills) as well as fossil evidence. But once it was realized that DNA and protein sequences contain traces of their history, new avenues to studying evolution—paleogenetics—were opened up (Zuckerkandl and Pauling, 1965). From alignments of clearly homologous sequences, it was possible to determine empirically the rates of residue substitution. Those rates allow optimum alignments to be determined over much larger evolutionary distances and inferences of homology to be much more widely applied. From those alignments it is then possible to determine the evolutionary relationship of molecules, and the species that contain them, at a much higher resolution than previously possible.
This sort of information is commonly represented by phylogenetic trees (Felsenstein, 2004). The structure of these trees is generally inferred by one of three approaches: parsimony, distance matrices, or likelihood. Parsimony uses the principle of “least evolution” to estimate which sequences are most closely related. Finding the most parsimonious tree is an NP-hard problem, but many heuristic methods have been devised to approach it. When organisms are sufficiently closely related, parsimony is a reasonable model. Otherwise, multiple changes may have occurred, and it can be quite misleading. For the distance matrix approach, a distance is defined between each pair of sequences based, for instance, on pairwise sequence alignment scores from DP or from a position-by-position score relating a pair of sequences sampled from a full multiple alignment. Can a tree generated this way have the property that the distance between any two sequences is the same as the sum of distances through the vertices that connect those sequences? There is some elegant work on this problem, which includes the celebrated four-point condition, but it is almost never the case that a distance matrix is additive. Finally, likelihood models assume a stochastic model for evolution of the positions along
branches of the tree. Likelihoods for two trees can be compared using classical likelihood ratios. Modern Bayesian methods, including simulated annealing, and Markov chain Monte Carlo methods (MCMCs) are employed. Work in this area is quite active, and the approach is considered by many to be the method of choice.
All of these methods have significant limitations. The number of possible trees is unmanageable for any reasonable number of sequences. Also, the methods all depend on the calculated alignment of sequences that contain uncertainties that are not all accounted for in the tree-building methods. Finally, simplifying assumptions about the independence of the positions and the uniformity of substitution rates limit the resolution that can be obtained.
Probabilistic models, such as hidden Markov models (HMMs), have recently made significant contributions to biological sequence analysis (Durbin et al., 1998). In protein comparisons, these approaches allow for position-specific variability in substitution scores. In modeling interacting domains of DNA and RNA, they allow for a more biophysical treatment of the interaction. And in ab initio predictions of gene sequences, they can better capture the statistical characteristics of genes, including both content features and signals that delineate boundaries between different segments. For protein-coding genes, such methods can be reasonably effective, but for genes that code for RNAs, the problem is much more difficult and challenging.
Determining the three-dimensional structure of macromolecules from experimental data, such as x-ray diffraction patterns and nuclear magnetic resonance measurements, is a mathematically demanding task. A typical protein structure must be inferred from enormous amounts of information that indirectly reveal the relative locations of key atoms in a biomolecular structure. The mapping from structure to data is straightforward, but the inverse problem, going from data to structure, is quite complex. The task is made harder because the experimental data are usually still too sparse to make a unique inversion possible. Instead, models must be built to overcome the intrinsic experimental ambiguity. This is done, for instance, by forcing the models to conform to the best current knowledge of molecular forces and/or to adhere to a presumed structural component, such as a peptide backbone in protein chains. Even so, constructing these models leads to demanding optimization problems.
Because one-dimensional sequence data abound but three-dimensional structures are generally more useful, understanding the mapping from sequence to structure is a major goal of molecular biology. Cur-
rently, the most reliable methods for predicting the structure of a new protein are based on homology. If the new protein sequence can be inferred to be homologous to a protein with known structure, then it can also be reliably inferred that the structures of the two proteins are quite similar. In general, protein structures are more highly conserved through evolution than their sequences, so the challenge is to identify distantly related homologous proteins, and progress is ongoing in this direction, as described in the last section and Chapter 2. Despite decades of research, there has been only modest progress in predicting protein structures from their sequence alone, ab initio. Experiments have shown that at least for moderate-sized proteins, sequence information alone is sufficient to specify the molecule’s three-dimensional shape. Understanding and then reproducing in the computer this transformation of one-dimensional information to three-dimensional information has come to be known as the protein-folding problem. The development of the modeling and computing capability needed to enable calculating structure from sequence data would clarify many basic issues about protein structure and function. This capability would also be of practical use in the emerging field of protein engineering.
Considerable progress has been made on the protein-folding problem using, on the theoretical side, the statistical mechanics of energy landscapes and, on the experimental side, knowledge of protein engineering and physicochemical kinetics. Of course some progress can simply be attributed to the rapid increase in the power of large-scale computing. The theoretical advances come from understanding how to characterize the universal topological properties of the energy surfaces of molecules; this challenge is deeply connected to the problem of characterizing the computational difficulty of optimization problems. Since the proteins fold on their own, and fairly rapidly given the enormous number of possible folding paths and the NP-completeness of the optimization problem, protein folding must somehow be constrained. The energy landscapes of real proteins must be rather smooth, resembling a funnel leading to the three-dimensional folded structure rather than the very rugged energy landscapes that could be present in principle. The rugged energy landscapes have globally different optima very far apart in structure, and this is a handicap to any search for optima.
Until recently, the energy functions used to simulate the folding process, built up from the interactions of fragments of protein, were too rugged to yield correctly folded structures. Instead, depending on the fine details of the optimization process, many unrelated structures were predicted to represent the global free energy minimum. The funnel-landscape idea offers a mathematical tool for developing more accurate energy surfaces by using bioinformatic patterns in the known database of protein
structures. Algorithms based on such hybrids of bioinformatics and physical energy functions begin to approach the accuracy of structure predictions based on homology. Nonetheless, innovative ideas are still needed to enable predicting the folded structure of proteins.
Useful as the classical protein-folding problem has been in stimulating interest in the physical chemistry of biological macromolecules, it is important to recognize that its “solution,” at least across any broad sampling of protein structures, is unlikely. In cells, the self-assembly processes that build up biological structures from their components—including the folding of many proteins—occur in highly specialized environments. For example, it has become clear during the past 15 years that the folding of many proteins is facilitated by local environments provided by proteins known as chaperones.
To predict the structure of RNAs, efficient algorithms based on thermodynamic parameters and DP methods have been available for over 20 years. Yet the best predictions are not very accurate, especially for long RNAs. The difficulty of making accurate predictions is undoubtedly due to some combination of the following facts: There is an enormous number of possible structures with nearly equal predicted energies; available thermodynamic parameters have limited accuracies; many RNAs require cofactors, such as magnesium, in the medium in order to fold properly; and efficient algorithms eliminate some possible structures, such as pseudoknots, that may be necessary to obtain the correct folding. As with proteins, RNAs can be folded most accurately by using information from homologous RNAs with known structure. But in the case of RNAs, it is also very useful to have related RNAs that, presumably, fold into nearly identical structures even if none of the structures are known in advance. Because base-pairing is the strongest force determining RNA structures, finding base-pairing patterns that are consistent for all of the sample sequences can lead to correct structure prediction. Sankoff (1985) published an algorithm for determining the optimum combination of alignment and structure scores for a collection of RNA sequences. However, that algorithm is computationally too expensive to be useful for typical numbers and lengths of sequences, so heuristic approximations have been developed in recent years to solve this problem.
Many experiments show that biomolecules do not exist as the single structures envisioned in the simpler forms of structure reconstruction. Instead, they are present in the body as an ensemble of structures, thermally populating a complex energy landscape. Some reconstructions try to take this ensemble character into account, but it is difficult to do so, and
there is no mathematical understanding of the limits of application for these algorithms. The problem is no longer one of carrying out a simple optimization but one of achieving a statistically reliable sampling of structures that conforms to our knowledge of molecular forces. Understanding the dynamics of molecules is not simply an exercise in modeling random fluctuations, because in many cases the dynamics of proteins are essential to their function. It is only by allowing some transformations to occur quickly and other thermodynamically possible transformations to occur slowly that a cell can control its behavior.
Computer simulation has made much progress in illustrating the motions of proteins, yet there are several problems in trying to simulate completely from molecular dynamics how biological molecules function. The first challenge is timescale. While the atoms in proteins carry out the bulk of their motions in less than a picosecond, the fastest characteristic response time for a cell is in the millisecond range. Evolution has tuned and selected molecular properties to deliver results on the latter timescale. Thus, the timescales of importance range over nine orders of magnitude, and one cannot argue, as is done in computational fluid dynamics, that short-timescale behavior is important only at small spatial scales and long-timescale behavior is important only at large spatial scales. The reason for the timescale gap is that functionally important motions are rare: A movie of the atoms in a biomolecule could be mostly a great bore, interrupted by a few dramatic moments. Overcoming the timescale problem will involve understanding the structure of the possible biomolecular motions and developing mathematical ways to survey all the possibilities in a statistically meaningful way.
The high degree of accuracy to which functional dynamics must be computed is also a challenge for direct simulation. For instance, there might be only a small amount of energy difference between one binding state and another, yet the choice of binding state is a critical determinant of which of several signaling cascades takes place. Thus, calculations must be done with great precision. However, because even the smallest biological macromolecules contain thousands of atoms, how can the energy be determined this accurately? Cells manage to achieve reliable specificity, so there must be a principle at work, much like the funnel-landscape idea for folding, that simplifies the problem. But discovering this principle is a challenge. Can it be learned after enough four-dimensional binding data have been determined using machine learning techniques? In his 1905 papers on Brownian motion, Einstein noted that at the large size scale of macromolecular assemblies in the cell, energies of very small order still reign and determine where things go. As an example, it is now believed that the stability of the chromatin packaging of DNA comes from attractive electrostatic forces between ions of like sign. Understanding the con-
nection of very small forces to the macromolecular interactions occurring within the cell remains a grand computational challenge.
Interactions between macromolecules play many essential roles. They occur in all combinations, with protein-protein and protein-DNA/RNA being the most studied. Complexes of multiple proteins are important in metabolic reactions where reactants can pass between enzymes to increase efficiency and reduce concentrations of intermediates. They are important in large intracellular structures, such as various types of filaments that provide both rigidity and directed motion within and between cells. Protein-protein interactions are essential in various signal transduction pathways where proteins modify one another to alter their properties. Cascades of such protein modifications are at the heart of processes that transfer information and signals between cellular components, and they allow cells to respond to their environment by altering their behavior.
Experimental approaches to determining the interactions between proteins have greatly expanded in the last few years, and there is now considerable data about specific interactions. However, these experimental methods tend to produce noisy data, and it remains a challenge to extract the true information. Some computational methods have proven useful in detecting interacting proteins, such as correlated occurrence in phylogenetic comparisons. While not currently feasible, computational methods to predict protein interactions based on compatible surfaces—essentially predicting the docking of large molecules—would be a major advance. One practical application of progress in this area would be the design of small-molecule drugs that interfere with specific protein-protein interactions.
Protein interactions with DNA and RNA are the primary mechanisms for controlling gene expression. In specific cells under specific conditions, only a subset of genes is expressed at any given time. Based on environmental signals, a fraction of the regulatory proteins will be expressed and active. Their sequence-specific interactions with regions of DNA will determine the set of genes expressed during the next interval. Statistical methods for predicting DNA target sites for specific proteins, given previous examples of similar sites, can be useful for discovering new regulatory interactions. What is needed is a recognition code that maps from the protein sequence (and utilizes the known structures of the transcription factor families) to a pattern that describes the family of DNA binding sites. Statistical methods have been applied to this problem, but current performance is far from adequate. Combinations of statistical methods and bio-
physical principles appear promising but are just in their infancy (Benos et al., 2002).
Many of the interactions between macromolecules define networks, both metabolic and regulatory. Simply knowing the components of the networks and the connections between them is not sufficient to understand or model their behavior. The levels of different components of the networks can vary over space and time and are often small enough that discrete modeling is important. In regulatory networks, each of the standard binary logic gates can be implemented by combinations of small numbers of proteins interacting with DNA; in other cases, continuous output values, sometimes with high gain, can be achieved by a similarly small number of components. Even in the best-studied cells, the network of functional interactions is only now being mapped out at the scale needed. All mathematical descriptions that adequately model the diverse components at suitable resolution while also capturing the stochastic character of intermolecular partnership and the complexity of individual biomolecular dynamics remain a distant goal but one that the community of mathematical and physical scientists is beginning to tackle.
There are many daunting mathematical challenges to understanding and modeling the molecular aspects of biology. Technological advances continue to increase the quantity of data that can be obtained and, more importantly, to enable the collection of data that were not previously available. This flood of data promises enormous advances in our understanding of biology in the coming decades, but that promise depends on the coincident development of mathematical approaches and applications.
Sequence databases will continue to grow at a rapid rate (currently doubling about every 15 months), increasing our ability to identify homology relationships and improving the evolutionary models on which those identifications depend. Increasingly comprehensive data will allow us to abandon some of the simplifying assumptions that were previously required and to develop more realistic models. The great increase in sequence data will also step up the demand for improvements in the mapping from sequence to structure and function.
Improved mathematics is also required for advances in the biophysical modeling of molecules and cells. Predictions of structures and interactions based on physical principles should significantly complement the results from evolutionary studies and are necessary for advances in protein design and other areas of synthetic biology and biological engineering. Improvements in modeling the dynamics of biological systems that
take into account the range of important scales of space, time, and number of components—for example, whether discrete or continuous models are needed—are necessary to develop accurate and predictive models of cellular behavior.
Mathematical models at all scales are important in biology and will become increasingly so. One of the critical areas of development will be the interactions between models and experiments. Every scientist develops models from data and uses those models to design the next experiments. But as the data become increasingly numerous and complex and the models ever more mathematical, it will become even more necessary to rely on mathematical and computational methods in the design of experiments. Determining the set of plausible models that explain the current information and then identifying the most informative experiments to distinguish between those models is a challenge even now and will be an increasingly central challenge in the future. For example, expression data can provide information about sets of plausible regulatory networks but are unlikely to define a single best network. By comparing the different implications of plausible networks, it should be possible to design experiments that best distinguish between them (Ideker et al., 2000).
The molecular level deals with the most basic components of the cell. Key molecules include the heritable material that permits the propagation of biological properties from generation to generation, while also displaying the variation that underlies evolution. Interacting sets of biological molecules are required for the behavior of cells and organisms, but they are not sufficient to predict this behavior. Cells, which will be discussed in the next chapter, provide a first example of the large gulf between successive levels of biological organization.
Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215(3): 403-410.
Benos, P.V., A.S. Lapedes, and G.D. Stormo. 2002. Is there a code for protein-DNA recognition? Probab(ilistical)ly. Bioessays 24(5): 466-475.
Durbin, R., S.R. Eddy, A. Krogh, and G. Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, U.K.: Cambridge University Press.
Felsenstein, J. 2004. Inferring Phylogenies. Sunderland, Mass.: Sinauer Associates, Inc.
Gusfield, G. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge, U.K.: Cambridge University Press.
Ideker, T.E., V. Thorsson, and R.M. Karp. 2000. Discovery of regulatory interactions through perturbation: Inference and experimental design. Pp. 305-316 in Biocomputing 2000: Proceedings of the Pacific Symposium, Vol. 5. Singapore: World Scientific.
Pevzner, P. 2000. Computational Molecular Biology: An Algorithmic Approach. Cambridge, Mass.: MIT Press.
Sankoff, D. 1985. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math. 45: 810-825.
Waterman, M.S. 1995. Introduction to Computational Biology: Sequences, Maps and Genomes. London, U.K.: Chapman and Hall.
Zuckerkandl, E., and L. Pauling. 1965. Molecules as documents of evolutionary history. J. Theor. Biol. 8: 357-366.