3
Understanding Molecules

INTRODUCTION

Cells are composed of molecules, and their properties are largely determined by the chemical reactions the molecules perform and by the structures the chemicals form. The molecules range from the small and ubiquitous, such as water and various salts and metal ions, to the very large molecules that are specific to living systems, such as the genetic material of DNA. In between are a wide variety of organic molecules necessary for life, including sugars and lipids, vitamins and other enzyme cofactors, and the nucleic acid and amino acid monomers required for DNA, RNA, and protein synthesis. The molecules interact in many ways and are capable of recognizing one another; some are active in the form of larger complexes. These dynamic interactions are the essence of the processes in living cells. Cells in turn have mechanisms for accurately controlling their composition, for obtaining from the environment molecules they can use, and for synthesizing those that they need. They can modify their composition depending on their environment and their fate. At every cell division, they must essentially duplicate every component, except that occasionally cells undergo asymmetric division so that the two daughter cells can take on different roles.

For the last 50 years, the field of molecular biology has examined the hereditary and information-carrying molecules of DNA, RNA, and proteins. These molecules will be the focus of this chapter. Determining the double-stranded structure of DNA made clear how genetic information is replicated and passed from generation to generation. Since then, much



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 38
3 Understanding Molecules INTRODUCTION Cells are composed of molecules, and their properties are largely de- termined by the chemical reactions the molecules perform and by the structures the chemicals form. The molecules range from the small and ubiquitous, such as water and various salts and metal ions, to the very large molecules that are specific to living systems, such as the genetic material of DNA. In between are a wide variety of organic molecules nec- essary for life, including sugars and lipids, vitamins and other enzyme cofactors, and the nucleic acid and amino acid monomers required for DNA, RNA, and protein synthesis. The molecules interact in many ways and are capable of recognizing one another; some are active in the form of larger complexes. These dynamic interactions are the essence of the pro- cesses in living cells. Cells in turn have mechanisms for accurately con- trolling their composition, for obtaining from the environment molecules they can use, and for synthesizing those that they need. They can modify their composition depending on their environment and their fate. At ev- ery cell division, they must essentially duplicate every component, except that occasionally cells undergo asymmetric division so that the two daughter cells can take on different roles. For the last 50 years, the field of molecular biology has examined the hereditary and information-carrying molecules of DNA, RNA, and pro- teins. These molecules will be the focus of this chapter. Determining the double-stranded structure of DNA made clear how genetic information is replicated and passed from generation to generation. Since then, much 38

OCR for page 38
39 UNDERSTANDING MOLECULES effort has been devoted to constructing a flow chart for the various forms of molecular information. One of the fruits of this labor has been a sense that many of the main information highways are known—for example, the central dogma that describes the irreversible flow of information from DNA to RNA to proteins. The fact that reality is more complex and that surprises continually emerge, such as the recent appreciation for the vari- ety of roles played by RNA, shows that this enormously successful field still has a long road ahead before it will be “solved” in any comprehen- sive sense. One of the most successful achievements of molecular biology has been a nearly complete catalogue of underlying DNA codes for a diverse set of information-bearing macromolecules: The genomes of humans and many microbes, plants, and other animals are known with considerable completeness and accuracy. There has thus emerged the feeling that bio- logical explanation is no longer primarily to be sought in finding new molecular actors but in understanding their individual functions and the patterns of organization and interaction that collectively determine the functions of the cell. THE MATHEMATICS-BIOLOGY CONNECTION Molecular biology has always relied heavily on mathematics. From the analysis of sequences to techniques for determining the three-dimen- sional structures of molecules to studies of the dynamics of entities rang- ing from individual molecules up to entire networks, mathematical tech- niques and computational algorithms are critical. Because of the rapid advances in the technology for DNA sequenc- ing, DNA sequences are now easily obtained, and protein sequences can be inferred with reasonably high accuracy and completeness. Thus, we now have an abundance of those sequences for analysis. DNA, RNA, and proteins are all linear polymers, or strings, made from a small alphabet of residues, 4 for DNA and RNA and 20 for proteins. The specific proper- ties of any molecule, or the functions it serves, are determined by its se- quence and its structure (in the appropriate context), though of course the structure is a result of the sequence and the molecule’s environment. One of the mathematical challenges of biology, then, is determining the mapping from sequence space to function space. The set of linear se- quences over a small alphabet leads quite naturally to the concept of se- quence space and the universe of all possible sequences. The concept of function space is a little harder to imagine, but certainly we could catego- rize all of the functions we know and consider them to be a partial set of all possible functions. For proteins, and for some RNAs, function is critically dependent on

OCR for page 38
40 MATHEMATICS AND 21ST CENTURY BIOLOGY structure, and knowledge of structure can provide information about function. Hence, the mapping from sequence space to structure space is part of the challenge, and may be part of the solution, of creating a map from sequence space to function space. The structure of a protein pro- vides strong clues about its biochemical function—for example, the mechanism for action by an enzyme—but at the moment, there have been only a few successes in predicting biological function from sequence. The structures of these macromolecules are also important for other research purposes—for example, they are the starting point for predicting bio- chemical action or for modeling the dynamics of the macromolecules, for suggesting ways to inhibit the action of undesired proteins, for predict- ing potential chemical inhibitors or activators of a given protein, or for altering a protein’s functionality through its environment or through reengineering its sequence and, consequently, its structure. Therefore, developing the ability to map from sequence space to structure space is a critical challenge that, if met, would have a significant impact on all bio- logical sciences and on our understanding of life. Currently, inferences about structure and function rely on the simple assumption that se- quences that are “close” in sequence space (using metrics determined from studies of evolution) are likely to map to nearby points in structure and function space. That is generally true, but there are complications. Short stretches of a protein can be exceptions to this general situation, and larger proteins are composites of folded segments or domains. In the absence of experimental determination through an x-ray crystallographic structure determination, we do not know in any detail how to ascertain where the boundaries are for domains. We also do not know which se- quence differences are most critical or might be most indicative of excep- tions or might most effectively predict the biological function. Even a catalog of all the components of a cell (a complete “parts list”) detailing not only their sequences but also their structure and function would not really explain the properties of that cell, because the system is far from equilibrium and in a very dynamic state. The properties of mol- ecules often depend on their dynamics, from the catalytic activities of en- zymes to the assembly of multicomponent structures, and many of a cell’s molecules need to be transported to specific locations within or outside the cell in order to perform their functions. Cells sense their environment and respond to various stimuli by sending signals throughout the cell and to neighboring cells, modifying their behavior. Metabolic networks are subject to feedback regulation and other kinds of control, and the expres- sion of specific genes is controlled by networks of regulatory factors and their interactions with each other and with cellular signals. Because many cellular processes are due to the actions of a small number of molecules, stochastic fluctuations cannot be ignored. In general, then, understanding

OCR for page 38
41 UNDERSTANDING MOLECULES the properties of cells depends on modeling the dynamics of the indi- vidual molecules and their interactions. The dynamics has both spatial and temporal components and is largely nonlinear and discrete. Math- ematical analysis of dynamical systems has been essential for our current understanding, but the field appears to be on the verge of a major expan- sion. New technologies offer the opportunity to greatly increase what we can measure about cellular components. Using those data to inform dy- namical models of the cell is critical to advancing our understanding of biology and is likely to tax existing mathematical techniques, opening up new areas of mathematical and biological research. AREAS OF MATHEMATICAL APPLICATIONS FOR MOLECULES Sequence Analysis The central role of sequences in mathematics is unquestioned. The discovery that DNA, RNA, and proteins are all composed of linear se- quences provided a strong link between active areas of research in both molecular biology and the mathematical sciences. A well-developed set of results for sequence analysis has been developed in computer science, some of which are useful for biological problems (Gusfield, 1997; Waterman, 1995). The most classical problem of string searches is to find exact occurrences of a sequence (usually short, which we will call the “pat- tern”) in another sequence (usually long, called the “text”). The Boyer- Moore and Knuth-Morris-Pratt algorithms are a sophisticated pair of al- gorithms that have been developed to alleviate this problem (Pevzner, 2000). These techniques make it possible to find exact matches to the pat- tern in linear time (time that is proportional to the sum of pattern and text lengths) or better; the obvious naive method takes quadratic time (time proportional to the product of pattern and text lengths). Suffix trees also accomplish this task and, more importantly, are also useful for more gen- eral problems (Gusfield, 1997). Certain biological problems involve exact patterns, an important example being restriction sites (usually four to eight base pairs in length), where certain enzymes cut DNA molecules. However, it is usually the case that sequence problems in biology in- volve approximate matching. Sites where a protein can bind to DNA, for example, usually are very inexact, in the sense that there may be a few distinct positions where certain bases are strongly preferred but no unique binding site. Dynamic programming (DP) is a useful method for many problems of approximate pattern matching (Waterman, 1995). Usually taking quadratic time, various DP algorithms can find the approximate location of patterns in texts, the best relationship between two and more sequences (the alignment problem), and the best overlap between two

OCR for page 38
42 MATHEMATICS AND 21ST CENTURY BIOLOGY sequences. These algorithms are indispensable for sequence assembly, the process of inferring a genome sequence from an oversampled set of short, randomly located sequences. One alignment problem is to find the best relationship between the letters of two sequences, and another is to find the best segments that match well. The latter problem, usually called local alignment, is most often used in database searches, where one wants to find domains of one sequence that are similar to domains of another se- quence by common ancestry—that is, domains that are homologous. Be- cause the databases of known sequences have become very large, DP is too slow for the full searches. Various heuristic methods, such as BLAST (Altschul et al., 1990), have been used for this problem and can be fairly effective, but greater sensitivity at detecting distant homologies is needed. Critical to the assessment of similarity between sequences is a model for evolutionary processes. Before protein and DNA sequences, the infor- mation used to classify an organism and infer its history was its observ- able properties (e.g., wings or gills) as well as fossil evidence. But once it was realized that DNA and protein sequences contain traces of their his- tory, new avenues to studying evolution—paleogenetics—were opened up (Zuckerkandl and Pauling, 1965). From alignments of clearly homolo- gous sequences, it was possible to determine empirically the rates of resi- due substitution. Those rates allow optimum alignments to be determined over much larger evolutionary distances and inferences of homology to be much more widely applied. From those alignments it is then possible to determine the evolutionary relationship of molecules, and the species that contain them, at a much higher resolution than previously possible. This sort of information is commonly represented by phylogenetic trees (Felsenstein, 2004). The structure of these trees is generally inferred by one of three approaches: parsimony, distance matrices, or likelihood. Parsimony uses the principle of “least evolution” to estimate which se- quences are most closely related. Finding the most parsimonious tree is an NP-hard problem, but many heuristic methods have been devised to approach it. When organisms are sufficiently closely related, parsimony is a reasonable model. Otherwise, multiple changes may have occurred, and it can be quite misleading. For the distance matrix approach, a dis- tance is defined between each pair of sequences based, for instance, on pairwise sequence alignment scores from DP or from a position-by-posi- tion score relating a pair of sequences sampled from a full multiple align- ment. Can a tree generated this way have the property that the distance between any two sequences is the same as the sum of distances through the vertices that connect those sequences? There is some elegant work on this problem, which includes the celebrated four-point condition, but it is almost never the case that a distance matrix is additive. Finally, likelihood models assume a stochastic model for evolution of the positions along

OCR for page 38
43 UNDERSTANDING MOLECULES branches of the tree. Likelihoods for two trees can be compared using classical likelihood ratios. Modern Bayesian methods, including simulated annealing, and Markov chain Monte Carlo methods (MCMCs) are em- ployed. Work in this area is quite active, and the approach is considered by many to be the method of choice. All of these methods have significant limitations. The number of pos- sible trees is unmanageable for any reasonable number of sequences. Also, the methods all depend on the calculated alignment of sequences that contain uncertainties that are not all accounted for in the tree-building methods. Finally, simplifying assumptions about the independence of the positions and the uniformity of substitution rates limit the resolution that can be obtained. Probabilistic models, such as hidden Markov models (HMMs), have recently made significant contributions to biological sequence analysis (Durbin et al., 1998). In protein comparisons, these approaches allow for position-specific variability in substitution scores. In modeling interact- ing domains of DNA and RNA, they allow for a more biophysical treat- ment of the interaction. And in ab initio predictions of gene sequences, they can better capture the statistical characteristics of genes, including both content features and signals that delineate boundaries between dif- ferent segments. For protein-coding genes, such methods can be reason- ably effective, but for genes that code for RNAs, the problem is much more difficult and challenging. Structure Analysis Determining the three-dimensional structure of macromolecules from experimental data, such as x-ray diffraction patterns and nuclear mag- netic resonance measurements, is a mathematically demanding task. A typical protein structure must be inferred from enormous amounts of in- formation that indirectly reveal the relative locations of key atoms in a biomolecular structure. The mapping from structure to data is straight- forward, but the inverse problem, going from data to structure, is quite complex. The task is made harder because the experimental data are usu- ally still too sparse to make a unique inversion possible. Instead, models must be built to overcome the intrinsic experimental ambiguity. This is done, for instance, by forcing the models to conform to the best current knowledge of molecular forces and/or to adhere to a presumed structural component, such as a peptide backbone in protein chains. Even so, con- structing these models leads to demanding optimization problems. Because one-dimensional sequence data abound but three-dimen- sional structures are generally more useful, understanding the mapping from sequence to structure is a major goal of molecular biology. Cur-

OCR for page 38
44 MATHEMATICS AND 21ST CENTURY BIOLOGY rently, the most reliable methods for predicting the structure of a new protein are based on homology. If the new protein sequence can be in- ferred to be homologous to a protein with known structure, then it can also be reliably inferred that the structures of the two proteins are quite similar. In general, protein structures are more highly conserved through evolution than their sequences, so the challenge is to identify distantly related homologous proteins, and progress is ongoing in this direction, as described in the last section and Chapter 2. Despite decades of re- search, there has been only modest progress in predicting protein struc- tures from their sequence alone, ab initio. Experiments have shown that at least for moderate-sized proteins, sequence information alone is suffi- cient to specify the molecule’s three-dimensional shape. Understanding and then reproducing in the computer this transformation of one-dimen- sional information to three-dimensional information has come to be known as the protein-folding problem. The development of the model- ing and computing capability needed to enable calculating structure from sequence data would clarify many basic issues about protein structure and function. This capability would also be of practical use in the emerg- ing field of protein engineering. Considerable progress has been made on the protein-folding problem using, on the theoretical side, the statistical mechanics of energy land- scapes and, on the experimental side, knowledge of protein engineering and physicochemical kinetics. Of course some progress can simply be at- tributed to the rapid increase in the power of large-scale computing. The theoretical advances come from understanding how to characterize the universal topological properties of the energy surfaces of molecules; this challenge is deeply connected to the problem of characterizing the com- putational difficulty of optimization problems. Since the proteins fold on their own, and fairly rapidly given the enormous number of possible fold- ing paths and the NP-completeness of the optimization problem, protein folding must somehow be constrained. The energy landscapes of real pro- teins must be rather smooth, resembling a funnel leading to the three- dimensional folded structure rather than the very rugged energy land- scapes that could be present in principle. The rugged energy landscapes have globally different optima very far apart in structure, and this is a handicap to any search for optima. Until recently, the energy functions used to simulate the folding pro- cess, built up from the interactions of fragments of protein, were too rug- ged to yield correctly folded structures. Instead, depending on the fine details of the optimization process, many unrelated structures were pre- dicted to represent the global free energy minimum. The funnel-landscape idea offers a mathematical tool for developing more accurate energy sur- faces by using bioinformatic patterns in the known database of protein

OCR for page 38
45 UNDERSTANDING MOLECULES structures. Algorithms based on such hybrids of bioinformatics and physi- cal energy functions begin to approach the accuracy of structure predic- tions based on homology. Nonetheless, innovative ideas are still needed to enable predicting the folded structure of proteins. Useful as the classical protein-folding problem has been in stimulat- ing interest in the physical chemistry of biological macromolecules, it is important to recognize that its “solution,” at least across any broad sam- pling of protein structures, is unlikely. In cells, the self-assembly pro- cesses that build up biological structures from their components—includ- ing the folding of many proteins—occur in highly specialized environments. For example, it has become clear during the past 15 years that the folding of many proteins is facilitated by local environments pro- vided by proteins known as chaperones. To predict the structure of RNAs, efficient algorithms based on ther- modynamic parameters and DP methods have been available for over 20 years. Yet the best predictions are not very accurate, especially for long RNAs. The difficulty of making accurate predictions is undoubtedly due to some combination of the following facts: There is an enormous num- ber of possible structures with nearly equal predicted energies; available thermodynamic parameters have limited accuracies; many RNAs require cofactors, such as magnesium, in the medium in order to fold properly; and efficient algorithms eliminate some possible structures, such as pseudoknots, that may be necessary to obtain the correct folding. As with proteins, RNAs can be folded most accurately by using information from homologous RNAs with known structure. But in the case of RNAs, it is also very useful to have related RNAs that, presumably, fold into nearly identical structures even if none of the structures are known in advance. Because base-pairing is the strongest force determining RNA structures, finding base-pairing patterns that are consistent for all of the sample se- quences can lead to correct structure prediction. Sankoff (1985) published an algorithm for determining the optimum combination of alignment and structure scores for a collection of RNA sequences. However, that algo- rithm is computationally too expensive to be useful for typical numbers and lengths of sequences, so heuristic approximations have been devel- oped in recent years to solve this problem. Dynamics Many experiments show that biomolecules do not exist as the single structures envisioned in the simpler forms of structure reconstruction. Instead, they are present in the body as an ensemble of structures, ther- mally populating a complex energy landscape. Some reconstructions try to take this ensemble character into account, but it is difficult to do so, and

OCR for page 38
46 MATHEMATICS AND 21ST CENTURY BIOLOGY there is no mathematical understanding of the limits of application for these algorithms. The problem is no longer one of carrying out a simple optimization but one of achieving a statistically reliable sampling of struc- tures that conforms to our knowledge of molecular forces. Understanding the dynamics of molecules is not simply an exercise in modeling random fluctuations, because in many cases the dynamics of proteins are essential to their function. It is only by allowing some transformations to occur quickly and other thermodynamically possible transformations to occur slowly that a cell can control its behavior. Computer simulation has made much progress in illustrating the motions of proteins, yet there are several problems in trying to simulate completely from molecular dynamics how biological molecules function. The first challenge is timescale. While the atoms in proteins carry out the bulk of their motions in less than a picosecond, the fastest characteristic response time for a cell is in the millisecond range. Evolution has tuned and selected molecular properties to deliver results on the latter timescale. Thus, the timescales of importance range over nine orders of magnitude, and one cannot argue, as is done in computational fluid dynamics, that short-timescale behavior is important only at small spatial scales and long- timescale behavior is important only at large spatial scales. The reason for the timescale gap is that functionally important motions are rare: A movie of the atoms in a biomolecule could be mostly a great bore, interrupted by a few dramatic moments. Overcoming the timescale problem will involve understanding the structure of the possible biomolecular motions and developing mathematical ways to survey all the possibilities in a statisti- cally meaningful way. The high degree of accuracy to which functional dynamics must be computed is also a challenge for direct simulation. For instance, there might be only a small amount of energy difference between one binding state and another, yet the choice of binding state is a critical determinant of which of several signaling cascades takes place. Thus, calculations must be done with great precision. However, because even the smallest biologi- cal macromolecules contain thousands of atoms, how can the energy be determined this accurately? Cells manage to achieve reliable specificity, so there must be a principle at work, much like the funnel-landscape idea for folding, that simplifies the problem. But discovering this principle is a challenge. Can it be learned after enough four-dimensional binding data have been determined using machine learning techniques? In his 1905 papers on Brownian motion, Einstein noted that at the large size scale of macromolecular assemblies in the cell, energies of very small order still reign and determine where things go. As an example, it is now believed that the stability of the chromatin packaging of DNA comes from attrac- tive electrostatic forces between ions of like sign. Understanding the con-

OCR for page 38
47 UNDERSTANDING MOLECULES nection of very small forces to the macromolecular interactions occurring within the cell remains a grand computational challenge. Interactions Interactions between macromolecules play many essential roles. They occur in all combinations, with protein-protein and protein-DNA/RNA being the most studied. Complexes of multiple proteins are important in metabolic reactions where reactants can pass between enzymes to increase efficiency and reduce concentrations of intermediates. They are impor- tant in large intracellular structures, such as various types of filaments that provide both rigidity and directed motion within and between cells. Protein-protein interactions are essential in various signal transduction pathways where proteins modify one another to alter their properties. Cascades of such protein modifications are at the heart of processes that transfer information and signals between cellular components, and they allow cells to respond to their environment by altering their behavior. Experimental approaches to determining the interactions between proteins have greatly expanded in the last few years, and there is now considerable data about specific interactions. However, these experimen- tal methods tend to produce noisy data, and it remains a challenge to extract the true information. Some computational methods have proven useful in detecting interacting proteins, such as correlated occurrence in phylogenetic comparisons. While not currently feasible, computational methods to predict protein interactions based on compatible surfaces— essentially predicting the docking of large molecules—would be a major advance. One practical application of progress in this area would be the design of small-molecule drugs that interfere with specific protein-pro- tein interactions. Protein interactions with DNA and RNA are the primary mechanisms for controlling gene expression. In specific cells under specific conditions, only a subset of genes is expressed at any given time. Based on environ- mental signals, a fraction of the regulatory proteins will be expressed and active. Their sequence-specific interactions with regions of DNA will de- termine the set of genes expressed during the next interval. Statistical methods for predicting DNA target sites for specific proteins, given previ- ous examples of similar sites, can be useful for discovering new regula- tory interactions. What is needed is a recognition code that maps from the protein sequence (and utilizes the known structures of the transcription factor families) to a pattern that describes the family of DNA binding sites. Statistical methods have been applied to this problem, but current perfor- mance is far from adequate. Combinations of statistical methods and bio-

OCR for page 38
48 MATHEMATICS AND 21ST CENTURY BIOLOGY physical principles appear promising but are just in their infancy (Benos et al., 2002). Many of the interactions between macromolecules define networks, both metabolic and regulatory. Simply knowing the components of the networks and the connections between them is not sufficient to under- stand or model their behavior. The levels of different components of the networks can vary over space and time and are often small enough that discrete modeling is important. In regulatory networks, each of the stan- dard binary logic gates can be implemented by combinations of small numbers of proteins interacting with DNA; in other cases, continuous output values, sometimes with high gain, can be achieved by a similarly small number of components. Even in the best-studied cells, the network of functional interactions is only now being mapped out at the scale needed. All mathematical descriptions that adequately model the diverse components at suitable resolution while also capturing the stochastic char- acter of intermolecular partnership and the complexity of individual biomolecular dynamics remain a distant goal but one that the community of mathematical and physical scientists is beginning to tackle. FUTURE DIRECTIONS There are many daunting mathematical challenges to understanding and modeling the molecular aspects of biology. Technological advances continue to increase the quantity of data that can be obtained and, more importantly, to enable the collection of data that were not previously avail- able. This flood of data promises enormous advances in our understand- ing of biology in the coming decades, but that promise depends on the coincident development of mathematical approaches and applications. Sequence databases will continue to grow at a rapid rate (currently doubling about every 15 months), increasing our ability to identify ho- mology relationships and improving the evolutionary models on which those identifications depend. Increasingly comprehensive data will allow us to abandon some of the simplifying assumptions that were previously required and to develop more realistic models. The great increase in se- quence data will also step up the demand for improvements in the map- ping from sequence to structure and function. Improved mathematics is also required for advances in the biophysi- cal modeling of molecules and cells. Predictions of structures and interac- tions based on physical principles should significantly complement the results from evolutionary studies and are necessary for advances in pro- tein design and other areas of synthetic biology and biological engineer- ing. Improvements in modeling the dynamics of biological systems that

OCR for page 38
49 UNDERSTANDING MOLECULES take into account the range of important scales of space, time, and num- ber of components—for example, whether discrete or continuous models are needed—are necessary to develop accurate and predictive models of cellular behavior. Mathematical models at all scales are important in biology and will become increasingly so. One of the critical areas of development will be the interactions between models and experiments. Every scientist devel- ops models from data and uses those models to design the next experi- ments. But as the data become increasingly numerous and complex and the models ever more mathematical, it will become even more necessary to rely on mathematical and computational methods in the design of ex- periments. Determining the set of plausible models that explain the cur- rent information and then identifying the most informative experiments to distinguish between those models is a challenge even now and will be an increasingly central challenge in the future. For example, expression data can provide information about sets of plausible regulatory networks but are unlikely to define a single best network. By comparing the differ- ent implications of plausible networks, it should be possible to design experiments that best distinguish between them (Ideker et al., 2000). The molecular level deals with the most basic components of the cell. Key molecules include the heritable material that permits the propagation of biological properties from generation to generation, while also display- ing the variation that underlies evolution. Interacting sets of biological molecules are required for the behavior of cells and organisms, but they are not sufficient to predict this behavior. Cells, which will be discussed in the next chapter, provide a first example of the large gulf between suc- cessive levels of biological organization. REFERENCES Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215(3): 403-410. Benos, P.V., A.S. Lapedes, and G.D. Stormo. 2002. Is there a code for protein-DNA recogni- tion? Probab(ilistical)ly. Bioessays 24(5): 466-475. Durbin, R., S.R. Eddy, A. Krogh, and G. Mitchison. 1998. Biological Sequence Analysis: Proba- bilistic Models of Proteins and Nucleic Acids. Cambridge, U.K.: Cambridge University Press. Felsenstein, J. 2004. Inferring Phylogenies. Sunderland, Mass.: Sinauer Associates, Inc. Gusfield, G. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computa- tional Biology. Cambridge, U.K.: Cambridge University Press. Ideker, T.E., V. Thorsson, and R.M. Karp. 2000. Discovery of regulatory interactions through perturbation: Inference and experimental design. Pp. 305-316 in Biocomputing 2000: Pro- ceedings of the Pacific Symposium, Vol. 5. Singapore: World Scientific. Pevzner, P. 2000. Computational Molecular Biology: An Algorithmic Approach. Cambridge, Mass.: MIT Press.

OCR for page 38
50 MATHEMATICS AND 21ST CENTURY BIOLOGY Sankoff, D. 1985. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math. 45: 810-825. Waterman, M.S. 1995. Introduction to Computational Biology: Sequences, Maps and Genomes. London, U.K.: Chapman and Hall. Zuckerkandl, E., and L. Pauling. 1965. Molecules as documents of evolutionary history. J. Theor. Biol. 8: 357-366.