Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 110
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research 7 What Is the Information That Defines and Sustains Life? Advances in technology have dramatically increased the amount of biological data being collected. For example, DNA sequence data on the genes of many organisms and satellite imagery of many ecosystems are now available. While this proliferation of data presents exciting new opportunities, making good use of it also presents significant challenges. Increasingly powerful computing technology provides a powerful tool for data analysis and allows for the use of such techniques as shotgun sequencing of whole genomes. However, in many cases using data to come to meaningful conclusions about life requires time-consuming and expensive work. The sharing of data sets between researchers provides opportunities to examine data collected for one purpose to make progress in a different area. For this type of sharing to be most productive, data sets need to be codified, well curated, and well maintained in an accessible format. While data sets are essentially collections of information, the role of information in biology is much more than the use of data sets. The refinement and application of theories of information to biology present a deep challenge and an opportunity for furthering our understanding of life. Existing theories of information borrowed from other fields can be difficult to apply to biology, a field in which context is so important, but the conceptual gain may be well worth the challenge. WHAT IS INFORMATION? The concept of information is used throughout biology. Biologists study how information is acquired, used, stored, and transferred in living things.
OCR for page 111
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research Many biological structures or processes can be thought of as carriers of information. From the sequence of a DNA molecule, to sounds, nerve impulses, signaling molecules, or chemical gradients, scientists find it useful to characterize biology in terms of information. From the critical discovery of the “genetic code” as the coupler between DNA sequence and protein synthesis, to the marvelous ability of bees to convey information about the location and quality of resources through dance, it is intuitively appealing to describe the processes and structures of biology in information terms. Throughout this report, there are numerous examples of the representation and transmission of information. Questions that naturally arise about information in biology include these: Is there a common way to think of the biological information in all of these representations? Is there a consistent and useful way to measure biological information so that it can be dealt with in quantitative descriptions of genetics, evolution, molecular processes, and communication between organisms? In common usage, the word “information” conveys many different notions. It is often used as a synonym for “data” or knowledge, and in most common language uses it is associated with written or spoken numbers or words. This connection is key to a more scientific use of the term, in that it suggests that information can be represented by numbers or letters or more generally by symbols of any form. Indeed, information must have a representation, whether it is as written symbols, bits in computers, or in macromolecules, cells, sounds, or electrical impulses. The informal use of informational terms is widespread in molecular biology. For example, molecular biology uses words that relate to transfer and processing of information as technical terms for biological processes. The choice of words like code, translation, transcription, messenger, editing, and proofreading reflects how scientists think of these processes. When information is used as the focal concept for thinking about molecular biology, it highlights the sequence properties of the molecules under study, instead of their actual physiochemical forms (Godfrey-Smith, 2007). This prompts a focus on the abstract representational role of these molecules, rather than the nature of the physical processes (e.g., the biochemistry of the translational machinery) that are inevitably required to express the stored information in meaningful form. It is important to think carefully about information at many levels, both below the sequence level in molecular detail and above at higher levels of organization. INFORMATION IN BIOLOGY August Weismann appears to have been the first to explicitly use the notion of information transmission in genetics in 1904 when he referred to the transmission of information in heredity (Weismann, 1904). The meta-
OCR for page 112
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research phor of information represented in written or electronic forms has become widespread, but is it more than a metaphor or an analogy? The concept can be formulated quantitatively for molecular biology, perhaps most strikingly in the operation of the “genetic code” physically embodied in the form of DNA and RNA. In the genetic code, sequences of three nucleotide bases, called codons, are used as symbols for the amino acids in proteins—and the so-called translational machinery of the cell biochemically synthesizes proteins from the coded instructions represented in RNA. In this case, the amount of information transmitted, for example, can be calculated as it goes from the DNA to the RNA and then to the protein, in which process a fraction of the information (about half) is typically unused or lost (Dewey, 1996, 1997). Biological systems differ from nonliving systems in several ways, but the most profound differences might lie in their information content. It can be useful for this purpose to think of biological systems as evolved transducers of information, since organisms accumulate, process, store, and share information of different types and on different time scales. An organism needs information about its internal condition to manage its internal functions. For example, organisms use internal information to maintain homeostasis, to coordinate and regulate development, and to detect potential pathogens. (See Chapter 8 for more examples.) Information about the external world also is critical for an organism to deal effectively with that world—for example, organisms use external information to find shelter, to escape from predators and compete with rivals, and to reproduce and care for offspring. Information about the structure and physical function of the organism also is necessary for evolution to proceed, and this information is sequestered in the genome in a variety of ways, some of which are not yet understood. The information described above is represented in a variety of forms, probably none more well known than the digital information in the genomes of organisms. This information is central to biology, for it represents the largest share of information that is passed on in reproduction (Hood and Galas, 2003). The field of genetics investigates the way in which symbolic information in the genetic material is inherited and interpreted as messages about protein and RNA structure or as messages about the timing and levels of gene expression. Cell biology seeks to understand how intracellular components encode and interpret the information necessary to organize cellular structure, maintain homeostasis, and carry out cellular functions. Development can be seen as the study of how these messages are used to extract and interpret the information in the genome in order to turn a single cell into a complex multicellular organism composed of thousands of cells with specialized functions. Neurobiology is the study of how internally generated electrical signals are combined with information about the environment to
OCR for page 113
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research allow the animal to generate meaningful behavior. Immunology depends critically on the problem of detection and surveillance—that of distinguishing self from harmful invaders such as bacteria and viruses or aberrant self such as cancerous cells. This detection problem requires exquisite sensitivity and precision, as does the process of mounting and regulating the appropriate responses. The storage and transmission of information are fundamental to living things, but they are not the exclusive properties of life. For inanimate matter, the power of information storage and transmission is decidedly limited, but not entirely absent—for example, crystals, dendritic minerals, snowflakes, and other physical and chemical structures form, and thereby store, information in spontaneous order. In living things, however, the power of information acquisition and transmission is enormous, characteristic, and almost unlimited in potential. The transition from the inanimate to animate might well be thought of as the acquisition of the singular ability to increase the storage and transmission of information, in quantity and quality. The possibility of this increase of information, well beyond what is ever seen in inanimate matter, is fundamental to the process called evolution. Darwin’s marvelous ideas, which embody this concept in a qualitative fashion, can be viewed as the realization that variation and selection are the key characteristics of this potential and that they interact to accumulate information in living lineages. The idea of evolution has, in fact, been recast in modern times in terms of information flow. Evolutionary biology can be thought of as the study of how information enters the genome, persists, and changes over time—the ebb and flow of information, its gain and loss. Developing a conceptual treatment of information measurement, storage, and transmission in biology will require logical discipline. The process of doing so will elucidate—and raise new—questions about the dynamics of evolution and the processes of physiology, development, and behavior and perhaps will even shed light on the origins and the fate of living systems. INFORMATION THEORY While the concept of information in biology makes sense using a common-language perspective on the term “information,” and while it captures the symbolic or representational nature of much biological information (Godfrey-Smith, 2000), an adequate definition of the term “information” for formal use in biology remains somewhat elusive. The two fundamental questions are: First, how is a particular kind of biological information represented or encoded? And second, how can the quantity of information in a given representation, biological or not, be usefully defined and, most importantly, how can it be measured? Some guidance is available from the large body of theory and research
OCR for page 114
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research FIGURE 7.1 Shannon’s framework for thinking about information transmission. Information from a source is encoded or represented by a transmitter, which sends that information through a (possibly noisy) communication channel in the form of a signal. The signal is received by a receiver that decodes the message and delivers it to the destination. C.E. Shannon, (1948). From The Mathematical Theory of Communication. Copyright 1949, 1998 by Board of Trustees of the University of Illinois. Used with permission of the University of Illinois Press. that deals with information outside of the field of biology. One approach is provided by information theory as founded by Claude Shannon and Norbert Wiener to quantitatively understand communication channels (Shannon, 1948; Wiener, 1948; Shannon and Weaver, 1949) (Figure 7-1). In Shannon’s theory, information is essentially that which allows its bearer to distinguish among alternative possibilities—the range of possible messages. By this approach, the more alternatives that can be distinguished among, the more information has been transmitted (Box 7-1). For example, sitting in a windowless office, one cannot distinguish among different possible weather conditions outside. It might be sunny or cloudy. By checking the weather on the Web, one can find out which of these states is actually occurring. Thus, the weather report contains information. If the weather report also gives the temperature, then one can distinguish between even more states: sunny and hot, sunny and cold, cloudy and hot, cloudy and cold. In this case, the report provides more information than if it only indicated current cloud cover. This view of information is closely related to notions of communication and computation; the amount of information conveyed by a signal is proportional to the bandwidth that would be required to send that signal through a communication channel or the storage space that would be required to record the message in compressed form on a computer.
OCR for page 115
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research In the Shannon approach to information, the message is the result of a transmitting source sending a signal in some given representation, usually a symbolic alphabet. Rather than assigning an information content to any specific message, the amount of information sent through the channel depends only upon the characteristics of that source and the range of possible messages that might be sent. Neuroscientists have an advantage in adapting formal information theory to their work, as spike trains can be easily understood to carry information about sensory inputs. However, in other fields of biology, it can be more difficult to define the properties of the “source” and the symbolic alphabet used in representing biological information in any satisfying way, so this approach presents conceptual problems. For example, to calculate the amount of information in the amino acid sequence of a protein requires knowing how many such sequences are possible. The question then is: What does that mean—literally all possible amino acid sequences of that length, or all possible sequences represented in living organisms, or all possible sequences in the currently known database of protein sequences, or some other way of characterizing the possibilities? These different possible “sources” would all yield different measures for information. This is clearly a problematic approach. Algorithmic Information Theory An alternative approach to defining information brings out the role of information in computation. Rather than measuring the information content of a statistical source, as Shannon does, algorithmic information theory considers only the message itself and asks what is required to generate or reconstruct just that message. This inherent “complexity” idea comes from a formulation known as Kolmogorov complexity, after the Russian mathematician Andrei Kolmogorov. The key concept was independently arrived at during the 1960s by Ray Solomonoff (1964), Geoffrey Chaitin (1966), and Kolmogorov (1965). This simple but subtle idea holds considerable promise for biology. It is currently heavily used in imaging processing, pattern recognition, artificial intelligence techniques, and other engineering applications, but it is just now beginning to be used in biological applications. For example, this powerful approach has been applied to calculating mitochondrial genome phylogeny (Li et al., 2004). In addition to the difficulties discussed above, it is becoming clear that most biological information depends on the context in which it finds itself—what other information is present in the same system, and how that information influences the range of actions that a protein, cell, or organism can take. If the representation of information cannot be “read” or used when it is out of context, it carries no meaningful information. For
OCR for page 116
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research Box 7.1 The Mathematical Basis of Shannon’s Ideas Suppose we receive a message m that could take any of four possible forms, A, B, C, or D, each with equal probability. How much information is associated with message m? Because the message allows us to distinguish among four different alternatives (A, B, C, or D), we might be tempted to say that m conveys four units of information. But suppose that we receive two such messages, m1 and m2, one after the other. Intuitively, it would be nice to say that this pair of messages gives us twice as much information as did the single message m. But notice that this pair of messages actually allows us to distinguish among not eight but rather 16 equally likely possibilities. By doubling the number of messages, we have quadrupled the number of alternatives among which we can distinguish: AA BA CA DA AB BB CD DB AC BC CC DC AD BD CD DD In a series of early (1917-1928) papers, Harry Nyquist and R. V. L. Hartley pointed out that if we measure information by the logarithm of the number of alternatives that can be distinguished, the problem is resolved. The message m gives us log(4) units of information. The pair of messages m1 and m2 together give us log(16) =2 log (4) units of information—exactly twice what we obtained from the single message alone. Now what happens if the different messages have different probabilities of occurring? Suppose that message A is sent with probability 7/10, while messages B, C, and D occur with probability 1/10 each. In this situation, it seems that if message B comes through, we’ve learned more than if message A comes through. Each message—A through D—allows us to distinguish among four alternatives, but somehow we seem to have learned more when we receive message B than when we receive message A. After all, in the absence of prior knowledge we would have been “expecting” the signal A anyway, so when A does arrive this doesn’t come as a particular surprise. Can we capture this somehow in our definition of information? Consider another example. Suppose there are 10 possible states of the world: A1, A2, A3, A4, A5, A6, A7, B, C, and D. Then if we receive signal B, this example, a pheromone or vocal call of one species commonly conveys little information to another species (or at least a very different kind of information); a common human gene, rich in information for a human cell, is likely to carry no meaningful information in a bacterial cell; a segment of amino acid sequence that folds into a functional protein structure in the context of the sequence of its native protein may be useless and nonfunctional when set in the context of another protein sequence; the structure of an orchid’s
OCR for page 117
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research allows us to distinguish among 10 states of the world. Signals C and D are the same; each of these provides us with log (10) units of information. If A1 occurs, this also has information log (10), but if we simply receive the signal A in response to this event, we actually don’t find out whether A1, A2, A3, A4, A5, A6, or A7 actually occurred. Thus we have lost the ability to distinguish among seven alternatives; the net amount of information that we get is then This suggests a measure of the information provided by a signal S that occurs with probability p: Applying this to our example above: At last we are in a position to define the expected amount of information transmitted by a signal. Suppose that, as in our previous example, the message m takes the form of one of four signals, A, B, C, and D, with probabilities 7/10, 1/10, 1/10, and 1/10, respectively. Then with probability 7/10 we will get a signal (A) that provides Log 10/7 units of information, and with probability 3/10 we will get one of the three signals (B, C, or D) that provides Log 10 units of information. The average, or expected, amount of information provided is then 7/10 Log (10/7) + 1/10 Log (10) + 1/10 Log (10) + 1/10 Log (10). More generally, we can say that if symbols i = 1, 2, 3, …, m, are sent with probabilities p1, p2, p3,…. pm the average amount of information H in a message is given by flower might facilitate pollination only for a single species of insect, and so on. Almost all biological examples have some contextual content. The information measures discussed earlier do not explicitly take context into account, as they were purposefully designed to be context-free. For biology, context is almost always essential, and consistent and useful theoretical tools are needed to describe, measure, and use contextual information in complex biological systems.
OCR for page 118
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research Decision Theory The field of decision theory more directly accounts for context when measuring information. This is a body of theory that is designed to study optimal choice behavior. In decision theory, information allows its bearer to make good choices in an uncertain world. Information is measured not by the bandwidth required to convey it, or its statistical structure, but rather by its value. The value of information is measured by the best payoff one expects to get from a decision based upon that information, minus the best payoff one can expect if one has to make the decision without that information. For example, an investor can gain higher expected returns from the stock market if she knows more about the corporations in which she invests. Information about these corporations is measured by the difference in expected returns. These ideas have found fertile ground in application to biological information problems, particularly in evolutionary ecology and behavioral biology. There, in decision problems and game-theoretic scenarios alike, information is routinely measured by its influence on expected fitness. Evolution establishes a relationship between the quantity of information and its usefulness, but whether this relationship is general, specific, or even expressible in a succinct form is not known at the moment. The need for more theory in this case is evident, but valid and precise information measures and probably a lot more data are necessary for the development of those theories. Then, perhaps biologists will be able to construct good quantitative theories that use information as a key measure in biological systems and begin to understand biological complexity in a quantitative, consistent, and useful sense. STORING AND EXPRESSING INFORMATION IN THE GENES The discovery of how biological systems transduce genetic information was one of the most profound triumphs of 20th century science. Somehow, the cells of an organism contain the hereditary information that—given appropriate interactions with the environment—determines phenotype and behavior. Over the course of a century, researchers in the field of genetics have largely worked out the common set of mechanisms by which all living organisms represent and express the hereditary information in their genes, leading to a detailed understanding of the mechanistic basis of heredity (see also Chapter 9). Several questions remain to be answered, however, in order to fully understand how a system uses this information. First, the information must exist in some physical form; what is the chemical, mechanical, or electrical structure in which it is represented? Second, what does the information
OCR for page 119
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research encode, and how are the details encoded and expressed? Third, how is the information in its physical form transduced so that it can be realized in phenotype and behavior? In the case of the genetics, the rules of heredity of certain properties or traits of organisms, as discovered by Gregor Mendel, were rediscovered at the beginning of the 20th century. A major initial discovery was to determine exactly where this hereditary information lay. At the turn of the 20th century, Boveri, Sutton, and Morgan realized that the known rules of heredity could be explained if the heredity information was somehow contained in the chromosomes. Fifty years later, Hershey and Chase devised a stunningly simple experiment that used their knowledge of bacterial viruses and a kitchen blender to provide strong chemical and physical evidence that the DNA component of the chromosomes was the actual information carrier (Hershey and Chase, 1952). Beadle and Tatum (1941) suggested that the information in genes describes how to make proteins: They postulated that genes affect function because each gene encodes a single protein. The next conceptual step was to figure out how the structure of DNA encodes information and how that information can determine the formation of a complex organism. In principle, this could happen in a number of ways. For example, DNA might form some kind of geometric templates for complex proteins. It might form some kind of polymeric substrate for driving thousands of different catalytic reactions. Or, as turns out to be true, DNA could be a coded instruction set that is read and decoded by another sort of molecular machinery. Watson and Crick inferred the rules that revealed the now famous structure of doublehelical DNA molecules (Watson and Crick, 1953). Crick, Brenner, and colleagues figured out that there was a triplet code in the DNA so that each three base pairs of DNA determine one amino acid of the resulting protein (Crick et al., 1961). Subsequently, Nirenburg, Khorona, Holley, and others worked out the coding rules by which DNA sequences are subsequently translated into proteins primarily by using synthetic RNA molecules in biochemical reaction mixtures for making proteins in the test tube. These rules are now known as the “genetic code” even though it is now known that much more than the protein sequence information is contained in the DNA molecule of every organism. Much of the subsequent revolution in molecular biology that unfolded in the last half of the 20th century elaborated biologists’ understanding of how each step of this process works: how DNA encodes protein structure, how the cellular machinery translates this code into proteins, and how the rest of the molecule provides information for the control of which proteins to synthesize and when. Thus, a complete picture of DNA has been developed as a uniquely stable molecule that stores complex specifications for building and managing the organism. The specification can be found in a
OCR for page 120
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research pattern-based code that depends on the linear sequence arrangement of its monomers (base pairs) rather than the DNA molecule’s collective mechanical or chemical properties. From this picture it can be concluded that the complex of reactions catalyzed by proteins can be orchestrated by how and when the information encoded in the DNA is expressed, but how the regulation of DNA expression impacts protein behavior was not at all obvious. The issue of how cells process information and make the computations that control protein expression became a major conceptual problem (Jacob and Monod, 1961). The revelation of the DNA structure and the genetic code opened the door to this problem; biologists now understand the control of gene expression to some extent, but its full complexity remains to be unraveled. Expression of information from the DNA into structural parts, catalytic enzymes, and other macromolecules drives a large part of the complex structures and functions of biological systems—cells, organs, organisms. Biologists are just beginning to figure out the patterns, the rules, and all of the machinery that generates this complexity from the information stored in the DNA. In fact, the theoretical underpinnings of this general problem—the conceptual basis of the global control of gene expression—is one of the major modern challenges of biology. Jacob and Monod worked out how a bacterial cell controlled the expression of the set of genes it used to take advantage of a particular energy source (sugar lactose) that it encountered (Box 7-2). That work illustrates in a simple form how molecular machinery and the information processing of the bacterial cell informs us about the regulation of gene expression. The basic components of the lactose operon are shown in Box 7-2 (part b). The lactose repressor is encoded in a nearby gene, the lacI gene. This protein is produced at the same low level all the time, independent of the medium or the metabolic state of the cell. It forms a tetramer of four identical units that recognizes and binds to a specific DNA segment that overlaps the promoter of the lactose metabolizing genes—when the repressor is bound, transcription is off. The turning on of the expression of the metabolizing genes of a particular substance is an example of what Monod and Jacob called induction. In this case, lactose is the inducer. The inducer binds directly to the lactose repressor and causes the protein itself to change its conformation, rendering it incapable of binding tightly to the operator. This is the basic induction response of the lactose operon—a disabling of a negative regulatory mechanism that allows transcription of the gene to proceed. Despite its name, this operon is sensitive to factors other than the presence or absence of lactose. The cell does not need to metabolize lactose if other carbon and energy sources are available. Glucose is the preferred energy source in bacteria because it is a highly energy-efficient carbon source.
OCR for page 121
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research (It is one of the products of catabolism of lactose by β-galactosidase). The ability of glucose to regulate the expression of a range of operons ensures that bacteria will utilize glucose before any other carbon source as a source of energy. The ability of glucose to control the expression of a number of different inducible operons is accomplished through a protein called the cAMP-binding protein (CAP; Box 7-2b). A key observation in deciphering this mechanism was the inverse relationship between glucose levels and cAMP levels in E. coli. When glucose levels are high, cAMP levels are low; when glucose levels are low, cAMP levels are high. Biologists now know that this relationship exists because the transport of glucose into the cell directly inhibits the enzyme adenyl cyclase that produces cAMP. The cAMP then binds to CAP in the bacterial cell. The cAMP-CAP complex, but not free CAP protein, binds to a site on the DNA in the promoters of catabolite repression-sensitive operons. The binding of the complex enhances the activity of the promoter and thus more transcripts are initiated from that promoter, so that there is a positive control. The logic of this module of functional regulatory control then is the following. If there is little or no glucose present, and lactose is available, the operon turns on. There are two inputs and one output. The lactose module can be thought of as an integrator of sorts. If the regulatory response were binary, or Boolean—on or off—it can be considered as an “AND gate.” While the lactose operon is complex in the sense that several proteins, specific DNA protein interactions, induced conformation changes in the repressor and the CAP protein, metabolic sensors, and enzymatic activities are involved, it behaves like a simple “AND gate” as depicted in Box 7-2c from the point of view of the cellular logic. The quantitative aspects of the behavior of the operon are important for some aspects of the cell’s response, so that the Boolean model is insufficient in detail, but the basic response is really very simple. The lactose operon system processes information about the environment of the cell in order to regulate the expression of information stored in the genome of the bacterium. However, it is unclear how consistently it describes and measures the information, be it from the environment or from the genome. It is difficult to describe and measure the relevant information because of the complexity of the local environment, the diversity of information types present, and the complexity of the genome. Bacteria typically have a few thousand genes and a few million base pairs of DNA in their genomes, whereas mammals have 25,000 or so genes and a few billion base pairs of DNA in their genomes. Although comprehensive models of gene expression, particularly in simple bacteria and archaea, are being developed, biologists’ understanding of the global regulation of gene expression in any multicellular organism is far from comprehensive (Bonneau et al., 2006).
OCR for page 122
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research Box 7-2 The Lactose Operon: A Genome-Encoded Network In the 1950s, researchers noticed that the bacterium E. coli synthesizes the lactose-metabolizing enzyme β-galactosidase only when lactose is present in its growth medium. Jacob and Monod and their colleagues focused on this phenomenon and hypothesized the correct explanation of their observation. The explanation was elaborated into the “operon model,” and the field of molecular gene regulation—still a major research area in biology today—began in earnest. The lactase genes (there are three rather than just the lactose-cleaving enzyme, β-galactosidase) are copied or transcribed as a single mRNA unit that is common for bacteria, and control is accomplished at the level of transcription of this single mRNA. The three structural genes that code for the protein enzymes involved in lactose metabolism are the lacZ gene that codes for β-galactosidase (β-galactosidase is an enzyme that breaks down lactose into glucose and galactose); the lacY gene that codes for a permease (the permease is involved in uptake of lactose from the medium into the cell); and the lac A gene that codes for a galactose transacetylase. These genes are transcribed from a common promoter into an mRNA, which is translated to yield the three distinct enzymes. Because the critical factors that the cell is responding to are metabolic in nature—the need to use lactose as a carbon source—the genetic regulatory network is coupled to the metabolic network of the cell. The lactose system was a fortunate choice by Jacob and Monod because it turns out to be a very simple system indeed—at least by the standards of genetic regulatory networks. Structure and function of the lactose operon. (a.) The organization of the lactose operon is shown along the (blue) genomic DNA molecule. The promoters are shown as red arrows, and the regulatory sites on the lac promoter, the CAP-binding site and the repressor-binding site, or operator, are shown as blue and green boxes respectively. (b.) The regulatory flow of information is shown in blue (genetic regulation) and orange (metabolic), illustrating the essential components that are coupled across the boundary between the metabolic and genetic domains. (c.) The logical structure of the regulatory relationships is summarized in this diagram that uses symbols common in electronic logical operations. The simplicity of the basic logic is evident here even though the biochemical and genetic interactions underlying the logic are much more complex. REPRESENTING INFORMATION IN DEVELOPMENT In the process of development, information from the genome is used to execute a program of cell division and change (differentiation) to create a multicellular organism from a single cell. The early embryo develops from a single cell, driven by a network specified by information in the genome. This information comes in two forms: the DNA sequences that are binding
OCR for page 123
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research SOURCE: Courtesy of David Galas. sites for a number of proteins in the DNA and the DNA sequences that are actual protein-encoding segments of DNA. These proteins bind to sites in the DNA near genes, some of which encode other proteins that bind DNA sites and regulate their expression (transcription factors). The structure of one such network is indicated in Figure 7-2. Davidson and colleagues have mapped out this network for the sea urchin embryo (Bolouri and Davidson, 2003; Howard and Davidson, 2004;
OCR for page 124
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research FIGURE 7-2 Sea urchin embryo development network. The structure of a network that executes the program of cell division and differentiation in the early sea urchin embryo. The genes are represented by the short horizontal lines; the control relationships are depicted by the lines extending between these genes. The different modules that control gene expression in different components of the early embryo are indicated in color: The pink segment is the skeletal cell module, the green segment is the mesoderm module, and so on. SOURCE: Reproduced with permission of Eric H. Davidson. Levine and Davidson, 2005; Istrail and Davidson, 2005). What this static picture does not show is the dynamics of the changing levels of gene expression as the program unfolds in time and that the first 30 hours of development of the embryo’s life is driven by the dynamic network. There is also unseen complexity in the batteries of other genes, including the metabolic and structural genes that are expressed in each cell type driven by the presence of the specific set of transcription factors in the cells of each type. This example illustrates the nature of the information needed and the degree of complexity involved in early embryogenesis. In many ways, the most remarkable thing about this work and the resulting genetic regulatory network is that it is decipherable and understandable at all. The dynamics
OCR for page 125
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research of the gene expression program that leads to the early sea urchin embryo is one of the most explicit cases known to date where the informational definition of the network has such clear biological significance. This advance will soon be only one of many such cases, and the specific and quantitative role of genetic information in embryogenesis will soon be much clearer. Study of gene regulatory networks in different organisms suggests that various subroutines are employed repeatedly. As discussed in Chapter 6, these conserved “modules” provide a circuit that drives a particular kind of outcome. The kind of circuit needed for development differs from the circuits characteristic of physiological regulatory networks like the lac operon described above. Box 7-3 gives examples of a particular circuit used in several different developmental pathways. SHARING INFORMATION Much of the accumulation of biological complexity that has occurred over the history of life on Earth has arisen through major transitions in which previously unassociated entities either joined into a common reproductive fate or developed cooperative associations while maintaining reproductive independence (Maynard Smith and Szathmáry, 1995). These transitions have had a number of effects, such as economies of scale and functional specialization. For example, symbioses that developed into complete cellular dependence occurred at least twice in the history of life: in the acquisition of mitochondria in eukaryotic cells and in the acquisition of chloroplasts in algae cells. It is significant that the incorporated bacterial cells that became mitochondria and chloroplasts retained part of their genomes and the ability to replicate them when they joined forces with their eukaryotic hosts. The cell acquired an entire new genome (or two). After incorporation and transfer of many genes into the nuclear genome, symbiosis between engulfed prokaryote and host eukaryote eventually evolved into a full partnership. Coordination of the symbiosis that led to the full partnership required the cellular genome to communicate with the organelle genome in ways that finally became permanently fixed in the information of their respective genomes. The evolution of mitochondria and chloroplasts is an illustration of how the complexity of biological information can increase. These intracellular organelles require cells to have a new level of communication and coordination. Information sharing works differently than the sharing of physical resources in that it is not a “zero sum game,” as expressed by the British playwright George Bernard Shaw:
OCR for page 126
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research Box 7-3 Comparison of Developmental and Physiological Regulatory Networks Unlike many physiological regulatory networks that have the purpose of moving the cell to a new state in response to the environment (see Box 7-2), developmental regulatory networks are more like sequential computer program subroutines in that they drive the unfolding of a defined set of successive steps or stages, as the program executes over time. Whereas the dynamical properties of the physiological networks enable them to transition back and forth between states in response to a changing environment, developmental networks, while sensing and coordinating with the cellular environment, must drive a regular, irreversible series of transitions through a defined series of states. Developmental programs, at least in the early embryo, probably never get near a steady state. The program inexorably drives itself forward, unfolding each successive stage of gene expression in the appropriate cell types. How do these kinds of programs work? Is there a theme or a repertoire of mechanisms? A number of examples of network mechanisms that drive development systems forward are known. The regulatory interactions of a small set of genes that drive the transition of the network to the next stage are a recurring theme. Despite the variety of organisms and cell differentiation pathways represented in the four examples shown here (two from the sea urchin (a and d), one from the mouse (b), and one from the fruit fly (c)), all have the following properties in common: Input to the regulatory region of one gene (represented by the small black arrow) drives a positive feedback loop that turns on one or more genes in the small module, and stabilizes the new state of those downstream genes, which in turn regulate other genes that will change the state of the cell. Once these circuits are triggered, they switch inexorably to a new state and don’t return to their initial state. The small boxes in the figure indicate in simplified binary form (1 is “on,” 0 is “off”) the initial state of the circuit (upper line) and the final state of the circuit (lower line).
OCR for page 127
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research SOURCE: Figure courtesy of David Galas, based on information contained in Davidson (2003).
OCR for page 128
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research If you have an apple and I have an apple and we exchange apples then you and I will still each have one apple. But if you have an idea and I have an idea and we exchange these ideas, each of us will have two ideas. Clearly, there will be situations in which sharing information devalues that information—for example, if a person shares the location of a limited food source with others, the information sharing is likely to reduce the amount of food that the person gets from that location. But devaluing information by sharing is not an inherent property of the information itself; rather, it is a consequence of the situation. In other cases, sharing information can carry no such costs. If a person warns others about tomorrow’s severe weather, the warning does not impinge upon the person’s own ability to take appropriate precautions. In some cases, sharing information may even increase the value of that information. For example, Marzluff et al. (1996) and Wright et al. (1998) provided compelling evidence that the communal roosts of common ravens (Corvus corax) serve as “information centers” in which individuals share information about the location of food sources. In this case, there are direct benefits to sharing information: The members of a communal roost cannot feed unless they arrive at the food source in large enough numbers to displace the local territory holders. Thus, knowledge of the location of a food source is useless unless shared. Moreover, the costs of sharing the information are small or nonexistent. These food resources, typically large carcasses, are often so big that a group of ravens cannot consume one entirely before the resource is lost to snowfall, mammalian scavengers, or other causes. INFORMATION AND EVOLUTION The evolutionary process itself can be conceptualized as a process of information acquisition. The sorts of information that are represented in the genome and the ways in which this information is extracted from the genome by the living organism were discussed earlier. But how did this information initially get into the genome? The answer is that information accumulates in the genome as a result of the process of evolution by natural selection. Mutation in its many forms provides a wide range of variation, but on its own, mutation does not necessarily add further information with respect to the environment (i.e., it does not increase Shannon’s mutual information between genome and environment). For example a “silent” mutation does not immediately change an organism’s phenotype. The additional information comes in as a result of the sorting process of natural selection. Selection preserves those genotypes that operate more effectively in the environment and discards those genotypes that are less effective.
OCR for page 129
The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research One can quantify this relationship; Haldane (1957) and Kimura (1961) established that information can accumulate in a sexual population at a rate no higher than – log(1–s) bits per generation, where s is the selective load (basically, the fraction of the population lost to selection). Recent analyses of evolution in fluctuating environments (Bergstrom and Lachmann, 2004; Kussell and Leibler, 2005) further draw out the relation between theoretic measures of genomic information and the concept of Darwinian fitness. These analyses hint that the two different ways of measuring information—the Shannon framework and the decision theory framework—could be closely related under special circumstances. Kelly (1956) characterized one such set of circumstances; he established a relationship between the value of side information to a gambler and the entropy rate of the process being wagered upon. Evolution by natural selection appears to provide another example. However, further work is needed to approach a thorough understanding of these relations. CONCLUSIONS An attempt to characterize living systems by citing just two essential properties would probably include, first, that they are thermodynamically far from equilibrium, and second, that they store, accumulate, and transmit large amounts of information. While there is still a struggle to shape the concepts in ways that are rigorous and useful for biology, biologists can recognize that information is indeed a valuable way to describe many life processes. There are many nonbiological disciplines, including mathematics, computer science, and statistics, that have problems similar to some of those that biologists grapple with. The problem of understanding biological information and developing fruitful theoretical ideas and useful tools will likely be aided by this rich vein of ideas and methods.