ROBERT M. HAZEN,* PATRICK L. GRIFFIN,* JAMES M. CAROTHERS,^{†} and JACK W. SZOSTAK^{‡}

Complex emergent systems of many interacting components, including complex biological systems, have the potential to perform quantifiable functions. Accordingly, we define “functional information,” *I*(*E*_{x}), as a measure of system complexity. For a given system and function, *x* (e.g., a folded RNA sequence that binds to GTP), and degree of function, *E*_{x} (e.g., the RNA–GTP binding energy), *I*(*E*_{x}) = −log_{2}[*F*(*E*_{x})]_{,} where *F*(*E*_{x}) is the fraction of all possible configurations of the system that possess a degree of function > *E*_{x}*.* Functional information, which we illustrate with letter sequences, artificial life, and biopolymers, thus represents the probability that an arbitrary configuration of a system will achieve a specific function to a specified degree. In each case we observe evidence for several distinct solutions with different maximum degrees of function, features that lead to steps in plots of information versus degree of function.

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 25

2
Functional Information and the
Emergence of Biocomplexity
roBerT M. hAZen,* PATriCK l. GriFFin,*
JAMes M. CAroThers,† and JACK W. sZosTAK‡
Complex emergent systems of many interacting components,
including complex biological systems, have the potential to per-
form quantifiable functions. Accordingly, we define ‘‘functional
information,’’ I(Ex), as a measure of system complexity. For a
given system and function, x (e.g., a folded RNA sequence that
binds to GTP), and degree of function, Ex (e.g., the RNA–GTP
binding energy), I(Ex) = −log2[F(Ex)], where F(Ex) is the fraction
of all possible configurations of the system that possess a degree
of function > Ex. Functional information, which we illustrate with
letter sequences, artificial life, and biopolymers, thus represents
the probability that an arbitrary configuration of a system will
achieve a specific function to a specified degree. In each case
we observe evidence for several distinct solutions with different
maximum degrees of function, features that lead to steps in plots
of information versus degree of function.
*Geophysical laboratory, Carnegie institution, 5251 Broad Branch road nW, Washington,
DC 20015-1305; †California institute for Quantitative Biomedical research and Berkeley
Center for synthetic Biology, University of California, 717 Potter street MC 3224, Berkeley,
CA 94720-3224; and ‡howard hughes Medical institute, Department of Molecular Biology
and Center for Computational and integrative Biology, 7215 simches research Center,
Massachusetts General hospital, Boston, MA 02114-2696.
2

OCR for page 25

2 / Robert M. Hazen et al.
C
omplex emergent systems, in which interactions among numerous
components or ‘‘agents’’ produce patterns or behaviors not obtain-
able by individual components, are ubiquitous at every scale of the
physical universe, for example in neural networks (Deamer and evans,
2006), turbulent fluids (Frisch, 1995), insect colonies (Camazine et al.,
2001), and spiral galaxies (Carlberg, 1992). Complex systems also appear
in a range of artificial symbolic contexts, including genetic algorithms
(Mitchell, 1996), cellular automata (Wolfram, 2002), artificial life (Adami,
1995), and models of market economies (holland, 1995).
life, with its novel collective behaviors at the scale of molecules,
genes, cells, and organisms, is the quintessential emergent complex sys -
tem. Furthermore, the ancient transition from a geochemical world to a
living planet may be modeled as a sequence of emergent events, each of
which increased the chemical complexity of the prebiotic world (De Duve,
1995; Morowitz, 2002; hazen, 2005).
Given this ubiquity and diversity, it is desirable to understand the
characteristics of emergent complex systems, as well as the factors that
might promote complexity in evolving systems. however, complexity has
proven difficult to define or measure with precision (Gell-Mann, 1995;
Adami, 2003; shalizi, 2006). A central objective of this study, therefore,
is to examine ‘‘functional information’’ (szostak, 2003) as a quantitative
measure of complexity that may be applicable to the analysis and predic-
tion of attributes of a wide range of phenomena in physical and symbolic
systems, including evolving biological systems.
An extensive literature explores historical developments and recent
advances in the study of complexity and information (Kåhre, 2002; Gell-
Mann and lloyd, 2003; von Baeyer, 2003; shalizi, 2006) as well as their
application to understanding biological systems (Morowitz, 1978; Bell,
1997; Allen et al., 1998; solé and Goodwin, 2000; Camazine et al., 2001;
Adami, 2003; Avery, 2003; ricard, 2003). Despite this rich literature, pre-
vious discussions of complexity have not generally focused on the rela -
tionship between information content and function (lehman et al., 2000).
We propose to measure the complexity of a system in terms of functional
information, the information required to encode a specific function.
SYSTEMS AND THEIR FUNCTIONS
in this chapter we consider the functional information of both sym -
bolic systems (letter sequences and Avida artificial life genomes) and bio-
polymers (rnA aptamers). These systems share several characteristics:
first, they consist of numerous individual components or ‘‘agents’’; sec -
ond, the agents can combine in a combinatorially large number of differ-
ent configurations; and third, some configurations display functions that

OCR for page 25

Functional Information and the Emergence of Biocomplexity / 2
are not characteristic of the individual agents. Analyses of these systems
address fundamental questions about the relationship between informa -
tion content and function. For example, how much information does it
take to encode a function? Are there multiple distinct solutions? how are
solutions distributed in configuration space? how much more information
does it take to encode a given improvement in function? What environ -
mental factors might influence these relationships?
The function of some emergent systems is obvious: a sequence of let -
ters communicates a specific idea, a computer algorithm performs a spe -
cific computation, and an enzyme catalyzes at least one specific reaction.
less obvious are the functions of systems of many interacting inanimate
particles, such as molecules, sand grains, or stars, but these systems may
also be described quantitatively in terms of function, for example, in terms
of their ability to dissipate energy or to maximize entropy production
through patterning (e.g., Bertalanffy, 1968; nicolis and Prigogine, 1977;
swensen and Turvey, 1991; emanuel, 2006). living systems, by contrast,
typically display multiple essential functions (Allen et al., 1998; Ayala,
1999; Mcshea, 2000). This consideration of complexity in terms of the func -
tion of a system, as opposed to some intrinsic measure of its patterning
or structural intricacy, distinguishes our treatment from many previous
efforts.
QUANTIFYING COMPLEXITY
Development of a quantitative measure of complexity has proven
difficult for at least three reasons, each of which relates to the diversity of
systems that may be labeled ‘‘complex.’’
1. systems may be complex in terms of information content, physical
structure, and/or behavior. Consider three stages in the life cycle of a multi-
cellular organism: a fertilized egg, a live adult, and a postmortem adult.
All three states are complex, but they are complex in different respects.
All three states possess the sequence information (a genome) necessary
to grow a living organism. living and dead adult organisms also display
complex anatomical structures, but only living organisms possess behav-
ioral complexity. Any universal definition of complexity must thus have
the potential to quantify complexity independently in terms of informa -
tion, structure, or behavior.
2. it has been difficult to define complexity in terms of a metric that
applies to all complex systems. no obvious common thread exists in com-
paring the complexity of symbolic systems, such as language, with those
of physical agents, such as cells. Parameters useful in characterizing sym -
bolic systems (e.g., algorithm- or information-based complexity metrics)

OCR for page 25

2 / Robert M. Hazen et al.
generally differ from those used to analyze systems of interacting particles
(e.g., newtonian dynamics or maximum entropy models). Gell-Mann
(1995) concludes, ‘‘A variety of different measures would be required to
capture all our intuitive ideas about what is meant by complexity.’’
3. Complex emergent systems are diverse in terms of their dimension-
ality. sequences of letters, computer code, or bipolymers can be treated as
one-dimensional strings of symbolic information (or as points in a high-
dimensional sequence space). on the other hand, many physical emergent
systems, including those composed of many interacting sand grains, cells,
organisms, or stars, exhibit time-dependent behaviors in two or three
spatial dimensions. it is desirable for a complexity formalism to apply to
this range of dimensionalities.
Despite this diversity, a common thread is present: All complex sys-
tems alter their environments in one or more ways, which we refer to
as functions (Bigelow and Pargetter, 1998). in the words of von Baeyer
(2003), ‘‘information gathering by itself, without observable effects on the
gatherer’s behavior, is a pointless pursuit.’’ Function is thus the essence
of complex systems. Accordingly, we focus on function in our operational
definition of complexity. Therefore, although many previous investigators
have explored aspects of biological systems in terms of information (e.g.,
schneider et al., 1986), we adopt a different approach and explore informa-
tion in terms of the function of a system (including biological systems).
szostak and coworkers (szostak, 2003; Carothers et al., 2004) intro-
duced ‘‘functional information’’ as a measure of complexity. They pro-
posed that the complexity of an information-rich system, such as rnA
aptamers (rnA structures that bind a target molecule), can be quantified
in the context of specific functions of the system, in contrast to prior for-
malisms based on genomic, sequence, or algorithmic information (e.g.,
lenski et al., 1999; Adami, 2003). here we examine applications of this
formalism to letter sequences, the artificial life platform Avida (Adami,
1998), and rnA aptamers.
FUNCTIONAL INFORMATION AS A
MEASURE OF SYSTEM COMPLEXITY
Many emergent systems of interacting agents can be described in
terms of their potential to accomplish one or more quantifiable tasks.
Consider a system that can exist in a combinatorially large number of
different configurations (i.e., a 100-nt rnA strand comprised of four dif-
ferent nucleotides, A, U, G, and C, with 4100 different possible sequences).
Assume that a small fraction of these configurations accomplishes a speci -
fied function x to a high degree (corresponding to a high information

OCR for page 25

Functional Information and the Emergence of Biocomplexity / 2
content). Typically, a significantly greater number of configurations will
prove somewhat less efficient in accomplishing function x (corresponding
to lower information content), whereas the majority of configurations will
display little or no function (lenski et al., 2003; Carothers et al., 2004).
Accordingly, ‘‘degree of function x’’ (Ex) is a measure of a configu-
ration’s ability to perform the function x. For example, in an enzymatic
system Ex might be defined as the increase in a specific reaction rate that
is achieved by the enzyme. in the case of a sequence of letters, Ex might
represent the probability that the sequence conveys a desired message
to a particular recipient. And in a system with water flowing over sand
ripples, Ex might be defined as the rate of energy dissipation by turbu-
lence, compared with flow over a smooth, unpatterned surface. The units
or scale of Ex may be somewhat arbitrary and will depend on the nature
of function x. Thus, for example, catalytic efficiency might be recorded
in terms of rate enhancement or in terms of decreased activation energy
(proportional to the log of the rate enhancement).
in the formalism of szostak (2003; see also Morowitz, 1978, p. 252),
functional information [I(Ex)] is calculated with reference to a specific
degree of function x, designated Ex. Typically, a small fraction, F(Ex), of all
possible configurations of a system achieves at least the specified degree
of function, ≥Ex. Accordingly, we define functional information in terms
of F(Ex):
I(Ex) = −log2[F(Ex)].
Thus, in a system with N possible configurations (e.g., a sequence of n
rnA nucleotides, which has N = 4n discrete possible sequences):
I(Ex) = −log2[M(Ex)/N],
where M(Ex) is the number of different configurations that achieves or
exceeds the specified degree of function x, ≥Ex
in every system, the fraction of configurations, F(Ex), capable of achiev-
ing a specified degree of function will generally decrease with increasing
Ex (szostak, 2003). The largest possible functional information of a system
is exhibited in the case of a single configuration that displays the highest
possible degree of function, Emax:
I(Emax) = −log2[1/N] = log2N,
where I is measured in bits. This maximum functional information is thus
equivalent to the maximum number of bits necessary and sufficient to
specify any particular configuration of the system.

OCR for page 25

0 / Robert M. Hazen et al.
Alternatively, the minimum functional information of a system is zero
for configurations with the lowest degree of function, Emin, because all
possible states have Ex ≥ Emin:
I(Emin) = −log2(N/N) = −log2(1) = 0 bits.
in this formulation, functional information increases with degree of func -
tion, from zero for no function (or minimum function) to a maximum
value corresponding to the number of bits necessary and sufficient to
specify completely any configuration of that system.
Functional information is defined only in the context of a specific
function x. For example, the functional information of a ribozyme may be
greater than zero with respect to its ability to catalyze one specific reaction
but will be zero with respect to many other reactions. Functional informa -
tion therefore depends on both the system and on the specific function
under consideration. Furthermore, if no configuration of a system is able
to accomplish a specific function x [i.e., M(Ex) = 0], then the functional
information corresponding to that function is undefined, no matter how
structurally intricate or information-rich the arrangement of its agents.
it is important to emphasize that functional information, unlike pre -
vious complexity measures, is based on a statistical property of an entire
system of numerous agent configurations (e.g., sequences of letters, rnA
oligonucleotides, or a collection of sand grains) with respect to a specific
function. To quantify the functional information of any given configura -
tion, we need to know both the degree of function of that specific configu -
ration and the distribution of function for all possible configurations in the
system. This distribution must be derived from the statistical properties of
the system as a whole [as opposed, for example, to the statistical proper-
ties of populations evolving in a fitness landscape (Wright, 1942)]. Any
analysis of the functional information of a specific functional sequence or
object, therefore, requires a deep understanding of the system’s agents
and their various interactions.
Three examples (letter sequences, the artificial life platform Avida, and
rnA aptamers) serve to illustrate the concept of functional information.
THE FUNCTIONAL INFORMATION OF LETTER SEQUENCES
systems of many interacting components can occur in a combinato -
rially large number of different configurations. Functional information
depends on the fraction of all possible configurations that achieve at least
a specified degree of function. sequences of letters provide a conceptually
familiar example. Consider various sequences of n letters that convey the
message: ‘‘A fire has just started in a house at the corner of Main street

OCR for page 25

Functional Information and the Emergence of Biocomplexity /
and Maple street.’’ Many different sequences of letters are capable of con-
veying that information. To determine the functional information of any
particular sequence we must specify three parameters:
1. n, the number of letters in the sequence.
2. Ex, the degree of function x of that sequence. in the case of the fire
example cited above, Ex might represent the probability that a local fire
department will understand and respond to the message (a value that
might, in principle, be measured through statistical studies of the responses
of many fire departments). Therefore, Ex is a measure (in this case from 0
to 1) of the effectiveness of the message in invoking a response.
3. M(Ex), the total number of different letter sequences that will achieve
the desired function, in this case, the threshold degree of response, ≥Ex.
The functional information, I(Ex), for a system that achieves a degree
of function, ≥Ex, for sequences of exactly n letters is therefore
I(Ex) = −log2[M(Ex)/26n].
note that 26n is the total number of possible arrangements of a 26-
letter alphabet in a sequence of n letters, and in this treatment we assign
equal probability to all possible sequences. The important more general
case of configurations of unequal probabilities is a straightforward exten -
sion of the treatment of shannon (shannon, 1948; Klir, 2006), as discussed
by Carothers et al. (2004). Greater clarity of expression can be added
through additional characters such as ‘‘space,’’ ‘‘capital,’’ and ‘‘period’’;
however, in this example we use only 26 letters. As in all combinatorially
large emergent systems, most sequences convey no information (i.e., have
no discernable function). Functional information is determined by identi -
fying the fraction of all sequences that achieve a specified outcome.
Consider, for example, sequences of 10 letters that have a high prob -
ability (Ex 1) of evoking a positive response from the fire department.
such sequences might include ‘‘FireonMAin,’’ ‘‘MAinsTFire,’’ or
‘‘MAPlenMAin.’’ Additionally, some messages containing phonetic mis-
spellings (Fyre or MAne), mistakes in grammar or usage (FireoFMAin),
or typing errors (MAZle or nAPle) may also yield a significant but lower
probability of response (0 << Ex < 1). Given these variants, on the order
of 1,000 combinations of 10 letters might initiate a rapid response to the
approximate location of the fire. Thus,
I(1) ≈ −log2[1000/2610] ≈ 36 bits.

OCR for page 25

2 / Robert M. Hazen et al.
numerous additional 10-letter sequences convey some relevant infor-
mation but would result in a lower probability of response (0 < Ex < 1):
‘‘FirehelPMe,’’ ‘‘DAnGerFire,’’ or ‘‘BUrninGnoW.’’ A lower degree
of function, Ex, will generally correspond to a larger number of effective
letter sequences, M(Ex).
The formulation of functional information also applies to systems in
which sequences of varied lengths are combined. For letter sequences of
any length from 1 to n letters,
I(Ex) = −log2{M(Ex)/[∑1 to n (26n)]}.
varying the maximum length, n, of the letter sequence has a significant
effect on the maximum possible degree of function, Ex, as well as the
number of states, M(Ex), that achieve that degree of function. sequences
of 1, 2, or 3 letters are unlikely to convey sufficient information to achieve
any response. With 4 letters, however, a few suggestive configurations
exist (Fire, MAin, or MAPl), although all such sequences possess a high
degree of ambiguity (i.e., Ex << 1).
on the other hand, with longer letter sequences (n >> 10), the number
of messages of a given degree of function increases dramatically, with new
opportunities for explicit instructions (and hence maximum degree of
function, Ex = 1). With a sufficient number of letters, any arbitrary degree
of accuracy and precision in a message can be communicated. note,
however, that arbitrarily long sequences are not necessarily more effec -
tive at conveying information and thus may not increase the functional
information of a system. For example, consider sequences of letters that
begin with the following 22 letters:
FireATMAinsTAnDMAPlesT . . .
such a sequence should invariably summon the fire department, no matter
what or how many additional letters are placed at the end of the sequence.
Thus, for this admittedly contrived fire department scenario, the fraction
of sequences that achieve the desired outcome attains a maximum value
at ≈20 letters. in competitive systems, notably genetic information con-
strained by length-selective pressure in living systems (e.g., Mills et al.,
1967; Andersson and Andersson, 1999; shigenobu et al., 2000; nakabuchi
et al., 2006), longer sequences may prove inefficient and do not necessarily
confer an advantage. (indeed, in the case of reporting a fire, an overly long
and detailed message might delay response time.)
note that in this formulation of functional information the maximum
possible value, I(Emax), arises when a message is so specific that only a
single letter sequence out of all possible letter sequences achieves a desired

OCR for page 25

Functional Information and the Emergence of Biocomplexity /
outcome. in the case of a sequence of n letters, that maximum functional
information occurs when M(Ex) = 1:
I(Emax) = −log2[1/26n] = log226n ≈ 4.7n bits.
Although this conceptual example is qualitative, it introduces key con-
cepts that are required to quantify functional information in any emergent
system with numerous configurations. of special interest is the relation-
ship between information and degree of function. letter sequences point
to the existence of discrete ‘‘classes’’ of functional configurations, based
in this case on the appearance of familiar words (‘‘Fire’’ and ‘‘MAin’’)
as well as their mutations (‘‘Fyre’’ and ‘‘MAne’’). We explore the role
of such multiple classes of solutions in the subsequent sections on Avida
and rnA aptamers.
We conclude that rigorous analysis of the functional information of a
finite system with respect to a specified function x requires knowledge of
two attributes: (i) all possible configurations of the system (e.g., all possible
sequences of a given length in the case of letters or rnA nucleotides) and
(ii) the degree of function x for every configuration.
These two requirements are difficult to meet in many systems. in the
case of letter sequences, for example, the sequence is obvious, but it is diffi-
cult to determine quantitatively the degree of function of many sequences.
By contrast, it is relatively straightforward to determine the degree of
function (for example, the ligand affinity) of any given rnA sequence,
but impossible with present technology to measure all sequences in a
large population, e.g., ≈1014 randomly generated 100-mers as used in some
aptamer evolution studies (although single-molecule methods may ulti -
mately provide a technical solution to this challenge). however, these
concepts may be placed on a firmer footing in the case of computational
systems, such as the artificial life platform Avida.
THE FUNCTIONAL INFORMATION OF AVIDA POPULATIONS
We have adapted the artificial life platform Avida (Adami, 1998;
lenski et al., 2003) to explore the distribution of function in an emer-
gent system. The digital organisms that populate the virtual world of
Avida are ‘‘computer programs that self-replicate, mutate, and adapt by
natural selection’’ (lenski et al., 1999) and as such share many (although
not all) of the attributes ascribed to biological life. Accordingly, artificial
life models have been used as a means of exploring ideas about organic
biology that are not readily amenable to experimentation. here we explore
the functional information of randomly generated populations of Avida
organisms. Understanding the origin and evolution of complex biologi-

OCR for page 25

/ Robert M. Hazen et al.
cal systems motivates this work; however, the first task is to demonstrate
an approach for quantifying the relationship between information and
functional behavior in a well characterized emergent system, whether or
not unambiguous biological insight is immediately revealed.
Avida organisms consist of multiple lines of machine instructions,
termed its ‘‘genome.’’ each organism operates as a formal computer simi-
lar to that outlined by Turing (1936), and the computational properties of
each organism are determined by the sequence of machine instructions
stored in its memory. A population of Avida organisms can be thought
of as a multitude of identical computers running many different simple
programs, where differences between any two members of the population
arise solely from the differences in the programs being run.
This research focuses on the ability of a small fraction of all ran-
domly generated Avida organisms to perform computational tasks that
arise through the coordinated execution of multiple machine instructions
(lenski et al., 2003). none of these computational tasks can be performed
by the execution of a single instruction; indeed, the shortest functional
program requires five instructions. The computational ability (function)
of Avida organisms thus emerges from the interaction of instructions (the
agents), making Avida an ideal model for characterizing complex emer-
gent systems.
in a typical Avida experiment, we generate 107 random instruction
sequences (i.e., 107 different individual genomes), each sequence 100–500
instructions in length, from the default set of 26 different machine instruc-
tions. Although most sequences display no function, a small subset of
sequences code for the ability to compute logic operations (such as ‘‘not’’
or ‘‘and’’) or arithmetic functions (addition and subtraction).
The set of computational tasks Avida organisms can perform allows
for varied solutions, analogous to variations seen in nature. This character-
istic is underscored by the fact that in its evolution apparatus Avida does
not consider how a task is accomplished but only the resulting function,
i.e., whether or not it is executed. The Avida platform does not specify
preferred approaches to problem solving, which allows novel solutions
to appear through evolution. There may be great variety among these
solutions, and they may be very different from those that might have been
arrived at by design (lenski et al., 1999).
MEASURES OF AVIDA FUNCTION
Just as there is no unique measure of function in natural systems, there
is no unique measure of the degree of function in an Avida sequence popu-
lation. We chose to consider three distinct measures of function: (i) the
number of times a sequence is able to compute a specific task, for example,

OCR for page 25

Functional Information and the Emergence of Biocomplexity /
addition or not/and; (ii) the total number of all tasks the sequence is able
to compute, because many sequences can perform multiple distinct opera-
tions; and (iii) the total number of different tasks the sequence is capable
of computing.
each of these measures of function correlates to strategies that bio -
logical organisms employ to increase their fitness. some organisms rely
on the ability to perform one action very well, others rely on the ability to
perform multiple actions moderately well, and still others take advantage
of flexibility, the ability to do many different tasks (Wilson, 1992). how-
ever, unlike with living organisms, quantifying the extent of these traits
in Avida is straightforward and unambiguous. Most of the discussion that
follows, however, focuses on execution of a single task.
Functional sequences constitute a tiny minority of the Avida genome
space. Therefore, to explore fully the distribution of function within a
sequence space, a large number of randomly generated sequences (i.e.,
equal probability) must be surveyed (see Methods). such random explo-
rations of genome space are similar to the strategies used in the directed
evolution of rnA structures (e.g., ellington and szostak, 1990; Wilson
and szostak, 1999). note, however, that this type of random sampling is
not possible with living organisms because the portion of genome space
explored in an evolution experiment will be constrained by the topology
of the underlying fitness landscape and the particular configuration of
the environment maxima (van nimwegen et al., 1999; lehman et al., 2000;
Taverna and Goldstein, 2000; sasaki and nowak, 2003).
AVIDA RESULTS
random sampling of genome space has yielded several interesting
results related to the frequency and distribution of functional configura -
tions. By using Avida’s default set of 26 machine instructions, a randomly
generated sequence with length of a magnitude of ≈102 lines was found
to be functional (i.e., was able to perform at least one logic or arithmetic
operation at least once) with probability P ≈ 10−3. The functional fraction
of a population decreases with decreasing sequence length until it reaches
zero for populations with sequences of a length of four machine instruc-
tions or less.
We observe regular, reproducible structure in the distribution of task
execution frequency, for example, in the number of not/and or addition
operations executed (Ex) versus functional information (Fig. 2.1). This
plot, which illustrates the distribution of function for 107 randomly gen-
erated 300-instruction genomes, is continuous over most values of Ex, for
example, between 2 and 48. however, at several values of Ex, discontinui-
ties appear. At Ex > 73 these discontinuities point to isolated individual

OCR for page 25

/ Robert M. Hazen et al.
FiGUre 2.1 Distribution of the not/and (nAnD) function in 300-line Avida
genomes in a randomly generated sample of 107 genomes. The degree of function,
E, is the number of times nAnD is executed by the genome, whereas functional
information, I (in bits), is −log2 of the fraction of all sequences that achieves at
least that degree of function, F(E). note the discontinuities, which are a recurrent
feature in these experiments.
genomes of high functionality; such outliers always appear, but they may
occur at different values of Ex for repetitions of this experiment. however,
other discontinuities (notably those between 48/49 and 58/59) are robust,
always appearing in experiments on 300-instruction genomes. Thus these
gap-like features reflect an intrinsic behavior of Avida genomes.
We also find that the number and specific location of these gaps, as
well as the maximum values of I(Ex) and Ex, depend on the length of
the sequences being studied (Fig. 2.2). For example, we examined the
number of executions of the addition function for 106 randomly generated
genomes of 100, 200, 300, and 500 instructions. We find that the maximum
number of addition executions, Ex, increases with genome length. We often
observe discrete highly functional genomes, representing outlier solutions,
as well as reproducible gaps. For randomly generated genomes of 100, 200,
300, and 500 instructions, the first significant gap in addition execution
frequency occurs at 19, 39, 59, and 69 executions, respectively.
ISLANDS OF FUNCTION
What is the source of the reproducible discontinuities in Figs. 2.1 and
2.2? We suggest that the population of random Avida sequences contains

OCR for page 25

Functional Information and the Emergence of Biocomplexity /
FiGUre 2.2 The frequency of the ADD function in 100-, 200-, 300-, and 500-line
linear Avida genomes in randomly generated samples of 106 genomes. Degree of
function, E, is the number of times the ADD function is executed by the genome,
whereas functional information, I (in bits), is −log2 of the fraction of all sequences
that achieves at least that degree of function, F(E). note that maximum E increases
with genome length.
multiple distinct classes of solutions, perhaps with conserved sequences
of machine instructions similar to those of words in letter sequences or
active rnA motifs (Knight and yarus, 2003). each class has a maximum
possible degree of function; therefore, the discontinuities occur at degrees
of function below which a major class of sequences is represented and
above which it is not represented.
Fig. 2.3 demonstrates one possible model for this stepped behavior,
based on discrete ‘‘islands’’ of solutions. in Fig. 2.3, the islands, each
of which represents a specific distinct set of solutions to the function
[i.e., fitness (z axis)], are conceptually represented as being close to each
other in sequence space (projected on the x–y plane). note, however,
that these islands are a visual simplification. For example, in the case of
rnA sequences, any given ‘‘island’’ of closely related functional solu -
tions may be more realistically represented by a densely interconnected
network that spans all of sequence space (huynen et al., 1996; lehman et
al., 2000; reidys et al., 2001). similar consideration of function topologies
has been applied to neural network connections (ebner et al., 2002) and
to viroid solutions infecting the same plant host (Codoñer et al., 2006).
Avida may be similar, because the commands relevant to a given solu -
tion do not necessarily need to appear sequentially at a specific location

OCR for page 25

/ Robert M. Hazen et al.
FiGUre 2.3 schematic representation of four discrete functional classes, or
‘‘islands,’’ of solutions that display function. The vertical axis is degree of func -
tion, E, whereas the horizontal plane represents a two-dimensional projection in
sequence space. The number of sequences with degree of function ≥ E corresponds
to the area intersected by the horizontal plane at that height along the E axis.
increasing E above the heights of the flat-topped islands A and B will result in dis -
continuities in the function E versus I, as illustrated in Figs. 2.1 and 2.2. island C is
a cone-shaped distribution, and island D represents a discrete solution of the type
that might not be discovered in random sampling experiments.
in the string but can occur in different registers and can be spread apart
by neutral commands.
Consider a case where four classes of solutions for the same function,
labeled A–D, occur in the population (Fig. 2.3). each class may contain
a normal distribution of degrees of function, but each has a different
topology in sequence space and a different maximum degree of function,
Ex. For relatively low values of Ex, all four islands contribute functional
sequences. As the value of Ex increases to just above the heights of flat-
topped islands A and then B, discontinuities in the plot of Ex versus I(Ex)
will occur (i.e., in Fig. 2.1 the height corresponding to island A would
be Ex = 48 and the height of island B would be Ex = 58). This model also
matches the observation that the continuous stretches of Ex versus I(Ex)
are longest for populations of long sequences: longer sequences allow for
a greater number of distinct solutions whose superposition would serve
to drown out individual discontinuities.

OCR for page 25

Functional Information and the Emergence of Biocomplexity /
This model for generating discontinuities is plausible because mul -
tiple distinct solutions may exist in sequence space for a given task.
For example, the shortest possible sequence (‘‘gene’’) for accomplishing
subtraction is five lines long (lenski et al., 2003). however, an alternative
unrelated subtraction gene 10 lines long can be constructed within the
Avida language using two’s-complement arithmetic (Zarowski, 2004). This
second class of solutions reinforces the concept of ‘‘islands’’ of function in
sequence space, where two or more types of solutions exist that achieve
the same task but do so in an unrelated fashion.
We note, by contrast, that purely random statistical functions do
not display steps. For example, if the degree of function is defined as
the frequency of the appearance of the number ‘‘1’’ in randomly gener-
ated sequences of 100 digits, then functional information follows a well
behaved smooth curve (Fig. 2.4). Maximum functional information arises
for the solitary state with 100 consecutive 1s, whereas an obvious uni-
form distribution follows for lesser degrees of function. This statistically
random case is not stepped. By comparison, the structures depicted in
Figs. 2.1 and 2.2 suggest that the tasks being considered as functions are
neither trivial, nor are they achieved by essentially arbitrary or random,
albeit rare, configurations of the system. The interactions in the Avida
system, and perhaps many other complex systems, lead to distributions
of function that prove far richer than in systems possessing statistically
FiGUre 2.4 I(E) versus E for the statistically random system, where E is the
number of times the digit 1 appears at least that many times in a sequence of
100 digits. This statistically random case is not stepped, in contrast to the topology
of Avida genomes.

OCR for page 25

0 / Robert M. Hazen et al.
trivial function. it remains to be seen, however, whether the observed
stepped relationship between I(Ex) and Ex is a general feature of functional
information or an idiosyncratic characteristic of Avida genomes.
FUNCTIONAL INFORMATION AND RNA POLYMERS
The previous two examples, sequences of letters and Avida machine
commands, illustrate the utility of the functional information formalism
in characterizing the properties of symbolic systems that can occur in
combinatorially large numbers of configurations. Functional information
also has applicability to complex biological and biochemical systems;
indeed, it was originally developed (szostak, 2003; Carothers et al., 2004)
to analyze aptamers (rnA structures that bind target ligands) and ribo-
zymes (rnA structures that catalyze specific reactions). Thus, the degree
of function, Ex, of these linear sequences of rnA letters (A, C, G, and
U) can be defined quantitatively as the binding energy to a particular
molecule or the catalytic increase in a specific reaction rate. We can easily
specify every possible rnA sequence of length n, and we can (at least in
principle) synthesize rnA strands and measure the degree of function of
any given sequence. The behavior of aptamers and ribozymes thus lends
itself to the type of quantitative analysis that we applied previously to
letter sequences and Avida populations (Carothers et al., 2004).
in general, a single rnA nucleotide will display minimal catalytic
or binding function, xmin. it follows that a minimum sequence length
(nmin nucleotides) will be required to achieve any significant degree of
ribozyme or aptamer function, Ex > Emin. increasing the number of nucleo-
tides (n > nmin) will generally lead to many more functional sequences,
some of which will have a greater degree of function. Furthermore, for any
given catalytic or binding function there exists an optimal rnA sequence
of length nopt that attains the maximum possible degree of function,
Emax. That sequence thus possesses the maximum possible functional
information:
Imax(Emax) = −log2{1/[∑1–n (4n)]}.
opt
For degrees of function less than the maximum (Ex < Emax), an intermediate
functional information obtains [I(Ex) < Imax(Emax)].
The in vitro evolution of rnA aptamers (e.g., ellington and szostak,
1990; Wilson and szostak, 1999) provides a dramatic illustration of the evo-
lution and selection of systems with high functional complexity. Aptamer
evolution experiments begin with large populations (up to 1016 randomly
generated rnA sequences), which are subjected to a selective environment,
a test tube coated with a target molecule, for example. A small fraction of

OCR for page 25

Functional Information and the Emergence of Biocomplexity /
the random rnA population will selectively bind to the target molecules.
Those rnA strands are recovered, amplified with mutations (through
reverse transcription, PCr, and transcription), and the process is repeated
several times. each cycle yields a more restricted rnA population with
improved binding specificity (i.e., a higher degree of function, Ex).
Carothers et al. (2004), who analyzed the distribution of functional
rnA aptamers in a random population, provide data on a specific exam-
ple. They identify 11 distinct classes of GTP-binding rnAs, which are
distinguished from each other both by nucleotide sequences (rnA motifs)
(Knight and yarus, 2003) and secondary stem–loop structures. The degree
of function of these aptamers can be defined by a solution dissociation
constant, a measure of the binding strength between GTP and the folded
aptamer. Carothers and coworkers find that a 10-fold increase in GTP
binding strength requires ≈10 additional bits of information (i.e., a 1,000-
fold decrease in abundance in a population of random sequences). such
a finding is in accord with studies of biopolymers (Aronson et al., 1994;
Wang and Unrau, 2005) that show functionally similar peptides with dis-
similar primary structures, as well as reports of many distinct classes of
protease enzymes (rawlings and Barrett, 1993; rawlings et al., 2006).
Furthermore, although the data of Carothers et al. (2004) are too few to
draw definitive conclusions, there is a suggestion of a stepped relationship
between binding strength (Ex) and functional information (I), a relation-
ship analogous to that displayed by populations of Avida organisms (e.g.,
Fig. 2.1). These steps, if real, are likely caused by the existence of separate
classes of GTP-binding solutions. Functional classes with greater numbers
of stems represent a significantly smaller fraction of all rnA sequences,
but they have the potential to display greater GTP-binding affinities.
FUNCTIONAL INFORMATION IN
HIGHER-DIMENSIONAL SYSTEMS
Functional information provides a measure of complexity by quantify-
ing the probability that an arbitrary configuration of a system of numerous
interacting agents (and hence a combinatorially large number of different
configurations) will achieve a specified degree of function. This concept
was originally discussed in the context of biopolymer sequences that
perform specific binding or catalytic functions (szostak, 2003; Carothers
et al., 2004). in the preceding sections we demonstrated that the extension
of functional information analysis to one-dimensional systems of letters
or Avida computer code is conceptually straightforward, requiring only
specification of the degree of function of each possible sequence.
We suggest that the functional information formalism may also be
applicable to complex physical structures in higher-dimensional systems.

OCR for page 25

2 / Robert M. Hazen et al.
of special interest in this regard are biological systems that display com -
plex emergent behavior, for example, through long-range chemical sig -
naling among a collection of cells in social amoebas (Goldbeter, 1996;
Brännström and Dieckmann, 2005; schaap et al., 2006), cooperation among
consortia of host organisms and symbionts (Moran, 2007, Chapter 9, this
volume), or colonies of social insects (solé and Goodwin, 2000; Camazine
et al., 2001; strassmann and Queller, 2007, Chapter 8, this volume). We
propose that functional information can be applied, at least in principle,
to any such emergent system that has the ability to perform a function.
Many emergent systems can be analyzed in terms of their ability to
dissipate energy or maximize entropy production (nicolis and Progogine,
1977; lorenz, 2003; Whitfield, 2005; emanuel, 2006). For example, consider
the functional information of an assemblage of sand grains subjected to a
steady flow of wind or water (e.g., Bagnold, 1988; hansen et al., 2001). The
formation of periodic sand dunes or ripples serves to initiate turbulent
flow and thus increase energy dissipation. Functional information of the
system can thus be measured as the fraction of all possible sand configu-
rations, F(Ex), that achieve at least the corresponding energy dissipation,
Ex. such a problem might be analyzed with Monte Carlo simulations
of numerous gravitationally stable sand configurations. The analytical
challenge remains to determine the degree of function of a statistically
significant random fraction of all possible configurations of the system so
that the relationship between I(Ex) and Ex can be deduced.
CONCLUSIONS
A complexity metric is of little utility unless its conceptual framework
and predictive power result in a deeper understanding of the behavior
of complex systems. Analysis of complex systems in terms of functional
information reveals several characteristics that are important in under-
standing the behavior of systems composed of many interacting agents.
letter sequences, Avida genomes and biopolymers all display degrees of
functions that are not attainable with individual agents (a single letter,
machine instruction, or rnA nucleotide, respectively). in all three cases,
highly functional configurations comprise only a small fraction of all pos -
sible sequences. Furthermore, these three examples reveal that several
discrete classes of functional configurations exist, a situation that can lead
to distinctive step features in plots of information versus function.
The functional information formalism may also point to key factors
in the origin and emergence of biocomplexity. in particular, functional
information quantifies the probability that, for a particular system, a con -
figuration with a specified degree of function will emerge. Furthermore,
analysis of the relationship between information and function may reveal

OCR for page 25

Functional Information and the Emergence of Biocomplexity /
how much more information is required to encode a given improvement
in function. The formalism also points to strategies, such as increasing the
concentration and/or diversity of molecular agents, that might maximize
the effectiveness of chemical experiments that attempt to replicate steps
in the origin of life.
METHODS
Determination of the computational properties of a randomly gener-
ated instruction sequence is accomplished within Avida’s analyze mode.
The trace feature in analyze mode generates detailed information on the
state of the virtual computer at each step in the processing of a genome,
including a notation of when a recognized function has been executed.
An automated script parsed these logs to collect all of the data necessary
to determine the functional properties of each sequence and cataloged the
genomes found to be functional to permit later study. Detailed documen-
tation of the Avida software, including descriptions of the trace function
and analyze mode, can be found online at the Digital evolution labo-
ratory at Michigan state University web site (http://devolab.cse.msu.
edu/software/avida/doc).
ACKNOWLEDGMENTS
We thank John Avise and Francisco Ayala for organizing this sackler
Colloquium; h. J. Cleaves, K. esler, r. lenski, h. Morowitz, C. ofria,
and D. sverjensky for valuable comments and suggestions. This work
was supported in part by the national Aeronautics and space Admin-
istration Astrobiology institute, the national science Foundation, and
the Carnegie institution. J.W.s. is an investigator of the howard hughes
Medical institute.

OCR for page 25