You are on page 1of 6

Evolution of biological complexity

Christoph Adami*, Charles Ofria, and Travis C. Collier


*Kellogg Radiation Laboratory 106-38 and Beckman Institute 139-74, California Institute of Technology, Pasadena, CA 91125; and Division of Organismic
Biology, Ecology, and Evolution, University of California, Los Angeles, CA 90095

Edited by James F. Crow, University of Wisconsin, Madison, WI, and approved February 15, 2000 (received for review December 22, 1999)

To make a case for or against a trend in the evolution of complexity distinguish distinct evolutionary pressures acting on the genome
in biological evolution, complexity needs to be both rigorously and analyze them in a mathematical framework.
defined and measurable. A recent information-theoretic (but intu- If an organisms complexity is a reflection of the physical
itively evident) definition identifies genomic complexity with the complexity of its genome (as we assume here), the latter is of
amount of information a sequence stores about its environment. prime importance in evolutionary theory. Physical complexity,
We investigate the evolution of genomic complexity in populations roughly speaking, reflects the number of base pairs in a sequence
of digital organisms and monitor in detail the evolutionary tran- that are functional. As is well known, equating genomic com-
sitions that increase complexity. We show that, because natural plexity with genome length in base pairs gives rise to a conun-
selection forces genomes to behave as a natural Maxwell De- drum (known as the C-value paradox) because large variations
mon, within a fixed environment, genomic complexity is forced to in genomic complexity (in particular in eukaryotes) seem to bear
increase. little relation to the differences in organismic complexity (9).
The C-value paradox is partly resolved by recognizing that not all
of DNA is functional: that there is a neutral fraction that can vary
D arwinian evolution is a simple yet powerful process that
requires only a population of reproducing organisms in from species to species. If we were able to monitor the non-

COMPUTER SCIENCES
which each offspring has the potential for a heritable variation neutral fraction, it is likely that a significant increase in this
from its parent. This principle governs evolution in the natural fraction could be observed throughout at least the early course
world, and has gracefully produced organisms of vast complexity. of evolution. For the later period, in particular the later Pha-
Still, whether or not complexity increases through evolution has nerozoic Era, it is unlikely that the growth in complexity of
become a contentious issue. Gould (1), for example, argues that genomes is due solely to innovations in which genes with novel
any recognizable trend can be explained by the drunkards functions arise de novo. Indeed, most of the enzyme activity
walk model, where progress is due simply to a fixed boundary classes in mammals, for example, are already present in pro-
condition. McShea (2) investigates trends in the evolution of karyotes (10). Rather, gene duplication events leading to repet-
certain types of structural and functional complexity, and finds itive DNA and subsequent diversification (11) as well as the

EVOLUTION
some evidence of a trend but nothing conclusive. In fact, he evolution of gene regulation patterns appears to be a more likely
concludes that something may be increasing. But is it complex- scenario for this stage. Still, we believe that the Maxwell Demon
ity? Bennett (3), on the other hand, resolves the issue by fiat, mechanism described below is at work during all phases of
defining complexity as that which increases when self- evolution and provides the driving force toward ever increasing
organizing systems organize themselves. Of course, to address complexity in the natural world.
this issue, complexity needs to be both defined and measurable.
Information Theory and Complexity. Using information theory to
In this paper, we skirt the issue of structural and functional

SPECIAL FEATURE
understand evolution and the information content of the se-
complexity by examining genomic complexity. It is tempting to
quences it gives rise to is not a new undertaking. Unfortunately,
believe that genomic complexity is mirrored in functional com-
many of the earlier attempts (e.g., refs. 1214) confuse the
plexity and vice versa. Such an hypothesis, however, hinges upon
picture more than clarifying it, often clouded by misguided
both the aforementioned ambiguous definition of complexity
notions of the concept of information (15). An (at times amus-
and the obvious difficulty of matching genes with function.
ing) attempt to make sense of these misunderstandings is ref. 16.
Several developments allow us to bring a new perspective to this
Perhaps a key aspect of information theory is that informa-
old problem. On the one hand, genomic complexity can be
tion cannot exist in a vacuum; that is, information is physical
defined in a consistent information-theoretic manner [the phys-
(17). This statement implies that information must have an
ical complexity (4)], which appears to encompass intuitive instantiation (be it ink on paper, bits in a computers memory,
notions of complexity used in the analysis of genomic structure or even the neurons in a brain). Furthermore, it also implies
and organization (5). On the other hand, it has been shown that that information must be about something. Lines on a piece of
evolution can be observed in an artificial medium (6, 7), pro- paper, for example, are not inherently information until it is
viding a unique glimpse at universal aspects of the evolutionary discovered that they correspond to something, such as (in the
process in a computational world. In this system, the symbolic case of a map) to the relative location of local streets and
sequences subject to evolution are computer programs that have buildings. Consequently, any arrangement of symbols might be
the ability to self-replicate via the execution of their own code. viewed as potential information (also known as entropy in
In this respect, they are computational analogs of catalytically information theory), but acquires the status of information
active RNA sequences that serve as the templates of their own only when its correspondence, or correlation, to other physical
reproduction. In populations of such sequences that adapt to objects is revealed.
their world (inside of a computers memory), noisy self- In biological systems the instantiation of information is DNA,
replication coupled with finite resources and an information-rich but what is this information about? To some extent, it is the
environment leads to a growth in sequence length as the digital blueprint of an organism and thus information about its own
organisms incorporate more and more information about their
environment into their genome. Evolution in an information-
poor landscape, on the contrary, leads to selection for replication This paper was submitted directly (Track II) to the PNAS office.
only, and a shrinking genome size as in the experiments of To whom reprint requests should be addressed. E-mail: adami@krl.caltech.edu.
Spiegelman and colleagues (8). These populations allow us to Present address: Center for Microbial Ecology, Michigan State University, Lansing, MI
observe the growth of physical complexity explicitly, and also to 48824.

PNAS April 25, 2000 vol. 97 no. 9 4463 4468


structure. More specifically, it is a blueprint of how to build an The maximal entropy per-site (if we agree to take our logarithms
organism that can best survive in its native environment, and to base 4: i.e., the size of the alphabet) is 1, which occurs if all
pass on that information to its progeny. This view corresponds of the probabilities are all equal to 14. If the entropy is
essentially to Dawkins view of selfish genes that use their measured in bits (take logarithms to base 2), the maximal
environment (including the organism itself), for their own entropy per site is two bits, which naturally is also the maximal
replication (18). Thus, those parts of the genome that do amount of information that can be stored in a site, as entropy is
correspond to something (the non-neutral fraction, that is) just potential information. A site stores maximal information if,
correspond in fact to the environment the genome lives in. in DNA, it is perfectly conserved across an equilibrated ensem-
Deutsch (19) referred to this view by saying that genes embody ble. Then, we assign the probability p 1 to one of the bases and
knowledge about their niches. This environment is extremely zero to all others, rendering Hi 0 for that site according to Eq.
complex itself, and consists of the ribosomes the messages are 2. The amount of information per site is thus (see, e.g., ref. 23)
translated in, other chemicals and the abundance of nutrients
Ii H max H i. [3]
inside and outside the cell, and the environment of the organism
proper (e.g., the oxygen abundance in the air as well as ambient In the following, we measure the complexity of an organisms
temperatures), among many others. An organisms DNA thus is sequence by applying Eq. 3 to each site and summing over the
not only a book about the organism, but is also a book about sites. Thus, for an organism of base pairs the complexity is


the environment it lives in, including the species it co-evolves
with. It is well known that not all of the symbols in an organisms C Hi. [4]
DNA correspond to something. These sections, sometimes re- i
ferred to as junk-DNA, usually consist of portions of the code
that are unexpressed or untranslated (i.e., excised from the It should be clear that this value can only be an approximation
mRNA). More modern views concede that unexpressed and to the true physical complexity of an organisms genome. In
untranslated regions in the genome can have a multitude of uses, reality, sites are not independent and the probability to find a
such as for example satellite DNA near the centromere, or the certain base at one position may be conditional on the proba-
polyC polymerase intron excised from Tetrahymena rRNA. In bility to find another base at another position. Such correlations
the absence of a complete map of the function of each and every between sites are called epistatic, and they can render the
base pair in the genome, how can we then decide which stretch entropy per molecule significantly different from the sum of the
of code is about something (and thus contributes to the per-site entropies (4). This entropy per molecule, which takes
complexity of the code) or else is entropy (i.e., random code into account all epistatic correlations between sites, is defined as
without function)?
A true test for whether a sequence is information uses the
H g
pgE log pgE [5]
success (fitness) of its bearer in its environment, which implies
that a sequences information content is conditional on the and involves an average over the logarithm of the conditional
environment it is to be interpreted within (4). Accordingly, probabilities p(gE) to find genotype g given the current environ-
Mycoplasma mycoides, for example (which causes pneumonia- ment E. In every finite population, estimating p(gE) using the
like respiratory illnesses), has a complexity of somewhat less than actual frequencies of the genotypes in the population (if those could
one million base pairs in our nasal passages, but close to zero be obtained) results in corrections to Eq. 5 larger than the quantity
complexity most everywhere else, because it cannot survive in itself (24), rendering the estimate useless. Another avenue for
any other environmentmeaning its genome does not corre- estimating the entropy per molecule is the creation of mutational
spond to anything there. A genetic locus that codes for infor- clones at several positions at the same time (7, 25) to measure
mation essential to an organisms survival will be fixed in an epistatic effects. The latter approach is feasible within experiments
adapting population because all mutations of the locus result in with simple ecosystems of digital organisms that we introduce in the
the organisms inability to promulgate the tainted genome, following section, which reveal significant epistatic effects. The
whereas inconsequential (neutral) sites will be randomized by technical details of the complexity calculation including these
the constant mutational load. Examining an ensemble of se- effects are relegated to the Appendix.
quences large enough to obtain statistically significant substitu-
tion probabilities would thus be sufficient to separate informa- Digital Evolution. Experiments in evolution have traditionally been
tion from entropy in genetic codes. The neutral sections that formidable because of evolutions gradual pace in the natural world.
contribute only to the entropy turn out to be exceedingly One successful method uses microscopic organisms with genera-
important for evolution to proceed, as has been pointed out, for tional times on the order of hours, but even this approach has
example, by Maynard Smith (20). difficulties; it is still impossible to perform measurements with high
In Shannons information theory (22), the quantity entropy precision, and the time-scale to see significant adaptation remains
(H) represents the expected number of bits required to specify weeks, at best. Populations of Escherichia coli introduced into new
the state of a physical object given a distribution of probabilities; environments begin adaptation immediately, with significant results
that is, it measures how much information can potentially be apparent in a few weeks (26, 27). Observable evolution in most
stored in it. organisms occurs on time scales of at least years.
In a genome, for a site i that can take on four nucleotides with To complement such an approach, we have developed a tool
probabilities to study evolution in a computational mediumthe Avida
platform (6). The Avida system hosts populations of self-
p Ci, p Gi, p Ai, p Ti, [1] replicating computer programs in a complex and noisy environ-
ment, within a computers memory. The evolution of these
the entropy of this site is digital organisms is limited in speed only by the computers
used, with generations (for populations of the order 103-104

programs) in a typical trial taking only a few seconds. Despite the


C,G,A,T

Hi p ji log p ji. [2] apparent simplicity of the single-niche environment and the
j limited interactions between digital organisms, very rich dynam-

4464 www.pnas.org Adami et al.


COMPUTER SCIENCES
Fig. 1. Typical Avida organisms, extracted at 2,991 (A) and 3,194 (B) generations, respectively, into an evolutionary experiment. Each site is color-coded
according to the entropy of that site (see color bar). Red sites are highly variable whereas blue sites are conserved. The organisms have been extracted just before
and after a major evolutionary transition.

ics can be observed in experiments with 3,600 organisms on a H i log28N , [6]


60 60 grid with toroidal boundary conditions (see Methods).
As this population is quite small, we can assume that an where N is the number of non-lethal substitutions (we count
equilibrium population will be dominated by organisms of a mutations that significantly reduce the fitness among the le-

EVOLUTION
single species, whose members all have similar functionality and thals). Note that the logarithm is taken with respect to the size
equivalent fitness (except for organisms that lost the capability of the alphabet.
to self-replicate due to mutation). In this world, a new species can This per-site entropy is used to illustrate the variability of loci
obtain a significant abundance only if it has a competitive in a genome, just before and after an evolutionary transition, in
advantage (increased Malthusian parameter) thanks to a bene- Fig. 1.
ficial mutation. While the system returns to equilibrium after the
innovation, this new species will gradually exert dominance over Progression of Complexity. Tracking the entropy of each site in the

SPECIAL FEATURE
the population, bringing the previously dominant species to genome allows us to document the growth of complexity in an
extinction. This dynamics of innovation and extinction can be evolutionary event. For example, it is possible to measure the
monitored in detail and appears to mirror the dynamics of E. coli difference in complexity between the pair of genomes in Fig. 1,
in single-niche long-term evolution experiments (28). separated by only 203 generations and a powerful evolutionary
The complexity of an adapted digital organism according to Eq. transition. Comparing their entropy maps, we can immediately
4 can be obtained by measuring substitution frequencies at each identify the sections of the genome that code for the new gene
instruction across the population. Such a measurement is easiest if that emerged in the transitionthe entropy at those sites has
genome size is constrained to be constant, as is done in the been drastically reduced, while the complexity increase across
experiments reported below, although this constraint can be relaxed the transition (taking into account epistatic effects) turns out to
by implementing a suitable alignment procedure. To correctly be C 6, as calculated in the Appendix.
assess the information content of the ensemble of sequences, we We can extend this analysis by continually surveying the
need to obtain the substitution probabilities pi at each position, entropies of each site during the course of an experiment. Fig.
which go into the calculation of the per-site entropy of Eq. 2. Care 2 does this for the experiment just discussed, but this time the
must be taken to wait sufficiently long after an innovation, to give substitution probabilities are obtained by sampling the actual
those sites within a new species that are variable a chance to diverge. population at each site. A number of features are apparent in this
Indeed, shortly after an innovation, previously 100% variable sites figure. First, the trend toward a cooling of the genome (i.e., to
will appear fixed by hitchhiking on the successful genotype, a more conserved sites) is obvious. Second, evolutionary transi-
phenomenon discussed further below. tions can be identified by vertical darkened bands, which arise
We simplify the problem of obtaining substitution probabili- because the genome instigating the transition replicates faster
ties for each instruction by assuming that all mutations are either than its competitors thus driving them into extinction. As a
lethal, neutral, or positive, and furthermore assume that all consequence, even random sites that are hitchhiking on the
successful gene are momentarily fixed.
non-lethal substitutions persist with equal probability. We then
Hitchhiking is documented clearly by plotting the sum of
categorize every possible mutation directly by creating all single-
per-site entropies for the population (as an approximation for
mutation genomes and examining them independently in isola-
the entropy of the genome).
tion. In that case, Eq. 2 reduces to

Forthese asexual organisms, the species concept is only loosely defined as programs that H Hi [7]
differ in genotype but only marginally in function. i1

Adami et al. PNAS April 25, 2000 vol. 97 no. 9 4465


Fig. 4. Complexity as a function of time, calculated according to Eq. 4.
Fig. 2. Progression of per-site entropy for all 100 sites throughout an Avida Vertical dashed lines are as in Fig. 3.
experiment, with time measured in updates (see Methods). A generation
corresponds to between 5 and 10 updates, depending on the gestation time
of the organism.
Such a typical evolutionary history documents that the phys-
ical complexity, measuring the amount of information coded in
across the transition in Fig. 3A. By comparing this to the fitness the sequence about its environment, indeed steadily increases.
shown in Fig. 3B, we can identify a sharp drop in entropy The circumstances under which this is assured to happen are
followed by a slower recovery for each adaptive event that the discussed presently.
population undergoes. Often, the population does not reach
Maxwells Demon and the Law of Increasing Complexity. Let us
equilibrium (the state of maximum entropy given the current
conditions) before the next transition occurs. consider an evolutionary transition like the one connecting the
While this entropy is not a perfect approximation of the exact genomes in Fig. 1 in more detail. In this transition, the entropy
entropy per program in Eq. 5, it reflects the disorder in the (cf. Fig. 3A) does not fully recover after its initial drop. The
difference between the equilibrium level before the transition
population as a function of time. This complexity estimate (4) is
and after is proportional to the information acquired in the
shown as a function of evolutionary time for this experiment in
transition, roughly the number of sites that were frozen. This
Fig. 4. It increases monotonically except for the periods just after
difference would be equal to the acquired information if the
transitions, when the complexity estimate (after overshooting
measured entropy in Eq. 7 were equal to the exact one given by
the equilibrium value) settles down according to thermodynam-
Eq. 5. For this particular situation, in which the sequence length
ics second law (see below). This overshooting of stable com-
is fixed along with the environment, is it possible that the
plexity is a result of the overestimate of complexity during the
complexity decreases? The answer is that in a sufficiently large
transition due to the hitchhiking effect mentioned earlier. Its
population this cannot happen [in smaller populations, there is
effect is also seen at the beginning of evolution, where the
a finite probability of all organisms being mutated simulta-
population is seeded with a single genome with no variation
neously, referred to as Mullers ratchet (29)], as a consequence
present.
of a simple application of the second law of thermodynamics. If
we assume that a population is at equilibrium in a fixed envi-
ronment, each locus has achieved its highest entropy given all of
the other sites. Then, with genome length fixed, the entropy can
only stay constant or decrease, implying that the complexity
(being sequence length minus entropy) can only increase. How
is a drop in entropy commensurate with the second law? This
answer is simple also: The second law holds only for equilibrium
systems, while such a transition is decidedly not of the equilib-
rium type. In fact, each such transition is best described as a
measurement, and evolution as a series of random measure-
ments on the environment. Darwinian selection is a filter,
allowing only informative measurements (those increasing the
ability for an organism to survive) to be preserved. In other
words, information cannot be lost in such an event because a
mutation corrupting the information is purged due to the
corrupted genomes inferior fitness (this holds strictly for asexual
populations only). Conversely, a mutation that corrupts the
information cannot increase the fitness, because if it did then the
population was not at equilibrium in the first place. As a
consequence, only mutations that reduce the entropy are kept
while mutations that increase it are purged. Because the muta-
tions can be viewed as measurements, this is the classical
behavior of the Maxwell Demon.
What about changes in sequence length? In an unchanging
environment, an increase or decrease in sequence length is
Fig. 3. (A) Total entropy per program as a function of evolutionary time. (B)
Fitness of the most abundant genotype as a function of time. Evolutionary
always associated with an increase or decrease in the entropy,
transitions are identified with short periods in which the entropy drops and such changes therefore always cancel from the physical
sharply, and fitness jumps. Vertical dashed lines indicate the moments at complexity, as it is defined as the difference. Note, however, that
which the genomes in Fig. 1 A and B were dominant. while size-increasing events do not increase the organisms

4466 www.pnas.org Adami et al.


physical complexity, they are critical to continued evolution as these complex programs, direct descendants of the simplest
they provide new space (blank tape) to record environmental self-replicators we ourselves wrote, that leads us to assert that
information within the genome, and thus to allow complexity to even in this view of life, spawned by and in our digital age, there
march ever forward. is grandeur.

Methods. For all work presented here, we use a single-niche Appendix: Epistasis and Complexity. Estimating the complexity
environment in which resources are isotropically distributed and according to Eq. 4 is somewhat limited in scope, even though it
unlimited except for central processing unit (CPU) time, the may be the only practical means for actual biological genomes for
primary resource for this life-form. This limitation is imposed by which substitution frequencies are known [such as, for example,
constraining the average slice of CPU time executed by any ensembles of tRNA sequences (4)]. For digital organisms, this
genome per update to be a constant (here 30 instructions). Thus, estimate can be sharpened by testing all possible single and
per update, a population of n genomes executes 30 n instruc- double mutants of the wild-type for fitness, and sampling the n
tions. The unlimited resources are numbers that the programs mutants to obtain the fraction of neutral mutants at mutational
can retrieve from the environment with the right genetic code. distance n, w(n). In this manner, an ensemble of mutants is
Computations on these numbers allow the organisms to execute created for a single wild-type resulting in a much more accurate
significantly larger slices of CPU time, at the expense of inferior estimate of its information content. As this procedure involves an
ones (see refs. 6 and 8). evaluation of fitness, it is easiest for organisms whose survival
A normal Avida organism is a single genome (program) rate is closely related to their organic fitness: i.e., for organisms
composed of a sequence of instructions that are processed as who are not epistatically linked to other organisms in the
commands to the CPU of a virtual computer. In standard Avida population. Note that this is precisely the limit in which Fishers
experiments, an organisms genome has one of 28 possible Theorem guarantees an increase in complexity (21).
instructions at each line. The set of instructions (alphabet) from For an organism of length with instructions taken from an
which an organism draws its code is selected to avoid biasing alphabet of size D, let w(1) be the number of neutral one-point

COMPUTER SCIENCES
evolution toward any particular type of program or environment. mutants N(1) divided by the total number of possible one-point
Still, evolutionary experiments will always show a distinct de- mutations
pendence on the ancestor used to initiate experiments, and on
the elements of chance and history. To minimize these effects, N 1
trials are repeated to gain statistical significance, another crucial w1 . [8]
D
advantage of experiments in artificial evolution. In the present
experiments, we have chosen to keep sequence length fixed at Note that N(1) includes the wild-type times, for each site is
100 instructions, by creating a self-replicating ancestor contain- replaced (in the generation of mutants) by each of the D
ing mostly non-sense code, from which all populations are instructions. Consequently, the worst w(1) is equal to D 1. In

EVOLUTION
spawned. Mutations appear during the copy process, which is the literature, w(n) usually refers to the average fitness (nor-
flawed with a probability of error per instruction copied of 0.01. malized to the wild-type) of n-mutants (organisms with n
For more details on Avida, see ref. 30. mutations). While this can be obtained here in principle, for the
purposes of our information-theoretic estimate, we assume that
Conclusions. Trends in the evolution of complexity are difficult to
all non-neutral mutants are nonviable**. We have found that for
argue for or against if there is no agreement on how to measure
digital organisms the average n-mutant fitness closely mirrors
complexity. We have proposed here to identify the complexity
the function w(n) investigated here.

SPECIAL FEATURE
of genomes by the amount of information they encode about the
Other values of w(n) are obtained accordingly. We define
world in which they have evolved, a quantity known as physical
complexity that, while it can be measured only approximately, N 2
allows quantitative statements to be made about the evolution of w2 2 , [9]
genomic complexity. In particular, we show that, in fixed envi- D 12
ronments, for organisms whose fitness depends only on their own
where N(2) is the number of neutral double mutants, including
sequence information, physical complexity must always increase.
the wild-type and all neutral single mutations included in N(1),
That a genomes physical complexity must be reflected in the
and so forth.
structural complexity of the organism that harbors it seems to us
For the genome before the transition (pictured on the left in
inevitable, as the purpose of a physically complex genome is
Fig. 1) we can collect N(n) as well as N(n) (the number of
complex information processing, which can only be achieved by
mutants that result in increased fitness) to construct w(n). In
the computer which it (the genome) creates.
That the mechanism of the Maxwell Demon lies at the heart Table 1, we list the fraction of neutral and positive n-mutants of
of the complexity of living forms today is rendered even more the wild type, as well as the number of neutral or positive found
plausible by the many circumstances that may cause it to fail. and the total number of mutants tried.
First, simple environments spawn only simple genomes. Second, Note that we have sampled the mutant distribution up to n
changing environments can cause a drop in physical complexity, 8 (where we tried 109 genotypes), to gain statistical significance.
with a commensurate loss in (computational) function of the The function is well fit by a two-parameter ansatz
organism, as now meaningless genes are shed. Third, sexual
wn D n [10]
reproduction can lead to an accumulation of deleterious muta-
tions (strictly forbidden in asexual populations) that can also introduced earlier (8), where 1 measures the degree of
render the Demon powerless. All such exceptions are observed neutrality in the code (0 1), and reflects the degree of
in nature. epistasis ( 1 for synergistic deleterious mutations, 1 for
Notwithstanding these vagaries, we are able to observe the antagonistic ones). Using this function, the complexity of the
Demons operation directly in the digital world, giving rise to wild-type can be estimated as follows.
complex genomes that, although poor compared with their
biochemical brethren, still stupefy us with their intricacy and an
uncanny amalgam of elegant solutions and clumsy remnants of **As the number of positive mutants becomes important at higher n, in the analysis below we
historical contingency. It is in no small measure an awe before use in the determination of w(n) the fraction of neutral or positive mutants f (n) f(n).

Adami et al. PNAS April 25, 2000 vol. 97 no. 9 4467


Table 1. Fraction of mutations that were neutral (first column), Naturally, H is impossible to obtain for reasonably sized
or positive (second column); total number of neutral or positive genomes as the number of mutations to test to obtain w() is of
genomes found (fourth column); and total mutants examined the order D. This is precisely the reason why we chose to
(fifth column) as a function of the number of mutations n, for approximate the entropy in Eq. 4 in the first place. However, it
the dominating genotype before the transition turns out that in most cases the constants and describing w(n)
n f(n) f(n) Total Tried can be estimated from the first few n. The complexity of the
wild-type, using the -mutant entropy (12), can be defined as
1 0.1418 0.034 492 2,700
2 0.0203 0.0119 225 10,000 C H . [13]
3 0.0028 0.0028 100 32,039
4 4.6 104 6.5 104 100 181,507 Using Eq. 10, we find
5 5.7 105 1.4 104 100 1.3 106
C , [14]
6 8.6 106 2.9 105 100 7.3 106
7 1.3 106 5.7 106 100 5.1 107 and, naturally, for the complexity based on single mutations only
8 1.8 107 1.1 106 34 1.0 109 (completely ignoring epistatic interactions)
C 1 . [15]
From the information-theoretic considerations in the
Thus, obtaining and from a fit to w(n) allows an estimate of
main text, the information about the environment stored in a
the complexity of digital genomes including epistatic interac-
sequence is
tions. As an example, let us investigate the complexity increase
C H max H H, [11] across the transition treated earlier. Using both neutral and
positive mutants to determine w(n), a fit to the data in Table 1
where H is the entropy of the wild-type given its environment. using the functional form of Eq. 10 yields 0.988(8) [ is
We have previously approximated it by summing the per-site obtained exactly via w(1)]. This in turn leads to a complexity
entropies of the sequence, thus ignoring correlations between estimate C 49.4. After the transition, we analyze the new wild
the sites. Using w(n), a multisite entropy can be defined as type again and find 0.986(8), not significantly different from
before the transition [while we found 0.996(9) during the
H logDwD , [12] transition]. The complexity estimate according to this fit is C
55.0, leading to a complexity increase during the transition of
reflecting the average entropy of a sequence of length . As D C 5.7, or about 6 instructions. Conversely, if epistatic
is the total number of different sequences of length , w()D is interactions are not taken into account, the same analysis would
the number of neutral sequences: in other words, all of those suggest C1 6.4, somewhat larger. The same analysis can be
sequences that carry the same information as the wild type. The carried out taking into account neutral mutations only to
coarse-grained entropy is just the logarithm of that number. calculate w(n), leading to C 3.0 and C1 5.4.
Eq. 12 thus represents the entropy of a population based on one
wild type in perfect equilibrium in an infinite population. It We thank A. Barr and R. E. Lenski for discussions. Access to a Beowulf
should approximate the exact result of Eq. 5 if all neutral system was provided by the Center for Advanced Computation Research
mutants have the same fitness and therefore the same abundance at the California Institute of Technology. This work was supported by the
in an infinite population. National Science Foundation.

1. Gould, S. J. (1996) Full House (Harmony Books, New York). 18. Dawkins, R. (1976) The Selfish Gene (Oxford Univ. Press, London).
2. McShea, D. W. (1996) Evolution (Lawrence, Kans.) 50, 477492. 19. Deutsch, D. (1997) The Fabric of Reality (Penguin, New York), p. 179.
3. Bennett, C. H. (1995) Physica D 86, 268273. 20. Maynard Smith, J. (1970) Nature (London) 225, 563.
4. Adami, C. & Cerf, N. J. (2000) Physica D 137, 6269. 21. Maynard Smith, J. (1972) On Evolution (Edinburgh Univ. Press, Edinburgh),
5. Britten, R. J. & Davidson, E. H. (1971) Q. Rev. Biol. 46, 111138. pp. 9299.
6. Adami, C. (1998) Introduction to Artificial Life (Springer, New York). 22. Shannon, C. E. & Weaver, W. (1949) The Mathematical Theory of Communi-
7. Lenski, R. E., Ofria, C., Collier, T. C. & Adami, C. (1999) Nature (London) 400, cation (Univ. of Illinois Press, Urbana, IL).
661664. 23. Schneider, T. D., Stormo, G. D., Gold, L. & Ehrenfeucht, A. (1986) J. Mol. Biol.
8. Mills, D. R., Peterson, R. L. & Spiegelman, S. (1967) Proc. Natl. Acad. Sci. USA 188, 415431.
58, 217224. 24. Basharin, G. P. (1959) Theory Probab. Its Appl. Engl. Transl. 4, 333336.
9. Cavalier-Smith, T. (1985) in The Evolution of Genome Size, ed. Cavalier-Smith, 25. Elena, S. F. & Lenski, R. E. (1997) Nature (London) 390, 395398.
T. (Wiley, New York). 26. Lenski, R. E. (1995) in Population Genetics of Bacteria, Society for General
10. Dixon, M. & Webb, E. C. (1964) The Enzymes (Academic, New York). Microbiology, Symposium 52, eds. Baumberg, S., Young, J. P. W., Saunders,
11. Britten, R. J. & Davidson, E. H. (1969) Science 165, 349357. S. R. & Wellington, E. M. H. (Cambridge Univ. Press, Cambridge, U.K.), pp.
12. Schrodinger, E. (1945) What is Life? (Cambridge Univ. Press, Cambridge, 193215.
U.K.). 27. Lenski, R., Rose, M. R., Simpson, E. C. & Tadler, S. C. (1991) Am. Nat. 138,
13. Gatlin, L. L. (1972) Information Theory and the Living System (Columbia Univ. 13151341.
Press, New York). 28. Elena, S. F., Cooper, V. S. & Lenski, R. E. (1996) Nature (London) 387,
14. Wiley, E. O. & Brooks, D. R. (1982) Syst. Zool. 32, 209219. 703705.
15. Brillouin, L. (1962) Science and Information Theory (Academic, New York). 29. Muller, H. J. (1964) Mutat. Res. 1, 29.
16. Collier, J. (1986) Biol. Philos. 1, 524. 30. Ofria, C., Brown, C. T. & Adami, C. (1998) in Introduction to Artificial Life, by
17. Landauer, R. (1991) Phys. Today 44 (5), 2329. Adami, C. (Springer, New York), pp. 297350.

4468 www.pnas.org Adami et al.

You might also like