Professional Documents
Culture Documents
Edited by James F. Crow, University of Wisconsin, Madison, WI, and approved February 15, 2000 (received for review December 22, 1999)
To make a case for or against a trend in the evolution of complexity distinguish distinct evolutionary pressures acting on the genome
in biological evolution, complexity needs to be both rigorously and analyze them in a mathematical framework.
defined and measurable. A recent information-theoretic (but intu- If an organisms complexity is a reflection of the physical
itively evident) definition identifies genomic complexity with the complexity of its genome (as we assume here), the latter is of
amount of information a sequence stores about its environment. prime importance in evolutionary theory. Physical complexity,
We investigate the evolution of genomic complexity in populations roughly speaking, reflects the number of base pairs in a sequence
of digital organisms and monitor in detail the evolutionary tran- that are functional. As is well known, equating genomic com-
sitions that increase complexity. We show that, because natural plexity with genome length in base pairs gives rise to a conun-
selection forces genomes to behave as a natural Maxwell De- drum (known as the C-value paradox) because large variations
mon, within a fixed environment, genomic complexity is forced to in genomic complexity (in particular in eukaryotes) seem to bear
increase. little relation to the differences in organismic complexity (9).
The C-value paradox is partly resolved by recognizing that not all
of DNA is functional: that there is a neutral fraction that can vary
D arwinian evolution is a simple yet powerful process that
requires only a population of reproducing organisms in from species to species. If we were able to monitor the non-
COMPUTER SCIENCES
which each offspring has the potential for a heritable variation neutral fraction, it is likely that a significant increase in this
from its parent. This principle governs evolution in the natural fraction could be observed throughout at least the early course
world, and has gracefully produced organisms of vast complexity. of evolution. For the later period, in particular the later Pha-
Still, whether or not complexity increases through evolution has nerozoic Era, it is unlikely that the growth in complexity of
become a contentious issue. Gould (1), for example, argues that genomes is due solely to innovations in which genes with novel
any recognizable trend can be explained by the drunkards functions arise de novo. Indeed, most of the enzyme activity
walk model, where progress is due simply to a fixed boundary classes in mammals, for example, are already present in pro-
condition. McShea (2) investigates trends in the evolution of karyotes (10). Rather, gene duplication events leading to repet-
certain types of structural and functional complexity, and finds itive DNA and subsequent diversification (11) as well as the
EVOLUTION
some evidence of a trend but nothing conclusive. In fact, he evolution of gene regulation patterns appears to be a more likely
concludes that something may be increasing. But is it complex- scenario for this stage. Still, we believe that the Maxwell Demon
ity? Bennett (3), on the other hand, resolves the issue by fiat, mechanism described below is at work during all phases of
defining complexity as that which increases when self- evolution and provides the driving force toward ever increasing
organizing systems organize themselves. Of course, to address complexity in the natural world.
this issue, complexity needs to be both defined and measurable.
Information Theory and Complexity. Using information theory to
In this paper, we skirt the issue of structural and functional
SPECIAL FEATURE
understand evolution and the information content of the se-
complexity by examining genomic complexity. It is tempting to
quences it gives rise to is not a new undertaking. Unfortunately,
believe that genomic complexity is mirrored in functional com-
many of the earlier attempts (e.g., refs. 1214) confuse the
plexity and vice versa. Such an hypothesis, however, hinges upon
picture more than clarifying it, often clouded by misguided
both the aforementioned ambiguous definition of complexity
notions of the concept of information (15). An (at times amus-
and the obvious difficulty of matching genes with function.
ing) attempt to make sense of these misunderstandings is ref. 16.
Several developments allow us to bring a new perspective to this
Perhaps a key aspect of information theory is that informa-
old problem. On the one hand, genomic complexity can be
tion cannot exist in a vacuum; that is, information is physical
defined in a consistent information-theoretic manner [the phys-
(17). This statement implies that information must have an
ical complexity (4)], which appears to encompass intuitive instantiation (be it ink on paper, bits in a computers memory,
notions of complexity used in the analysis of genomic structure or even the neurons in a brain). Furthermore, it also implies
and organization (5). On the other hand, it has been shown that that information must be about something. Lines on a piece of
evolution can be observed in an artificial medium (6, 7), pro- paper, for example, are not inherently information until it is
viding a unique glimpse at universal aspects of the evolutionary discovered that they correspond to something, such as (in the
process in a computational world. In this system, the symbolic case of a map) to the relative location of local streets and
sequences subject to evolution are computer programs that have buildings. Consequently, any arrangement of symbols might be
the ability to self-replicate via the execution of their own code. viewed as potential information (also known as entropy in
In this respect, they are computational analogs of catalytically information theory), but acquires the status of information
active RNA sequences that serve as the templates of their own only when its correspondence, or correlation, to other physical
reproduction. In populations of such sequences that adapt to objects is revealed.
their world (inside of a computers memory), noisy self- In biological systems the instantiation of information is DNA,
replication coupled with finite resources and an information-rich but what is this information about? To some extent, it is the
environment leads to a growth in sequence length as the digital blueprint of an organism and thus information about its own
organisms incorporate more and more information about their
environment into their genome. Evolution in an information-
poor landscape, on the contrary, leads to selection for replication This paper was submitted directly (Track II) to the PNAS office.
only, and a shrinking genome size as in the experiments of To whom reprint requests should be addressed. E-mail: adami@krl.caltech.edu.
Spiegelman and colleagues (8). These populations allow us to Present address: Center for Microbial Ecology, Michigan State University, Lansing, MI
observe the growth of physical complexity explicitly, and also to 48824.
the environment it lives in, including the species it co-evolves
with. It is well known that not all of the symbols in an organisms C Hi. [4]
DNA correspond to something. These sections, sometimes re- i
ferred to as junk-DNA, usually consist of portions of the code
that are unexpressed or untranslated (i.e., excised from the It should be clear that this value can only be an approximation
mRNA). More modern views concede that unexpressed and to the true physical complexity of an organisms genome. In
untranslated regions in the genome can have a multitude of uses, reality, sites are not independent and the probability to find a
such as for example satellite DNA near the centromere, or the certain base at one position may be conditional on the proba-
polyC polymerase intron excised from Tetrahymena rRNA. In bility to find another base at another position. Such correlations
the absence of a complete map of the function of each and every between sites are called epistatic, and they can render the
base pair in the genome, how can we then decide which stretch entropy per molecule significantly different from the sum of the
of code is about something (and thus contributes to the per-site entropies (4). This entropy per molecule, which takes
complexity of the code) or else is entropy (i.e., random code into account all epistatic correlations between sites, is defined as
without function)?
A true test for whether a sequence is information uses the
H g
pgE log pgE [5]
success (fitness) of its bearer in its environment, which implies
that a sequences information content is conditional on the and involves an average over the logarithm of the conditional
environment it is to be interpreted within (4). Accordingly, probabilities p(gE) to find genotype g given the current environ-
Mycoplasma mycoides, for example (which causes pneumonia- ment E. In every finite population, estimating p(gE) using the
like respiratory illnesses), has a complexity of somewhat less than actual frequencies of the genotypes in the population (if those could
one million base pairs in our nasal passages, but close to zero be obtained) results in corrections to Eq. 5 larger than the quantity
complexity most everywhere else, because it cannot survive in itself (24), rendering the estimate useless. Another avenue for
any other environmentmeaning its genome does not corre- estimating the entropy per molecule is the creation of mutational
spond to anything there. A genetic locus that codes for infor- clones at several positions at the same time (7, 25) to measure
mation essential to an organisms survival will be fixed in an epistatic effects. The latter approach is feasible within experiments
adapting population because all mutations of the locus result in with simple ecosystems of digital organisms that we introduce in the
the organisms inability to promulgate the tainted genome, following section, which reveal significant epistatic effects. The
whereas inconsequential (neutral) sites will be randomized by technical details of the complexity calculation including these
the constant mutational load. Examining an ensemble of se- effects are relegated to the Appendix.
quences large enough to obtain statistically significant substitu-
tion probabilities would thus be sufficient to separate informa- Digital Evolution. Experiments in evolution have traditionally been
tion from entropy in genetic codes. The neutral sections that formidable because of evolutions gradual pace in the natural world.
contribute only to the entropy turn out to be exceedingly One successful method uses microscopic organisms with genera-
important for evolution to proceed, as has been pointed out, for tional times on the order of hours, but even this approach has
example, by Maynard Smith (20). difficulties; it is still impossible to perform measurements with high
In Shannons information theory (22), the quantity entropy precision, and the time-scale to see significant adaptation remains
(H) represents the expected number of bits required to specify weeks, at best. Populations of Escherichia coli introduced into new
the state of a physical object given a distribution of probabilities; environments begin adaptation immediately, with significant results
that is, it measures how much information can potentially be apparent in a few weeks (26, 27). Observable evolution in most
stored in it. organisms occurs on time scales of at least years.
In a genome, for a site i that can take on four nucleotides with To complement such an approach, we have developed a tool
probabilities to study evolution in a computational mediumthe Avida
platform (6). The Avida system hosts populations of self-
p Ci, p Gi, p Ai, p Ti, [1] replicating computer programs in a complex and noisy environ-
ment, within a computers memory. The evolution of these
the entropy of this site is digital organisms is limited in speed only by the computers
used, with generations (for populations of the order 103-104
Hi p ji log p ji. [2] apparent simplicity of the single-niche environment and the
j limited interactions between digital organisms, very rich dynam-
EVOLUTION
single species, whose members all have similar functionality and thals). Note that the logarithm is taken with respect to the size
equivalent fitness (except for organisms that lost the capability of the alphabet.
to self-replicate due to mutation). In this world, a new species can This per-site entropy is used to illustrate the variability of loci
obtain a significant abundance only if it has a competitive in a genome, just before and after an evolutionary transition, in
advantage (increased Malthusian parameter) thanks to a bene- Fig. 1.
ficial mutation. While the system returns to equilibrium after the
innovation, this new species will gradually exert dominance over Progression of Complexity. Tracking the entropy of each site in the
SPECIAL FEATURE
the population, bringing the previously dominant species to genome allows us to document the growth of complexity in an
extinction. This dynamics of innovation and extinction can be evolutionary event. For example, it is possible to measure the
monitored in detail and appears to mirror the dynamics of E. coli difference in complexity between the pair of genomes in Fig. 1,
in single-niche long-term evolution experiments (28). separated by only 203 generations and a powerful evolutionary
The complexity of an adapted digital organism according to Eq. transition. Comparing their entropy maps, we can immediately
4 can be obtained by measuring substitution frequencies at each identify the sections of the genome that code for the new gene
instruction across the population. Such a measurement is easiest if that emerged in the transitionthe entropy at those sites has
genome size is constrained to be constant, as is done in the been drastically reduced, while the complexity increase across
experiments reported below, although this constraint can be relaxed the transition (taking into account epistatic effects) turns out to
by implementing a suitable alignment procedure. To correctly be C 6, as calculated in the Appendix.
assess the information content of the ensemble of sequences, we We can extend this analysis by continually surveying the
need to obtain the substitution probabilities pi at each position, entropies of each site during the course of an experiment. Fig.
which go into the calculation of the per-site entropy of Eq. 2. Care 2 does this for the experiment just discussed, but this time the
must be taken to wait sufficiently long after an innovation, to give substitution probabilities are obtained by sampling the actual
those sites within a new species that are variable a chance to diverge. population at each site. A number of features are apparent in this
Indeed, shortly after an innovation, previously 100% variable sites figure. First, the trend toward a cooling of the genome (i.e., to
will appear fixed by hitchhiking on the successful genotype, a more conserved sites) is obvious. Second, evolutionary transi-
phenomenon discussed further below. tions can be identified by vertical darkened bands, which arise
We simplify the problem of obtaining substitution probabili- because the genome instigating the transition replicates faster
ties for each instruction by assuming that all mutations are either than its competitors thus driving them into extinction. As a
lethal, neutral, or positive, and furthermore assume that all consequence, even random sites that are hitchhiking on the
successful gene are momentarily fixed.
non-lethal substitutions persist with equal probability. We then
Hitchhiking is documented clearly by plotting the sum of
categorize every possible mutation directly by creating all single-
per-site entropies for the population (as an approximation for
mutation genomes and examining them independently in isola-
the entropy of the genome).
tion. In that case, Eq. 2 reduces to
Forthese asexual organisms, the species concept is only loosely defined as programs that H Hi [7]
differ in genotype but only marginally in function. i1
Methods. For all work presented here, we use a single-niche Appendix: Epistasis and Complexity. Estimating the complexity
environment in which resources are isotropically distributed and according to Eq. 4 is somewhat limited in scope, even though it
unlimited except for central processing unit (CPU) time, the may be the only practical means for actual biological genomes for
primary resource for this life-form. This limitation is imposed by which substitution frequencies are known [such as, for example,
constraining the average slice of CPU time executed by any ensembles of tRNA sequences (4)]. For digital organisms, this
genome per update to be a constant (here 30 instructions). Thus, estimate can be sharpened by testing all possible single and
per update, a population of n genomes executes 30 n instruc- double mutants of the wild-type for fitness, and sampling the n
tions. The unlimited resources are numbers that the programs mutants to obtain the fraction of neutral mutants at mutational
can retrieve from the environment with the right genetic code. distance n, w(n). In this manner, an ensemble of mutants is
Computations on these numbers allow the organisms to execute created for a single wild-type resulting in a much more accurate
significantly larger slices of CPU time, at the expense of inferior estimate of its information content. As this procedure involves an
ones (see refs. 6 and 8). evaluation of fitness, it is easiest for organisms whose survival
A normal Avida organism is a single genome (program) rate is closely related to their organic fitness: i.e., for organisms
composed of a sequence of instructions that are processed as who are not epistatically linked to other organisms in the
commands to the CPU of a virtual computer. In standard Avida population. Note that this is precisely the limit in which Fishers
experiments, an organisms genome has one of 28 possible Theorem guarantees an increase in complexity (21).
instructions at each line. The set of instructions (alphabet) from For an organism of length with instructions taken from an
which an organism draws its code is selected to avoid biasing alphabet of size D, let w(1) be the number of neutral one-point
COMPUTER SCIENCES
evolution toward any particular type of program or environment. mutants N(1) divided by the total number of possible one-point
Still, evolutionary experiments will always show a distinct de- mutations
pendence on the ancestor used to initiate experiments, and on
the elements of chance and history. To minimize these effects, N 1
trials are repeated to gain statistical significance, another crucial w1 . [8]
D
advantage of experiments in artificial evolution. In the present
experiments, we have chosen to keep sequence length fixed at Note that N(1) includes the wild-type times, for each site is
100 instructions, by creating a self-replicating ancestor contain- replaced (in the generation of mutants) by each of the D
ing mostly non-sense code, from which all populations are instructions. Consequently, the worst w(1) is equal to D 1. In
EVOLUTION
spawned. Mutations appear during the copy process, which is the literature, w(n) usually refers to the average fitness (nor-
flawed with a probability of error per instruction copied of 0.01. malized to the wild-type) of n-mutants (organisms with n
For more details on Avida, see ref. 30. mutations). While this can be obtained here in principle, for the
purposes of our information-theoretic estimate, we assume that
Conclusions. Trends in the evolution of complexity are difficult to
all non-neutral mutants are nonviable**. We have found that for
argue for or against if there is no agreement on how to measure
digital organisms the average n-mutant fitness closely mirrors
complexity. We have proposed here to identify the complexity
the function w(n) investigated here.
SPECIAL FEATURE
of genomes by the amount of information they encode about the
Other values of w(n) are obtained accordingly. We define
world in which they have evolved, a quantity known as physical
complexity that, while it can be measured only approximately, N 2
allows quantitative statements to be made about the evolution of w2 2 , [9]
genomic complexity. In particular, we show that, in fixed envi- D 12
ronments, for organisms whose fitness depends only on their own
where N(2) is the number of neutral double mutants, including
sequence information, physical complexity must always increase.
the wild-type and all neutral single mutations included in N(1),
That a genomes physical complexity must be reflected in the
and so forth.
structural complexity of the organism that harbors it seems to us
For the genome before the transition (pictured on the left in
inevitable, as the purpose of a physically complex genome is
Fig. 1) we can collect N(n) as well as N(n) (the number of
complex information processing, which can only be achieved by
mutants that result in increased fitness) to construct w(n). In
the computer which it (the genome) creates.
That the mechanism of the Maxwell Demon lies at the heart Table 1, we list the fraction of neutral and positive n-mutants of
of the complexity of living forms today is rendered even more the wild type, as well as the number of neutral or positive found
plausible by the many circumstances that may cause it to fail. and the total number of mutants tried.
First, simple environments spawn only simple genomes. Second, Note that we have sampled the mutant distribution up to n
changing environments can cause a drop in physical complexity, 8 (where we tried 109 genotypes), to gain statistical significance.
with a commensurate loss in (computational) function of the The function is well fit by a two-parameter ansatz
organism, as now meaningless genes are shed. Third, sexual
wn D n [10]
reproduction can lead to an accumulation of deleterious muta-
tions (strictly forbidden in asexual populations) that can also introduced earlier (8), where 1 measures the degree of
render the Demon powerless. All such exceptions are observed neutrality in the code (0 1), and reflects the degree of
in nature. epistasis ( 1 for synergistic deleterious mutations, 1 for
Notwithstanding these vagaries, we are able to observe the antagonistic ones). Using this function, the complexity of the
Demons operation directly in the digital world, giving rise to wild-type can be estimated as follows.
complex genomes that, although poor compared with their
biochemical brethren, still stupefy us with their intricacy and an
uncanny amalgam of elegant solutions and clumsy remnants of **As the number of positive mutants becomes important at higher n, in the analysis below we
historical contingency. It is in no small measure an awe before use in the determination of w(n) the fraction of neutral or positive mutants f (n) f(n).
1. Gould, S. J. (1996) Full House (Harmony Books, New York). 18. Dawkins, R. (1976) The Selfish Gene (Oxford Univ. Press, London).
2. McShea, D. W. (1996) Evolution (Lawrence, Kans.) 50, 477492. 19. Deutsch, D. (1997) The Fabric of Reality (Penguin, New York), p. 179.
3. Bennett, C. H. (1995) Physica D 86, 268273. 20. Maynard Smith, J. (1970) Nature (London) 225, 563.
4. Adami, C. & Cerf, N. J. (2000) Physica D 137, 6269. 21. Maynard Smith, J. (1972) On Evolution (Edinburgh Univ. Press, Edinburgh),
5. Britten, R. J. & Davidson, E. H. (1971) Q. Rev. Biol. 46, 111138. pp. 9299.
6. Adami, C. (1998) Introduction to Artificial Life (Springer, New York). 22. Shannon, C. E. & Weaver, W. (1949) The Mathematical Theory of Communi-
7. Lenski, R. E., Ofria, C., Collier, T. C. & Adami, C. (1999) Nature (London) 400, cation (Univ. of Illinois Press, Urbana, IL).
661664. 23. Schneider, T. D., Stormo, G. D., Gold, L. & Ehrenfeucht, A. (1986) J. Mol. Biol.
8. Mills, D. R., Peterson, R. L. & Spiegelman, S. (1967) Proc. Natl. Acad. Sci. USA 188, 415431.
58, 217224. 24. Basharin, G. P. (1959) Theory Probab. Its Appl. Engl. Transl. 4, 333336.
9. Cavalier-Smith, T. (1985) in The Evolution of Genome Size, ed. Cavalier-Smith, 25. Elena, S. F. & Lenski, R. E. (1997) Nature (London) 390, 395398.
T. (Wiley, New York). 26. Lenski, R. E. (1995) in Population Genetics of Bacteria, Society for General
10. Dixon, M. & Webb, E. C. (1964) The Enzymes (Academic, New York). Microbiology, Symposium 52, eds. Baumberg, S., Young, J. P. W., Saunders,
11. Britten, R. J. & Davidson, E. H. (1969) Science 165, 349357. S. R. & Wellington, E. M. H. (Cambridge Univ. Press, Cambridge, U.K.), pp.
12. Schrodinger, E. (1945) What is Life? (Cambridge Univ. Press, Cambridge, 193215.
U.K.). 27. Lenski, R., Rose, M. R., Simpson, E. C. & Tadler, S. C. (1991) Am. Nat. 138,
13. Gatlin, L. L. (1972) Information Theory and the Living System (Columbia Univ. 13151341.
Press, New York). 28. Elena, S. F., Cooper, V. S. & Lenski, R. E. (1996) Nature (London) 387,
14. Wiley, E. O. & Brooks, D. R. (1982) Syst. Zool. 32, 209219. 703705.
15. Brillouin, L. (1962) Science and Information Theory (Academic, New York). 29. Muller, H. J. (1964) Mutat. Res. 1, 29.
16. Collier, J. (1986) Biol. Philos. 1, 524. 30. Ofria, C., Brown, C. T. & Adami, C. (1998) in Introduction to Artificial Life, by
17. Landauer, R. (1991) Phys. Today 44 (5), 2329. Adami, C. (Springer, New York), pp. 297350.