Professional Documents
Culture Documents
REVIEW
The common patterns of nature
S. A. FRANK
Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, USA and Santa Fe Institute, Santa Fe, NM, USA
Keywords: Abstract
ecological pattern; We typically observe large-scale outcomes that arise from the interactions of
maximum entropy; many hidden, small-scale processes. Examples include age of disease onset,
neutral theories; rates of amino acid substitutions and composition of ecological communities.
population genetics. The macroscopic patterns in each problem often vary around a characteristic
shape that can be generated by neutral processes. A neutral generative model
assumes that each microscopic process follows unbiased or random stochastic
fluctuations: random connections of network nodes; amino acid substitutions
with no effect on fitness; species that arise or disappear from communities
randomly. These neutral generative models often match common patterns of
nature. In this paper, I present the theoretical background by which we can
understand why these neutral generative models are so successful. I show
where the classic patterns come from, such as the Poisson pattern, the normal
or Gaussian pattern and many others. Each classic pattern was often
discovered by a simple neutral generative model. The neutral patterns share
a special characteristic: they describe the patterns of nature that follow from
simple constraints on information. For example, any aggregation of processes
that preserves information only about the mean and variance attracts to the
Gaussian pattern; any aggregation that preserves information only about the
mean attracts to the exponential pattern; any aggregation that preserves
information only about the geometric mean attracts to the power law pattern.
I present a simple and consistent informational framework of the common
patterns of nature based on the method of maximum entropy. This framework
shows that each neutral generative model is a special case that helps to
discover a particular set of informational constraints; those informational
constraints define a much wider domain of non-neutral generative processes
that attract to the same neutral pattern.
In fact, all epistemologic value of the theory of probability is self-effacement amidst the wildest confusion. The larger
based on this: that large-scale random phenomena in their the mob and the greater the apparent anarchy, the more
collective action create strict, nonrandom regularity. (Gne- perfect is its sway. It is the supreme law of unreason.
denko & Kolmogorov, 1968, p. 1) (Galton, 1889, p. 166)
I know of scarcely anything so apt to impress the We cannot understand what is happening until we learn
imagination as the wonderful form of cosmic order to think of probability distributions in terms of their demon-
expressed by the ‘law of frequency of error’ [the normal strable information content … (Jaynes, 2003, p. 198)
or Gaussian distribution]. Whenever a large sample of
chaotic elements is taken in hand and marshaled in the
order of their magnitude, this unexpected and most Introduction
beautiful form of regularity proves to have been latent Most patterns in biology arise from aggregation of many
all along. The law … reigns with serenity and complete
small processes. Variations in the dynamics of complex
Correspondence: Steven A. Frank, Department of Ecology and Evolutionary
neural and biochemical networks depend on numerous
Biology, University of California, Irvine, CA 92697 2525, USA and Santa fluctuations in connectivity and flow through small-scale
Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA. subcomponents of the network. Variations in cancer
Tel.: +1 949 824 2244; fax: +1 949 824 2181; e-mail: safrank@uci.edu onset arise from variable failures in the many individual
checks and balances on DNA repair, cell cycle control and processes, or the generation of pattern by the aggregation
tissue homeostasis. Variations in the ecological distribu- of non-neutral processes in which the non-neutral
tion of species follow the myriad local differences in the fluctuations tend to cancel in the aggregate. This
birth and death rates of species and in the small-scale distinction cuts to the heart of how we may test neutral
interactions between particular species. theories in ecology and evolution.
In all such complex systems, we wish to understand Third, a powerfully attracting limiting distribution may
how a large-scale pattern arises from the aggregation of be relatively insensitive to perturbations. Insensitivity
small-scale processes. A single dominant principle sets arises because, to be attracting over broad variations in
the major axis from which all explanation of aggregation the aggregated processes, the limiting distribution must
and scale must be developed. This dominant principle is not change too much in response to perturbations.
the limiting distribution. Insensitivity to perturbation is a reasonable definition
The best known of the limiting distributions, the of robustness. Thus, robust patterns may often coincide
Gaussian (normal) distribution, follows from the central with limiting distributions.
limit theorem. If an outcome, such as height, weight or In general, inference in biology depends critically on
yield, arises from the summing up of many small-scale understanding the nature of limiting distributions. If a
processes, then the distribution typically approaches the pattern can only be generated by a very particular
Gaussian curve in the limit of aggregation over many hypothesized process, then observing the pattern
processes. strongly suggests that the pattern was created by the
The individual, small-scale fluctuations caused by each hypothesized generative process. By contrast, if the same
contributing process rarely follow the Gaussian curve. pattern arises as a limiting distribution from a variety of
But, with aggregation of many partly uncorrelated underlying processes, then a match between theory and
fluctuations, each small in scale relative to the aggregate, pattern only restricts the underlying generative processes
the sum of the fluctuations smooths into the Gaussian to the broad set that attracts to the limiting pattern.
curve – the limiting distribution in this case. One might Inference must always be discussed in relation to the
say that the numerous small fluctuations tend to cancel breadth of processes attracted to a particular pattern.
out, revealing the particular form of regularity or Because many patterns in nature arise from limiting
information that characterizes aggregation and scale for distributions, such distributions form the core of infer-
the process under study. ence with regard to the relations between pattern and
The central limit theorem is widely known, and the process.
Gaussian distribution is widely recognized as an aggre- I recently studied the problems outlined in this
gate pattern. This limiting distribution is so important Introduction. In my studies, I found it useful to separate
that one could hardly begin to understand patterns of the basic facts of probability theory that set the
nature without an instinctive recognition of the relation background from my particular ideas about biological
between aggregation and the Gaussian curve. robustness, the neutral theories in ecology and evolu-
In this paper, I discuss biological patterns within a tion, and the causes of patterns such as the age of
broader framework of limiting distributions. I emphasize cancer onset and the age of death. The basic facts of
that the common patterns of nature arise from distinctive probability are relatively uncontroversial, whereas my
limiting distributions. In each case, one must understand own interpretations of particular biological patterns
the distinctive limiting distribution in order to analyse remain open to revision and to the challenge of
pattern and process. empirical tests.
With regard to the different limiting distributions that In this paper, I focus on the basic facts of probability
characterize patterns of nature, aggregation and scale theory to set the background for future work. Thus,
have at least three important consequences. First, a this paper serves primarily as a tutorial to aspects of
departure from the expected limiting distribution sug- probability framed in the context of several recent
gests an interesting process that perturbs the typical conceptual advances from the mathematical and phys-
regularity of aggregation. Such departures from expecta- ical sciences. In particular, I use Jaynes’ maximum
tion can only be recognized and understood if one has a entropy approach to unify the relations between
clear grasp of the characteristic limiting distribution for aggregation and pattern (Jaynes, 2003). Information
the particular problem. plays the key role. In each problem, the ultimate
Second, one must distinguish clearly between two pattern arises from the particular information preserved
distinct meanings of ‘neutrality’. For example, if we in the face of the combined fluctuations in aggregates
count the number of events of a purely random, or that decay all nonpreserved aspects of pattern towards
‘neutral’, process, we observe a Poisson pattern. How- maximum entropy or maximum randomness. My
ever, the Poisson may also arise as a limiting distribution novel contribution is to apply this framework of
by the aggregation of small-scale, nonrandom processes. entropy and information in a very simple and consis-
So we must distinguish between two alternative causes of tent way across the full range of common patterns in
neutral pattern: the generation of pattern by neutral nature.
Here is a rough, intuitive way to think about the basin independently of each other, then the waiting time for
of attraction towards a random, limiting distribution the nth occurrence follows the gamma pattern. The
(Jaynes, 2003). Think of each small-scale process as exponential is a limiting distribution with a broad basin
contributing a deviation from random in the aggregate. If of attraction. Thus, the gamma is also a limiting distri-
many nonrandom and partially uncorrelated small-scale bution that attracts many aggregate patterns. For exam-
deviations aggregate to form an overall pattern, then the ple, if an aggregate structure fails only after multiple
individual nonrandom deviations will often cancel in the subcomponents fail, then the time to failure may in the
aggregate and attract to the limiting random distribution. limit attract to a gamma pattern. In a later section, I will
Such smoothing of small-scale deviations in the aggre- discuss a more general way in which to view the gamma
gate pattern must be rather common, explaining why a distribution with respect to randomness and limiting
random pattern with strict regularity, such as the distributions.
Poisson, is so widely observed.
Many aggregate patterns will, of course, deviate from
Power law
the random limiting distribution. To understand such
deviations, we must consider them in relation to the Many patterns of nature follow a power law distribution
basin of attraction for random pattern. In general, the (Mandelbrot, 1983; Kleiber & Kotz, 2003; Mitzenmacher,
basins of attraction of random patterns form the foun- 2004; Newman, 2005; Simkin & Roychowdhury, 2006;
dation by which we must interpret observations of Sornette, 2006). Consider the distribution of wealth in
nature. In this regard, the major limiting distributions human populations as an example. Suppose that the
are the cornerstone of natural history. frequency of individuals with wealth x is f(x), and the
frequency with twice that wealth is f(2x). Then the ratio
of those with wealth x relative to those with twice that
Exponential and gamma
wealth is f(x)/f(2x). That ratio of wealth is often constant
The waiting time for the first occurrence of some no matter what level of baseline wealth, x, that we start
particular event often follows an exponential distribu- with, so long as we look above some minimum value of
tion. The exponential has three properties that associate wealth, L. In particular,
it with a neutral or random pattern. f ðxÞ
First, the waiting time until the first occurrence of a ¼ k;
f ð2xÞ
Poisson process has an exponential distribution: the
exponential is the time characterization of how long where k is a constant, and x > L. Such relations are called
one waits for random events, and the Poisson is the count ‘scale invariant’, because no matter how big or small x,
characterization of the number of random events that that is, no matter what scale at which we look, the
occur in a fixed time (or space) interval. change in frequency follows the same constant pattern.
Second, the exponential distribution has the memory- Scale-invariant pattern implies a power law relation-
less property. Suppose the waiting time for the first ship for the frequencies
occurrence of an event follows the exponential distribu-
f ðxÞ ¼ ax b ;
tion. If, starting at time t ¼ 0, the event does not occur
during the interval up to time t ¼ T, then the waiting where a is an uninteresting constant that must be chosen
time for the first occurrence starting from time T is the so that the total frequency sums to 1, and b is a constant
same as it was when we began to measure from time that sets how fast wealth becomes less frequent as wealth
zero. In other words, the process has no memory of how increases. For example, a doubling in wealth leads to
long it has been waiting; occurrences happen at random
irrespective of history. f ðxÞ ax b
¼ ¼ 2b ;
Third, the exponential is in many cases the limiting f ð2xÞ að2xÞb
distribution for aggregation of smaller scale waiting time
processes (see below). For example, time to failure which shows that the ratio of the frequency of people
patterns often follow the exponential, because the failure with wealth x relative to those with wealth 2x does not
of a machine or aggregate biological structure often depend on the initial wealth, x, that is, it does not depend
depends on the time to failure of any essential compo- on the scale at which we look.
nent. Each component may have a time to failure that Scale invariance, expressed by power laws, describes a
differs from the exponential, but in the aggregate, the very wide range of natural patterns. To give just a short
waiting time for the overall failure of the machine often listing, Sornette (2006) mentions that the following
converges to the exponential. phenomena follow power law distributions: the magni-
The gamma distribution arises in many ways, a tudes of earthquakes, hurricanes, volcanic eruptions and
characteristic of limiting distributions. For example, if floods; the sizes of meteorites; and losses caused by
the waiting time until the first occurrence of a random business interruptions from accidents. Other studies have
process follows an exponential, and occurrences happen documented power laws in stock market fluctuations,
sizes of computer files and word frequency in languages expressed by a few constraints (Jaynes, 2003). In this
(Mitzenmacher, 2004; Newman, 2005; Simkin & Roy- section, I introduce the concept of maximum entropy,
chowdhury, 2006). where entropy measures randomness. In the following
In biology, power laws have been particularly impor- sections, I derive common distributions to show how
tant in analysing connectivity patterns in metabolic they arise from maximum entropy (randomness) subject
networks (Barabasi & Albert, 1999; Ravasz et al., 2002) to constraints such as information about the mean,
and in the number of species observed per unit area in variance or geometric mean. My mathematical presen-
ecology (Garcia Martin & Goldenfeld, 2006). tation throughout is informal and meant to convey basic
Many models have been developed to explain why concepts. Those readers interested in the mathematics
power laws arise. Here is a simple example from Simon should follow the references to the original literature.
(1955) to explain the power law distribution of word The probability distributions follow from Shannon’s
frequency in languages (see Simkin & Roychowdhury, measure of information (Shannon & Weaver, 1949). I
2006). Suppose we start with a collection of N words. We first define this measure of information. I then discuss an
then add another word. With probability p, the word is intuitive way of thinking about the measure and its
new. With probability 1 ) p, the word matches one relation to entropy.
already in our collection; the particular match to an Consider a probability distribution function (pdf)
existing word occurs with probability proportional to the defined as p(y|h). Here, p is the probability of some
relative frequencies of existing words. In the long run, measurement y given a set of parameters, h. Let the
the frequency of words that occurs x times is proportional abbreviation py stand for p(y|h). Then Shannon informa-
to x)[1+1 ⁄ (1)p)]. We can think of this process as preferential tion is defined as
attachment, or an example in which the rich get richer. X
Simon’s model sets out a simple process that generates a H¼ py logðpy Þ;
power law and fits the data. But could other simple
processes generate the same pattern? We can express this where the sum is taken over all possible values of y, and
question in an alternative way, following the theme of this the log is taken as the natural logarithm.
paper: What is the basin of attraction for processes that The value ) log (py) ¼ log (1/py) rises as the proba-
converge onto the same pattern? The following sections bility py of observing a particular value of y becomes
take up this question and, more generally, how we may smaller. In other words, ) log (py) measures the surprise
think about the relationship between generative models of in observing a particular value of y, because rare events
process and the commonly observed patterns that result. are more surprising (Tribus, 1961). Greater surprise
provides more information: if we are surprised by an
observation, we learn a lot; if we are not surprised, we
Random or neutral distributions
had already predicted the outcome to be likely, and we
Much of biological research is reverse engineering. We gain little information. With this interpretation, Shannon
observe a pattern or design, and we try to infer the information, H, is simply an average measure of surprise
underlying process that generated what we see. The over all possible values of y.
observed patterns can often be described as probability We may interpret the maximum of H as the least
distributions: the frequencies of genotypes; the numbers predictable and most random distribution within the
of nucleotide substitutions per site over some stretch of constraints of any particular problem. In physics, ran-
DNA; the different output response strengths or move- domness is usually expressed in terms of entropy, or
ment directions given some input; or the numbers of disorder, and is measured by the same expression as
species per unit area. Shannon information. Thus, the technique of maximiz-
The same small set of probability distributions describe ing H to obtain the most random distribution subject to
the great majority of observed patterns: the binomial, the constraints of a particular problem is usually referred
Poisson, Gaussian, exponential, power law, gamma and a to as the method of maximum entropy (Jaynes, 2003).
few other common distributions. These distributions Why should observed probability distributions tend
reveal the contours of nature. We must understand towards those with maximum entropy? Because
why these distributions are so common and what they observed patterns typically arise by the aggregation of
tell us, because our goal is to use these observed patterns many small-scale processes. Any directionality or non-
to reverse engineer the underlying processes that created randomness caused by each small-scale process tends, on
those patterns. What information do these distributions average, to be cancelled in the aggregate: one fluctuation
contain? pushes in one direction, another fluctuation pushes in a
different direction and so on. Of course, not all observa-
tions are completely random. The key is that each
Maximum entropy
problem typically has a few constraints that set the
The key probability distributions often arise as the most pattern in the aggregate, and all other fluctuations cancel
random pattern consistent with the information as the nonconstrained aspects tend to the greatest
entropy or randomness. In terms of information, the final we should consider all possible outcomes equally (uni-
pattern reflects only the information content of the formly) probable. The uniform distribution is sometimes
system expressed by the constraints on randomness; all discussed as the expression of ignorance or lack of
else dissipates to maximum entropy as the pattern information.
converges to its limiting distribution defined by its In observations of nature, we usually can obtain some
informational constraints (Van Campenhout & Cover, additional information from measurement or from
1981). knowledge about the structure of the problem. Thus,
the uniform distribution does not describe well many
patterns of nature, but rather arises most often as
The discrete uniform distribution
an expression of prior ignorance before we obtain
We can find the most probable distribution for a information.
particular problem by the method of maximum entropy.
We simply solve for the probability distribution that
The continuous uniform distribution
maximizes entropy subject to the constraint that the
distribution must satisfy any measurable information The previous section derived the uniform distribution
that we can obtain about the distribution or any in which the observations y take on integer values
assumption that we make about the distribution. a, a + 1, a + 2, … , b. In this section, I show the steps and
Consider the simplest problem, in which we know that notation for the continuous uniform case. See Jaynes
y falls within some bounds a £ yP £ b, and we require that (2003) for technical issues that may arise when analysing
the total probability sums to 1, y py ¼ 1. We must also maximum entropy for continuous variables.
specify what values y may take on between a and b. In Everything is the same as the previous section, except
the first case, restrict y to the values of integers, so that that y can take on any continuous value between a and b.
y ¼ a, a + 1, a + 2, …, b, and there are N ¼ b ) a + 1 We can move from the discrete case to the continuous
possible values for y. case by writing the possible values of y as a, a + dy,
We find the maximum entropy distribution by max- a + 2dy, … , b. In the discrete case above, dy ¼ 1. In the
imizing Shannon entropy, H, subjectPto the constraint continuous case, we let dy fi 0, that is, we let dy become
that the total probability sums to 1, P y py ¼ 1. We can arbitrarily small. Then the number of steps between a
abbreviate this constraint as P ¼ y py ) 1. By the and b is (b ) a) ⁄ dy.
method of Lagrangian multipliers, this yields the quantity The analysis is exactly as above, but each increment
to be maximized as must be weighted by dy, and instead of writing
!
X X X
b
K ¼ H wP ¼ py logðpy Þ w py 1 : py dy ¼ 1
y y y¼a
sum over all probabilities is 1, thus From the prior section, we know ¶K/¶py ¼ 0 leads to
X
b X
b py ¼ e)(1+w), thus
py ¼ eð1þwÞ ¼ Neð1þwÞ ¼ 1; Z b Z b
y¼a y¼a
py dy ¼ eð1þwÞ dy ¼ ðb aÞeð1þwÞ ¼ 1;
a a
where N arises because y takes on N different values
ranging from a to b. From this equation, e)(1+w) ¼ 1/N, Rb
where b ) a arises because a dy ¼ b a, thus e)(1+w) ¼
yielding the uniform distribution
1 ⁄ (b ) a), and
py ¼ 1=N 1
py ¼ ;
for y ¼ a, … , b. This result simply says that if we do not ba
know anything except the possible values of our obser- which is the uniform distribution over the continuous
vations and the fact that the total probability is 1, then interval between a and b.
N N!
The binomial distribution my ¼ ¼ : ð2Þ
y y!ðN yÞ!
The binomial distribution describes the outcome of a
series of i ¼ 1, 2,… , N observations or trials. Each obser- Suppose that we also know the expected number P
vation can take on one of two values, xi ¼ 0 or xi ¼ 1, for of successes in a series of N trials, given as Æyæ ¼ ypy,
where I use the physicists’ convention of angle brackets
the ith observation. For convenience, we refer to an
observation of 1 as a success, and an observation of zero for the expectation of the quantity inside. Earlier, I
as a failure. We assume each observation is independent defined ai as the probability of success in the ith trial.
of the others, and the probability of a success on any trial Note that the average probability of success per trial is
is ai, where ai may vary from trial to P trial. The total Æyæ ⁄ N ¼ Æaæ. For convenience, let a ¼ Æaæ, thus the
number of successes over N trials is y ¼ xi. expected number of successes is Æyæ ¼ Na.
Suppose this is all the information that we have. We What is the maximum entropy distribution given all of
know that our random variable, y, can take on a series the information that we have, including the expected
of integer values, y ¼ 0, 1, … , N, because we may have number of successes? We proceed by maximizing S
between zero and N total successes in N trials. Define subject to the constraint that all the probabilities
P must
add to 1 and subject to the constraint, C1 ¼ ypy ) Na,
the probability distribution as py , the probability that
we observe y successes in N trials. We know that the that the mean number of successes must be Æyæ ¼ Na. The
probabilities sum to 1. Given only that information, it quantity to maximize is
may seem, at first glance, that the maximum entropy K ¼ S wP k1 C1
distribution would be uniform over the possible out- X X X
py
comes for y. However, the structure of the problem ¼ py log w py 1 k1 ypy Na :
my
provides more information, which we must incorpo-
rate. Differentiating with respect to py and setting to zero yields
How many different ways can we can obtain y ¼ 0
N k1 y
successes in N trials? Just one: a series of failures on every py ¼ k e ; ð3Þ
y
trial. How many different ways can we obtain y ¼ 1
success? There are N different ways: a success on the first where k ¼ e)(1+w), in which k and k1 are two constants
trial and failures on the others; a success on the second that
P must be P chosen to satisfy the two constraints
trial, and failures on the others; and so on. py ¼ 1 and ypy ¼ Na. The constants k ¼ (1 ) a)N
)k1
The uniform solution by maximum entropy tells us and e ¼ a ⁄ (1 ) a) satisfy the two constraints (Sivia &
that each different combination is equally likely. Because Skilling, 2006, pp. 115–120) and yield the binomial
each value of y maps to a different number of combina- distribution
tions, we must make a correction for the fact that
N y
measurements on y are distinct from measurements on py ¼ a ð1 aÞNy :
y
the equally likely combinations. In particular, we must
formulate a measure, my, that accounts for how the Here is the important conclusion. If all of the infor-
uniformly distributed basis of combinations translates mation available in measurement reduces to knowledge
into variable values of the number of successes, y. Put that we are observing the outcome of a series of binary
another way, y is invariant to changes in the order of trials and to knowledge of the average number of
outcomes given a fixed number of successes. That successes in N trials, then the observations will follow
invariance captures a lack of information that must be the binomial distribution.
included in our analysis. In the classic sampling theory approach to deriving
This use of a transform to capture the nature of distributions, one generates a binomial distribution by a
measurement in a particular problem recurs in analyses series of independent, identical binary trials in which the
of entropy. The proper analysis of entropy must be made probability of success per trial does not vary between
with respect to the underlying measure. We replace the trials. That generative neutral model does create a
Shannon entropy with the more general expression binomial process – it is a sufficient condition.
However, many distinct processes may also converge
X py to the binomial pattern. One only requires information
S¼ py log ; ð1Þ
my about the trial-based sampling structure and about the
expected number of successes over all trials. The prob-
a measure of relative entropy that is related to the ability of success may vary between trials (Yu, 2008).
Kullback–Leibler divergence (Kullback, 1959). When my Distinct aggregations of small-scale processes may
is a constant, expressing a uniform transformation, then smooth to reveal only those two aspects of information –
we recover the standard expression for Shannon entropy. sampling structure and average number of successes –
In the binomial sampling problem, the number of with other measurable forms of information cancelling in
combinations for each value of y is the aggregate. Thus, the truly fundamental nature of the
binomial pattern arises not from the neutral generative additional constraints that capture all of the available
model of identical, independent binary trials, but from information, each constraint expressed as Ci ¼
P
the measurable information in observations. I discuss y fi(y)py ) Æfi(y)æ, where the angle brackets denote the
some additional processes that converge to the binomial expected value. If the problem is continuous, we use
in a later section on limiting distributions. integration as the continuous
P limit of the summation.
We always use P ¼ y py ) 1 to constrain the total
probability to 1. We can use any function of y for the
The Poisson distribution
other fi according to the appropriate constraints for each
One often observes a Poisson distribution when counting problem. For example, if fi(y) ¼ y, then we constrain the
the number of observations per unit time or per unit final distribution to have mean Æyæ.
area. The Poisson occurs so often because it arises from To find the maximum entropy distribution, we max-
the ‘law of small numbers’, in which aggregation of imize
various processes converges to the Poisson when the X
n
number of counts per unit is small. Here, I derive the K ¼ S wP ki Ci
Poisson as a maximum entropy distribution subject to a i¼1
constraint on the sampling process and a constraint on by differentiating with respect to py, setting to zero and
the mean number of counts per unit (Sivia & Skilling, solving. This calculation yields
2006, p. 121). P
Suppose the unit of measure, such as time or space, is py ¼ kmy e ki fi ; ð4Þ
divided into a great number, N, of very small intervals.
For whatever item or event we are counting, each where we choose k so that the total probability is 1:
P
interval contains a count of either zero or one that is y py ¼ 1 or in the continuous limit y py dy ¼ 1. For the
independent of the counts in other intervals. This additional
P n constraints, we choose each ki so that
subdivision leads to a binomial process. The measure y fi(y) py ¼ Æfi(y)æ for all i ¼ 1, 2, …, n, using integra-
for the number of different ways a total count of y ¼ tion rather than summation in the continuous limit.
0, 1,… , N can arise in the N subdivisions is given by my of To solve a particular problem, we must choose a proper
eqn 2. With large N, we can express this measure by measure my. In the binomial and Poisson cases, we found
using Stirling’s approximation my by first considering the underlying uniform scale, in
pffiffiffiffiffiffiffiffiffiffi which each outcome is assumed to be equally likely in
N! 2pN ðN=eÞN ; the absence of additional information. We then calcu-
lated the relative weighting of each measure on the
where e is the base for the natural logarithm. Using this
y scale, in which, for each y, variable numbers of
approximation for large N, we obtain
outcomes on the underlying uniform scale map to that
N N! Ny value of y. That approach for calculating the transforma-
my ¼ ¼ ¼ :
y y!ðN yÞ! y! tion from underlying and unobserved uniform measures
to observable nonuniform measures commonly solves
Entropy maximization yields eqn 3, in which we can use sampling problems based on combinatorics.
the large N approximation for my to yield In continuous cases, we must choose my to account for
xy the nature of information provided by measurement.
py ¼ k ;
y! Notationally, my means a function m of the value y,
alternatively written as m(y). However, I use subscript
in whichP x ¼ Ne)k1. From this equation, P y the
x notation to obtain an equivalent and more compact
constraint py ¼ 1 leads to the identityP y x /y! ¼ e ,
)x expression.
which implies k ¼ eP . The constraint ypy ¼ Æyæ ¼ l
We find my by asking in what ways would changes in
leads to the identity y yxy ⁄ y! ¼ xex, which implies x ¼ l.
measurement leave unchanged the information that we
These substitutions for k and x yield the Poisson
obtain. That invariance under transformation of the
distribution
measured scale expresses the lack of information
el obtained from measurement. Our analysis must always
py ¼ ly :
y! account for the presence or absence of particular infor-
mation.
Suppose that we cannot directly observe y; rather, we
The general solution
can observe only a transformed variable x ¼ g(y). Under
All maximum entropy problems have the same form. We what conditions do we obtain the same information from
first evaluate our information about the scale of obser- direct measurement of y or indirect measurement of the
vations and the sampling scheme. We use this informa- transformed value x? Put another way, how can we
tion to determine the measure my in the general choose m so that the information we obtain is invariant
expression for relative entropy in eqn 1. We then set n to certain transformations of our measurements?
From the general solution in eqn 4, we use for this natural physical constraint, and seek to modify the
problem my ¼ 1/y and f1 ¼ log(y), yielding definition of entropy so that they can retain the arith-
metic mean as the fundamental constraint of location.
py ¼ ðk=yÞek1 logðyÞ ¼ ðk=yÞyk1 ¼ kyð1þk1 Þ : Perhaps in certain physical applications it makes sense
to retain a limited view of physical constraints. But from
Power law distributions typically hold above some lower the broader perspective of pattern, beyond certain phys-
bound, L ‡ 0. I derive the distribution of 1 £ L < y < ¥ as ical applications, I consider the geometric mean as a
an example. From the constraint that the total probability natural informational constraint that arises from mea-
is 1 surement or assumption. By this view, the simple
Z 1 derivation of the power law given here provides the
kyð1þk1 Þ dy ¼ kLk1 =k1 ¼ 1; most general outlook on the role of information in setting
L
patterns of nature.
yielding k ¼ k1Lk1. Next we solve for k1 by using the
constraint on Ælog(y)æ to write
Z 1 The gamma distribution
hlogðyÞi ¼ logðyÞpy dy If the average displacement from an arbitrary point
ZL 1 captures all of the information in a sample about the
¼ logðyÞk1 Lk1 yð1þk1 Þ dy probability distribution, then observations follow the
L
exponential distribution. If the average logarithmic dis-
¼ logðLÞ þ 1=k1 : placement captures all of the information in a sample,
Using d ¼ k1, we obtain d ¼ 1/Ælog(y ⁄ L)æ, yielding then observations follow the power law distribution.
Displacements are non-negative values measured from a
py ¼ dLd yð1þdÞ : ð6Þ reference point.
If we choose L ¼ 1, then In this section, I show that if the average displacement
and the average logarithmic displacement together con-
py ¼ dyð1þdÞ ; tain all the information in a sample about the underlying
where 1 ⁄ d is the geometric mean of y in excess of the probability distribution, then the observations follow the
lower bound L. Note that the total probability in the gamma distribution.
upper tail is (L ⁄ y)d. Typically, one only refers to power No transformation preserves the information in both
law or ‘fat tails’ for d < 2. direct and logarithmic measures apart from uniform
scaling, x ¼ ay. Thus, my is a constant and drops out of
the analysis in the general solution given in eqn 4. From
Power laws, entropy and constraint
the general solution, we use the constraint on the mean,
There is a vast literature on power laws. In that literature, f1 ¼ y, and the constraint on the mean of the logarithmic
almost all derivations begin with a particular neutral values, f2 ¼ log (y), yielding
generative model, such as Simon’s (1955) preferential
attachment model for the frequency of words in lan- py ¼ kek1 yk2 logðyÞ ¼ kyk2 ek1 y :
guages (see above). By contrast, I showed that a power We solve for the three unknowns k, k1, and k2 from the
law arises simply from an assumption about the mea- constraints on the total probability, the mean, and the
surement scale and from information about the geomet- mean logarithmic value (geometric mean). For conve-
ric mean. This view of the power law shows the direct nience, make the substitutions l ¼ k1 and r ¼ 1 ) k2.
analogy with the exponential distribution: setting the Using each constraint in turn and solving for each of the
geometric mean attracts aggregates towards a power law unknowns yields the gamma distribution
distribution; setting the arithmetic mean attracts aggre-
lr r1 ly
gates towards an exponential distribution. This sort of py ¼ y e ;
informational derivation of the power law occurs in the CðrÞ
literature (e.g. Kapur, 1989; Kleiber & Kotz, 2003), but where C is the gamma function, the average value is
appears rarely and is almost always ignored in favour of Æyæ ¼ r/l and the average logarithmic value is Ælog(y)æ ¼
0
specialized generative models. ) log(l) + C (r)/C(r), where the prime denotes differen-
Recently, much work in theoretical physics attempts to tiation with respect to r. Note that the gamma distribu-
find maximum entropy derivations of power laws (e.g. tion is essentially a product of a power law, yr)1, and an
Abe & Rajagopal, 2000) from a modified approach called exponential, e)ly, representing the combination of the
Tsallis entropy (Tsallis, 1988, 1999). The Tsallis approach independent constraints on the geometric and arithmetic
uses a more complex definition of entropy but typically means.
applies a narrower concept of constraint than I use in this The fact that both linear and logarithmic measures
paper. Those who follow the Tsallis approach apparently provide information suggests that measurements must
do not accept a constraint on the geometric mean as a be made in relation to an absolute fixed point. The
too large as long as the squared deviation (variance) for of each Xt depends on Xt)1; the distribution of Xt)1
that perturbation is not, on average, infinitely large depends on Xt)2; and so on. Therefore each observation
relative to the other fluctuations. depends on all prior observations.
One encounters in the literature many special cases of Extension of the central limit theorem remains a very
the central limit theorem. The essence of each special active field of study (Johnson, 2004). A deeper under-
case comes down to information. Suppose some process standing of how aggregation determines the patterns of
of aggregation leads to a probability distribution that can nature justifies that effort.
be observed. If all of the information in the observations In the end, information remains the key. When all
about the probability distribution is summarized com- information vanishes except the variance, pattern con-
pletely by the variance, then the distribution is Gaussian. verges to the Gaussian distribution. Information vanishes
We ignore the mean, because the mean pins the by repeated perturbation. Variance and precision are
distribution to a particular location, but does not equivalent for a Gaussian distribution: the information
otherwise change the shape of the distribution. (precision) contained in an observation about the aver-
Similarly, suppose the variance is the only constraint age value is the reciprocal of the variance, 1 ⁄ r2. So we
we can assume about an unobserved distribution – may say that the Gaussian distribution is the purest
equivalently, suppose we know only the precision of expression of information or error in measurement
observations about the location of the mean, because the (Stigler, 1986).
variance defines precision. If we can set only As the variance goes to infinity, the information per
the precision of observations, then we should assume observation about the location of the average value, 1/r2,
the observations come from a Gaussian distribution. goes to zero. It may seem strange that an observation
We do not know all of the particular generative could provide no information about the mean value. But
processes that converge to the Gaussian. Each particular some of the deepest and most interesting aspects
statement of the central limit theorem provides one of pattern in nature can be understood by considering
specification of the domain of attraction – a subset of the the Gaussian distribution to be a special case of a wider
generative models that do in the limit take on the Gauss- class of limiting distributions with potentially infinite
ian shape. I briefly mention three forms of the central variance.
limit theorem to a give a sense of the variety of When the variance is finite, the Gaussian pattern
expressions. follows, and observations provide information about the
First, for any random variable with finite variance, the mean. As the variance becomes infinite because of
sum of independent and identical random variables occasional large fluctuations, one loses all information
converges to a Gaussian as the number of observations about the mean, and patterns follow a variety of power
in the sum increases. This statement is the most common law type distributions. Thus, Gaussian and power law
in the literature. It is also the least general, because it patterns are part of a single wider class of limiting
requires that each observation in the sum come from the distributions, the Lévy stable distributions. Before I turn
same identical distribution, and that each observation be to the Lévy stable distributions, I must develop the
independent of the other observations. concept of aggregation more explicitly.
Second, the Lindeberg condition does not require that
each observation come from the same identical distribu-
Aggregation: summation and its meanings
tion, but it does require that each observation be
independent of the others and that the variance be finite Our understanding of aggregation and the common
for each random variable contributing to the sum (Feller, patterns in nature arises mainly from concepts such as
1971). In practice,P forpffiffiffia sequence of n measurements the central limit theorem and its relatives. Those theo-
2
with sum Zn ¼ Xi = n for i ¼ 1, … , n, and
P if2 ri is the rems tell us what happens when we sum up random
variance of the ith variable so that Vn ¼ ri =n is the processes.
average variance, then Zn approaches a Gaussian as Why should addition be the fundamental concept of
long as no single variance r2i dominates the average aggregation? Think of the complexity in how processes
variance Vn. combine to form the input–output relations of a control
Third, the martingale central limit theorem defines a network, or the complexity in how numerous processes
generative process that converges to a Gaussian in which influence the distribution of species across a natural
the random variables in a sequence are neither identical landscape.
nor independent (Hall & Heyde, 1980). Suppose we have Three reasons support the use of summation as a
a sequence of observations, Xt, at successive times t ¼ common form of aggregation. First, multiplication and
1, … , T. If the expected value of each observation equals division can be turned into addition or subtraction by
the value observed in the prior time period, and the taking logarithms. For example, the multiplication of
variance in each time period, r2t , remains finite, then the numerous processes often smooths into a Gaussian
sequence Xt is a martingale that converges in distribution distribution on the logarithmic scale, leading to the log-
to a Gaussian as time increases. Note that the distribution normal distribution.
(a) FT of sum (b) Sum (c) Normalized sum (d) FT of normalized sum
n
1
about the spectral distribution. Thus, The frequency interpretation via sine and cosine curves
0 k1 s2 arises from the fact that eis ¼ cos(s) + i sin(s). Thus
H ðsÞ ¼ ke :
one can expand the Fourier transform into a series of
0
We need to choose k so that H (s)ds ¼ 1 and choose k1 so sine and cosine components expressed in terms of
that s2H 0 (s)ds ¼ Æs2æ. The identical problem arose when frequency s.
The inverse transformation demonstrates the full increases, the variance r2/n becomes small, and X
preservation of information when changing between x converges to the mean l. We can think of r2/n, the
and s, given as variance of X, as the spread of the estimate X about the
Z true mean, l. Thus the inverse of the spread, n/r2,
1 1
f ðxÞ ¼ F 1 fFðsÞg ¼ FðsÞeixs ds: measures the precision of the estimate. For finite r2, as n
2p 1
goes to infinity, the precision n/r2 also becomes infinite
as X converges to l.
If the variance r2 is very large, then the precision n/r2
The Lévy stable distributions remains small even as n increases. As long as n does not
exceed r2, precision is low. Each new observation
When we sum variables with finite variance, the distri- provides additional precision, or information, about the
bution of the sum converges to a Gaussian. Summation is mean in proportion to 1 ⁄ r2. As r2 becomes very large,
one particular generative process that leads to a Gauss- the information about the mean per observation
ian. Alternatively, we may consider the distributional approaches zero.
problem from an information perspective. If all we know For example, consider the power law distribution for X
about a distribution is the variance – the precision of with pdf 1/x2 for x > 1. The probability of observing a
information in observations with respect to the mean – value of X greater than k is 1/k. Thus, any new
then the most likely distribution derived by maximum observation can be large enough to overwhelm all
entropy is Gaussian. From the more general information previous observations, regardless of how many observa-
perspective, summation is just one particular generative tions we have already accumulated. In general, new
model that leads to a Gaussian. The generative and observations occasionally overwhelm all previous obser-
information approaches provide distinct and comple- vations whenever the variance is infinite, because the
mentary ways in which to understand the common precision added by each observation, 1/r2, is zero. A sum
patterns of nature. of random variables, or a random walk, in which any
In this section, I consider cases in which the variance new observation can overwhelm the information about
can be infinite. Generative models of summation con- location in previous observations, is called a Lévy flight.
verge to a more general form, the Lévy stable distribu- Infinite variance can be characterized by the total
tions. The Gaussian is just a special case of the Lévy stable probability in the extreme values, or the tails, of a
distributions – the special case of finite variance. From an probability distribution. For a distribution f(x), if the
information perspective, the Lévy stable distributions probability of |x| being greater than k is greater than 1/k2
arise as the most likely pattern given knowledge only for large k, then the variance is infinite. By considering
about the moments of the frequency spectrum in the large values of k, we focus on how much probability
Fourier domain. In the previous section, I showed that there is in the tails of the distribution. One says that
information about the spectral variance leads, by max- when the total probability in the tails is greater than 1/k2,
imum entropy, to the Gaussian distribution. In this the distribution has ‘fat tails’, the variance is infinite, and
section, I show that information about other spectral a sequence follows a Lévy flight.
moments, such as the mean of the spectral distribution, Variances can be very large in real applications, but
leads to other members from the family of Lévy stable probably not infinite. Below I discuss truncated Lévy
distributions. The other members, besides the Gaussian, flights, in which probability distributions have a lot of
have infinite variance. weight in the tails and high variance. Before turning to
When would the variance be infinite? Perhaps never in that practical issue, it helps first to gain full insight into
reality. More realistically, the important point is that the case of infinite variance. Real cases of truncated Lévy
observations with relatively large values occur often flights with high variance tend to fall between the
enough that a set of observations provides very little extremes of the Gaussian with moderate variance and
information about the average value. the Lévy stable distributions with infinite variance.
Large variance and the law of large numbers Generative models of summation
Consider a random variable X with mean l and variance Consider the sum
r2. The sum
1 X
n
1X n
Zn ¼ Xi ð17Þ
Zn ¼ Xi ¼ X n1=a i¼1
n i¼1
is the sample mean, X, for a set of n independent for independent observations of the random variable X
observations of the variable X. If r2 is finite then, by the with mean zero. If the variance of X is finite and equal to
central limit theorem, we know that X has a Gaussian r2, then with a ¼ 2 the distribution of Zn converges to a
distribution with mean l and variance r2 ⁄ n. As n Gaussian with mean zero and variance r2 by the central
limit theorem. In the Fourier domain, the distribution of spectral variance in this case is s2H 0 (s)ds ¼ Æs2æ. If we
the Gaussian has Fourier transform follow the same derivation for maximum entropy in the
a
jsja Fourier domain, but use the general expression for the
HðsÞ ¼ ec ð18Þ
ath spectral moment for a £ 2, given by |s|aH 0 (s)ds ¼
with a ¼ 2, as given in eqn 12. If a < 2, then the variance Æ|s|aæ, we obtain the general expression for maximum
of X is infinite, and the fat tails are given by the total entropy subject to information about the ath spectral
probability 1/|x|a above large values of |x|. The Fourier moment as
transform of the distribution of the sum Zn is given by a
jsja
HðsÞ ¼ ec : ð19Þ
eqn 18. The shape of the distribution of X does not
matter, as long as the tails follow a power law pattern. This expression matches the general form obtained by
Distributions with Fourier transforms given by eqn 18 n-fold convolution of the sum in eqn 17 converging to
are called Lévy stable distributions. The full class of Lévy eqn 18 by Fourier analysis. The value of a does not have
stable distributions has a more complex Fourier transform to be an integer: it can take on any value between 0 and
with additional parameters for the location and skew of 2. [Here, ca ¼ 1/aÆ|s|aæ, and, in eqn 14, k ¼ ac/2C(1/a).]
the distribution. The case given here assumes distributions The sum in eqn 17 is a particular generative model that
in the direct domain, x, are symmetric with a mean of zero. leads to the Fourier pattern given by eqn 18. By contrast,
We can write the symmetric Lévy stable distributions in the maximum entropy model uses only information
the direct domain only for a ¼ 2, which is the Gaussian, about the ath spectral moment and derives the same
and for a ¼ 1 which is the Cauchy distribution given by result. The maximum entropy analysis has no direct tie to
c a generative model. The maximum entropy result shows
hðxÞ ¼ : that any generative model or any other sort of informa-
pðc2 þ x 2 Þ
tion or assumption that sets the same informational
As (x/c)2 increases above about 10, the Cauchy distribu- constraints yields the same pattern.
tion approximately follows a pdf with a power law What does the ath spectral moment mean? For a ¼ 2,
distribution 1/|x|1+a, with total probability in the tails 1/ the moment measures the variance in frequency when
|x|a for a ¼ 1. we weight frequencies by their intensity of contribution
In general, the particular forms of the Lévy stable to pattern. For a ¼ 1, the moment measures the average
distributions are known only from the forms of their frequency weighted by intensity. In general, as a
Fourier transforms. The fact that the general forms of these declines, we weight more strongly the lower frequencies
distributions have a simple expression in the Fourier in characterizing the distribution of intensities. Lower
domain occurs because the Fourier domain is the natural frequencies correspond to more extreme values in the
expression of aggregation by summation. In the direct direct domain, x, because low frequency waves spread
domain, the symmetric Lévy stable distributions approx- more widely. So, as a declines, we weight more heavily
imately follow power laws with probability 1/|x|1+a as |x|/c the tails of the probability distribution in the direct
increases beyond a modest threshold value. domain. In fact, the weighting a corresponds exactly to
These Lévy distributions are called ‘stable’ because the weight in the tails of 1/|x|a for large values of x.
they have two properties that no other distributions Numerous papers discuss spectral moments (e.g.
have. First, all infinite sums of independent observations Benoit et al., 1992; Eriksson et al., 2004). I could find in
from the same distribution converge to a Lévy stable the literature only a very brief mention of using maxi-
distribution. Second, the properly normalized sum of two mum entropy to derive eqn 19 as a general expression of
or more Lévy stable distributions is a Lévy stable the symmetric Lévy stable distributions (Bologna et al.,
distribution of the same form. 2002). It may be that using spectral moments in max-
These properties cause aggregations to converge to the imum entropy is not considered natural by the physicists
Lévy stable form. Once an aggregate has converged to who work in this area. Those physicists have discussed
this form, combining with another aggregate tends to extensively alternative definitions of entropy by which
keep the form stable. For these reasons, the Lévy stable one may understand the stable distributions (e.g. Abe &
distributions play a dominant role in the patterns of Rajagopal, 2000). My own view is that there is nothing
nature. unnatural about spectral moments. The Fourier domain
captures the essential features by which aggregation
shapes information.
Maximum entropy: information and the stable
distributions
Truncated Lévy flights
I showed in eqn 15 that
2 2 The variance is not infinite in practical applications.
HðsÞ ¼ ec s
Finite variance means that aggregates eventually con-
is the maximum entropy pattern in the Fourier domain verge to a Gaussian as the number of the components in
given information about the spectral variance. The the aggregate increases. Yet many observable patterns
have the power law tails that characterize the Lévy form of eqn 6. As cy2 grows above 1, the tail of
distributions that arise as attractors with infinite vari- the distribution approaches zero more rapidly than a
ance. Several attempts have been made to resolve this Gaussian.
tension between the powerful attraction of the Gaussian I emphasize this informational approach to truncated
for finite variance and the observable power law pat- power laws because it seems most natural to me. If all we
terns. The issue remains open. In this section, I make a know is a lower bound, L, a power law shape of the
few comments about the alternative perspectives of distribution set by d through the observable range of
generative and informational views. magnitudes, and a finite variance, then the form such as
The generative approach often turns to truncated Lévy eqn 20 is most likely.
flights to deal with finite variance (Mantegna & Stanley,
1994, 1995; Voit, 2005; Mariani & Liu, 2007). In the
Why are power laws so common?
simplest example, each observation comes from a distri-
bution with a power law tail such as eqn 6 with L ¼ 1, Because spectral distributions tend to converge to their
repeated here maximum entropy form
a
jsja
py ¼ dyð1þdÞ ; HðsÞ ¼ ec : ð21Þ
for 1 £ y < ¥. The variance of this distribution is infinite. With finite variance, a very slowly tends towards 2,
If we truncate the tail such that 1 £ y £ U, and normalize leading to the Gaussian for aggregations in which
so that the total probability is 1, we get the distribution component perturbations have truncated fat tails with
finite variance. If the number of components in the
dyð1þdÞ
py ¼ ; aggregate is not huge, then such aggregates may often
1 U d closely follow eqn 21 or a simple power law through the
which for large U is essentially the same distribution as commonly measurable range of values, as in eqn 20. Put
the standard power law form, but with the infinite tail another way, the geometric mean often captures most of
truncated. The truncated distribution has finite variance. the information about a process or a set of data with
Aggregation of the truncated power law will eventually respect to underlying distribution.
converge to a Gaussian. But the convergence takes a very This informational view of power laws does not favour
large number of components, and the convergence can or rule out particular hypotheses about generative
be very slow. For practical cases of finite aggregation, the mechanisms. For example, word usage frequencies in
sum will often look somewhat like a power law or a Lévy languages might arise by particular processes of prefer-
stable distribution. ential attachment, in which each additional increment of
I showed earlier that power laws arise by maximum usage is allocated in proportion to current usage. But we
entropy when one has information only about Ælog(y)æ must recognize that any process, in the aggregate, that
and the lower bound, L. Such distributions have infinite preserves information about the geometric mean, and
variance, which may be unrealistic. The assumption that tends to wash out other signals, converges to a power law
the variance must be finite means that we must constrain form. The consistency of a generative mechanism with
maximum entropy to account for that assumption. Finite observable pattern tells us little about how likely that
variance implies that Æy2æ is finite. We may therefore generative mechanism was in fact the cause of the
consider the most likely pattern arising simply from observed pattern. Matching generative mechanism to
maximum entropy subject to the constraints of minimum observed pattern is particularly difficult for common
value, L, geometric mean characterized by Ælog(y)æ, and maximum entropy patterns, which are often attractors
finite second moment given by Æy2æ. For this application, consistent with many distinct generative processes.
we will usually assume that the second moment is large
to preserve the power law character over most of the
range of observations. With those assumptions, our Extreme value theory
standard procedure for maximum entropy in eqn 4 The Lévy stable distributions express widespread patterns
yields the distribution of nature that arise through summation of perturbations.
2 Summing logarithms is the same as multiplication. Thus,
py ¼ kyð1þdÞ ecy ; ð20Þ
the Lévy stable distributions, including the special
where I have here used notation for the k constants of Gaussian case, also capture multiplicative interactions.
eqn 4 with the substitutions k1 ¼ 1 + d and k2 ¼ c. No Extreme values define the other great class of stable
simple expressions give the values of k, d and c; those distributions that shape the common patterns of nature.
values can be calculated from the three constraints: An extreme value is the largest (or smallest) value from a
py dy ¼ k, y2py dy ¼ Æy2æ and log(y)py dy ¼ Ælog(y)æ. If sample. I focus on largest values. The same logic applies
we assume large but finite variance, then Æy2æ is large and to smallest values.
c will be small. As long as values of cy2 remain much less The cumulative probability distribution function for
than 1, the distribution follows the standard power law extreme values, G(x), gives the probability that the
greatest observed value in a large sample will be less than observations as Mn. The probability that the maximum is
x. Thus, 1 ) G(x) gives the probability that the greatest less than y is equal to the probability that each of the n
observed value in a large sample will be higher than x. independent observations is less than y, thus
Remarkably, the extreme value distribution takes one PðMn < yÞ ¼ ½FðyÞn :
of three simple forms. The particular form depends only
on the shape of the upper tail for the underlying This expression gives the extreme value distribution,
probability distribution that governs each observation. because it expresses the probability distribution of the
In this section, I give a brief overview of extreme value maximum value. The problem is that, for large n,
theory and its relation to maximum entropy. As always, I [F(y)]n fi 0 if F(y) < 1 and [F(y)]n fi 1 if F(y) ¼ 1. In
emphasize the key concepts. Several books present full addition, we often do not know the particular form of F.
details, mathematical development and numerous appli- For a useful analysis, we want to know about the
cations (Embrechts et al., 1997; Kotz & Nadarajah, 2000; extreme value distribution without having to know
Coles, 2001; Gumbel, 2004). exactly the form of F, and we want to normalize the
distribution so that it approaches a limiting value as n
increases. We encountered normalization when studying
Applications sums of random variables: without normalization, a sum
Reliability, time to failure and mortality may depend on of n observations often grows infinitely large as n
extreme values. Suppose an organism or a system increases. By contrast, a properly normalized sum con-
depends on numerous components. Failure of any verges to a Lévy stable distribution.
component causes the system to fail or the organism to For extreme values, we need to find a normalization
die. One can think of failure for a component as an for the maximum value in a sample of size n, Mn, such
extreme value in a stochastic process. Then overall failure that
depends on how often an extreme value arises in any of P½ðMn bn Þ=an < y ! GðyÞ: ð22Þ
the components. In some cases, overall failure may
depend on breakdown of several components. The In other words, if we normalize the maximum, Mn, by
Weibull distribution is often used to describe these kinds subtracting a location coefficient, bn, and dividing by a
of reliability and failure problems (Juckett & Rosenberg, scaling coefficient, an, then the extreme value distribu-
1992). We will see that the Weibull distribution is one of tion converges to G(y) as n increases. Using location and
the three general types of extreme value distributions. scale coefficients that depend on n is exactly what one
Many problems in ecology and evolution depend on does in normalizing a sum to obtain the standard
evaluating rare events. What is the risk of invasion by a Gaussian with mean zero and standard deviation 1: in
pest species (Franklin et al., 2008)? For an endangered the Gaussian case, to normalize a sum of n observations,
species, what is the risk of rare environmental fluctua- onep subtracts
ffiffiffi from the sum bn ¼ nl, and divides the sum
tions causing extinction? What is the chance of a rare by nr, where l and r are the mean and standard
beneficial mutation arising in response to a strong deviation of the distribution from which each observa-
selective pressure (Beisel et al., 2007)? tion is drawn. The concept of normalization for extreme
values is the same, but the coefficients differ, because we
are normalizing the maximum value of n observations
General expression of extreme value problems rather than the sum of n observations.
Suppose we observe a random variable, Y, with a We can rewrite our extreme value normalization in
cumulative distribution function, or cdf, F(y). The cdf is eqn 22 as
defined as the probability that an observation, Y, is less PðMn < an y þ bn Þ ! GðyÞ:
than y. In most of this paper, I have focused on the
Next we use the equivalences established above to write
probability distribution function, or pdf, f(y). The two
expressions are related by
n
n
PðMn < an y þ bn Þ ¼ Fðan y þ bn Þ ¼ 1 F^ ðan y þ bn Þ :
Z y
PðY < yÞ ¼ FðyÞ ¼ f ðxÞdx: We note a very convenient mathematical identity
1
The value of F(y) can be used to express the total F^ ðyÞ n ^
1 ! eF ðyÞ
probability in the lower tail below y. We often want the n
total probability in the upper tail above y, which is as n becomes large. Thus, if we can find values of an and
Z 1
bn such that
PðY > yÞ ¼ 1 FðyÞ ¼ F^ ðyÞ ¼ f ðxÞdx;
y F^ ðyÞ
F^ ðan y þ bn Þ ¼ ; ð23Þ
where I use F^ for the upper tail probability. n
Suppose we observe n independent values Yi from the then we obtain the general solution for the extreme
distribution F. Define the maximum value among those n value problem as
^
GðyÞ ¼ eF ðyÞ ; ð24Þ With the two constraints Æn(y)æ and hF^ ðyÞi, the max-
imum entropy probability distribution (pdf) is
where F^ ðyÞ is the probability in the upper tail for the
^
underlying distribution of the individual observations. gðyÞ ¼ kek1 nðyÞk2 F ðyÞ : ð28Þ
Thus, if we know the shape of the upper tail of Y, and we We can relate this maximum entropy probability distri-
can normalize as in eqn 23, we can express the distribu- bution function (pdf) to the results for the three types of
tion for extreme values, G(y). extreme value distributions. I gave the extreme value
distributions as G(y), the cumulative distribution func-
The tail determines the extreme value distribution tion (cdf). We can obtain the pdf from the cdf by
I give three brief examples that characterize the three differentiation, because g(y) ¼ dG(y)/dy.
different types of extreme value distributions. No other From eqn 25, the pdf of the Gumbel distribution is
types exist (Embrechts et al., 1997; Kotz & Nadarajah, gðyÞ ¼ eye ;
y
which is called the double exponential or Gumbel which, from eqn 28, corresponds to a constraint Æn(y)æ ¼
distribution. Typically any distribution with a tail that Ælog(y)æ for the geometric mean and a constraint
decays faster than a power law attracts to the Gumbel, hF^ ðyÞi ¼ hyd i for power law weighted tail shape. Here,
where, by eqn 6, a power law has a total tail probability k ¼ d, k1 ¼ 1 + d and k2 ¼ 1.
in its cumulative distribution function proportional to From eqn 27, the pdf of the Weibull distribution is
1 ⁄ yd, with d < 2. Exponential, Gaussian and gamma d
distributions all decay exponentially with tail probabili- gðyÞ ¼ dðM yÞd1 eðMyÞ ;
ties less than power law tail probabilities. which, from eqn 28, corresponds to a constraint Æn(y)æ ¼
In the second example, let the upper tail of Y decrease Ælog(y)æ for the geometric mean and a constraint
like a power law such that F^ ðyÞ ¼ yd . Then, with an ¼ hF^ ðyÞi ¼ hðM yÞd i that weights extreme values by a
n1/d and bn ¼ 0, we obtain truncated tail form. Here, k ¼ d, k1 ¼ 1 ) d and k2 ¼ 1.
d In summary, eqn 28 provides a general form for
GðyÞ ¼ ey ; ð26Þ extreme value distributions. As always, we can think of
that maximum entropy form in two complementary
which is called the Fréchet distribution.
ways. First, aggregation by repeated sampling suppresses
Finally, if Y has a finite maximum value M such that
weak signals and enhances strong signals until the only
F^ ðyÞ has a truncated upper tail, and the tail probability
remaining information is contained in the location and in
near the truncation point is F^ ðyÞ ¼ ðM yÞd , then, with
the weighting function constraints. Second, independent
an ¼ n)1/d and bn ¼ n)1/dM ) M, we obtain
of any generative process, if we use measurements or
d extrinsic information to estimate or assume location and
GðyÞ ¼ eðMyÞ ; ð27Þ
weighting function constraints, then the most likely
which is called the Weibull distribution. Note that distribution given those constraints takes on the general
G(y) ¼ 0 for y > M, because the extreme value can never extreme value form.
be larger than the upper truncation point.
Generative models vs. information
Maximum entropy: what information determines constraints
extreme values?
The derivations in the previous section followed a
In this section, I show the constraints that define the generative model in which one obtains n independent
maximum entropy patterns for extreme values. Each observations from the same underlying distribution. As n
pattern arises from two constraints. One constraint sets increases, the extreme value distributions converge to
the average location either by the mean, Æyæ, for cases one of three forms, the particular form depending on the
with exponential tail decay, or by the geometric mean tail of the underlying distribution.
measured by Ælog(y)æ for power law tails. To obtain a We have seen several times in this paper that such
general form, express this first constraint as Æn(y)æ, where generative models often attract to very general maximum
n(y) ¼ y or n(y) ¼ log(y). The other constraint measures entropy distributions. Those maximum entropy distribu-
the average tail weighting, hF^ ðyÞi. tions also tend to attract a wide variety of other
generative processes. In the extreme value problem, any design influenced by the tendency for aggregates to
underlying distributions that share similar probability converge to a few common patterns?
weightings in the tails fall within the domain of attrac-
tion to one of the three maximum entropy extreme value
Acknowledgments
distributions (Embrechts et al., 1997).
In practice, one often first discovers a common and I came across the epigraph from Gnedenko and Kol-
important pattern by a simple generative model. That mogorov in Sornette (2006) and the epigraph from
generative model aggregates observations drawn inde- Galton in Hartl (2000). My research is supported by
pendently from a simple underlying distribution that National Science Foundation grant EF-0822399, National
may be regarded as purely random or neutral. It is, Institute of General Medical Sciences MIDAS Program
however, a mistake to equate the neutral generative grant U01-GM-76499 and a grant from the James S.
model with the maximum entropy pattern that it McDonnell Foundation.
creates. Maximum entropy patterns typically attract a
very wide domain of generative processes. The attrac-
References
tion to simple maximum entropy patterns arises
because those patterns express simple informational Abe, S. & Rajagopal, A.K. 2000. Justification of power law
constraints and nothing more. Aggregation inevitably canonical distributions based on the generalized central-limit
washes out most information by the accumulation of theorem. Europhys. Lett. 52: 610–614.
partially uncorrelated perturbations. What remains in Barabasi, A.L. & Albert, R. 1999. Emergence of scaling in
random networks. Science 286: 509–512.
any aggregation is the information in the sampling
Beisel, C.J., Rokyta, D.R., Wichman, H.A. & Joyce, P. 2007.
structure, the invariance to changes in measurement Testing the extreme value domain of attraction for distri-
scale, and the few signals retained through aggregation. butions of beneficial fitness effects. Genetics 176: 2441–
Those few bits of information define the common 2449.
patterns of nature. Benoit, C., Royer, E. & Poussigue, G. 1992. The spectral
Put another way, the simple generative models can be moments method. J. Phys. Condens. Matter 4: 3125–3152.
thought of as tools by which we discover important Bologna, M., Campisi, M. & Grigolini, P. 2002. Dynamic versus
maximum entropy attractor distributions. Once we have thermodynamic approach to non-canonical equilibrium. Phy-
found such distributions by a generative model, we may sica A 305: 89–98.
extract the informational constraints that define the Bracewell, R.N. 2000. The Fourier Transform and its Applications,
3rd edn. McGraw Hill, Boston.
pattern. With that generalization in hand, we can then
Coles, S. 2001. An Introduction to Statistical Modeling of Extreme
consider the broad scope of alternative generative Values. Springer, New York.
processes that preserve the information that defines the Cramer, H. & Leadbetter, M.R. 1967. Stationary and Related
pattern. The original generative model no longer has Stochastic Processes: Sample Function Properties and their Applica-
special status – our greatest insight resides with the tions. Wiley, New York.
informational constraints that define the maximum Embrechts, P., Kluppelberg, C. & Mikosch, T. 1997. Modeling
entropy distribution. Extremal Events: For Insurance and Finance. Springer Verlag,
The challenge concerns how to use knowledge of the Heidelberg.
common patterns to draw inferences about pattern and Eriksson, E.J., Cepeda, L.F., Rodman, R.D., Sullivan, K.P.H.,
process in biology. This paper has been about the first McAllister, D.F., Bitzer, D. & Arroway, P. 2004. Robustness of
spectral moments: a study using voice imitations. Proceedings of
step: to understand clearly how information defines the
the Tenth Australian International Conference on Speech Science and
relations between generative models of process and the Technology 1, 259–264.
consequences for pattern. I only gave the logical struc- Feller, W. 1968. An Introduction to Probability Theory and its
ture rather than direct analyses of important biological Applications, Vol. I, 3rd edn. Wiley, New York.
patterns. The next step requires analysis of the common Feller, W. 1971. An Introduction to Probability Theory and its
patterns in biology with respect to sampling structure, Applications Vol. II, 2nd edn. Wiley, New York.
informational constraints and the domains of attraction Franklin, J., Sisson, S.A., Burgman, M.A. & Martin, J.K. 2008.
for generative models to particular patterns. What range Evaluating extreme risks in invasion ecology: learning from
of non-neutral generative models attract to the common banking compliance. Divers. Distrib. 14: 581–591.
patterns, because the extra non-neutral information gets Galton, F. 1889. Natural Inheritance. MacMillan, London/New
York.
washed out with aggregation?
Garcia Martin, H. & Goldenfeld, N. 2006. On the origin and
With the common patterns of biology better under- robustness of power-law species-area relationships in ecology.
stood, one can then analyse departures from the com- Proc. Natl. Acad. Sci. USA 103: 10310–10315.
mon patterns more rationally. What extra information Gnedenko, B.V. & Kolmogorov, A.N. 1968. Limit Distributions for
causes departures from the common patterns? Where Sums of Independent Random Variables. Addison-Wesley, Read-
does the extra information come from? What is it about ing, MA.
the form of aggregation that preserves the extra infor- Gumbel, E.J. 2004. Statistics of Extremes. Dover Publications, New
mation? How are evolutionary dynamics and biological York.
Hall, P. & Heyde, C.C. 1980. Martingale Limit Theory and its Newman, M.E.J. 2005. Power laws, Pareto distributions and
Application. Academic Press, New York. Zipf’s law. Contemp. Phys. 46: 323–351.
Hartl, D.L. 2000. A Primer of Population Genetics, 3rd edn. Sinauer, Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N. & Barabasi,
Sunderland, MA. A.L. 2002. Hierarchical organization of modularity in meta-
Jaynes, E.T. 2003. Probability Theory: The Logic of Science. Cam- bolic networks. Science 297: 1551–1555.
bridge University Press, New York. Shannon, C.E. & Weaver, W. 1949. The Mathematical Theory of
Johnson, O. 2004. Information Theory and the Central Limit Communication. University of Illionis Press, Urbana, IL.
Theorem. Imperial College Press, London. Simkin, M.V. & Roychowdhury, V.P. 2006. Re-inventing Willis.
Johnson, R.W. & Shore, J.E. 1984. Which is the better entropy Arxiv:physics/0601192.
expression for speech processing: -S log S or log S? IEEE Simon, H.A. 1955. On a class of skew distribution functions.
T. Acoust. Speech. 32: 129–137. Biometrika 42: 425–440.
Johnson, N.L., Kotz, S. & Balakrishnan, N. 1994. Continuous Sivia, D.S. & Skilling, J. 2006. Data Analysis: A Bayesian Tutorial.
Univariate Distributions, 2nd edn. Wiley, New York. Oxford University Press, New York.
Johnson, N.L., Kemp, A.W. & Kotz, S. 2005. Univariate Discrete Sornette, D. 2006. Critical Phenomena in Natural Sciences: Chaos,
Distributions, 3rd edn. Wiley, Hoboken, NJ. Fractals, Selforganization, and Disorder: Concepts and Tools.
Juckett, D.A. & Rosenberg, B. 1992. Human disease mortality Springer, New York.
kinetics are explored through a chain model embodying Stigler, S.M. 1986. The History of Statistics: The Measurement of
principles of extreme value theory and competing risks. Uncertainty before 1900. Belknap Press of Harvard University
J. Theor. Biol. 155: 463–483. Press, Cambridge, MA.
Kapur, J.N. 1989. Maximum-entropy Models in Science and Engi- Tribus, M. 1961. Thermostatics and Thermodynamics: An Introduc-
neering. Wiley, New York. tion to Energy, Information and States of Matter, with Engineering
Kleiber, C. & Kotz, S. 2003. Statistical Size Distributions in Applications. Van Nostrand, New York.
Economics and Actuarial Sciences. Wiley, New York. Tsallis, C. 1988. Possible generalization of Boltzmann–Gibbs
Kotz, S. & Nadarajah, S. 2000. Extreme Value Distributions: Theory statistics. J. Stat. Phys. 52: 479–487.
and Applications. World Scientific, Singapore. Tsallis, C. 1999. Nonextensive statistics: theoretical, experimen-
Kullback, S. 1959. Information Theory and Statistics. Wiley, New tal and computational evidences and connections. Braz. J.
York. Phys. 29: 1–35.
Mandelbrot, B.B. 1983. The Fractal Geometry of Nature. WH Van Campenhout, J.M. & Cover, T.M. 1981. Maximum entropy
Freeman, New York. and conditional probability. IEEE Trans. Inf. Theory 27: 483–
Mantegna, R.N. & Stanley, H.E. 1994. Stochastic process with 489.
ultraslow convergence to a Gaussian: the truncated Levy Voit, J. 2005. The Statistical Mechanics of Financial Markets, 3rd
flight. Phys. Rev. Lett. 73: 2946–2949. edn. Springer, New York.
Mantegna, R.N. & Stanley, H.E. 1995. Scaling behaviour in the Yu, Y.M. 2008. On the maximum entropy properties of
dynamics of an economic index. Nature 376: 46–49. the binomial distribution. IEEE Trans. Inf. Theory 54: 3351–
Mariani, M.C. & Liu, Y. 2007. Normalized truncated Levy walks 3353.
applied to the study of financial indices. Physica A 377: 590–598.
Mitzenmacher, M. 2004. A brief history of generative models for Received 15 March 2009; revised 23 April 2009; accepted 5 May 2009
power law and lognormal distributions. Internet Math. 1: 226–
251.