Professional Documents
Culture Documents
ONLINE LlBRARY
Physics and Astronomy
springeronline.com
Advanced Texts in Physics
This program of advanced texts covers a broad spectrum of topics that are of
current and emerging interest in physics. Each book provides a comprehensive and
yet accessible introduction to a field at the forefront of modern research. As such,
these texts are intended for senior undergraduate and graduate students at the M.S.
and Ph.D. levels; however, research scientists seeking an introduction to particular
areas of physics will also benefit from the titles in this collection.
Philippe Refregier
With 80 Figures
t Springer
Translated by
Philippe Refn!gier
Stephen Lyle, Andebu, 09240 Alzen, France
Institut Fresnel
D.U. St ]erome
13397 Marseille cedex 20
France
philippe.refregier@fresnel.fr
9 8 7 6 5 4 32 1 SPIN 10949207
springeronline.com
To the memory of my Father
Foreword
I had great pleasure in reading Philippe Refregier's book on the theory of noise
and its applications in physics. The main aim of the book is to present the
basic ideas used to characterize these unwanted random signals that obscure
information content. To this end, the author devotes a sigificant part of his
book to a detailed study of the probabilistic foundations of fluctuation theory.
Following a concise and accurate account of the basics of probability the-
ory, the author includes a detailed study of stochastic processes, emphasizing
the idea of the correlation function, which plays a key role in many areas of
physics.
Physicists often assume that the noise perturbing a signal is Gaussian.
This hypothesis is justified if one can consider that the noise results from the
superposition of a great many independent random perturbations. It is this
fact that brings the author to discuss the theory underlying the addition of
random variables, accompanied by a wide range of illustrative examples.
Since noise affects information, the author is naturally led to consider
Shannon's information theory, which in turn brings him to the altogether
fundamental idea of entropy. This chapter is completed with a study of com-
plexity according to Kolmogorov. This idea is not commonly discussed in
physics and the reader will certainly appreciate the clear presentation within
these pages.
In order to explain the nature of noise from thermal sources, Philippe
Refregier then presents the essential features of statistical physics. This allows
him to give a precise explanation of temperature. The chapter is very complete
and omits none of the key ideas.
To conclude the work, the author devotes an important chapter to prob-
lems of estimation, followed by a detailed discussion of the examples presented
throughout the book.
I am quite certain that this book will be highly acclaimed by physicists
concerned with the problems raised by information transmission. It is well
VIII Foreword
presented, rigorous without excess, and richly illustrated with examples which
bring out the significance of the ideas under discussion.
This book results from work carried out over the past few years as a member
of the Physics and Image Processing team at the Fresnel Institute in the
Ecole Nationale Superieure de Physique de Marseille and the University of
Aix Marseille III. In particular, it relates to the MSc programme in Optics,
Image and Signal.
I would like to thank all my colleagues for many stimulating scientific
discussions over the past years. Naturally, this involves for the main part the
permanent academic staff of the Physics and Image Processing team, who
are too numerous to list here. However, I wish to extend particular thanks to
Franc;ois Goudail for his invaluable help, both on the scientific level and in
terms of the technical presentation of this book, and to Pierre Chavel for his
judicious remarks and advice.
Nino Boccara, whose teachings have been extremely useful to me, accepted
to write the foreword to this book, for which I am sincerely grateful.
Our understanding is fashioned and refined during scientific discussions
with colleagues of the same or similar disciplines. I would therefore like to
thank the GDR ISIS and the Societe Franc;aise d'Optique who created such a
favourable context for exchange.
Finally, I would like to acknowledge my debt towards Marie-Helene and
Nina who supported me when I undertook the task of writing this book.
2 Random Variables......................................... 5
2.1 Random Events and Probability. . . . . . . . . . . . . . . . . . . . . . . . . .. 6
2.2 Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Means and Moments ................................... " 10
2.4 Median and Mode of a Probability Distribution ............. 12
2.5 Joint Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13
2.6 Covariance ............................................. 16
2.7 Change of Variables ..................................... 18
2.8 Stochastic Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19
Exercises ................................................... 22
Introduction
The aim of this book is to present the statistical basis for theories of noise
in physics. More precisely, the intention is to cover the essential elements
required to characterize noise (also referred to as fluctuations) and to describe
optimization techniques for measurements carried out in the presence of such
perturbations. Although this is one of the main concerns of any engineer or
physicist, these ideas tend not to be tackled in a global manner. The approach
developed here thus aims to provide the reader with a consistent view of
this concept, in such a way that the various physical interpretations are not
obscured by mathematical difficulties.
In the book, fluctuations are placed at the center of attention. In order to
analyze them, our approach is based upon probability theory, the physics of
linear systems, statistical physics, information theory and statistics.
Statistics
Information Statistical
theory physics
Probability Stochastic
theory processes
Probability Theory
The principal mathematical tool used to describe noise and fluctuations is
the theory of probability. Chapter 2 provides a brief presentation of the main
ideas concerning random variables. Probability is not an easy subject and we
strongly recommend the reader to check that he or she has a good grasp of
the ideas in Chapter 2 before proceeding to the following chapters. There is
perhaps a certain ambiguity in speaking of a mathematical tool and it might
be better to speak rather of a language when the interaction between the
physical situation under investigation and the mathematical concept chosen
to describe it is so vast.
Statistical Physics
Statistics
In a typical practical context, the physicist or engineer may be faced with quite
the opposite problem to the one posed in the probabilistic approach. For the
problem is often one of inferring quantities not considered as random, such
as a mean particle flux, the mean voltage across a resistor, the covariance
functions of a process, and so on, from a finite number of observations or
measurements. This is therefore no longer a simple problem of probability,
for we must now appeal to statistics. It is worth noting in this context that
statistical physics is more probabilistic than genuinely statistical.
Statistics is a vast field and we shall only consider the rather restricted
aspect of statistical inference whose aim is precisely to determine efficient
methods of estimation from fluctuating measurements, i.e., from measure-
ments made in the presence of noise. An important feature of statistics is
the possibility of characterizing the accuracy of our estimates of measured
quantities. The ultimate attainable accuracy cannot be arbitrarily small for a
finite number of measurements and various lower bounds have been known for
some time now. In chapter 7 we shall only consider the best known amongst
these, the Cramer-Rao bound, and we shall study the conditions under which
it may be attained.
Applications
In case there should be any doubt about it, this book is mainly oriented
toward applications. The last chapter reconsiders the various examples used as
4 1 Introduction
illustrations throughout the rest of the book, turning to the specific problem
of estimation. Characterizing, predicting and optimizing measurements are
permanent concerns of the physicist or engineer. However, these goals cannot
be achieved efficiently by applying rules of thumb. A deeper understanding of
noise is essential. We hope that this book will help the reader by achieving a
good compromise between theory and application.
2
Random Variables
LPi = 1.
i=l
L Pi = 1 .
00
P1 + P2 + ... + Pn + ... = 1 or
i=l
of beating that one." The race itself is a single event and the probability of
3/4 mentioned here can in no way correspond to a frequency of occurrence.
But this quantity may nevertheless prove useful to a gambler.
The set in question may be infinite. Consider a monkey typing on the
keyboard of a computer. We may choose as random events the various possible
words, i.e., sequences of letters that are not separated by a space. The set [lof
possible words is clearly infinite. To see this, we may imagine that the monkey
typing on the computer keyboard never once presses on the space bar. One
might object, quite rightly, that the animal has a finite lifespan, so that the set
[l must also be finite. Rather than trying to refine the example by finding a
way around this objection, let us just say that it may be simpler to choose an
infinite size for [l than to estimate the maximal possible size. What matters
in the end is the quality of the results obtained with the chosen model and
the simplicity of that model.
Generalizing a little further, it should be noted that the set n of possible
events may not only be infinite; it may actually be uncountable. In other
words, it may be that the elements of the set [l cannot be put into a one-
to-one correspondence with the positive integers. To see this, suppose that
we choose at random a real number between 0 and 1. In this case, we may
identify [l with an interval [0, 1], and this is indeed uncountable in the above
sense. This is a classic problem in mathematics. Let us outline our approach
when the given set is uncountable. We consider the set of all subsets of [l
and we associate with every subset w ~ [l a positive number P(w). We then
apply Kolmogorov's axioms to equip [l with a probability law. To do so, the
following conditions must be satisfied:
P([l) = 1 and P(0) = 0 (where 0 is the empty set),
if the subsets AI, A 2, ... ,An' ... are pairwise disjoint, so that no pair of sets
contains common elements, we must have l
and we associate with each of these events A a value X>.. If the possible values
of X>. are real numbers, we speak of a real random variable, whereas if they
are complex numbers, we have a complex random variable. In the rest of this
chapter we shall be concerned mainly with real- or integer-valued random
variables. In the latter case, X>. will be a whole number. In order to define
the probability of a random variable, we proceed in two stages. We consider
first the case where [l is countable. The uncountable case will then lead to
the idea of probability density.
If [l is countable, we can define Pi = P(Ai) with L~l Pi = 1. The latter
is also written L>'ES! P(A) = 1, which simply means that the sum of the
probabilities P(A) of each element of [l must equal 1. Let x be a possible
value of X>.. Then P(x) denotes the probability that X>. is equal to x. We
obtain this value by summing the probabilities of all random events in [l such
that X>. = x. In the game of heads or tails, we may associate the value 0 to
tails and 1 to heads. We thereby construct an integer-valued random variable.
The probability P(O) is thus 1/2, as is P(l). For a game with a six-sided die,
we would have P(l) = P(2) = ... = P(6) = 1/6. If our die is such that the
number 1 appears on one side, the number 2 on two sides, and the number 3
on three sides, we then set P(l) = 1/6, P(2) = 1/3, and P(3) = 1/2.
Letting Dx denote the set of possible values of X>., we see that we must
have
L P(x) = 1.
xEDx
n! = n (n - 1) . (n - 2) .... 21 .
As we shall discover later, the Poisson law is a simple model which allows us
to describe a great many physical phenomena.
2.2 Random Variables 9
P ( ) = dFx(x)
x x dx'
= 1 and Fx(x) =
I:
As Fx(oo) J~C<J Px(y)dy, we deduce that
Px(x)dx = 1 .
and 0 otherwise
Cauchy ( 2a 2) with a > 0 a
7r a +x
J f(x)o(x - n) dx = f(n) .
J
We thus find
f(x)Px(x) dx = LPnf(n) .
n
The moments of a probability law play an important role in physics. They are
J
defined by
(XD = x T Px(x) dx .
Moments are generally only considered for positive integer values of r. How-
ever, we may equally well choose any positive real value of r. It should be
emphasized that there is a priori no guarantee of the existence of the mo-
ment of order r. For example, although all moments with r 2: 0 exist for the
Gaussian law, only those moments (X>J with r E [0,1) exist for the Cauchy
law. Indeed, in the latter case, the moment (X>.) is not defined because
2.3 Means and Moments 11
The first moments play a special role, because they are very often considered
in physical problems. For r = 1, we obtain the mean value of the random
J
variable:
mx = (X A ) = xPx(x) dx .
The physical interpretation is simple. ax, also known as the standard devi-
ation, represents the width of the probability density function, whilst mx is
the value on which the probability density is centered (see Fig. 2.2). Quite
generally, we define the central moment of order r by ((x - (x)Y). Table 2.2
shows the mean and variance of various probability laws.
Bernoulli q q(l - q)
Poisson J-t J-t
Uniform (a + b)/2 (a - b?/12
Gaussian m (72
The moments just defined correspond to mean values involving the proba-
bility density function. They are referred to as expected values or expectation
values, or simply expectations. These quantities are defined from determinis-
tic variables x rather than random variables X A They are clearly determinis-
tic quantities themselves. Although they characterize the probability density
function, they cannot be directly ascertained by experiment. In fact, they can
only be estimated. We shall return to the idea of estimation in Chapter 7.
Let us content ourselves here with a property which illustrates the practical
significance of moments. This property follows from the weak law of large
12 2 Random Variables
Jr xPx(x)dx
b
lim lim
a---+-oo b---+oo a
does not converge. Note, however, that it is symmetrical about the origin,
so that the value x = 0 must play some special role. This is indeed what
we conclude when we introduce the notions of mode and median for this
probability law.
When it exists, the median XM is defined as the value of x such that
J J
XM 00
Px(x) dx = Px(x) dx .
-00
Mode
Median x
xp = argmax [Px(x)] ,
x
which simply means that it is the value of x which maximizes Px(x), see
Fig. 2.2. It is then clear that, if Px(x) is differentiable, the mode satisfies the
relation
or equivalently,
a
ax In [Px(xp)] =0.
It is thus easy to see that the mode of a Gaussian distribution is equal to its
mean.
Px(x)
P(AIB) = P(A,B)/P(B) ,
where we adopt the standard notation P(A, B) = P(A n B). It follows imme-
diately that
2.5 Joint Random Variables 15
AuB
The joint distribution function Fx,Y(x, y) relative to the two random vari-
ables X).. and Y).. is defined as the probability that simultaneously X).. is less
than x and Y).. less than y. Let Ix denote the set of random events in n
such that X).. is less than x. Likewise, let Iy denote the set of random events
in n such that Y).. is less than y. Then it follows that Fx(x) = P(lx) and
Fy(y) = P(ly), and also that
P ( )_ cP Fx,Y(x, y)
X,y x, Y - 8x8y .
Px,Y(x, y)
PxlY(xly) = Py(y) ,
and symmetrically,
P (I) - Px,Y(x , y)
YIX Yx - Px(x) .
16 2 Random Variables
To complete these relations, note that Fx,Y(x, 00) = Fx(x) and hence,
J
00
and likewise,
J
00
We say that Px,Y(x, y) is the joint probability law and that Px(x) and Py(y)
are the marginal probability laws.
2.6 Covariance
Px,Y(x, y) = Px(x)Py(y) ,
which then implies that PXIY(xiy) = Px(x) and PYlx(yix) = Py(y). In other
words, knowing the value of a realization of Yo>- tells us nothing about the value
of Xo>- since PXIY(xiy) = Px(x), and likewise, knowing the value of Xo>- tells
us nothing about the value of a realization of Yo>-.
The second extreme situation corresponds to the case where there is a
perfectly deterministic relationship between Xo>- and Yo>-, which we denote by
Yo>- = g(Xo>-). Clearly, in this case, when the value of a realization of Xo>- is
known, only the value g(Xo>-) is possible for Yo>-, and we write
or more explicitly,
It can be shown that ITXyl2 : : ; aia~. Indeed, consider the quadratic form
(aoX\ -oY>Y, where oX>. = X>. - (X>.) and oY>. = Y>. - (Y>.). Since this form
it positive definite, its expectation value must also be positive. Expanding out
this expression, we obtain
The random variables X>. and Y>. are therefore uncorrelated. However, they
are not independent, since (X>.)2 + (y>.)2 = 1.
3 Introducing once again the centered variables oX), = X), - (X),) and oY), =
Y), - (Y),), it is easy to see that (oX),) = 0 = (oY),) and that rXY = (oX),oY),),
II
i.e.,
rXY = xyPox,w(x, y)dxdy,
noting that we are considering the probability density functions of the centered
variables oX), and oY)'. Clearly, we have Pox(x) = Px(x - (X),)) and poY(Y) =
Py(y - (Y),)). Since by hypothesis PX,y(x,y) = Px(x)Py(y), we can deduce
from the above that Pox,w(x,y) = Pox (x)PoY (y). It thus follows that rXY =
II xyPox (x)Pw (y)dxdy and hence
r XY = I xPox(x)dx I yPoy(y)dy.
'( ) _ dg(x)
g x - dx'
we obtain
1
Py(y) = g'(x) Px(x) .
Noting that g'(x) = dy/dx, the above expression can also be written in
the more memorable form (see Fig. 2.4)
Py(y)dy = Px(x)dx .
If the relation y = g(x) is not bijective, the above argument can be applied
to intervals where it is bijective, adding the contributions from the various
intervals for each value of y.
Considering the case where the probability density function PA(a) of the
amplitude A>.. (assumed to be real-valued) of the electric field is Gaussian with
zero mean and variance a 2 , let us determine the probability density function
P1(I) of the intensity 1>.. = 1A>.. 12 . To do so, we begin with the positive values
of a. Hence,
pt(I)dI = ~
y27ra
2
exp (- a 2 ) da.
2a
Now dI = 2a da and a = VI, which implies
2.8 Stochastic Vectors 19
= g(x)
I
l
.. P,{y)dy
h-,-',
: d
\ -g{x)
\ dx
dy __ _______t _____.
pt(I) = k
2 27f1a
exp (- 12)
2a
In the same manner we obtain for negative values
PI-(1) = k
2 27f1a
exp (-2 2)
1
a
For each value of I, we may have a = ,.fl or a = -,.fl, and we thus deduce
that PI(I) = pt(I) + PI-(I) Hence,
PI(I) = ~
v 27f1a
exp (-2 12) .
a
where the symbol T indicates that we consider the transposed vector. We thus
see that a stochastic vector is simply equivalent to a system of N random
20 2 Random Variables
variables. The distribution function Fx(x) is the joint probability that X>.(j)
is less than or equal to Xj for all j in the range from 1 to N, with x =
(Xl,X2, ... ,XN)T. In other words,
In the complex case, let X>.(j) = xr-(j)+ixl(j), where Xr-(j) and Xl(j)
are the real and imaginary parts of the component x>. (j). The distribution
function is then
Fx(x) = Prob[Xr-(1) :::; x~, xl(1) :::; xl, Xr-(2) :::; x~, Xl(2) :::; x~,
... , Xr-(N):::; x~, Xl(N) :::; x~] ,
and the probability density function is
f)2N
Px(x) = f) Rf) If) R f) Rf) I Fx(x) .
Xl Xl X 2 ... X N X N
is equivalent to
2.8 Stochastic Vectors 21
Note that if a and b are two N-component vectors, atb is a scalar, since it is
in fact the scalar product of a and b, whilst bat is an N x N tensor with ij th
component biaj. This formulation is sometimes useful for simplifying certain
proofs. For example, we can show that the covariance matrices are positive.
For simplicity, we assume here that the mean value of X A is zero. If it is not,
we can consider Y A = oX A = X A - (X A). For any vector a, the modulus
squared of the scalar product at X A is positive or zero, i.e.,
or
=t -
we see immediately that we have a Hermitian matrix, i.e., r = r, since
{XA(j) [XA(i)]*} * = XA(i) [XA(j)]* .
where Irl is the determinant of r. This expression can be written in the form
where
N N
Q(X1, X2,'" ,XN) = L 2)Xi - mi)Kij(xj - mj) ,
i=1 j=1
and
Exercises
Exercise 2.1. Probability and Probability Density Function
Let X>, be a random variable with probability density function Px(x). Con-
sider the new variable Y>, obtained from X>, in the following manner:
Exercise 2.5
Let G(x, y) be the probability that the random variable X>.. lies between x
and y. Determine the probability density of X>.. as a function of G(x, y).
Consider the complex random variable defined by Z>.. = X>.. + iY>.. where i2 =
-1, and X>.. and Y>.. are independent Gaussian random variables with the same
variance. Give an expression for the probability density of Z>...
Determine the probability density function of Y>.. obtained from X>.. by the
transformation Y>.. = (X>..)f3, where (3 > 0 and X>.. is a random variable dis-
tributed according to the Gamma probability law. Analyze the special case
where the Gamma distribution is exponential.
variables Bi are statistically independent of one another. The sum of all the
measurements is evaluated, thereby producing a new random variable
(1) Calculate the probability density function of the random variable Y, as-
suming it obeys a Gaussian distribution.
(2) Why can we say that measurement of g using Y is more 'precise' than
measurement using a single value Fi ?
Consider two independent random variables X), and Y)" identically distributed
according to a Gaussian probability law with zero mean. Determine the prob-
ability density function of the quotient random variable Z),T = X)" /Y),2.
3
Fluctuations and Covariance
In this chapter, we shall discuss random functions and fields, generally known
as stochastic processes and stochastic fields, respectively, which can simply
be understood as random variables depending on a parameter such as time
or space. We may then consider new means with respect to this parameter
and hence study new properties. We shall concentrate mainly on second order
properties, i.e., properties of the first two moments of these random functions
and fields.
Sensor
.
~
Gain: G ..
Internal noise: YJ1
Amplifier
This result shows that the covariance function is a quantity which charac-
terises the mean variation of realizations of the stochastic process between
times tl and t2'
3.1 Stochastic Processes 27
X l, (t)
The most general case occurs when the fields depend on both space and time,
as happens, for example, when we measure an electric field at a point r in
space at a given time t. The stochastic field is then written EA(r, t). We can
define the covariance function for the two random variables which represent
the field at point rl at time h and the field at point r2 at time t2 :
for the random variables X.x(td and X.x(t2). We could of course generalize
this definition to an arbitrary number N of times to consider
Note that the same random event occurs throughout the latter expression.
As a special case, we recover the first two moments arising in the expression
for the covariance introduced in Section 3.2, viz.,
J
T2
T2 ~ Tl X.x(t)dt .
The obvious problem with this definition is that it depends on the choice of
Tl and T2. To get round this difficulty, we take the limit of the above mean
when T2 tends to infinity and Tl tends to minus infinity. Note, however, that
there is no guarantee that such a quantity actually exists, i.e., that the limit
exists. When it does, it will be called the time average, written
30 3 Fluctuations and Covariance
Clearly, X)..(t) and X).. (t)X).. (t + T) cannot depend on t. However, they may
depend on A and T. A stochastic process is said to be weakly ergodic or ergodic
in the wide sense if it is ergodic up to second order moments, i.e., if X).. (t)
and X).. (t)X).. (t + T) do not depend on A.
Note that this definition is a kind of dual to the definition of stationarity.
A stochastic process is (weakly) stationary if the expectation relative to A
removes the dependence on t. A stochastic process is (weakly) ergodic if the
average with respect to t removes the dependence on A. It is common in physics
books to define ergodicity only in the case of stationary stochastic processes.
However, this approach tends to hide the symmetry between the definitions.
Let us now illustrate these ideas with two simple examples.
Consider first the case where X).. (t) = A).. and A).. is a real random vari-
able. This random variable is clearly stationary. (When there is no risk of
ambiguity, although we speak simply of stationarity and ergodicity, it should
be understood that we are referring to weak stationarity and weak ergodic-
ity, up to second order moments.) Indeed, it is easy to see that (X)..(t)) and
(X)..(t)X)..(t + T)) are independent of t, since we have (X)..(t)) = (A)..) and
(X)..(t)X)..(t + T)) = ((A)..)2). On the other hand, it is not ergodic because the
time average does not eliminate the dependence on A. Indeed, X)..(t) = A).. and
X)..(t)X)..(t + T) = (A)..)2. These results are easy to interpret. As the process is
time-independent, it is invariant under time translations and hence obviously
stationary. Now ergodicity means that time averages should "rub out" any
dependence on the particular realization of the stochastic process that we are
analyzing. In other words, each realization should be representative (up to
the second order moment) of the family of functions defining the stochastic
process. It is clear that this cannot be the case when X)..(t) = A).. (unless of
course we have the trivial situation in which A).. is constant as a function of
A, i.e., A).. is a fixed value).
We now discuss the case where X)..(t) = Acos(wt + 'P)..), with A a real
parameter and 'P).. a real random variable taking values between 0 and 27f
with probability density P( 'P).
3.2 Stationarity and Ergodicity 31
J
27r
J
271'
J[
271'
cos(2wt + WT + cp)P(cp)dcp .
o
The second term is not generally independent of time t.
Regarding the question of ergodicity, let us examine the time averages. We
have
----
X>.(t) = lim
Tl ---> -00 T2 -
1T
1
JT2
Acos(wt + cp>.)dt ,
T2 ---> 00 Tl
J
Tz
or
X;>.. (t)X;>.. (t + T) =
T,
lim
-> - 0 0
1
T2 - Tl
JA
Tz
-
2
2
[cos(2wt + WT + rp) + cos(wT)Jdt,
Tz -> 00 T,
The state of the physical system can thus be represented by a point in a space,
generally of very high dimension, known as the phase space. In reality, the
variables for each particle can only take values within a bounded set and the
state Qt of the system belongs to a bounded subset of phase space which we
shall call S. A physical system is then said to be ergodic if the only invariant
subspaces of S under action of the operator XT [ 1 are the set S itself and
the empty set (see Fig. 3.3). In other words, during its evolution, an ergodic
system visits all its possible states. This would not be the case if, during its
3.3 Ergodicity in Statistical Physics 33
Fig. 3.3. In physics, a system is said to be ergodic if the only subs paces of the phase
space that remain invariant under the action of the evolution operator X T are the
whole space and the empty set
notions of stationarity and ergodicity in the wide sense, since this approach
has the advantage of being less restrictive.
and
or
rxx(rI, r2) = rxx (Rw [rl], Rw[r2]) .
The question of ergodicity is a little more delicate and we shall limit our
discussion to the effects of translation. We define spatial averages by
where IVI is the volume of the subspace V and V ~ 00 means that the
volume increases to cover the whole space. Note that (x, y, z) represents the
components of r.
3.5 Random Sequences and Cyclostationarity 35
where the function mod N [nJ is defined by mod N [nJ = n - pN and p is a whole
number chosen so that n - pN E [1, NJ (see Fig. 3.4).
X.l. {n)
,
,il,
"i:.!, ..
h~
1\ ,I \ (I \
, /
n :1
[; !\
II ! \;
'\f'\
: : : : Ii1\ iitl , ,'\.,/j "~!I \ .
\
; I ". 1j j f1
'"1,' \"I.! V \.1 i," ,i
~
:.::: :i ~: : : \I ;)1 V t,/lI1.f n
V -: :,: "0" ::. ::
. " ". I
v '"
'--------------------v---------------~
Xf {n)
We can calculate the discrete Fourier transform of X.>- (n), since it is a finite
sequence. (There is thus no problem of non-convergence as might happen with
a continuous signal, or a signal with unbounded temporal support.) The result
is
1 N-I 27rVn)
X.>-(n) = N ~ X.>-(v) exp i~ ,
A (
N ( 27rVn)
X.>-(v) ~x.>-(n)exp -i~
A
Hence,
.27rVIn) [.27rV2(n+m')]
xexp ( -l~ exp 1 N '
or
if v = 0,
otherwise.
3.5 Random Sequences and Cyclostationarity 37
Finally, we obtain
(3.1)
r(v)
,
=
1
N fo
N-I
r(m)exp i~
( 27rVm)
.
, N ( 27rVn)
XA(V) = ~ XA(n) exp - i f \ i '
and hence,
(XA(V)) = 0 .
Indeed,
N ( 27rVn)
(XA(V)) ~ (XA(n)) exp - i f \ i '
A
and
(XA(n)) = (X A(1)) = mx = 0,
so that
(X>.(v)) = mxNo v ,
central limit theorem, as we shall see in the next chapter.] We shall show that
X>.(v) is then a complex random variable with probability density
where 0: and (3 are real and represent the real and imaginary parts of the
complex variable x = 0: + i(3. In particular, note that the phase of X>. (v) is
random and uniformly distributed over the interval from 0 to 27f.
Given that the probability density of X>. (v) is Gaussian with zero mean,
we need only calculate the second order moments. Considering nonzero fre-
quencies, we set X>.(v) = Xf(v) + iXl(v), where Xf(v) and xl(v) are the
real and imaginary parts of X>.(v), respectively. Hence,
N (27fVn)
Xf(v) ~ X>.(n) cos JiI '
A
and
Now
Since
~
~cos
[27fv(2n + m')] _
N - 0,
n=l
we obtain
3.5 Random Sequences and Cyclostationarity 39
;.. (27rVm' )
~ r(m')cos ~ = Nr(v),
A
m'=l
and hence,
R 2 N2
(IX). (v)1 ) = Tr(v) .
A A
(x{t(v)xl (v))
=- ~ fl t,
r( m') {sin [27rV(2~+ m')] + sin (27r~m') }
Now
N
L sin [27rv(2n + m')jNJ = 0
n=l
and since rem) is an even function, we have
;..
~
r( msm~=.
') . (27rVm') 0
m'=l
and
R I
(X). (v)X).(v)) = 0 .
A A
This explains the form of the probability density function mentioned above.
3 For real cyclostationary sequences, we have
(X).(t)) = X).(t) ,
=
If we assume that we can change the order of the integrals in the expectation
values and the time averages,4 we obtain the fundamental relations,
and
X). (t)X). (t + T) = (X). (t)X). (t + T)) .
We thus obtain the fundamental result that the ensemble average (X).(t))
and the covariance function FXX(T) can be obtained by calculating the time
averages X).(t) and X). (t)X). (t + T). The latter are more easily estimated than
the expectation values, which require us to carry out independent experiments.
[Note that the covariance function FxX(t,t+T) is then equal to the centered
correlation function ex>, x>, (t + T) = X). (t)X). (t + T) - X).(t) X).(t + T) .J
In the case of homogeneous and ergodic real stochastic fields,
(X).(r)) = X).(r) ,
and
(X). (r)X). (r + d)) = X). (r)X). (r + d) ,
whilst the spatial averages are
and
J
T2
The electric field of an optical wave oscillates about a zero value and we thus
see that mE(r) is zero. We therefore obtain
J
T2
To begin with, consider the case of a point light source and assume that the
light is able to follow two different paths, as happens in the Mach-Zehnder
interferometer shown in Fig. 3.5.
The dependence on the space variable r is irrelevant so, to simplify the
notation, we write the field before the beam splitter in the form E>.(t) =
Eiz)(r,t). We also assume that (E>.(t)) = O. The effect of the two arms of
the interferometer is to introduce different delays, denoted 71 in the first
arm and 72 in the second arm. The electric field at the detector is propor-
tional to E>.(t - 7d + E>.(t - 72) and the intensity is thus proportional to
IE>.(t - 7d + E>.(t - 72)1, which can be written
4fV\\___:C_
- ~~ _./_.-
71---
Emitter i i Receiver
'\ 7
. .
. 'r .
. -.-.-~-.-.-.
If we assume that the electric field is stationary and ergodic, a good approx-
imation for the intensity can be found by integrating over a sufficiently long
time interval to obtain
20"1 + r EE (72 - 71) + r EE (71 - 72) ,
where 0"1 = (IE>.(t)12) and r E,E(72 -71) = ([E>.(t - 71)]* E>.(t-72)). We thus
find that, if the field is temporally incoherent for large differences h - 721, the
intensity at the detector will be independent of 71 and 72. On the other hand,
if the field is coherent for certain values of 171 - 721, it will vary as a function
of 71 - 72. It is sometimes possible to define a coherence time 7e . This may
be taken as the time 7e for which rEE(7e ) = rEE(O)/e. For example, if we
have rEE(7) = raexp (-171/a) cos (W7), the coherence time will be defined
by 7e = a.
In an analogous manner, if we consider the electric field at the same time
but at two different points, we can sometimes define a coherence length. For
example, if we have
AE( ) _ 1 8 2E(r,t) = 0
L.l. r, t c2 8t 2 '
H[X(r,t)] =0.
Let us assume that the evolution of the relevant field X (r, t) is described by a
linear partial differential equation. Recall that a partial differential equation is
said to be linear if, for all fields Xl (r, t) and X 2(r, t) satisfying H [Xl (r, t)] = 0
and H [X2 (r, t)] = 0, and for all scalars a and b, we have
The field and its covariance evolve according to the same partial differential
equation.
If we also assume that the stochastic field is stationary and homogeneous,
we have
rxx(r, t)
= (X>.(rl, tl)X>.(rl + r, tl + t)) - (X>. (rl, h)) (X>.(rl + r, tl + t)) ,
and hence,
H [rxx(r, t)] = 0,
where H[ ] applies to variables rand t.
Let us consider the particular case where an electromagnetic wave prop-
agates in vacuum. The covariance function of the electric field thus satisfies
the equation
44 3 Fluctuations and Covariance
where
82 82 82
.6. 1 = ~ 2 + ~ 2 + ~ 2
uX l uYl uZl
and rl = (Xl, Yl, zt) T. In the particular case where the electric field of an
electromagnetic wave is stationary and homogeneous, its covariance function
evolves according to the equation
ArEE (r, t ) _ ~2 8 2 r EE
~ 2
(r,t) = 0
'
L.l.
C ut
where
82 82 82
.6. = 8X2 + 8y2 + 8z2
and r = (x, y, z) T. In this way, we can describe the evolution of coherence in
optics.
does not generally exist, because the phase of J~2 X,\(t)e-i27rvtdt may not
converge. We thus define the power spectral density of X,\(t), also called the
spectrum of the signal, by
Sxx(v) = lim
Tl ---7-00
T2 -> 00
1
00
1
00
Px = Sxx(v)dv.
-00
111
2
J
and also
Px = Sxx(k)dk .
Here we find once again the property of time translation invariance used to
define stationarity. It should be remembered, however, that we are now con-
cerned with the idea of a stationary system rather than a stationary stochastic
process. Stationarity of a system means that it possesses no internal clock and
that it therefore reacts in the same way at whatever instant of time the input
signal is applied. Most stationary linear systems possess a relation between
input and output that can be written in the form of a convolution:
J
00
where X(t) represents the convolution kernel (see Fig. 3.6). In this case we
speak of a convolution filter, or more simply, a linear filter. It is a well known
mathematical result that the Fourier transforms e(v) and s(v) of e(t) and s(t)
are related by
s(v) = x(v)e(v) ,
where X(v) is the Fourier transform of X(t). In physics, X(t) is called the
susceptibility or impulse response since it is the response of the system if a
Dirac impulse is applied as input, viz., e(t) = 8(t).
In physics, one usually measures the response function CT(t) which is defined
(see Fig. 3.7) as the response to an input of the form e(t) = eo [1 - B(t)]. Here
the Heaviside step function is defined by B(t) = 1 if t ::::: 0 and B(t) = 0 if
t < o.
J
00
J
00
leading to
3.10 Filters and Fluctuations 49
I I X*(tdX(t2)X~(T1
00 00
If we assume that we can change the order of the various integrals, we then
find that
I I X*(t1)X(t2)(X~(T1
00 00
II
00 00
II
00 00
It is clearly more convenient to use a vector notation and define the spatial
Fourier transform by
which reads
i.e.,
In this section, we shall illustrate the above ideas in the context of optical
imaging. We therefore consider an imaging system between the plane PI of
the object and the plane P2 on which the image is formed (see Fig. 3.8).
A.. (x,y,t )
\
\
\
\
\
\
\
\
\
\
A;..(x,y,t) = II h(x-~,y-~)E;..(~,~,t)d~d~,
=I
or more simply
A;..(r, t) h(r - u)E;..(u, t)du .
Note that, in this section, we are neglecting delays due to propagation of
optical signals. It is easy to show that they have little effect on the results of
our analysis in this context.
In order to keep the notation as simple as possible, we shall not take
into account the magnification factors present in most optical systems. The
received intensity if(r, t) = IA;..(r, t)12 is thus
2
IR(r, t) = Io(t) If h(r - U)F(U)dUI .
The situation is therefore very different from the totally incoherent case. The
relation is in fact linear and stationary in the field amplitude, and hence non-
linear in the intensity, in contrast to the case of the totally incoherent field.
The result that we have just established in optics is encountered in many
different areas of physics. A convolution relation in amplitude in the case of
correlated fields becomes a convolution relation in intensity in the case of
uncorrelated fields.
" ( ) 1 8 2 E(x, y, z, t)
uE x, y, z, t - 2" 8 2 = 0,
c t
where
c is the speed of light in vacuum, and x, y, z are the coordinates of the point r.
In the following chapter, we shall see that the diffusion equation is a partial
differential equation describing macroscopic phenomena such as the diffusion
of particles through a solvent. In a homogeneous medium, this equation is
"N( ) _ ~ 8N(x, y, z, t) = 0
u x,y,z,t 2 8 '
X t
where N(x, y, z, t) represents the concentration of particles at time t at the
point with coordinates x, y, z, and X is the diffusion coefficient.
In this section we shall therefore assume that the field X (r, t) that interests
us evolves according to a linear partial differential equation which we shall
write in the form
H[X(r,t)] =0.
We use the notation r = (x, y, Z)T to shorten the equations. We can now
introduce the Green function G (r, t, r', t') (where t > t') which solves the
partial differential equation
Indeed, we have
54 3 Fluctuations and Covariance
Symbolically, we write
or in long-hand
G(r, t, r', t') F(r') = JJJ G(x, y, z, t, x', y', z', t')F(x', y', z')dx'dy'dz' .
We thus obtain X>-(rl, td = G(rl' iI, r, t)X>-(r, t), where tl ::::: t, and hence,
(X~(rl,tl)X>-(r2,t2))
= G* (rl' tl, r~, t~) G(r2' t2, r~, t~) (X~(r~, t~)X>-(r~, t~)) ,
or more fully,
H[X(r,t)] =0,
with initial conditions X(r, t') = F(r). This partial differential equation
H [X(r, t)] = 0 is said to be stationary if, for any initial conditions X(r, t') =
F(r) and any T, the solution with initial conditions X(r, t' + T) = F(r) is
simply XF(r, t + T). In this case the Green function can be written in the
form G(r, r', t - t').
3.12 Green Functions and Fluctuations 55
and hence,
or
X,\(r\ ,t\)
Stationarity
X,\(r'2,t'2)
rXX(Xl, Yl, Zl, X2, Y2, Z2, T) = III x~, Y2 - Y~, Z2 - z;, T - T')
G(X2 -
, , , ')d'x 2d Y2'd z2'
x r xx (Xl,Yl,Zl,x2,Y2,z2,T '
We thus note that, as time goes by, there is a spatial filtering of the covari-
ance function r Xx (rl,r2,T') by the convolution kernel G(r2,T - T') which
represents the Green function.
It is no surprise that the covariance function and the field itself are filtered
by the same kernel in the Green function formulation, since they obey the same
partial differential equation (see Section 3.8).
with
Several special cases can now be studied on the basis of this definition. For
example, we may be concerned with only two coordinates, or we may wish
to study the covariance matrix at a single point r or at the same times. This
is what happens classically when we analyze the polarization properties of
electromagnetic waves, as we shall see in Section 3.14.
We consider the electric field E).,(r, t) of light. (We could also consider the
magnetic field.) In vacuum, if the light propagates along the Oz axis, the
vector E).,(r, t) lies in the plane Ox, Oy. We may therefore write
where ex and e y are unit vectors along the orthogonal axes Ox and Oy. We
shall assume that the wave is stationary and homogeneous. We shall not be
concerned with properties that depend on the space coordinates and we shall
no longer include the dependence on r. Letting I/o denote the central frequency
of the optical wave and using the complex notation, we thus write
([UY)(t)]*UY)(t+T))) .
([uy)(t)r uy)(t + T))
In practice, in the field of optics, one often defines the coherency matrix,
which is the covariance matrix when T = O. This matrix provides interesting
information about the polarization state of the light. We shall illustrate this
point using two concrete examples and in the two limiting cases of perfectly
coherent and perfectly incoherent light.
We begin with the case of perfectly coherent light. If the light is linearly
polarized along the Ox axis, we can write
I Linear
..
Linear
horizontal
Elliptic
right
00 o
vertical
If the incoherent light is totally unpolarized, this means that we can write
We can also define different polarization states for incoherent light which
are intermediate between the two cases we have just described. The general
coherency matrix is
J= (Ix
p* Iy
p)
60 3 Fluctuations and Covariance
Like any covariance matrix, the coherency matrix is Hermitian. It can there-
fore be diagonalized and has orthogonal eigenvectors. Like any covariance
matrix (see Section 2.8), it is positive and its eigenvalues are therefore pos-
itive. We denote them by),! and '\2, where Al ;::: A2. (The eigenvalues of
partially polarized light are represented schematically in Fig. 3.11.) As the
eigenvectors are orthogonal, the change of basis matrix M used to diagonalize
---t =t
the coherency matrix (i.e., such that MJ M is diagonal, where M is the
conjugate transpose of M) is therefore unitary, i.e. , it satisfies the relation
--t -t- - -
M M = M M = Id 2 where Id 2 is the 2 x 2 identity matrix. It is common
practice to define the degree of polarization of light by
There are two invariants under orthonormal basis change (i. e., when the
change of basis matrix is unitary), viz., the trace T and the determinant
D of the matrix. Since T = Al + A2 and D = A1A2, it is a straightforward
matter to deduce that
The coherency matrix of light that is linearly polarized along an axis mak-
ing an angle e to the Ox axis is given by
3.15 Ergodicity and Polarization of Light 61
J= /IU(X) t 12) ( 2
cos 8 COS8Sin8).
\ >. () COS 8 sin 8 sin 2 8
It is easy to check that this has zero determinant and hence that P = 1.
The same is true for circularly polarized light, whose coherency matrix is
proportional to
We shall now bring out another aspect which clearly illustrates the phe-
nomenological nature of the idea of stochastic process. To begin with, we note
that in the above discussion there is no conceptual difference on the mathe-
matical level between perfectly coherent linearly polarized light and perfectly
incoherent linearly polarized light. We have just seen that, in the first case,
we have
A~~
~
P(9)
o 2n 9
Fig. 3.12. Light with linear polarization along an axis making a random angle (}p,
with the Ox axis
or
where
where
64 3 Fluctuations and Covariance
The above time averages do not therefore remove the dependence on the
random event IL, and this shows clearly that A>.,/-L(t) is not ergodic.
To conclude, we find the same coherency matrices for incoherent light that
is completely unpolarized or incoherent light that is completely polarized along
an axis at a uniformly distributed angle between 0 and 27r relative to some
reference axis. The difference between these two cases is indeed related to their
ergodicity property. Generally speaking, caution is required when dealing with
covariance matrices of non-ergodic processes.
or
JJI 1
J(j1, T) I
F(j1, T)dj1dT = JJ G(tl' t2)dh dt2 ,
We now define
AT(t) = { 01 - ItllT if It I < T ,
otherwise.
Provided we assume that we can change the order of the limit and the integral.
Sxx(v) = i:
With the above notation, we then have
i:
T2 --+ 00
and hence,
Sxx(v) = rxx(T)e-i27rl/T dT .
This is precisely the Wiener-Khinchine theorem. It says that the power spec-
tral density Sx x (v) of a stochastic process which is stationary to second order
is equal to the Fourier transform of its covariance r x x (T).
66 Exercises
Exercises
Exercise 3.1.
X>. and Y>. are two random variables with variances o-i- and o-}, respectively.
Denoting their correlation coefficient by Fxy , show that Fxy ::; (o-i- +o-n/2.
Use the fact that
((X~ - Y~l) ~ 0 ,
where X~ = X>. - (X>.) and Y" = Y>. - (Y>.).
bT-l (b)
{---exp -- if b> 0
PB(b) = aTF(r) a -,
o otherwise.
where F(r) is the Gamma function. Setting Z>.(t) = In Y>.(t), calculate the
probability density function of the fluctuations in Z>.(t).
Let h(t) be a periodic function with period T. Using h(t), we construct the
stochastic process
il-tlR
,x ~ h(t - 7>.) ,
where il is the space of random events ,x, and lR is the set of real numbers. We
assume that the probability density function for 7>. is constant in the interval
[0, T]. We will be interested in the ergodicity and stationarity in the sense of
first and second order moments.
(1) Determine whether or not this stochastic process is stationary.
(2) Determine whether or not it is ergodic.
Exercises 67
where ).. is a random variable with values in the interval [0,21f] and i2 = -l.
(1) Let })..(v) be the Fourier transform of f)..(t). Determine the phase of })..(v).
(2) Is I>. (t) weakly stationary?
(3) What can you deduce concerning (j)..(v)), where ( ) denotes the mean with
respect to >. ?
(4) Determine the Fourier transform of a signal of form
in the case where the n,).. are independent random variables uniformly
distributed over the interval [0, 21f].
(5) What can you deduce concerning (j~(Vd})..(V2))?
68 Exercises
(1) Calculate the autocorrelation function of the noise after filtering X.\(t) by
h(t) when B ----+ +00.
(2) Deduce the total power of the fluctuations after filtering.
(3) What happens if a ----+ +oo?
Let X.\ (t) be a weakly stationary real stochastic process such that (X.\(t)) = O.
I
We define
t+T
Y.\(t) = t X.\(~)d~ .
(1) Express the spectral density of Y.\(t) in terms of the spectral density of
X.\(t).
(2) What happens if the spectral density Sxx(v) of X.x(t) is such that
Sxx(v) = (J28 (v - niT), where n is a nonzero natural number and 8(x)
is the Dirac distribution?
(3) If now (X.\(tdX.\(t2)) = 8(h - t2), what happens to the spectral density
of Y.x(t)?
(4) How does the power of Y.\(t) vary?
Exercises 69
Let X.x(t) and Y.x(t) be two real random signals, both stationary with fi-
nite power, where Y.x(t) is the result of filtering X.x(t) by a convolution
filter (hence linear and stationary) with impulse response h(t). We write
rXY(T) = (X.x (t)Y.x (t + T)) and rXX(T) = (X.x (t)X.x (t + T)), where ( ) rep-
resents the mean with respect to outcomes of random events A.
We wish to estimate h(t) from rXY(T) and rXX(T).
(1) Determine h(t) in terms of rXY(T) and rXX(T).
(2) Write down the Fourier transform of this relation.
(3) What condition can you deduce on the spectral density of X.x(t) in order
to determine h(t)?
(4) What happens if X.x(t) is white noise in the frequency band between -B
and B?
4
Limit Theorems and Fluctuations
Sums of random variables are a fascinating subject for they lead to certain
universal types of behavior. More precisely, if we add together independent
random variables distributed according to the same probability density func-
tions and not too widely scattered in the sense that they have a finite second
moment, then the new random variable obtained in this way will be described
to a good approximation by a Gaussian random variable. This property has
a great many applications in physics. We shall describe a certain number of
them: the random walk, speckle in coherent imaging, particle diffusion, and
Gaussian noise, which is a widely used model in physics. The Gaussian distri-
bution is not the only one to appear as a limiting case. The Poisson distribu-
tion can be introduced by analogous arguments and it is also very important
because it provides simple models of fluctuations resulting from detection of
low particle fluxes.
where A = [A(l), A(2), ... , A(n)]. Let us determine the mean and variance of
S>.(n). We have
where P (Xl, X2, ... , xn) is the joint probability density function of the random
variables X>'(l), X>'(2), ... , X>.(n). Since
we deduce that
n n
j=l j=l
In order to analyze the behavior of [S.x(n)]2, we introduce the covariance
rij = (X.xCi)X.xCj)) - mimj, which can also be written
or
2 n 2 n n
Given that [(S.x(n))] = (L mj) = L L mimj, it is easy to see that the
j=l i=lj=l
variance of S.x(n) can be written
n n
([S.x(n)] 2) - [(S.x(n))]2 =L L rij .
i=l j=l
The second moments ([X.xCi)]2) are not simply additive. However, if the ran-
dom variables X.x(1), X.x(2)' ... ' X.xCn), are uncorrelated, then we have by def-
inition that rij = uT
if i = j and rij = 0 otherwise. We thus obtain
n
([S.x(n)] 2) - [(S.x(n))]2 = Lu;.
j=l
physical quantity. If the characteristics of the noise change little during the
measurement time, we can describe X,\(1) ,X,\(2) , ... ,X'\(n) by random variables
distributed according to the same probability density function with mean m
and variance 0'2. We can define the empirical mean of the n measurements
by f.l,\{n) = S,\{n)/n, usually referred to as the sample mean. The expecta-
tion value and standard deviation (denoted by O'J.L) of f.l,\{n) are thus simply
(f.l,\{n)) = m and O'J.L = O'/vn
In other words, the mathematical expectation of the sample mean of the
n uncorrelated measurements is just the statistical mean of X,\(j) , whilst the
standard deviation is reduced, and so therefore is the spread about the mean,
by a factor of vn. This is why it is useful to carry out several measurements
and take the average. In practice, the situation is often as follows. From n
independent measurements X'\(l) , X'\(2),'" ,X,\(n) , we can estimate the mean
by f.l,\{n) = S,\{n)/n and the variance by
(We shall see in Chapter 7 that this estimator is biased, but that for large n
the bias is low.) We should thus retain from this that 'T],\{n) is an estimate of
the variance of X,\(j) and that the variance of f.l,\{n) is rather of the order of
'T]'\{n)/n. This is indeed a useful result for plotting error bars in experimental
measurements.
We can determine the probability density function of a sum of indepen-
dent random variables. Consider first the case of two random variables X,\ and
Y,\ distributed according to probability density functions Px{x) and Py{y),
respectively. From two independent realizations X'\(l) and Y,\(2), we define a
new random variable ZJ.L = X'\(l) + Y'\(2) , where we have set f.l = (>'{1), >.(2)).
We shall now investigate the probability density function Pz{z) of ZJ.L' using
Fz{z) to denote its distribution function. To simplify the argument, we as-
sume that Px{x) is a continuous function. For a fixed value x of X,\(l) , the
probability that ZJ.L is less than z is equal to Fy{z - x). Now the probability
that X'\(l) lies between x - dx/2 and x + dx/2 is equal to Px{x)dx. The
probability that X'\(l) lies between x - dx/2 and x + dx/2 and that ZJ.L is
simultaneously less than z is then Fy{z - x)Px{x)dx. The probability that
ZJ.L is less than z independently of the value of X'\(l) is thus
J
00
J
00
We thus deduce that the probability density function of the sum variable
is obtained by convoluting the probability density functions of each of the
random variables in the sum. Note, however, that this result is no longer true
if the summed variables are not independent.
1:
acteristic function lP'x(lI) associated with the probability density function:
where 0 (liT) tends to 0 more quickly than liT when II tends to O. This is a
consequence of the expansion
(iIlX)n
=L --,- .
00
exp(illx)
n=O
n.
J
00
Uniform [0,1]
{lif
n=O
o
0';"';1
otherwise e
-iv/2 sin(v12)
vl2
Gaussian 1
--exp [(x-m)2]
- eimv-cr2v2/2
2a 2
{~-".
V27ia
r
if x;;:: 0
Exponential (1 _ i:)-1
otherwise
-{3x
XQ if x;;:: 0
Gamma r(a)- ' e (1- ~)-a
0 otherwise
1 When there is no risk of ambiguity, we will use this abbreviated manner of speak-
ing to indicate that a random variable has an exponential probability density
function.
76 4 Limit Theorems and Fluctuations
We now consider a sequence ofrandom variables X.x(l) , X.x(2) , ... ,X.x(n) which
we shall assume to be independent with finite mean and second moment. The
mean and variance of X>.(n) are mn and a;, respectively. We define the sum
random variable
n
S.x(n) = L X.x(j) ,
j=l
where ,\ = [,\(1), '\(2), ... , ,\(n)J. We have seen that the mean of S>.(n) is
Moreover, suppose that for any j between 1 and n, aJ IV; tends to 0 as n tends
to infinity. The condition that each random variable has finite second moment
tells us that it is not too widely scattered about its mean. The condition that,
for any j between 1 and n, aJ IV; tends to 0 as n tends to infinity ensures
that the fluctuations of one random variable do not dominate the others.
The central limit theorem tells us that the random variable Z.x (n) =
[S.x(n) -MnJ/Vn converges in law toward a reduced Gaussian random variable
(i.e., with zero mean and unit variance). Convergence in law toward a reduced
Gaussian distribution means that, as n tends to infinity, the distribution func-
tion FZ(n)(z) of Z>.(n) tends pointwise to the distribution function FRG(Z) of
a reduced Gaussian law:
The result is shown using the characteristic functions of each random variable.
The proof is simpler if we introduce the centered variables Y.x(j) = X.x(j) - mj
n
and define U.x(n) = L: Y.x(j)' We then note that Z.x(n) = U>.(n)/V". Let
j=l
l]fY,j(v), l]fu,n(v) and l]fz,n(v) be the characteristic functions of Y.x(j) , U>.(n)
and Z>.(n), respectively. Then we have
or
In [I}/z,n(v)] = ~ -2 V2 + 0
n [ 2 2 (22 V22)]
(Tj V I (Tj V
'
J=1 n n
or again,
where
We thus obtain
lim I}/z,n(v) = exp (_ v2) ,
n ...... oo 2
and hence,
lim Fz,n(z)
n ...... oo
= J
-00
z
y
1
r;:c:,
27r
(e)
exp - -
2
d~,
Gaussian random variable with mean limn--->oo Mn/n, i.e., with mean equal to
limn--->oo ~7=1 mj/n, and standard deviation limn--->oo Vn/n, i.e., with variance
equal to limn--->oo ~7=1 (JJ/n 2 .
Note also that, if the random variables are distributed according to
the same probability density function with mean m and variance (J2, then
~7=1 X)"(j)/n tends toward a Gaussian random variable with mean m and
variance (J2/ n .
We shall see that the central limit theorem is a very important result
because it can help us to understand many physical phenomena. It proves the
existence of a universal type of behavior which arises whenever we are dealing
with a sum of independent random variables with comparable fluctuations
(the latter being characterized by their variance). At this level, it is difficult
to obtain a simple interpretation for this result, which raises at least two
questions:
Why is there a unique law?
Why is this law Gaussian, or normal, as it is sometimes called?
We shall see in Chapter 5 that arguments based on information theory will
help us to elucidate this problem.
The theorem can be extended in various ways to independent real stochas-
tic vectors. Consider a sequence of real N-component stochastic vectors
X ).,(1), X ).,(2), ... , X ).,(n) , with identical distribution. Let m = (X ).,(i) denote
the mean vector and define the covariance matrix r by
wx(v) = 1 +00
-00 Px(x)exp(ivtx)dx.
Wy(v) = 1 +00
-00 Py(y)exp(ivty)dy.
Then setting
1 n
Z).,(n) = Vn LY)"(i) '
j=1
4.3 Central Limit Theorem 79
we obtain
Now,
(~)
1:
Yfy
=
00
Py(y) {1 + In(v ty ) - 2~ (v t y)2 + 0 [ ()nVty) 2]} dy.
1: 1:
We have I~; Py(y)dy = 1, and
Jnv t
00 00
Py(y) In(vtY)dY = Py(y)ydy.
and hence,
1+00 Py(y)-(vty)2dy
1 = -
1 1+00 Py(y)vt[yyt]vdy.
-00 2n 2n -00
where 0 [(vt I vn) 2] is a scalar tending to zero faster than lin. Since YfZ,n(v) =
[Yfy (vlvn)t, we obtain
and hence,
=) .
or
1 t rv
lim l}/z,n(v) = exp ( --v
n-->oo 2
We recognize the characteristic function of a probability d~nsity of Gaussian
stochastic vectors, with zero mean and covariance matrix F, viz.,
SA(n) = 2:,7=1 XA(j)/n, where A = [A(l), A(2), ... , A(n)], will have characteris-
tic function exp( -al vi). Then s>. (n) will be a Cauchy variable with probability
density function PS(n)(s) = a/ [7r(a 2 + S2)], that is, with the same parameter
as the summed variables. The mean sA(n) therefore fluctuates as much as
each of the summed random variables.
When we add together two identically distributed and independent Gaus-
sian variables X A(l) and X A(2), the sum is still Gaussian. We thus say that the
Gaussian law is stable. More precisely, a probability density function repre-
sents a stable probability law if, when we add two independent variables X A (l)
and X A(2) identically distributed according to Px(x), there exist two numbers
a and b such that [X A(l) +XA(2)l/a+b is distributed according to Px(x). We
have just seen that the Cauchy probability law is stable. The central limit
theorem guarantees that if ((XAr) converges for any value of r greater than
2, the only stable probability laws are Gaussian. However, if ((XAr) does not
converge for any value of r greater than (Y, where (Y is strictly less than 2,
there may be other stable laws. This is the case for the Cauchy distribution,
for which (Y = l.
The study of stable distributions is very interesting but goes somewhat
beyond the scope of this book. However, if the noise is described by a random
variable which is not too scattered, that is, which has finite second moment,
then the Gaussian hypothesis may be acceptable, provided that a large number
of independent phenomena add together to make up the noise. On the other
hand, if the phenomena causing the fluctuations undergo large deviations
which prevent us from defining a finite second moment, then we are compelled
to reject the Gaussian noise hypothesis.
A(t) = lSI
Ao (.
exp 1Wt)
82 4 Limit Theorems and Fluctuations
Fig. 4.1. Image of an agricultural area in the Caucasus acquired using the synthetic
aperature radar (SAR) aboard the European Remote Sensing satellite ERS 1. The
image shows characteristic speckle noise. (Image provided by the CNES, courtesy of
the ESA)
illuminates a rough surface 8 (see Fig. 4.2) with constant reflectivity. The sur-
face 8 corresponds to the illuminated surface which forms the wave producing
the field measured in a given pixel.
The factor 181, which represents the measure of the surface 8, is intro-
duced for reasons of homogeneity with regard to physical units, as we shall
see shortly. Other conventions could have been chosen, but this one is perhaps
the simplest. At the detector, the amplitude of the field can be written
4.5 A Simple Model of Speckle 83
AR(t) = II ~I
s
pexp [iw(t - tx,y)] dxdy ,
where p is the square root of the reflection coefficient and tx,y describes the
retardation of the ray leaving the emitter and arriving at the point with
coordinates (x, y) on the surface before converging on the detector. wtx,y is
thus a phase term if>x,y, which can be chosen to lie between 0 and 21l'. To be
precise, we set if>x,y = wtx,y - 21l'n, where n is a natural number chosen so that
if>x,y lies between 0 and 21l'. If the depth fluctuations on the surface are large,
we may expect a significant spread of values for wtx,y relative to 21l', so that
if>x,y is likely to be well described by a random variable uniformly distributed
between 0 and 21l'. We thus write
AR(t) = II ~IPexp(iwt
s
- iif>x,y)dxdy,
Z" = I~I II
s
exp( -iif>x,y)dxdy .
Note that AR(t) is a random variable but that its dependence on A is not
mentioned, to simplify the notation. Speckle thus amounts to multiplying
the reflected amplitude Aop by Z". It should be pointed out that the model
is multiplicative because we assumed that the reflectivity p is constant, i.e.,
independent of x and y. If this were not so, the model would not necessarily be
simply multiplicative. It is therefore important to specify the model precisely.
To proceed with this calculation, we now make a simplifying hypothesis,
namely that we may cut the surface S up into N parts, each of which in-
troduces an independent phase difference CPj, for j = 1, ... , N. We can then
write
1 N
Z" = N Lexp(-icpj).
j=l
Decomposing Z" into real and imaginary parts Z" = X" +iY", we find that X"
and Y" are sums of independent random variables with finite second moment.
In fact, we shall only show that X" and Y" are uncorrelated and that they
have the same variance, and we shall then assume that they are independent.
We have
1 N N 1 N N
X"Y" = N2 L sin CPi L cos CPj = LLsinCPicoscpj .
-2
. 1
2= . 1
J= Ni=l.J= 1
If we assume that the CPR are uniformly distributed between 0 and 21l', we
obtain
84 4 Limit Theorems and Fluctuations
1 N N
(XAYA) = N2 LL(sin<Picos<pj)
i=l j=l
and
(sin <Pi cos <pj) =0.
Hence, finally, (XAYA) = 0, which shows that the variables X A and Y A are
uncorrelated. In the same way, it can be shown that
and
N
Now (cos 2 <Pc) = (sin 2 <Pc) since the <Pc are uniformly distributed between 0
and 27r, so that ((XA)2) = ((YA)2). This shows that the variables X A and Y A
have the same variance.
When several elementary scatterers contribute, i.e., when N is large, X A
and YA are therefore Gaussian random variables with probability density func-
tions
0.8
0.6
0.4
0 .2
Fig. 4.3. Probability density function for the intensity of a scalar monochromatic
wave with mean value 1, in the presence of fully developed speckle for a single look
In other words, the multiplicative noise B).. will be a Gamma variable of order
n and mean l.
For electromagnetic waves, and in particular, light, the wave is not scalar
but vectorial, since it can be polarized (see Section 3.14). The above calcula-
tion is only valid for one of the two polarization components. In the general
case, we have
AR(t) = Ali (t)ex + Ak (t)e y .
We deduce that IR = IAR(t)12 = IAIi(t)12+IAk(t)12. Suppose that the surface
is illuminated by a completely unpolarized wave and that the field AR(t) is
86 4 Limit Theorems and Fluctuations
1 X 12 )=(Ao(t)
10x =(Ao(t) 1 y 12 )=10y and
Hence, for a single measurement, IAR(t)12 is the sum of two random variables
with probability density function
2 (21 )
10lpl2 exp - 10lpl2 .
Since the two intensities IAif (t)12 and IA6 (tW are independent, the measured
quantity is the sum of two independent measurements and we have
0.4 r--~--~--~-~--""""'----,
0.3
0 .2
0.1
2 3 4 5 6
Fig. 4.4. Probability density function for the intensity of an unpolarized monochro-
matic wave with mean value equal to unity, in the presence of fully developed speckle
for a single look
Let us return to the case of a scalar wave or, equivalently, consider one
component of the electric field. The model presented above describes the prob-
ability density function for a given reflectivity. To be precise, the random
4.5 A Simple Model of Speckle 87
event A is associated with different surfaces of the same reflectivity but differ-
ent roughness. In other words, experimentally, we would obtain an intensity
histogram for the field at the detector that would be close to an exponential
probability density function by calculating an ensemble average over several
realizations of surfaces with the same statistical characteristics. If we are in-
terested in an image such as the one in Fig. 4.1, the intensity histogram for the
field at the detector due to a homogeneous region, i.e., a single cultivated field,
will be close to this exponential probability density function if the reflectivity
is constant over the whole region. Each pixel for a homogeneous region must
therefore correspond to a region of the ground with the same reflectivity if the
histogram of gray levels of pixels in this region is to correspond to an expo-
nential probability density. We have represented this situation schematically
in Fig. 4.5.
~
;'6
S, .;j
:ll~
~:::: +---+--+-----'f--+-- +--+-+--+
'"
If the ground reflectivity is not constant, the gray level histogram for a
homogeneous region will no longer correspond to an exponential distribution.
We have represented this situation schematically in Fig. 4.6.
The random experiment we are referring to in order to define a probability
density function now results from a combination of two phenomena. The first
relates to speckle and results from interference as discussed above. Let X.x be
the random variable representing the value of the intensity for unit reflectivity.
X.x is thus distributed according to an exponential law Px(x) with mean 10 ,
which we shall assume to be 1. The second random phenomenon relates to the
value of the reflectivity of the zone projected onto a pixel in the image. Let Rp,
be the random variable which represents the value of this reflectivity, which
88 4 Limit Theorems and Fluctuations
Fig. 4.6. Variable reflectivity model which no longer leads to an exponential prob-
ability density function for the intensity in the presence of speckle for a single look
For a given value of R, the probability density function for I(A ,fJ,) is
Pl(I) = JR
1
exp
(I) (I/)V Rv-l
- R 73 r(l/) exp
( R)
(3 dR.
-1/
There is no simpler explicit form for this law which is known as the K dis-
tribution with parameter 1/. A typical example is illustrated in Fig. 4.7 for a
scalar monochromatic wave of mean 1 in the presence of speckle for a single
look, described by a K distribution with parameter 1/ = 1.
'0
,cf
1
10
102
103
10'"
'05
0 2 4 6 8 10
Fig. 4.7. Probability density function (continuous curve) for a K distribution with
parameter 1/ = 1. The scale on the ordinate is logarithmic. The dotted curve shows
an exponential distribution with the same mean
P(r+dr,t+Tlr,t) .
This random walk has no memory since the displacement at a given time t
is independent of the positions occupied by the particle before it reached the
point r at time t.
90 4 Limit Theorems and Fluctuations
P(r+dr,t+Tlr,t] =W(dr,T).
R n ,).. = Ro + L8)..j ,
j=l
where A = (AI, A2,"" An) and 8)..j represents the step of the walk made at
time t = jT. 8)..j is therefore a realization of a stochastic vector distributed
according to Wed, T). R n ,).., Ro and 8)..j are vectors in the 3-dimensional
space. We project this equation onto the Ox axis to give
n
X n ,).. = Xo + LJf, .
j=l
px.,(x) = 1
X.Jii'i
exp[_(X-"/t]
2X. 1
p. ,\1r-;.\___ v....::.
....:..<t.:.....
1--..:. l,,--) --~
"...../
/
x
Fig. 4.9. Evolution of the probability density function for a I-dimensional random
walk
that the probability density function for R n ,>. is simply the product of these
marginal distributions with respect to each component:
PR,t(x,y,z) = - - - - 1 (1)3
XxXYXZ v
~
27rt
exp [Q(X,y,z)]
-
2t
,
where
Q( x, y, Z ) -_ (x - vXt)2 (y - vyt)2 (z - vzt)2
2 + 2 + 2 .
Xx Xy Xz
Figure 4.9 shows how the probability density evolves in the case of a 1-
dimensional random walk.
In the case of isotropic diffusion, for which Xx = Xy = XZ = X (only the
diffusion part is assumed to be isotropic, for we may have v x i- vy i- v z ), we
92 4 Limit Theorems and Fluctuations
obtain simply
3
1 A(x,y,z)
PR,t(X,y,z) =
(
~ )
exp [- 2X 2t ],
with
A(x, y, z) = (x - vXt)2 + (y - vyt)2 + (z - vzt)2 .
Putting
D
m=(m1,m2, ... ,mD?, V=(V1,V2, ... ,VD)T, IIRI12= Lr; ,
j=l
we then observe the fundamental results for isotropic diffusion with finite
second moment:
or
V(IIRn,A - vt11 2) = xv'I5i .
Before concluding this section, it is worth noting that speckle noise cor-
responds to the special case of an isotropic random walk without drift (i.e.,
with v = 0) in two dimensions.
particle at the point with coordinates (x, y, z) at time t and W(b x , by, bz , T)
the probability density function for observing a displacement of amplitude
(b x , by, bz)T during the time interval T. We will assume that this displace-
ment results from a great many small, independent, random displacements
distributed according to stationary laws with finite second moments and in-
dependent along each of the three axes. W(b x , by, bz , T) is a transition rate
density and we may thus write
P (x', y', z', t + T) = IIIp (x', y', z', t + T lx, y, z, t) P(x, y, z, t)dxdydz .
P(X,y,Z,t+T)
We have assumed that the transition rate density W (b x , by, bz , T) results from
a great many small, independent, random displacements distributed according
94 4 Limit Theorems and Fluctuations
Pix, y, <, l )
to stationary laws with finite second moments and independent along each of
the three axes. We therefore choose to denote the various moments as follows:
111
and
(bx - VX T)2 W (bx, by, bz, T) dbxdbydbz = XiT ,
and likewise for the other coordinates. Integrating both sides of the truncated
expansion with respect to
we deduce that
Ll(x, y, z, t + T, t)
a a
;::,;; - ox P(x, y, Z, t)VxT - oy P(x, y, Z, t)VyT
a 1 02
- oz P(x, y, Z, t)VzT + 2 ox 2 P(x, y, z, t)X;T
1 02 1 02
+2 oy2 P(x, y , Z, t)x~T + 2 oz2 P(x, y, z, t)X;T
+ terms in T2,
where Ll(x, y, z, t + T, t) = P(x, y, z, t + T) - P(x, y, z, t). Moreover,
a
P(x, y, z, t + T) - P(x, y, z, t) ;::,;; aT P(x, y , Z, t)T .
4.7 Application to Diffusion 95
a a a
at P(x, y, z, t) = -Vx ax P(x, y, z, t) - Vy ay P(x, y, z, t)
a
-vz az P(x, y, z, t) +2
x; ax
a2
2 P(x, y, z, t)
x~ a2 x~ a2
+2 ay2 P(x, y, z, t) + 2 az 2 P(x, y, z, t) .
The two terms in this equation correspond to two different physical phenom-
ena. The first,
apex, y, z, t) _ X2 AP( )
at - 2 U x,y,z,t,
where
through the appearance of (vx, vy, v z ) and (Xx, Xy, Xz). It is important to grasp
the phenomenological nature of this equation. Indeed, if the diffusion equation
is applied to periods of time too short or distances that are too large certain
paradoxes arise. For suppose that, at time t = 0, the molecule concentration
is represented by a Dirac distribution c5(x, y, z) centered at the origin. The
solution of the diffusion equation shows us that, for any positive time c, no
matter how small, there is a nonzero probability of finding molecules at a dis-
tance L as large as we wish. This equation is thus compatible with molecular
displacements at speeds faster than the speed of light. The diffusion equation
must therefore be applied only within its field of validity. It should thus be
remembered that, in order to justify taking only first and second order terms
in our expansion, we assumed that the time step was not too small and the
displacements not too large.
G(x,t)
Fig. 4.11. Green function for the diffusion equation in the 1-dimensional case
We saw in Section 3.12 that, given the Green function for a partial dif-
ferential equation, it is easy to determine the solution for arbitrary initial
conditions in an infinite medium. For the case of an infinite, homogeneous
and isotropic medium without drift, the Green function of the diffusion equa-
tion (see Fig. 4.11) can be written
[xv'27r(t-t')] X t-t
where we assume that t > t' and Ilr - r'11 2 = (x - X')2 + (y - y')2 + (z - Z')2.
4.8 Random Walks and Space Dimensions 97
Just as for the random walk, we can immediately generalize the last result to
the case of diffusion in a space of dimension D. We consider the homogeneous
and isotropic case without drift. Putting R = (rl, r2, ... , rD)T, we obtain
ap(R, t) = X2 f).P(R )
at 2 ,t ,
with
The probability density function P(R, t) for finding a particle at the point
with coordinates R = (r1' r2, ... , rD)T at time t given the initial conditions
where
D
IIRI12 = ~)rj)2 .
j=1
This is the approximation obtained at the end of the section on random walks.
It leads to the same result as the Green function of the diffusion equation.
There is nothing surprising about this, since the underlying physical model
chosen to describe diffusion was indeed the random walk model. We will re-
consider this approximation below.
Let us analyze a random walk experiment starting out from the origin
and in which we take a snapshot of the particle in its current position every
dt seconds, where the time interval dt is chosen arbitrarily (see Fig. 4.12).
Imagine now that we repeat this experiment N times, where N is some large
number. We therefore expect the probability Pn of finding the particle in the
neighborhood dV (assumed small) of the origin on one of the N photos at a
time ndt to be approximately
Photo 3 Photo 4
Fig. 4.12. Random walk experiment starting at the origin and taking photos of the
particle position at time intervals dt
.
00
t
OO( ~
N(to)=LNPj~-d-J 21TX2t
NdV 1 )D dt,
J=n to
we then obtain
00
dt ( J21TX2) to
Setting
00
[D(tO) = J C D/ 2 dt ,
to
we see finally that there are two cases:
if D ~ 2, then [D(tO) diverges, i.e., [D(tO) tends to infinity,
if D > 2, then [D(tO) converges and we have limto--->oo N(to) = O.
4.8 Random Walks and Space Dimensions 99
We therefore conclude that the number of times N (to) that the particle returns
to the origin for times t greater than or equal to to diverges if D ~ 2 and
converges if D > 2. So this feature depends on the dimension of the space
in which the random walk takes place. Consequently, we expect qualitatively
and quantitatively different results if D ~ 2 or D > 2. It can indeed be shown
more rigorously that, for a random walk on a discrete lattice in a space of
dimension D, the probability that the particle returns to the origin is equal
to 1 if D ~ 2 and is strictly less than 1 if D > 2.
Fig. 4.13. A mass m is placed at each step of a random walk and the total mass
Mn then varies as the square of the characteristic size Ln of the cluster
The fact that the dimension D = 2 plays a special role can also be un-
derstood in a slightly different way. Consider a homogeneous and isotropic
random walk in discrete time steps with no drift. Suppose that a mass m is
placed at each site occupied during the random walk. Suppose also that suc-
cessively occupied sites are linked by a line segment. After n steps, the mass
Mn of the chain formed in this way is therefore nm and its characteristic
length is
Ln ~ V(IIR n ,.>.11 2) = aVn ,
where a is a constant (see Fig. 4.13). We thus obtain
m 2
Mn = 2(Ln) .
a
In a space of dimension D, the mass of a homogeneous cube of density p and
side L is pLD. For the random walk, setting p = m/a 2 , we have Mn = p(Ln)2.
From this standpoint, we can thus say that the intrinsic dimension of the
random walk is 2, regardless of the dimension of the space in which it takes
place (if D > 2).
100 4 Limit Theorems and Fluctuations
We may thus put forward the following picture of the results we have just
described. If the dimension of the space is strictly greater than the intrinsic
dimension of the random walk, then this random walk intersects itself much
less often than in the other case.
We may now determine the probability PT (n) that n particles are detected
in a time interval of length T. Indeed, we divide this interval into N segments
of very short length dt, so that T = N dt. The number of particles detected in
the interval of duration T is the sum of the Bernoulli variables which describe
the number of particles detected in each time interval of length dt. In the limit
as dt tends to zero, the characteristic function of PT (n) is therefore
or
4.9 Rare Events and Particle Noise 101
II
.. I.
1
1 1 dt
Fig. 4.14. The time interval is chosen small enough to ensure that there is a negli-
gible probability of the number of detection events being greater than 1
by 2. In this case, we would no longer have a Poisson variable since the result
would not necessary be integer-valued.
However, when f,L is large, a good approximation to the characteristic func-
tion IJr (v) = exp [- f,L (1 - e iV )] is exp (if,Lv - f,LV 2/2) which corresponds to the
characteristic function of a Gaussian variable with mean and variance equal to
f,L. Indeed, if f,L is large, exp [-f,L (1 - e iv )] is only non-negligible if v is close to
O. Now in this case, e iv ~ 1+iv-v2/2 and hence, lJr(v) = exp [-f,L (1- e iv )] ~
exp [-f,L (-iv + v 2/2)]. A sum of Poisson variables thus converges to a Gaus-
sian variable. In other words, in high fluxes, Poisson noise is equivalent to
Gaussian noise. Note, however, that the variance of this Gaussian distribu-
tion will be equal to its mean, as always happens with a Poisson distribution.
It is interesting to compare this result with the random walk where the vari-
ance and mean are also proportional.
PI(I) = ~ exp
10
(_i) .
10
Suppose further that the photon flux is stationary and very low. In Section 4.9,
we saw that the number of photons measured over a time interval T is a random
variable NJ-L whose probability distribution can be accurately described by a
Poisson law, viz.,
P ( ) _ -PT (pT)n
Tn - e ,.
n.
We now write this law with the notation PT = "YI, where I is the mean
intensity received at the detector and "Y is a coefficient depending on the
surface properties, the efficiency of the detector and the time interval T. We
now observe that, if we consider that the speckle pattern is projected onto
a detector whose position is random, then the intensity I is itself a random
variable that we shall write in the form l;... The random variables N J-L and l;..
arise from very different phenomena. The Poisson noise reflects fluctuations
4.10 Low Flux Speckle 103
Pr(n) = 1+ 00
Pr(nlI)P1(I)dl ,
where the argument here is analogous to the one in Section 4.5. We can then
write
Pr(n) =
+00
1- 1("(I)n 1
e ' - - - exp - - dl.
(I)
a n! 10 10
It is now easy to obtain
P n _ 1 ("(10) n
r( ) - 1+,,(10 1+,,(10 .
1+
0:
Pr(n) o:n
= -,
00
exp( -o:x)x n exp( -x)dx .
n. a
If we put I n = Jo+ OO exp [-(1 + o:)x] xndx, we find that
o:n
PT(n) = -,
n.
In .
as required.
A simple calculation2 is enough to show that
(N) = "(10 ,
and that
IJ'7v = (N 2) - (N)2 = "(10(1 + "(10) .
At very low fluxes, we have "(10 1 and hence IJ'7v ~"(la, which corre-
sponds to the variance of the Poisson noise. The main source of fluctuations is
thus the Poisson noise. In high fluxes, we have "(10 1 and hence IJ'7v ~ ("(10)2,
which corresponds to the variance of the speckle noise. The Poisson noise is
then negligible in comparison with the fluctuations due to speckle.
2 A more complete notation would be N/",),. In order to simplify, we have not
indicated the random events J1, and >..
104 Exercises
Exercises
Let Bi be noise samples, where i E Z and Z is the set of positive and negative
integers. Assume that Bi is a sequence of independent random numbers with
zero mean and values uniformly distributed over the interval [-a, a]. Hence,
and where ( ) represents the expectation value operator with respect to the
various realizations Bi of the noise.
(1) Calculate u~ as a function of a.
Suppose that another sequence Si is obtained by averaging the Bi according
to the rule
1 N
Si = .17\T LBi+j .
vN j=l
(2) What is the covariance function of Si 7
(3) What is the probability density function of Si when N = 27 Express the
result as a function of u~.
(4) What is the probability density function of Si as N -+ +007
Consider now
Consider a point optical detector which can measure the intensity in the ver-
tical or horizontal polarization states of a light signal. We assume that these
two intensities are described by independent random variables h~r and Iver
with the same exponential probability density function, viz.,
Consider a continuous random walk in discrete time steps. Let P(r) be the
probability of taking a step of amplitude rand Rn the position at step n.
Discuss the difference in the asymptotic behavior for large n when
1
(1) P(r) = "2 exp( -Irl), where Irl represents the absolute value of r,
1
(2) P(r) = -(1 + r2)-1.
7r
(1) Determine the probability density as a function of time, but without cal-
culating the sums.
(2) What would be the probability density as a function of time if the particles
diffused over a circle of radius R and the initial conditions were
P(x,O) = 6(x) .
The aim here is to model random walks in which large jumps may occur
sporadically. For example, we might think of a flea which walks for a while,
then takes a jump, walks a bit more, then takes another jump, and so on. To
simplify, we consider this random walk in 1 dimension. We write simply
L
Zi = Xi + L Yi, ,
=1
Exercises 107
where the Xi are random variables distributed according to the Cauchy prob-
ability density function
a 1
px(x) =
7rX
2
+ a2 '
and where the Yi,l are random variables distributed according to the Gaussian
probability density function
1
Py(y) = --exp -- (y2 ) .
.,fiiia 2a 2
each equipped with a probability law pY) for the events A]1) and p]2) for
the events A]2). As the events are assumed independent, the probability of
observing Ay) and A?) simultaneously is simply py)p~2). The information
content of the joint event (A]1) , A~2)) is thus
Since the events are independent, we expect the total information content to
be the sum of the information contents of the two components, i.e., Qj,R =
Qj + QR, and this is indeed the case with Shannon's definition.
It is worth noting that the information content carried by an event which is
certain in the probabilistic sense, so that Pi = 1, is actually zero, i.e., Qi = O.
In contrast, the information content carried by a very unlikely event can be
arbitrarily large, i.e.,
lim Qi = +00.
Pi---+O
At this stage, the definition of Shannon information may look somewhat ar-
bitrary. Here again, if we gauge the interest of a definition by the relevance of
the results it generates, there can be no doubt that this definition is extremely
productive. It underlies the mathematical theory of information transmission
and coding systems. It can be used to set up highly efficient techniques for
optimizing such systems. However, since our objective is to characterize the
fluctuations in physical systems and estimate physical quantities in the pres-
ence of fluctuations, we shall not emphasize this interesting and important
aspect of information theory. We shall instead focus on applications in classi-
cal physics.
5.2 Entropy 111
5.2 Entropy
It is easy to see that the entropy is a positive quantity. It is zero if only one
event has a nonzero probability. Indeed, we have limpr->o(pj lnpj) = 0 and
lIn 1 = O. Note also that, if the events are equiprobable, we have Pj = liN
for each value of j and hence
8(fl) = InN.
8(flT)=- L P(A)lnP(A).
AE!lT
8(flT) = - L L P('\,p,)lnP('\,p,) ,
>'E!l'/-LE!l2
or
8(flT) = - L L P(,\, p,) [lnP(,\) + InP(p,)]
>'E!l,/-LE!l2
Using
P(,\) = L P(,\, p,) and P(p,) = L P(,\, p,) ,
we then have
8(flT) = 8(fld + 8(fl2) ,
where 8(fl1) = - L P(,\)lnP(,\) and 8(fl2) = - L P(p,)lnP(p,).
>'E!l, /-LE!l2
We can now give a simple interpretation of the Shannon entropy by con-
sidering experiments in which we observe the realization of L independent
random events arising from fl. We thus form a new random event denoted
A L , which takes its values in the set f) = fl x fl x ... x fl, where il x fl
is the Cartesian product of fl with itself and where fl appears L times in
112 5 Information and Fluctuations
the expression for 8. In other words, and to put it more simply, the new
random events are the sequences AL = {A(1), A(2), ... , A(L)}, where A(n) is
the nth independent realization of an event in fl. [It is important to distin-
guish the notation A(j) and Aj. Indeed, Aj is the j th element of fl, where
fl = {AI, A2, ... , AN}, whereas A(j) is the j th realization in the sequence
AL = {A(1), A(2), ... , A(L)}.] The number NL of different sequences AL for
which each event Aj, representing the jth event of fl, appears fj times is
L!
NL = f 1If2
I f NI '
L!
L(L - 1) ... (L - fl + 1) = (L _ fd!
There are now L - fl places for the events A2. For the given positions of the
events Al, there are therefore
(L -fd!
L! (L - fd! L!
(L - fd!fl! (L - fl - f2!)!f2!
possibilities for placing the events Al and A2. Repeating this argument, it is
easy to convince oneself that the number of different sequences is
L!
We now analyze the case where L is very large. For this we shall need the
simplified Stirling approximation
5.2 Entropy 113
whence
When L is large, the second term is negligible compared with the first and
hence
-
11m - - 11m
InNL_ - ~- n (I!j)]
[~I!jl -
L-+oo L L->oo L L
j=1
In this case, the law of large numbers allows us to assert that the event Aj
will occur approximately pjL times, where Pj is the probability of the event
Aj occurring. We will thus have
. -
hm
L->oo
InNL
L- = -
L plnp
N
J J ,
j=1
or
lim InLNL = S(D) ,
L-+oo
or
NL ~ exp [S(D)L]
We now consider two extreme cases. Suppose to begin with that only
the event A1 has nonzero probability of occurring. We have seen that the
entropy is then minimal and zero. In this case, the number N L of different
sequences of random events is obviously equal to 1, since only the sequence
A1, A1, A1, ... ,A1 then has nonzero probability of occurring. Note that this
result is consistent with what was said before, since In N L = 0 and S( D) = o.
In contrast, if the set of random events is made up of n equiprobable
events, we have Pi = lin and the number NL of different sequences of random
events is then NL = n L . The entropy is easy to determine, and we obtain
S(D) = - L:f=1 (lin) In(l/n) = In n, which is in good agreement with NL ~
exp[S(D)L], since NL = n L or n L = exp(Llnn).
114 5 Information and Fluctuations
To sum up, we observe that the entropy defined as Shannon's mean in-
formation content is indeed a characteriztic measure of the potential disorder
of fl, for it directly determines the number of different potential realizations
that fl can generate. We thus see that Shannon's choice for the information
content leads to a useful definition of entropy. Indeed, the entropy is defined as
the mean quantity of information that the source fl of random events can gen-
erate. Moreover, we have just shown that it is directly related to the number
of different sequences of independent realizations that we can observe.
Al = 101010101010101010101010101010101010101010101010
and
A2 = 110010001110101010011100010101000111010011010011
5.3 Kolmogorov Complexity 115
have the same probability q24(1 - q)24. However, it would be much easier
to memorize the first sequence (Ad than the second (A2)' This is simply
because there is a very simple algorithm for generating the first sequence.
This algorithm might be, for example,
Write 48 figures alternating between 0 and 1 and starting
with 1.
It is on the basis of this observation that we can define the notion of Kol-
mogorov complexity for a sequence of characters. The Kolmogorov complexity
K(A) of a sequence A is the length of the shortest program able to generate it.
We may thus reasonably expect to obtain K(Ad < K(A 2), whereas if Q(Ad
and Q(A 2) denote the information contents of the sequences Al and Q(A 2),
we have Q(Ad = Q(A2)'
The Kolmogorov complexity of a sequence is maximal when there is no
algorithm simpler than a full description of it term by term. In this case, if
the length of the sequence A is n, we have K(A) = n + Const., where the
constant is independent of n. The complexity of a password corresponding to
the first name of one of your friends and relations will certainly rate lower
than Dr45k;D. It is because a hacker assumes that you will have chosen a
password of low complexity that he or she will begin by trying the first names
of those dear to you. The algorithm applied by the potential intruder could
consist in trying all the first names of your friends and relations, together with
their family names and dates of birth. If the hacker is well organized, he or
she will run through all the nouns in a dictionary.
We also observe that, if we find a program of length p which can generate a
sequence of length n with p < n, this means that we have managed to compress
the sequence by a factor of nip. For a given length n, the compression will be
all the more efficient as p is small.
A good example of a situation in which we must seek a simple algorithm
are the so-called logic tests which consist in completing a series of numbers. To
begin with, the algorithm must reproduce the series of numbers, whereupon we
may use it to predict the following numbers. Imagine now a more sophisticated
test in which, as well as predicting the rest of the sequence, we first ask whether
this will even be possible. The fact that we have not found a solution does
not mean that there does not exist a solution. Hence, for these series, we
cannot obviously assert at first sight that there does not exist a solution.
It can be shown that finding the program of minimal length is a problem
with no simple solution in general. Unfortunately, this significantly reduces
the practical relevance of Kolmogorov complexity. However, it is worth noting
that this definition still retains several points of interest. In the next section,
we will be able to use it to define a degree of randomness, which is interesting
even if we cannot always calculate it.
The other significant feature is the "philosophical" implication contained
in the definition of Kolmogorov complexity. Indeed, we may consider that
the aim of a physical theory is to sum up in the most concise manner
116 5 Information and Fluctuations
X l. (t)
YI (/ )
t ..
Fig. 5.1. Either of the two functions XA(t) and YA(t) could represent a deterministic
function or a stochastic process
Y>.(t) is centered on a time T>. which varies in an unpredictable way from one
observation to another, it is more useful to model this family of functions by
a stochastic process. (In reality, it is enough not to be trying to predict it.
Nothing requires there to be anything unpredictable about it.) Consequently,
we note once again that the idea of stochastic process corresponds to a certain
standpoint we have adopted, rather than some intrinsic property of the signals
we observe. Since the property of stationarity is defined in terms of the mean,
i.e., the expectation value, of the various possible realizations, it is of course
quite impossible to decide whether or not a stochastic process is stationary
on the basis of a single realization.
It is nevertheless clear that X>.(t) seems much more irregular than Y>.(t),
and we might be tempted to declare that X>.(t) is "more random" than Y>.(t).
The power of X>.(t) is more widely spread out across the observation period
than the power of Y>.(t), and this might suggest that X>.(t) is "more station-
ary" than Y>.(t). In order to analyze this kind of intuition in more detail,
we shall examine the case of binary-valued sampled functions. The sampling
theorem asserts that, provided we are dealing with signals having a bounded
spectrum, there is no loss of generality in considering only sampled signals.
Binary-valued processes nevertheless constitute a less general class. Later, we
shall analyze the problems raised by continuously varying random variables.
In the present case, the stochastic processes are simply random binary
sequences, which we shall denote by X>.(n) and Y>.(n), where n E [1, N]. Let
us examine two random sequences analogous to the functions represented in
Fig. 5.l.
We thus define X>.(n) as a sequence of Os and 1s drawn randomly and in-
dependently from each other with probability 1/2. The sequence Y>.(n) will be
identically zero except for one sample j, where it will equal 1. In other words,
Y>. (n) = 0 if n =f j and Y>. (n) = 1 if n = j. It is fairly clear that a realization
of X>.(n) will generally have a much greater Kolmogorov complexity than a
realization of Y>. (n). In the latter case, the algorithm to construct Y>. (n) is
rather simple:
Write N Os and replace the jth term by 1.
We may then say that the sequence Y>. (n) is not algorithmically random.
A more precise definition of this idea consists in considering a sequence as
algorithmically random if its Kolmogorov complexity is equal to its length.
This definition is particularly attractive but unfortunately turns out to be
rather impractical owing to the difficulty we have already mentioned, namely
the difficulty in determining the program of minimal length able to describe
the sequence.
The Kolmogorov complexity characterizes a given realization and this is
indeed what interests us here. We may also analyze the complexity of each
of the random sequences X>.(n) and Y>.(n), i.e., the set of all possible realiza-
tions. The approach adopted here then consists in calculating their entropy.
With regard to X>.(n), there are 2N different possible sequences, all of which
5.5 Maximum Entropy Principle 119
are equally probable. The entropy is thus 8(X) = N In 2. There are only N
different possible sequences for Y,\(n), once again assumed equiprobable. Its
entropy is therefore 8(Y) = InN. We thus have 8(Y) 8(X), which means
that, from the entropy standpoint, Y,\(n) is also simpler than the sequence
X,\(n).
We now analyze the stationarity properties of these two sequences. We can
make no assertions on the basis of a single realization. However, it is very easy
to show that, if we neglect edge effects, l the two sequences both possess first
and second moments that are invariant under time translations. Indeed, we
have (X,\(n)) = 1/2 and (Y,\(n)) = l/N and, in addition, (X,\(n)X,\(m)) =
(1 + bn - m )/4 and (Y,\(n)Y,\(m)) = bn-m/N, where bn - m is the Kronecker
symbol. These two sequences are therefore stationary.
8
-8 P,,(P) = 0 ,
Pj
where Z(Ml, M2) = L:f=l exp (-M1Xj - M2XJ). It is a more delicate matter to
identify the parameters Ml and M2 than to identify Mo. Note, however, that the
mathematical form obtained is analogous to a Gaussian distribution (although
not the same, because we have been discussing a discrete probability law).
It is interesting to relate this result to the central limit theorem. Suppose
that N is very large, that the Xj are regularly spaced (Xj = jd) and that the
index j runs over the set of integers (see Fig. 5.2). We will thus have Pj =
exp (-Mdd - M2j 2d 2) /Z(Ml, M2) and in the limit as d becomes very small, we
will obtain a good approximation to the Gaussian distribution Pj = PG(jd)d
with
PG(X) = ~(J" exp [- (x ;(J":n)2] ,
where m and (J" are the mean and standard deviation of the distribution,
respectively.
jd x
the probability density function of the sum variable must finally converge.
We obtain here a new interpretation of this result. For given mean and vari-
ance, the Gaussian distribution is the one which generates the largest number
of different sequences in which the frequency of occurrence of each random
event is equal to its probability during independent realizations. Indeed, it
has maximum entropy relative to all other distributions with the same mean
and variance. Note that we must be given a resolution d for distinguishing
two values before we can speak of the probability of each random event. For
a given variance, it thus contains the maximal potential disorder. We might
say that it is the most complex law from the stochastic point of view. In other
words, the universal character of the Gaussian distribution in the context of
the central limit theorem corresponds to convergence toward the probability
distribution containing the maximal potential disorder, where the measure of
disorder is the number of different sequences a law can generate during inde-
pendent realizations. We may say schematically that, when we sum random
variables with finite second moment, the result has maximal complexity or
maximal disorder. However, it is important to understand the exact meaning
attributed to the notions of complexity and disorder when we make such a
claim.
Finally, for a given power, the assumption that the fluctuations in a phys-
ical quantity are Gaussian can be understood as a hypothesis of maximal
disorder or a priori minimal knowledge. The entropy is the mean information
content that the source of random events can supply during independent re-
alizations. In the case of a source with high entropy, each realization will tend
to bring a lot of information, and this is compatible with the interpretation
whereby the information available to us a priori, i.e., before the trials, is itself
minimal.
J p. (x),ix
Jd.&ll
Pi =
jd-dl!
jO x
J
jli+6/2
Pj = Px(x)dx.
jli-li/2
J
or
S(Y) ~- Px(x) in [Px(x)]dx -ln8 .
We thus note that, when 8 tends to 0, the entropy S(Y) diverges. This result is
easily understood if we remember that the entropy is a measure of the number
N L of different sequences of length L that can be generated from independent
realizations in such a way that the frequency of occurrence of each random
event is equal to its probability. As the value of 8 decreases, the number of
different sequences increases, until it diverges as 8 tends to 0.
In the limit as 8 tends to 0, X>. and Y>. become identical. We may thus
say that the entropy of any continuous random variable is formally infinite.
This is hardly a practical result! We therefore define the entropy of continuous
random variables in terms of the probability density function by
124 5 Information and Fluctuations
1 [(x-m)2]
Px (x) = ~ exp - 2 '
v2m:T 2cr
whereupon
In this section we shall study the dynamical evolution of the entropy in two
very similar, but nevertheless different cases, namely, propagation and diffu-
sion. To keep the discussion simple, we restrict to the case of I-dimensional
signals (see Fig. 5.4).
We begin with the evolution of the entropy during propagation of optical
signals. In the I-dimensional case, the equation obeyed by the field A(x, t) is
simply
8 2 A(x, t) 1 8 2 A(x, t)
8x2 - c2 8t 2 = 0 .
We denote the intensity of the field A(x, t) by J(x, t) at the point with coor-
dinate x and at time t, so that J(x, t) = IA(x, t) 12. A simple model consists
in considering that, at the point with coordinate x and at time t, the number
of photons that can be detected is proportional to J(x, t). Suppose that the
light pulse we are interested in has finite energy E, where E = J~oo J(x, t)dx.
Moreover, if we assume that there is no absorption during propagation, the
energy will be constant in time. We may thus consider that for a pulse with
one photon, the density probability of detecting a photon at the point with
coordinate x and at time t is given by
5.7 Entropy, Propagation and Diffusion 125
c(r, -t.)
x
Fig. 5.4. Comparing the dynamical evolution of probability densities for systems
undergoing diffusion and propagation
PX,t(X) = I(~ t) .
82 82
8x 2 A(x, t) = 8x 2 Ao(x - ct) and
Now 8 2uj8t 2 c28 2uj 8x 2, so we can deduce that, whatever the function
Ao(x),
8 2 Ao(x - ct) _ ~ 8 2 Ao(x - ct) = 0
8x 2 c2 8t 2 .
In the same way, it can be shown that A(x, t) = Ao(x + ct) is also a solution
of the propagation equation. In this case, the wave moves toward negative x
values, whereas with A(x, t) = Ao(x - ct), the wave moves toward positive x
values. We consider only the last solution. We thus have I(x, t) = Io(x - ct),
where Io(x) = IAo(x)12, and PX,t(x) = Px,o(x - ct) = Io(x - ct)jE.
Let us now investigate how the entropy changes in time. At a given time t,
the definition of the entropy for a continuous probability distribution implies
that
126 5 Information and Fluctuations
St = - J PX,t(x) In [PX,t(x)] dx .
St = - J Px,o(u) In [Px,o(u)] du .
oPX,t(x) _ X2 0 2 PX,t(x) _ 0
ot 2 ox 2 -.
To bring out the analogy with the propagation equation, we can also write
0 2 PX,t(x) _ 2 oPX,t(X) = 0
ox 2 X2 ot .
or
oPX,t(x)
at
we can write
5.7 Entropy, Propagation and Diffusion 127
where
and
F2(X, t) = - J 1
PX,t(x)
[aPX,t(X)] 2
ax dx.
If we assume that PX,t(x) decreases monotonically as x -+ 00 and also as
x -+ -00, the first term is then zero. Indeed, let us examine the limit of this
term when x -+ 00. If we set f(x) = PX,t(x) and f'(x) = df(x)/dx, we have
d
f'(x) [1 + lnf(x)] = dx [f(x) lnf(x)] .
~St =
at
X2
2
J PX,t(x)
1 [aPX,t(x)] 2 dx > 0 .
ax
The entropy thus increases with time during a diffusion process. This result
would also be true in dimensions greater than 1. Diffusion is therefore an irre-
versible process and, unlike the propagation equation, the diffusion equation
is not invariant under time reversal.
128 5 Information and Fluctuations
where II'I
is the determinant of I'. In the multidimensional case, the entropy
of a probability density function is simply
where
N N
QX(Xl,X2, ... ,XN) = LLXiKijXj +In [(27r)Nlrl]
i=lj=l
S = "21 ~
N N
~ rijKij + In [()N -
yI2;i II'I 1/2]
Now
N N
L L r i j Kij = N,
i=l j=l
and hence,
s = ~ In [(27re)N IFI]
We shall now express this entropy in terms of the spectral density of the
sequence X>.(n). To this end, we must consider stationary random sequences.
However, it is difficult to define stationarity for a finite-dimensional random
sequence. Our task is made easier by constructing an infinitely long periodic
sequence from X>.(n), viz.,
5.8 Multidimensional Gaussian Case 129
We saw in Section 3.5 that the power spectral density F(v) of X.\(n) satisfies
the relations
r(v) = N
A 1 N-l
1; (i27rVm)
r(m) exp ~ ,
and
-t- --t -t
This matrix is unitary~.e., F F = F F = Id N , where F is the transposed
complex conjugate of F and Id N the identity matrix in N dimensions. We
thus find that
noting that F(v) are the eigenvalues ofr, hence real and positive, as explained
in Section 2.8. Now
--t - - -t -
IFr F 1= IFllrllF I = II'I ,
and we thus deduce that
N-l
IFI = NN II F(v) .
v=o
We have
S
N-l
= ~ In [ (27reN)N!! F(v) ,
1
so that
130 5 Information and Fluctuations
N-l
S = ~ LIn [21feNf(v)]
1/=0
Consider the trivial case of a white sequence, i.e., an uncorrelated sequence,
with power a 2 . We thus have f( v) = a 2 / N, which implies that
This result is consistent with the entropy value (1/2) In(21fea 2 ) of a scalar
Gaussian variable. (The entropy of N independent scalar Gaussian variables
is simply the sum of the entropies of each random variable.)
We now analyze the evolution of the entropy when noise is transformed
by a convolution filter. To keep this simple, we assume once again that the
noise is sampled and cyclostationary. The noise before filtering is denoted by
X)..(n) and after filtering by Y)..(n). We have seen that we must have
where fxx(v) and fyy(v) are the spectral densities of X)..(n) and Y)..(n),
respectively. X(v) is the transfer function characterizing the convolution filter,
i.e., the discrete Fourier transform of the impulse response (or kernel) of the
convolution filter. Let S(X) and S(Y) be the entropies of X)..(n) and Y)..(n),
respectively. We have
N-l N-l
and hence,
N-l
Using the simplified Stirling approximation when L is very large, we can carry
out the same analysis for the entropy. We then obtain
lnWL
m
11
L-+oo
- - -- l'1m
L L-+oo
[~jl
-~-n-
. L
(j)]
qJL
J=l
The probability of finding a sequence such that the event Aj occurs approxi-
mately Pj L times is then
The main point is this: the larger the value of the Kullback-Leibler measure,
the smaller the probability of observing a sequence with frequencies of occur-
rence Pi if the probability law happens to be qi. In addition, it is clear that
the approximation W L ~ exp [-LKu (P IIQ)] is valid when L is large and thus
that W L tends exponentially to 0 unless Ku (P IIQ) = O. The Kullback-Leibler
measure then characterizes the rate of decrease of the exponential.
The total number of different sequences is N L. The number N L of different
sequences AL for which each event Aj occurs j = pjL times is approximately
WLNL. When L is large, we have
S~L[lnN-Ku(PIIQ)] .
" In Pj
tP(P) = '~Pj ~ - f-l '~Pj
" .
j=l qJ j=l
Now 8tP(P)/8pj = 0 implies that 1 + lnpj - lnqj - f-l = O. The constraint
2:f=l Pj = 1 leads to f-l = 1 and hence Pj = qj , Vj = 1, ... ,N. In this case, we
find immediately that Ku (P IIQ) = O. To check that this is indeed a minimum,
we note that 8 2 tP(P)/8p] = l/pj ~ 0 and that 8 2 tP(P)/8pj 8Pi = 0 if i -I- j.
This shows that K u(P I Q) characterizes the separation between the laws
PjE[l,Nj and qjE[l,Nj' For this reason it is often referred to as the Kullback-
Leibler distance. However, it should be noted that, from a mathematical point
of view, this quantity does not satisfy the axioms required of a true definition
of distance, i.e., it is not symmetric and does not satisfy a triangle inequality.
It is interesting to determine the Kullback-Leibler measure of a law PjE[l,Nj
with respect to a uniform distribution qjE[l,Nj = l/N, recalling that it is the
uniform probability law that maximizes the entropy. We have
N
Ku (P IIQunid = LPj In(Npj) ,
j=l
or
N
Ku (P II Qunif ) = InN - S(P) , where S(P) =- LPj lnpj .
j=l
This is a special case of the result stated above:
S ~ L[lnN - Ku (P IIQ)] ,
recalling that, if S is the entropy of a given probability law, the entropy of the
law associated with the sequences made up of L independent observations is
LS.
5.10 Appendix: Lagrange Multipliers 133
We often seek the probability law which maximizes the entropy under certain
constraints. For example, we may be looking for the probability law which
maximizes the entropy 8(fl) = - :Ef=l Pj lnpj under the constraint that
:E{:l Pi = 1. To achieve this, we can use the Lagrange multiplier technique.
Many mathematical works specialized in optimization rigorously establish the
situations in which this technique can be applied and where it guarantees the
existence and relevance of the solutions produced. We shall now show how
to use this technique, whilst proposing a non-rigorous interpretation which
nevertheless allows us to obtain a simple physical intuition.
We consider a function F(X) of the vector variable X = (Xl, X2, ... , XN )T.
Let us suppose that we seek the value for which this function reaches its
maximum when the variable X satisfies the constraint g(X) = O. In order
to apply the Lagrange multiplier technique, we define the Lagrange function
WJ.I(X) = F(X) - /1g(X), where /1 is a real parameter, also known as the
Lagrange multiplier. We then seek the value of X which maximizes WJ.I(X),
denoted symbolically by
We next determine the value /10 of /1 for which x~Pt satisfies the constraint,
i.e., for which g(X~~t) = O. It is easy to see that, if we find a value x~~t such
that g(X~~t) = 0, it must correspond to the solution which maximizes F(X)
under the constraint g(X) = o. Indeed, suppose that there were a value X
such that F(X) > F(X~~t) and g(X) = O. Then it is clear that we would
have WJ.l(X) > wJ.l(X~~t), which would contradict the hypothesis that x~Pt
maximizes WJ.l(X).
This technique is easily generalized to the case where there are several
constraints. For example, let these constraints be gl (X) = 0 and g2(X) = o.
We then define the Lagrange function
where /11 and /12 are the two Lagrange multipliers. We then seek the value of
X which maximizes WJ.ll,J.l2(X), denoted symbolically by
We then determine the values of /11 and /12 which satisfy the constraints
gl (X~i~J1J = 0 and g2(X~i~J.l2) = O. It is clear that this approach can be
generalized to an arbitrary number of constraints.
134 Exercises
Exercises
Exercise 5.1
Consider a system with N states eo, el, ... , eN-I' The probability of finding
the system in the state eo is
1
Po = N + (N - l)a ,
whilst for the other states el, e2,"" eN-I, this probability is
1
P=N- a .
U sing the results of the last exercise, determine the Kullback-Leibler distances
between the following continuous probability distributions:
(1) scalar Gaussian distributions with different means and variances,
(2) Gamma distributions with different means and orders,
(3) Poisson distributions,
(4) geometric distributions.
Consider two probability laws Pa(n) and Pb(n), where n is a positive integer.
The aim here will be to determine the probability law Ps ' (n) which lies at
equal Kullback-Leibler distance from both Pa(n) and Pb(n) and which are
the closest to them, where
and
+00 [Ps(n)]
Ku(Ps IIPb) = ~ Ps(n) In Pb(n) .
(1) Among all those probability laws that possess a Kullback-Leibler measure
K u (Ps I!Fa ) with respect to Pa(n), show that the one closest to Pb (n) can
be written in the form
where
+00
C(s) = L Pt(n)P)'-S(n) ,
n=l
s* = argminC(s) .
136 Exercises
Define
+00
C(s) = L Pb(n)p~-S(n) .
n=l
(1) Express
d d
ds In[C(s)Js=o and ds In[C(s)Js=l
Thermodynamic Fluctuations
For a macroscopic system, any physical quantity fluctuates in space and time.
These fluctuations are due to thermal agitation and we shall see that it is
possible to analyze some of their properties without having to determine the
exact configuration of all the particles in the system. For this purpose, when
the physical system is in thermodynamic equilibrium, we must first determine
the probability law for finding it in a given state. We shall then focus more
closely on the fluctuations of macroscopic quantities associated with thermo-
dynamic systems, although we shall restrict the discussion to cases described
by classical physics.
discussed.) We also assume that the system can exchange energy with the
thermostat and that its instantaneous energy content can therefore fluctu-
ate. The number of particles N in the system remains constant, however. At
thermodynamic equilibrium, we assume that the thermostat and the system
have been in contact for a sufficiently long time to ensure that the average
macroscopic properties are no longer time dependent. In other words, the
macroscopic properties are stationary quantities. In particular, although the
r
energy in can fluctuate, its mean value, understood in the sense of the ex-
pectation value, will be assumed constant and fixed by the thermal energy of
the thermostat.
A state of the system, denoted by X, is defined by giving the whole set of
coordinates of all the particles included in the system. The set n of states of r
constitutes the phase space of the system. For example, if we are interested in
the magnetic properties of a solid, we can set up a simple model by assuming
that the solid is made up of N magnetic moments. There exist materials in
which it is reasonable to assume that the projection of this magnetic moment
along a given axis can only take on a finite number of values, which we shall
denote by mj/L for the j th magnetic moment, where mj is a whole number
and /L is a constant determined by microscopic properties. A state of the
system is then defined by giving the whole set of values of the numbers mj,
i.e., X = {ml, m2, ... , mN }. We can thus see that n is generally embedded
in a space of very high dimension indeed (see Fig. 6.1).
(+++)
(+ +-) (- - -)
(+ - +)
(- + +)
(+ --)
(- +-)
(- - +)
(- - -)
(+ +-)
, , "- - - - - - - - - - - - - - -
,,
,,
(+ + +)
Fig. 6.1. Schematic illustration of the phase space when there are three particles
and the state of each particle is represented by a magnetic moment which can only
take on two values
L H(X)P(X) =U .
XEn
The problem now is simply to maximize the Shannon entropy:
so that
1
kB S = (3(H(X)) + In Z(3 .
We thus find that
~ 8S = (H(X)) (38(H(X)) ~I Z .
kB 8(3 + 8(3 + 8(3 n (3
8S 8S8U 18U
8(3 8U 8(3 T 8(3 ,
and hence,
1 It is common practice to speak of Gibbs statistics, although it is not a statistic,
but rather a family of probability laws.
6.2 Free Energy 141
as 1 au au
aj3 = T aj3 = k Bj3 aj3 ,
which implies finally that kBj3 = l/T.
The absolute temperature Ta is equal to kB times the temperature in
degrees kelvin, i.e., Ta = kBT. At thermodynamic equilibrium and at the
temperature 1/ j3 = Ta = kBT, we denote the thermodynamic entropy by S(3
and the mean energy by U(3. The latter is also called the internal energy.
where T is the temperature, S(3 the entropy, and U(3 the internal energy.
At thermodynamic equilibrium, we have S = kBj3(H(X)) + kB In Z(3, which
clearly implies
1
F(3 = -kBTlnZ(3 = -73
lnZ(3 .
Consider now a system r out of equilibrium and let P(X) denote the probabil-
ity law of its states X. The definition of the entropy is still applicable, so that
S = -kB 1: P(X) InP(X) and the mean energy is U = 1:XEfl H(X)P(X).
XEJJ
The free energy of the system out of equilibrium is then simply F = U - TS.
(In this case, T is chosen equal to the temperature of the thermostat.)
The Kullback-Leibler measure between the law P(X) and the equilibrium
law P(3(X) = exp [ - j3H(X)]/Z(3 is
u __ olnZf3
f3 - o{3'
The internal energy and entropy can also be expressed in terms of the free
energy:
U - o({3Ff3) d S __ oFf3
f3 - o{3 an f3 - oT
In Section 3.10, we gave several examples of conjugate intensive and ex-
tensive quantities. Two quantities are said to be conjugate if their product
has units of energy and if they appear in the various thermodynamic energy
functions. To be more precise, let Ho(X) be the energy of the configuration X
of the system r when there is no applied external field. Consider an applied
field h which corresponds to the intensive quantity conjugate to the exten-
sive quantity M(X). To fix ideas, we consider the example in which M(X) is
the total magnetization for the configuration X and h is an applied magnetic
field. We assume that, in the presence of the field, the energy Hh(X) of the
configuration X can be written
and hence,
aZf3 =
ah
I: {3M(X) exp [ - {3Ho(X) + {3hM(X)] .
XE{}
(To simplify the notation, the dependence on the field h will be omitted when
there is no risk of ambiguity.) Since,
aln Zf3
ah
we obtain
Xf3 =
a2
ah 2 (1j11n )
Zf3 and Xf3 =-
a 2 Ff3
ah 2
In Section 6.4, we will show that the susceptibility Xf3 is positive. We then
have
()2 Ff3
ah 2 ~ 0,
which means that the free energy is a concave function of its argument h.
Table 6.1 sums up the main properties we have just shown.
Table 6.1. The main relations between the partition function and thermodynamic
quantities
Free energy
Internal energy
Entropy
Magnetization
Susceptibility
obtain information about the total magnetization M(t). Let L be the lattice
on which the magnetic atoms are located. The total magnetization is then
M(t) = 2:rEL m(r, t). We thus see that M(t) is a sum of random variables
and so, according to our investigations in Chapter 4, it is no surprise that we
can obtain interesting information about this quantity.
In order to simplify our analysis, we assume that the system is homoge-
neous and ergodic. Moreover, we assume that it is in thermodynamic equilib-
rium. The quantities characterising it are therefore stationary. We define the
total covariance function by
Once again, there is an abuse of notation here, using the same symbol for
functions of one, two or four variables. This simplifies things, provided that
no ambiguity is thereby introduced.
The homogeneity hypothesis implies that (m(r)) = (m(r + d)) and hence
that
rmm(d) = (m(r)m(r + d)) - (m(r))2 .
Because this spatial covariance does not depend on the time, it can be simply
expressed in terms of Gibbs statistics. For this purpose, let m( r, X) be the
magnetization of the atom located at site r when the system is in state X.
Then,
6.4 Covariance of Fluctuations 145
where
Zj1 = 2:= exp [ - ,6Ha(X) + ,6hM(X)] .
XE{l
The details of the calculation are perfectly analogous to those given in Sec-
tion 6.6. We thus observe that
and we thus deduce that the total power of the fluctuations has the following
simple expression in terms of the susceptibility:
146 6 Thermodynamic Fluctuations
([M(X)]2) - [(M(X))]2 = X; .
This result, which we shall call the total fluctuation theorem, shows above all
that the susceptibility is a positive quantity.
Let us write this relation as a function of the Boltzmann constant and the
absolute temperature in kelvins. We have f3 = Ij(kBT) so that
([M(X)]2) - [(M(X))]2 = k B TX(3 .
The power of the fluctuations in the extensive quantity M(X) is thus propor-
tional to T and X(3' The dependence on T is easily understood. The higher
the value of T, the greater the thermal agitation, and this is what favors large
fluctuations. A large value of X(3 implies that the system is "flexible" in the
sense that it can react vigorously to the application of the intensive field h
conjugate to M(X). This flexibility in the system also leads to a high reactiv-
ity with regard to thermal agitation and so to big fluctuations. Note, however,
that nothing proves that X(3 will be an increasing function of temperature. It
can be shown that the susceptibility is in fact a decreasing function of temper-
ature in the case of perfect paramagnetizm, which corresponds to a system of
magnetic moments without interactions. We shall see later that the suscepti-
bility is also a decreasing function of temperature when it is slightly above the
critical temperature during a second order phase transition. It may happen
that the product TX(3 is a decreasing function of T, in which case the same is
true of the fluctuation power. Note finally that an analogous calculation leads
8::; .
to
([H(X)]2) _ [(H(X))]2 = _
We may also examine temporal correlations:
rMM(T) = (M(t)M(t + T)) - (M(t))(M(t + T)) .
Let M(t) be the magnetization in the state X t at time t, so that M(t)
M(Xt), and hence, M(t) = L:rER mer, t). If P(3,t,t+r(X, Y) is the joint prob-
ability that the system is in the state X at time t and in the state Y at time
t + T, we can write
Although the situation here is analogous to the one discussed just previously, it
is in fact slightly more involved, because we do not yet have a simple expression
for P(3,t,t+r(X, Y). This point will be analyzed further in Section 6.6.
AE
Fig. 6.2. Schematic representation of a system of particles with two energy levels
Here H(X) is simply L~1 [(1 - ni)E1+ ni E2] , where ni is equal to 0 if the
i th particle is the state with energy El and 1 if it is in the state with energy
E2 (see Fig. 6.2).
We can thus write
Z(3 = L
(nl,n2, ... ,nN)
exp {-j3 t
"-1
[(1 - ni)E1 + niE2]} ,
or alternatively,
Putting
1
Z(3 = L exp { - j3[(1 - n)El + nE2]} ,
n=O
we find that Z(3 = (z(3)N. Now z(3 = exp(-j3El) + exp(-j3E2), and we can
deduce that
or
lnZa = Nln [exp(-j3Er) + exp(-j3E2)]
Let Pn(O) and Pn (1) be the probabilities of finding a particle in the states
with energy El and E 2, respectively. Writing
148 6 Thermodynamic Fluctuations
1 1 1
Pn(nd= L L ... L
n2=O n3=O nN=O
we then have
p. (0) _ exp( -(3E1 )
n - exp(-{3Ed + exp( -(3E2) ,
and
Pn (1) = exp( -(3E2) .
exp( -{3Ed + exp( -(3E2)
The mean numbers of particles (N1 ) and (N2 ) in states with energies El and
E2 are therefore
and
(N2) = N exp( -(3E2)
exp( -(3E1 ) + exp( -(3E2)
Since F/3 = -{3-1ln Z/3, the free energy is
which gives
U/3 = N El exp( -{3Ed + E2 exp( -(3E2) .
exp( -{3Ed + exp( -(3E2)
Note that we can also write
([H(X)]2) - [(H(X))]2
(tJ.E)2
is a function of
tJ.E
kBT'
In other words, energy fluctuations become significant if kBT > tJ.E.
h(t)
M( t )
Fig. 6.3. Illustration of the relaxation function or response function in the linear
case with Mo = 0
M (I) I (H
t~~
Co".ari.ancc
function
Fig. 6.4. The covariance function characterizes to second order the spontaneous
fluctuations in an extensive quantity. In this case, Mo = 0
Let Ft (X) be the probability of finding the system in the state X at time t >
in the absence of any applied field and given that , for negative times t < 0,
a field of amplitude ho was applied. Ft(X) is not therefore representative of
equilibrium because the field was suddenly brought to zero at t = 0. On the
other hand. we have
6.6 Fluctuation-Dissipation Theorem 151
P (Y) = exp[-,BH(Y)]
(3,ho Z{3'
where
H(Y) = Ho(Y) - hoM(Y) .
According to Bayes' relation, we can also write
Therefore,
Rho(t) = L L M(X)f>t,o(X I Y)P{3,ho(Y) ,
XES? YES?
a(t) = L
YES?
M(X) [Ft,O(X I Y) 8~ 0
(Y)I ho=O1
P{3,ho
8
8h o exp [ - ,BHo(Y) + ,BhoM(Y)] = ,BM(Y) exp[-,BH(Y)] ,
and hence,
Thus,
Now
L L M(X)P)3,t,o(X, Y) = (M(X)) ,
XEilYEil
and hence,
a(t) = fJ {L L
XEilYEil
M(X)M(Y)P)3,t,o(X, Y) - [(M(X))]2}
This relation can simply be written a(t) = fJrMM(t). It has been established
for positive times t. Let us now consider what happens for negative t. Clearly,
rMM(t) is an even function, whilst a(t) is only defined for positive t. We make
the convention that a(t) = 0 for negative t. We then obtain the fluctuation-
dissipation theorem, which simply says that
This relates the relaxation function a (t) and the covariance function rM M ( t )
of the fluctuations at thermodynamic equilibrium in the absence of an applied
field.
The Wiener-Khinchine theorem says that the Fourier transform of the
covariance function of stationary stochastic processes is equal to the spectral
density of the fluctuations SMM(V). The impulse response is related to the
relaxation function by X(t) = -da(t)/dt, so that in Fourier space, x(v) =
-i27rva(v). The susceptibility x(v) is traditionally expressed in physics in
terms of its real and imaginary parts, viz., x(v) = X'(v) -iX"(v). Noting that
as(t) = a(ltl), we immediately obtain X"(V) = 7rvas(v). Then recalling that
fJ = l/k BT, where kB is Boltzmann's constant and T the absolute temperature
in kelvins, we find that
X"(V)
8MM(V) = kBT-- .
A
7rV
This relation is the fluctuation-<iissipation theorem in Fourier space.
6.7 Noise at the Terminals of an RC Circuit 153
The solution to this differential equation is Q(t) = CVo exp( -t/ RC). The
response function is therefore a(t) = C exp (-t/ RC) and the impulse response
154 6 Thermodynamic Fluctuations
c
Fig. 6.5. Series RC circuit. Q),,(t) represents the charge fluctuations and [),,(t) the
fluctuations in dQ)" (t) / dt
is X(t) = (1/ R) exp (-t/ RC), since we know from Section 3.10 that X(t) =
-da(t)/dt. It is easy to deduce the a.c. susceptibility by Fourier transform:
XA( v ) = C
1 + i27rRCv
Let x(v) = X'(v) - iX"(v) with X'(v) and X"(V) real. Then,
directly from the fluctuation-dissipation theorem. The total power of the fluc-
tuations is thus kBTC. The spectral density SQQ(v) is equal to kBTX"(v)/(7rv)
or
RC 2
BQQ(V) = 2kBT 1 + (27rRCv)2 .
A
A (27rv)2 Rc 2
Bn(v) = 2kBT ( C )2 .
1 + 27rR v
In electronics, all the power of the fluctuations is often concentrated on the
positive frequencies by setting Fn(v) = 2Sn(v)(}(v), whereupon
6.7 Noise at the Terminals of an RC Circuit 155
J J
00 00
f
00 ,
and hence,
, 47r 2V2RC2
BII(V) = 2kBT (1- 27rV2LC)2 + (27rRCv)2 .
We deduce that f~ BII(V)dv is finite. The power dissipated across the resis-
tance R is indeed bounded. In fact it is equal to R([I,,(t)]2), or R f~oo BII(v)dv.
This result can be represented in a more memorable way. The charge on the
plates of the capacitor fluctuates due to thermal agitation. Let us see whether
it is possible to represent these charge fluctuations as being produced by a
random voltage generator. We saw in Section 6.6 that the spectral density of
the fluctuations in the intensive quantity would then be
Let Z(v) denote the electrical impedance of the circuit. We have u(v) =
Z(v)](v) and ](v) = i27rvQ(v), where ](v) is the Fourier transform of the
current j(t), and hence Q(v) = -iu(v)/[27rvZ(v)] , or
iZ*(v)
x(v) =
27rvIZ(v)12
We thus have
and X'''()
v = ZR(V)
, ,
27rVIZ(v)12
where ZR(V) is the real part of Z(v). We deduce that the spectral density
of the voltage fluctuations that would produce the same spectral density of
current fluctuations as those due to thermal agitation would be given by
156 6 Thermodynamic Fluctuations
Svv(v) = 2kBTZR(V) .
This result shows that we can represent the fluctuations due to thermal
agitation by a random voltage generator with spectral density given by
Svv(v) = 2kBTZR(V). The real part of the electrical impedance RLC of
the circuit is equal to the resistance R and we therefore find that
Svv(V) = 2kBTR,
or equivalently,
Pvv(V) = 4kBTR.
We now illustrate this result in the case of an optical detector. We assume
that we may treat this detector as a perfect current generator placed in series
with a resistance Rl and a capacitance C, as shown in Fig. 6.6. We will also
assume that the detector is mounted on a resistance R 2
We begin by finding the potential difference u (2) (t) across the resistance
R2 when a current i(t) is generated by the detector. The result is simply
U(2)(t) = R2i(t).
Now consider the fluctuations in the potential difference V),(t) across the
resistance R2. We have PII(V) = 4kBT(21l'v)2RC2j[1 + (21l'RCv)2] , where
R = Rl + R 2. We thus deduce that
R,
Fig. 6.7. Equivalent circuit with voltage generators associated with each resistor
A R~
Pvv(v) ~ 4kBT R R
1+ 2
R 4kBTR~()v
N ~ Rl +R2 '
whilst the signal power will be P s ~ R~li(v)12()V, where i(v) is the Fourier
transform of i(t). The signal-to-noise ratio will thus be
Fig. 6.8). Still in the absence of an applied field, cooling the material from
a temperature above Te down to a temperature below Te, the symmetry is
broken since either a positive or a negative magnetization will be obtained.
The choice of sign occurs at the critical temperature, a temperature where
the thermodynamic properties of the system are singular.
In the low temperature phase, states evolve in such a way that the magne-
tization keeps the same sign. Let X t be the state of the system at a given time
t. For T < Te, we observe that M(Xt) remains strictly positive or strictly
negative as time goes by. Hence if X t is in D+, it will stay in D+ as the sys-
tem evolves with time. Likewise if X t is in D_, it will stay in D_. The two
subs paces D+ and D_ are invariant under the action of the evolution operator
and they characterize the ergodicity breaking (see Fig. 6.10).
In practice, because of effects due to the finite size of the thermodynamic
system, this ergodicity breaking is never perfect. The magnetization can then
change sign. We will not analyze this aspect here and will simply assume that
the ergodicity breaking is total.
6.9 Critical Fluctuations 161
where the proportionality sign <X indicates that we are interested in the asymp-
totic behavior and will not describe any multiplicative factors. We may pro-
ceed in the same way with the other quantities. The specific heat varies as
Cv <X ItRI-<> and the free energy as Ff3 <X ItRIA. The parameters 0:, >. and 'Y
are the critical exponents.
As T approaches Te, the susceptibility Xf3 diverges and the same is therefore
true for the power of the magnetization fluctuations ([M(X)]2) - [(M(X))j2.
This is quite characteriztic of second order phase transitions. Near the critical
temperature, the value of the susceptibility Xf3 increases and so the response of
the system to an external excitation is greater. However, this leads to a growth
in the fluctuations in the macroscopic quantity since their power diverges (see
Fig. 6.11).
From the standpoint of an engineer, we could say that ifthe signal increases
(the signal being the response to some external excitation here), the same is
true of the noise (i.e., fluctuations in the quantity). From a more fundamental
standpoint, the divergence of the fluctuations lies at the origin of several
fascinating types of thermodynamic behavior that can be observed close to
the transition temperature, especially with regard to the critical exponents.
Indeed, these parameters manifest universality properties, in the sense that
their values do not depend on the details of interparticle interactions, but only
on the dimension3 of the system under investigation and symmetry properties
of the order parameter. We shall briefly discuss this in the following.
3 Two for a surface material and three for a bulk material.
162 6 Thermodynamic Fluctuations
Fig. 6.11. Divergence in the power of the magnetization fluctuations at the critical
temperature Tc for a second order transition of paramagnetic- ferromagnetic type
r
For homogeneous, stationary systems, mm (r, t , r + d, t + T) does not depend
only on d and T, and we can thus simplify this expression:
We also note that, for T ~ Te and T ~ Te, the correlation length ~ diverges as
a power law in the reduced temperature: ~ ex ItRI-V, where v can be considered
as a new critical exponent.
Exercises 163
Exercises
Exercise 6.1. Introduction
Consider a system of N particles, each of which has two energy levels El and
E 2, with El < E 2. Let q be the probability that a particle is in the state el
with energy El and p the probability that it is in the state e2 with energy E 2.
We then have p + q = 1.
(1) What is the average energy U of the system expressed in terms of E l , E2
and p?
(2) What is the variance (J~ of the total energy expressed in terms of E l , E2
and p?
(3) For what value of p is (J~ maximal?
(4) Express the entropy S of the system in terms of p.
(5) For what value of p is the entropy maximal?
Assume now that the system is in thermodynamic equilibrium so that
exp( -(JE2)
p= --~~~7---~~--~
exp( -(JE2) + exp( -(JEd
(6) What is the value of (J~ when the temperature tends to O?
(7) What is the value of (J~ when the temperature tends to +oo?
(8) How does the entropy S evolve with temperature?
Let E(n) be the energy of particle n, where E(n) = El or E(n) = E 2. The
energy H(X) for a given configuration X is thus
N
H(X) = L E(n) .
n=l
(9) What is the probability distribution of the state X and what happens
when the temperature tends to O?
164 Exercises
( ) =_1_8InZa,h
m N(3 8h .
(2) Determine (1/ N (32)(82 In Zj3,h/8h2). What is the physical meaning of this
quantity?
Show that the variance of the energy fluctuations in a system can be written
and determine a.
166 Exercises
Imagine that we have measured the number of particles detected during a time
interval T. In Section 4.9, we showed that the Poisson distribution provides a
simple model for describing this number of particles. If N(t, t + dt) represents
the mean number (in the sense of expectation values) of particles detected
over the time interval [t, t + dt], the flux at time t is
-l. dN(t,t+dt)
<P ()
t - 1m d .
dt--+O t
The mean (n) and variance (n 2) - (n)2 of the Poisson distribution defined by
P(n) = exp( _p,)p,n In! are equal to p" and we thus deduce that
,pI T =
1 p
P L n; -
(1 P )2
P L nj
j=1 j=1
unrelated to the quality of the estimate itself. Such arguments may concern
such features as computation time or the possibility of making the estimate
using analog techniques, which are sometimes decisive factors when choosing
estimation methods. We may then say that the processing structure prevails.
On the other hand, when it is the quality of the estimate alone which is taken
into acount, we must specify how it can be quantified. This is what we intend
to examine here.
We have made considerable use of the notion of expectation value in pre-
vious chapters and we shall consider this type of average once again here,
still with the notation ( ). We can thereby define the expectation value (also
known as the statistical mean) of any statistic T(X>.). To simplify the nota-
tion, we have written AT = A, although we shall keep the explicit mention of A
to emphasize the fact that we are now considering P potential measurements
and hence that X>. is indeed a set of random variables. To determine this ex-
pectation value, we must define the probability Le (Xl, X2, .. , X p) of observing
the sample x. In the problem situation specific to estimation, the true value
Bo of the parameter B is unknown. The probability Lo (Xl, X2, . , X p) should
therefore be considered as a function of B. It is often called the likelihood of
the hypothesis which attributes the value B to the unknown parameter when
the observed sample is X = {Xl, X2, ... , X P }, and this explains why the symbol
L is used to denote it. The expected value of the statistic T(X>.) is then
To begin with, and in order to simplify the discussion, we consider the case
where B is a scalar parameter. If T(X>.) is an estimator of Bo, we clearly hope
that T(X>.) will be as close as possible to Bo. To make this idea more precise,
we now present the main features of a statistical estimator.
The bias of an estimator T(X>.) for the parameter Bo is defined to be the
difference between the expected value of T(X>.) and the true value Bo. More
precisely, the bias bT of the estimator T(X>.) for Bo is defined as
Let us apply this definition to the example in Section 7.1. To simplify, we will
consider Bo = <PT and thus PO a (n) = e-oaBo In!. We can study the bias of the
estimator iJ(X>.) defined by
Then,
7.3 Characterizing an Estimator 171
or
Now,
and hence,
The bias is zero and we thus say that the estimator is unbiased.
Consider now the estimator
P-l
(()I(n)) = ----p-()o .
A
This result shows that, unlike {)(XA)' ()'(XA) is a biased estimator of ()o.
In the last example, we observe that the bias of the estimator is due to
the finiteness of the number of measurements in the sample X. To be precise,
when P tends to infinity, ()'(XA) becomes an unbiased estimator of (). More
generally, we say that an estimator T(XA) of a parameter () is asymptotically
umbiaised if
Histogram of
Tlt.,J
Bias
Histogram of
Tlt.,J
Bias
<T[XJ> 8,
Histogram of
1U.J
Estimator
Bia,
Fig. 7.3. Schematic illustration of the standard deviation (or equivalently, the vari-
ance) and the bias of an estimator
3.4
IIII
I "
I I
I
I
3.0 I
I I
I
I
I
I
2.6
2.2
o 5 10 15 20
Fig. 7.4. Values of the estimators obtained with 8(X) (white squares) and 8' (x)
(black squares)
with minimal variance do indeed playa key role in the context of estimation
theory.
Let us return to the example of Poisson noise. We have generated several
samples Xj containing 100 Poisson variables with 00 = 2.5. For each sample
Xj, we have determined the values of the two estimators O(Xj) and 0' (Xj). The
values of O(Xj) and O'(Xj) are shown in Fig. 7.4 for 20 samples (j = 1, ... ,20).
They thus correspond to the mean and variance of each set of independent
realizations of 100 Poisson variables with parameter 00 = 2.5.
174 7 Statistical Estimation
6000
5000
4000
3000
2000
1000
, " ... ...
0
0 2.5 5
Fig. 7.5. Histograms of O(X) (continuous line) and O'(X) (dashed line) determined
from 20 000 samples of 100 Poisson variables each with ()o = 2.5
Figure 7.5 shows the histograms of B(X) and B'(X) determined from 20000
samples containing 100 Poisson variables each. The means obtained with the
estimators B(X) and B'(X) are 2.5002 and 2.4979, respectively. The variances
differ much more in the two cases:
0.0254 for B(X), i.e., a standard deviation of 0.1595,
0.1510 for B' (X) , i.e., a standard deviation of 0.3886.
We thus observe that, in this particular case, the estimator B(X) is better
than B'(X). This raises the question as to whether B(X) is actually the best
possible estimator, or more precisely, whether it is the estimator with the
lowest possible variance amongst all unbiased estimators.
L(xI O) = II
p ( oni)
e- IJ _ .1
n,.
i=1
Quite generally, when the realizations are independent and identically dis-
tributed, we have
p
(xI O) = LlnPo(xi) ,
i=l
((xIO)) = L(lnPo(xi)) ,
i=l
or
((xIO)) = P(lnPo(x)) = PLPo(x)lnPo(x) .
x
The value of the parameter 0 which maximizes the likelihood is the maximum
likelihood estimator. As the logarithm function is an increasing function, max-
imizing the likelihood is equivalent to maximizing the log-likelihood. This in
turn amounts to seeking the value of 0 that minimizes the information con-
tained in the realization of X.
Let us examine the result obtained for the Poisson distribution. To find
the maximum of (xIO), we require the value of 0 that makes a(xIO)/ao zero:
a P n.
ao(xIO)=L(-I+ 0')=0.
i=l
We then obtain
R(xI8) = L (- ~i -ln8) ,
i=l
Note, however, that we can also consider the expectation value as a mathe-
matical operator. The idea is simply to calculate the mean of T(Xl, X2, ... , xp)
with a probability law
L(Xl, X2, ... , xpl()) .
To emphasize the dependence on (), we will write
(T(XA))O = h(()).
This is the mean of the statistic T(X,\) which we would obtain for random sam-
ples XA that would be generated with the probability law L(xl()). To simplify
the formulas, we use the notation
J T(X)L(xl())dX
Let us examine in detail the case where X takes continuous values, which will
justify writing the above relations in integral form. When X takes discrete
values, the integrals are simply replaced by discrete sums.
178 7 Statistical Estimation
In Section 7.12, we show that ifthe domain of definition of X>. does not depend
on 0, the variance of the statistic T(X>.) cannot be less than a certain limiting
value:
2
_1~h(O)12
80
<JT(O) ~
J 2 (I)
8 In L X 0 L( IO)d
8()2 X X
Naturally, we must also assume that the logarithm of the likelihood does have
a second derivative. This inequality, which provides a lower bound for the
variance <J}(O) of the estimator T(X>.), is a classical result in statistics, known
as the Cramer-Rao bound. We note immediately that this inequality holds
whatever value we take for O.
If the estimator is unbiased, the expression for the Cramer-Rao bound
simplifies. Indeed, in this case, we have h(O) = (T(X>.))o = 0 and hence
8h(O)/80 = 1, which implies
where
I (0)
F
= - J 8 2 InL(xIO) L( IO)d
80 2 X X
We see from this expression that the variance of an unbiased estimator cannot
be less than a certain lower bound. In the case of unbiased estimators, this
bound does not depend on the estimator chosen. It only depends on the mean
value of the curvature of the logarithm of the likelihood (see Fig. 7.6).
In the neighborhood of a maximum, the second derivative of the likelihood
is negative, and the first derivative is decreasing. Its absolute value is all the
greater as the curvature is large. In other words, the more sharply peaked the
likelihood is as a function of the parameter we wish to estimate, the more
precisely we may hope to estimate that parameter (see Fig. 7.7).
Note, however, that it is the expectation value of the second derivative of
the log-likelihood which comes into the expression for the Cramer-Rao bound,
J
since
8 22 In L(xI O) )
( 80 = L(xI O) 80
2
8 2 In L(xIO)dX .
(::2
The quantity
IF(O) = - InL(xI O))
e(xl e)
Curvature with :
B.w e
Fig. 7.6. The role played by the curvature in the Cramer-Rao bound , where OM is
simply the value of 0 maximizing the log-likelihood (xIO) = In L(xIO)
Fig. 7.7. Relation between the Cramer- Rao bound (CRB) and the shape of the
log-likelihood
J
where Z(8) = exp [a(8)t(x) + f(x)] dx. Statistics which attain the Cramer-
Rao bound when we observe independent and identically distributed realiza-
tions are then proportional to
180 7 Statistical Estimation
p
T(X) = L t(Xi) .
i=1
The variance of the estimator T'(X) is then simply the variance ofT(x) divided
by p2. The above probability density functions define the family of exponen-
tial probability densities. We also speak more simply of the exponential family.
t.
Note that the log-likelihood of X is simply
a
aOh(O)
a~(O) =
ao(O)
we say that it is in the canonical or natural form and 0 is then the canonical
or natural parameter of the law. Let us examine the Cramer-Rao bound
when a law is written in the canonical form. We then have a(O) = 0 and
hence ao(O) = aa(O)/aO = 1. Moreover, T(X) = 2:::1 t(Xi) and we have
(T(X))o = P(t(x))o. We thus obtain
where Z(O) = e6 , I(x) = -In(x!), t(x) = x and a(O) = In(O). Let us de-
termine the variance of the estimator m(n.) = (1/ P) 2::f=1 x>-.(j) which is
unbiased. According to the last section this statistic reaches the Cramer-
Roo bound. We observe that T(X>-.) = 2::f=1 x>-'(j). It is easy to see that
h(O) = IT(X)L(xIO)dX = PO, and hence 8h(0)/80 = P. Moreover, since
a(O) = lnO, we have ao(O) = 8a(0)/80 = 1/0 and hence O"~(O) = PO. The
variance of the estimator m(x>-.) of 0 is therefore
2 0
O"m(O) =P .
We estimated the variance of this estimator in Section 7.3. We had P = 100
and 0 = 2.5 and it was found that O"~(O) = 0.0254, which is indeed of the
order of 0/ P. There is no surprise here, since m(x>-.) is an efficient estimator
of O.
The Gaussian case is particularly interesting because it is often a good
model when measurements are perturbed by additive noise. Suppose we carry
out P measurements corresponding to the model
Xi = 0 + Yi ,
where i E [1, PJ and Yi is a random variable with zero mean and variance b2 .
If we seek to estimate 0 from the P measurements Xi with i E [1, P], we can
consider the estimator m(X>-.) = (1/ P) 2::f=1 x>-'(j) which is unbiased and set
T(X>-.) = 2::f=1 x>-'(j). The variance of this estimator is easily determined. Note
first that Xi belongs to the exponential family. Indeed,
b2
O"~(O) = P .
It is easy to show that the probability laws in Table 7.1 belong to the expo-
nential family. We leave it to the reader to reformulate these laws in order to
show that they belong to the exponential family.
182 7 Statistical Estimation
r
Bernoulli (1 - q)8(x) + q8(x - 1) q
L exp( -p,)8(x -
<Xl
Gaussian 1
--exp [(x-m?]
- m and 0-
v'2ir0- 20- 2
where ~m(X) = Ti(X) - (Ti (X))6. In the case where T(X>..) is an unbiased
estimator of 8, whatever the complex vector u E en, we have
- --1
utru ~ u t ] u,
--1
where u t ~ the transposed complex conjugate of u and] is the matrix
inverse to J. This is proved in Section 7.14 at the end of this chapter.
Let us illustrate this result in the case where we wish to estimate the mean
of two-dimensional Gaussian vectors. We have
1 [ 1
Pm"m2(X) = ~ exp -"2(x - m) T=-1
C (x - m) ] ,
27ry 101
where m = (ml' m2)T and 101 is the determinant ofthe covariance matrix C.
The log-likelihood is thus
7.9 Likelihood and the Exponential Family 183
and
In particular, we can have r 12 -I- 0, which implies that, if the covariance matrix
of the fluctuations is not diagonalized, there may be correlations between the
joint estimation errors of m1 and m2.
It is shown in Section 7.14 that probability laws with statistics which attain
the Cramer-Rao bound all belong to the exponential family. In the vectorial
case, these laws have the form
( I )-
L XB -
exp [a(B)T(x)
Zp(B)
+ F(X)]
,
with T(X) = E~=l t(x n ), F(X) = E~=l f(x n ) and Zp(B) = [Z(B)]P.
It is interesting to observe to begin with that the likelihood can be written
in the form
L(xIB) = g(T(x)IB)h(X) .
To see this, we set g(T(x)IB) = exp [a(B)T(x)] /Zp(B) and h(X) = exp[F(x)]
If, for a given probability law, the likelihood can be decomposed into a product
L(xIB) = g(T(x)IB)h(X), we say that T(X) is a sufficient statistic of the law
for B. Although this concept is very important in statistics, we shall limit
ourselves to a few practical results in the present context.
First of all, the conditional probability of observing X given T(X) is inde-
pendent of B. To show this, consider the case where T(X) has discrete values.
We have
p. ( IT( )) = Po (X, T(X))
oX X Po (T(X))
Now when we know X, we automatically know T(X) and therefore Po (X, T(X))
= Po(X) = L(xIB). Moreover, Po (T(X)) is obtained by summing the probabil-
ity Po(X) over all samples X which have the same value for the statistic T(X)
Therefore,
Po (T(X) = T) = L Po(X)
xIT(x)=T
L(xIB) = g(T(x)IB)h(X) ,
and hence,
Po (T(X) = T) = g(TIB) L h(X)
xIT(x)=T
7.9 Likelihood and the Exponential Family 185
Defining
H(T) = L h(X) ,
xIT(x)=T
_ b'(~Mdx)) = T(X)/P.
a' (OML (X) )
In the case of a canonical parametrization, a' (0) = 1 and the maximum like-
lihood estimator then simplifies to b'(O) = -T(X)/ P. If it is unbiased, it will
have minimal variance. Note, however, that since only T(X)/ P is efficient, Le.,
only T(X) / P attains the Cramer-Rao bound, -b' (0) / a' (0) is the only function
of 0 that can be efficiently estimated.
The maximum likelihood estimator corresponds to the equality
Indeed,
J(O) = J exp [a(O)T(x) + F(X) + Pb(B)]dX = 1 ,
and hence dJ(O)/dO = 0, so that
we do indeed obtain
T(X) = (T(X))OMLCX) .
For independent realizations, we have T(X) = L~=l t(xn). We deduce that
(T(X)}o = P(t(x)}o and hence,
In this section, we illustrate the results of the last few sections with five exam-
ples from the exponential family. We will consider the Poisson distribution,
the Gamma distribution, two examples of the Gaussian distribution, and the
Wei bull distribution. We use the notation of the last section and we assume
that the P-sample X corresponds to independent realizations.
7.10 Examples in the Exponential Family 187
P ( ) = exp( -B)en
N n n."
where B is the parameter to be estimated. When we observe a P-sample X =
{nl' n2, ... ,np }, the log-likelihood is
p
(X) = - PB + T(X) In B - l: In( nil) ,
i=1
where the sufficient statistic T(X) is simply T(X) = 2:.:[=1 ni. We thus have
a(B) = InB , a'(B) = liB} ====?- _ b'(B) = B .
b(B) = -B, b'(B) = -1 a'(B)
According to the results of the last section, this estimator is therefore efficient.
Let us return for a moment to the example discussed at the beginning of this
chapter. We see that we now obtain an unambiguous answer concerning the
best way to estimate the parameter in the Poisson distribution, and hence the
particle flux, if the relevant criterion is the variance of the estimator when
there is no bias.
Xa - 1
Px(x) = Bar(a) exp
(X)
-0 '
where B is the parameter to be estimated and we assume that a is given.
When we observe a P-sample X = {nl' n2, ... , np}, the log-likelihood is
1 p
(X) = -PalnB - (jT(x) + (a -1) l:lnxi - Plnr(a) ,
i=1
p
where the sufficient statistic T(X) is simply T(X) = 2:.: Xi. We thus have
i=1
188 7 Statistical Estimation
where the sufficient statistic T(X) is still simply T(X) = 2:;=1 Xi. We thus
have
7.10 Examples in the Exponential Family 189
1 [(x-m)2]
Px (x) = v'2/IT() exp - 2()2 '
where the sufficient statistic T(X) is now T(X) = E;=l(Xi - m)2. We thus
have
a(()) = -1/(2()2) , a'(()) = 1/()3 } ===> _ b'(()) = ()2 .
b(()) = -In() , b'(()) = -l/() a'(())
We see that the maximum likelihood estimator of ()2 (and not ()) leads to
ax",-l
Px(x) = ~exp - [(X)"']
(j ,
(a + 1) .=1
1 p
()moment(X) = LXi.
pr --
a
The log-likelihood is
1 p
.e(X) = -Paln() - ()",T(x) + (a -1) Llnxi + Plna,
i=l
This estimator is efficient for ()'" (but not for ()). It should also be noted that
the moment method and the maximum likelihood method do not lead to the
same estimator. The Weibull distribution belongs to the exponential family
so it is better to consider the maximum likelihood estimator.
To illustrate the differences that are effectively obtained with the moment
and maximum likelihood methods, we have displayed the results of several
numerical simulations in Table 7.2. Figure 7.8 shows histograms of the pa-
rameter () estimated from 1000 independent samples of 5000 realizations each.
The value of the parameter a is 0.25 and the true value of () is 10.
The continuous curve shows the histogram of values obtained using the
maximum likelihood method, whilst the dotted curve shows the same obtained
using the moment method. The superiority of the maximum likelihood method
is clear. We estimated () from 100 independent samples of variable size P and
Table 7.2 shows the means and variances of the values obtained. We thus
7.10 Examples in the Exponential Family 191
180 ,...-~----.----.-- .......- - - - - - - - - - ,
140
100
:: .. // / . ,... ..........\
o ~~~--~--~--~------~--~~~'
8 9 10 11 12
8
Fig. 7.S. Histograms of estimated values of the parameter in the Wei bull distri-
bution using the maximum likelihood method (continuous line) and the moment
method (dotted line)
observe that to obtain the same estimation variance with the method which
involves identifying the first moment as with the maximum likelihood method,
we would need a sample roughly four times larger.
We can find the Cramer- Rao bound which is attained by the statistic
T(X) = 2:::1 xi If we consider this statistic, it constitutes an unbiased es-
timator of eo., but a biased estimator of e. Let us show that it is indeed an
unbiased estimator of eo.. Setting y = xo. , we thus have dy = axo.-1dx. Now
Py(y)dy = Px(x)dx, which implies
192 7 Statistical Estimation
I
F
=- J 8 2 In [L(xlfJ)] L( I )d
8fJ2 X fJ x
We should not end this chapter without discussing the robustness problems
associated with estimation techniques using the maximum likelihood method.
An estimator eML (X) is optimal in the sense of maximum likelihood for a
given probability law (or probability density function) Pe(x). In other words,
we are concerned here with parameter estimation, since we assume that the
observed data obey a law of a form that is known a priori. However, it may
be that the observed data are distributed according to a law Pe(x) which is
slightly different from those in the family Pe (x). An estimator is said to be
robust if its variance changes only very slightly when it is evaluated for a
sample arising from Pe(x) rather than from Pe (x).
Let us illustrate with an example. The maximum likelihood estimator
for the mean of a Gaussian distribution has already been determined to be
7.11 Robustness of Estimators 193
-1
-2
_3 o
~~--------------~----~-----J
10 20 30 40 50 60 70 BO 90 100
Fig. 7.9. Example of 100 realizations of Gaussian variables with mean 0 and vari-
ance 1
3
-1
I~
~ , M .~
~
~ ~
~
-2
-3
0 10 20 30 40 50 60 70 BO 90 100
BMdx) = (2:::1 Xi)/ P. We have also seen that this estimator is efficient. Sup-
pose now that the P-sample X = {Xl, X2,"" xp} arises from the probability
law Po (x) rather than Po (x) , where
40
30
20
10
o -2 -1 o 2
(J
Fig. 7.11. Histogram of values obtained with BMdx) = (1/ P) 2:;=1 Xi when the
samples are generated by pure Gaussian variables
50
40
30
20
10
0
,J \I.,
-2 -1 0 2
(J
Fig. 7.12. Histogram of values obtained with BMdx) = (I / P) 2:;=1 Xi when the
samples are distributed according to P9(X) = (1 - c:)N(x) + c:G(X)
this end, we have estimated 0 for 100 independent samples made up of 1000
realizations each. Figures 7.11 and 7.12 show the histograms of values obtained
with BMdx) = (L:f:l Xi)/ P when the samples are generated by pure Gaussian
variables with mean 0 and variance 1 or by variables distributed according to
Fo(x) = (1 - c)N(x) + cC(x) with c = 10- 2 .
It should be observed that, although the realizations seem similar for pure
Gaussian variables and variables distributed according to Fo(x), there are
spurious peaks in the second case for large values of 101 . The variances of
the estimator are also very different, as can be seen from Table 7.4, where
7.11 Robustness of Estimators 195
the values have been estimated for various configurations. [To be perfectly
rigorous, the mean and the variance of a Cauchy random variable do not
exist. This same is therefore true for our own problem as soon as c i- O. The
figures mentioned only have a meaning for the numerical experiments carried
out.] It is quite clear then that the estimator BMdx) = (2:::1 Xi)/ P is not
robust.
Table 7.5. Variances estimated with the median in the presence of c % Cauchy
variables
o
1
Fig. 1.13. Base 10 logarithm of the variances of several estimators for P = 1000 as
a function of E: mean (continuous line), median (dotted line) , 4% truncated mean
(dot dashed line)
ments in Xa' We call this the a% truncated mean estimator. Figure 7.13 shows
the base 10 logarithm of the variances, i.e., 10glO(0'~)' of these estimators as
a function of E and for a = 4%.
Note that the truncated mean performs extremely well. It is easy to gen-
eralize this method to the estimation of parameters other than the mean. We
can say that the estimator has been robustified. This is an important point
in applications as soon as there is any risk of atypical data. In particular, it
is very important if the atypical data can exhibit large deviations, even if the
probability of this happening is extremely low.
[(8T(n)U(x>.))o]2 (O"f(B)([U(n)]2)o .
[(8T(n)U(x>.))O]2 (O"f(B)([U(X>.)]2)O ,
B
( aV(x>.IB)) = j aL(xI ) d
aB o aB x
a [J L(XIO)dX]
ao
and hence,
(oT(n)U(n))e = J
T(n)
aIn [L(xI O)]
ao L(xIO)dx
(oT(n)U(n))e = J
T(x)
aL(xIO)
ao dx .
(OT(n)U(x>.))e = :0 J T(x)L(xIO)dx .
[(oT(n)U(n))e]2 ~ 0'~(0)([U(n)]2)e ,
we finally obtain
where
We can obtain a new expression for the Fisher information once again using
the fact that the likelihood defines a probability density function on X. Indeed,
we have J L(xl())dX = 1. We have already seen that
J 8ln [L(xl())] ( I)
8()
-
L X () dX - 0 .
J 8 2 In [L(xl())] ( I )
8()2 L X () dX +
J 8ln [L(xl())] 8L(xl())
8()
-
8() dX - 0 .
and hence,
IF = - J 8 2 In [L(xl())]
8()2 L(xl())dX .
The random variables oT(X)..) and U(X)..) in X).. are therefore equal in quadratic
mean. Since 0:0 can depend on (), we write 0:0 = o:o(()) and hence U(X)..) =
0:0 (())oT(X)..) , or
200 7 Statistical Estimation
= O:o(O)oT(XA) .
8ln [L(XAIO)]
80
We put {3(0) = -(O:o(O)T(XA))e, and consider a given sample X. We have
8ln [L(xIO)] 180 = o:o(O)T(X) + {3(0) and if we integrate this expression with
respect to 0, we obtain In [L(xIO)] = a(O)T(x) + F(X) + Pb(O), or
with Z(O) = exp [ - b(O)] = J exp [a(O)t(x) + f(x)]dx. We now determine the
variance of the estimator. Since the Cramer-Rao bound is attained, we have
2 (0) = 18h(0)1801 2
J
(jT 2 '
r
where h(O) = JT(X)L(xIO)dX. We have Oln [L(xIO)] 180 = o:o(O)OT(XA) and
hence
J [:0 In [L(xIO)] L(xIO)dX = 0:6(0)(j}(0) ,
We use the notation e = (01,02, ... ,On)t, OT(X) = T(X) - (T(X))e, where
the vector statistic T(XA) has the same dimension as e. We will also assume
that T(X) is an unbiased estimator of e. The covariance matrix of T(XA) is r
where (xI9) = In [L(xI9)]. We also write U(X,x) for the vector with compo-
nents Ui(X) = [)(XI9)/[)()i. Then,
(U(x,x) [8T(X,x)]t)1i = Id n ,
or
or
Now,
as required.
An analogous calculation to the one carried out previously shows that
where 6i-j is the Kronecker symbol. Indeed, we have fTj (X)L(xI9)dX = ()j
and
[) J
[)()i
{I
if i = j ,
Tj (x)L(xI9)dX = 0 if i i- j .
This equation can be written in matrix form, viz., (U(X,x) [8T(X,x)]t)1i = Idn
and we deduce that
202 7 Statistical Estimation
{)~i J L(xIO)dX = a,
or alternatively,
J{)~i (XIO)L(xIO)dX = a .
Differentiating a second time,
or
and hence,
As we saw above,
so that
t=J _ ~~ . (8(xI O) {)(xI O) ) .
u u - L-t L-t uJ {)(). {)(). U. ,
j=l i=l J /;/
and hence,
7.14 Appendix: Vectorial Cramer-Rao Bound 203
which implies
and finally,
--1
with F(X)..) = u t] U(n) and G(X)..) = [8T(X)..)]t u. This implies that
IUt),-l ul 2
:::; (IF(n)1 2 )1I(IG(n)1 2 )1I ,
where
(IF(n)1 2 )1I = ( Ut ),-l U (X)..) [U(n))t ),-1 u ) II '
and
u
(IG(x)..)1 2 )1I = \ t 8T(n) [8T(x)..)]t u) II .
We can analyze each term on the right-hand side. We have for the first term
and therefore,
204 7 Statistical Estimation
We now analyze the second term on the right-hand side of the above equation:
Now
so that
The inequality
thus becomes
--1 -
u t] u ~ ut"ru.
--1
This equality is true Vu ~ en and therefore] U(-2')..) = a~)T(x)..)+c(O) or
~ternatively, U(X>-.) = A(O)T(X)..) + {3(0), where A(O) = ]a(O) and {3(0) =
] c( 0). Expanding out this equation, we obtain
L(xI O) = II P(xiIO) .
i=1
Exercises 205
[If we have Tj(X) = (1/ P) 2:;=1 tj(Xi) rather than Tj(X) = 2:;=1 tj(Xi), this is
irrelevant here, because the parameters in the probability law are only defined
up to a multiplicative constant.] We thus have
p
Tj(X) = L tj(Xi) ,
i=l
and
i=l
The probability or density of the law thus has the form
Exercises
Exercise 7.1. Cramer-Rao Bound
By analyzing the general expression for the Cramer-Rao bound in the case
where the estimator may be biased, explain qualitatively why the variance
of this estimator might actually be less than the Cramer-Rao bound of an
unbiased estimator for the same parameter.
PA(X) = -1
2a
(lxl)
exp - -
a
,
Consider a random variable X).. taking real values in the interval [0,1] with a
beta probability law of type I:
_ 1 n-l p-l
Px(x) - B(n,p)x (1- x) ,
Consider a real random variable X).. with uniform probability distribution over
the interval [0,0].
(1) Write down the probability density function of X)...
(2) Find the estimator of the first order moment of O.
(3) Find the maximum likelihood estimator for O.
(4) Can it be asserted that this estimator has minimal variance?
Consider now a real random variable X).. with uniform probability distribution
over the interval [-0, 0].
(5) Write down the probability density function of X)...
(6) Suggest an estimator for 0 in the sense of moments.
(7) Find the maximum likelihood estimator for O.
The real random variable B).. is assumed to have a probability density function
of the form
where c 2 o.
(1) Calculate the Cramer-Rao bound for the estimator of the empirical mean.
(2) Compare the Cramer-Rao bounds when c = 0 and when c > o.
8
Examples of Estimation in Physics
This is just what our intuition would have suggested, i.e., to use the method
in which we identify the first moment. We have also seen that these esti-
mators have minimal variance. Indeed, in the exponential family, the max-
imum likelihood estimators have minimal variance when they have no bias.
Let us determine their Cramer-Rao bounds. We know that, when we ob-
serve a P-sample X = {Xl,X2, ... ,Xp}, we have (7"2(0) ;;0: l/Idx), where
IF = -(82f(X)/80 2) and where (7"2(0) is the variance of the sufficient statis-
tic OMdx) = (I:::l Xi)/ P for the Gauss and Poisson distributions and
OML(X) = (2:[=1 xd/(aP) is the sufficient statistic for the Gamma distri-
bution. Since the estimators we are considering here are efficient, the last
inequality is actually an equality and hence u 2(O) = 1IIF(X)
The log-likelihood of the Poisson distribution is
p
f(X) = -PO+T(x)lnO- 2)n(xi!)'
i=l
where T(X) = 2:[=1 Xi (In contrast to the last chapter, we denote discrete
and continuous random variables in the same way here, for reasons of sim-
plicity.) We have (T(X)) = PO and, as the Fisher information is given by
IF = (T(x))/0 2, we deduce that IF = PIO. We thus obtain
2 0
UT(O) = P .
1 p
f(X) = -PaInO - (jT(X) + (0'. -1) Llnxi - PlnF(a) ,
i=l
Now (T(X)) = aPO, and hence, IF = PaI0 2. Since the estimator OML(X) =
(2:::1 xi)/(aP) is unbiased, we deduce that
02
uf(O) = Pa .
ratio, which can be defined as the square of the mean over the variance, is
then equal to p = PO for Poisson noise and p = Po. for noise associated with
a Gamma distribution. The signal-to-noise ratio is therefore independent of
the mean for Gamma noise, whereas it increases linearly with the mean in the
case of Poisson noise. For additive Gaussian noise, we would have p = P0 2 / a 2 ,
which shows that the signal-to-noise ratio would increase in this case as the
square of the signal mean.
Let us now analyze what happens if we make an estimate of the relax-
ation parameter of a time-varying flux using the maximum likelihood method.
To simplify the analysis, we assume that the signal is measured at dis-
crete time intervals t = 1,2, ... , P and denote the P-sample as usual by
X = {Xl, X2, .. , X P }. When there is no noise, the signal would be equal
to {Sni=I, ... ,P, and we will assume that it is equal to the mean value of
the measured signal: sf = (Xi). {sfh=I, ... ,P then constitutes a parametric
model of the signal. For example, for an exponential relaxation of the flux,
we would have sf = So exp( -i/O) and for a flux varying sinusoidally with
time, sf = So sin(wi + 0), where we have assumed that the phase of the sig-
nal is an unknown parameter. We will assume that the measurement noise is
uncorrelated in such a way that we can write the log-likelihood in the form
P
R(X) = 2)n [Psf(Xi)] .
i=l
First of all, the relation between the parameter 0 and the measurements X =
{Xl, X2, .. , X p} is no longer as simple as in the last case and we cannot use
the previous arguments to guarantee that the maximum likelihood technique
will lead to the estimator with minimal variance, even if it is unbiased. Let
us consider the three cases of Poisson, Gamma and additive Gaussian noise
in turn.
For Poisson noise, the log-likelihood is
P P P
R(X) = - Lsf + LXilnsf - Lln(xi!)'
i=l i=l i=l
The maximum likelihood estimate of 0 is thus obtained by minimizing
P
E(X,O) = L (sf - Xi In sf)
i=l
For Gamma noise, the log-likelihood is
P P P P
R(x) = (a -1) Llnxi - a Llnsf - L x; - LlnF(a) .
i=l i=l i=l si i=l
The maximum likelihood estimate of 0 is thus obtained by minimizing
212 8 Examples of Estimation in Physics
We see that it is only in the case of additive Gaussian noise that the least
squares criterion is optimal as far as likelihood is concerned. Table 8.1 gives
the various quantities we must minimize to achieve the maximum likelihood
estimate in the case of an exponential relaxation sf
= So exp( -i/O).
Poisson
Gamma
Gauss
where ni is a Gaussian variable with zero mean and unknown variance 0"5. Xi
is thus a Gaussian variable with mean a and unknown variance. We saw in
Section 7.10 that the maximum likelihood estimator of the mean of a Gaussian
distribution is unbiased and efficient:
Let us now calculate the variance of this estimator, viz., ([Il ML - (Il ML )]2),
where, as in previous chapters, ( ) represents the expectation value. If ao
denotes the true value of a, since the estimator is unbiased, we have (IlMd =
ao. We obtain
1 p P
(18IlMLI2) = p2 L L((Xj - ao)(xi - ao)) .
i=l j=l
Now ((Xj - aO)(xi - ao)) 0"56i-j, where 8n is the Kronecker delta, and
therefore,
2
(18IlMLI2) = ~ .
We may therefore say that the standard deviation 0" a of the estimator IlML is
Figure 8.1 shows the variances of the random variable and the estimator of
the mean.
To estimate the accuracy of the estimate of a, we need to know 0"0. Let us
therefore turn to the problem of estimating the variance of the Xi. This will
allow us to deduce the standard deviation 0"a and hence the accuracy of the
estimate of a. We will then be able to plot an error bar of plus or minus 0" a on
either side of IlML. We do this in two stages. First of all, we consider the case
214 8 Examples of Estimation in Physics
a,
a, JP
- - .
a""
Fig. 8.1. Comparing the variances of the random variable and the estimator of the
mean
where the mean is assumed to be known, and then we turn to the situation
when it too is unknown. In the first case, only the variance is unknown and
the likelihood is
L(xlo-) = IT
i=l
1 (Xi - ao
{_1_ exp [ _ _
V21T0- 20- 2
)2]}
This leads to a log-likelihood
1 p
f(xlo-) = - 20- 2 ~)Xi - ao)2 - Pln(J2;o-) .
i=l
and hence,
We have already seen in Section 7.10 that this maximum likelihood estimator
of the variance is unbiased and efficient for 0- 2 .
Needless to say, this simple scenario does not correspond to the one usu-
ally encountered in reality. Indeed, the mean is generally unknown, precisely
because it is the very thing we seek to estimate. In this case, we must write
8.2 Measurement Accuracy in the Presence of Gaussian Noise 215
1 p
C(xla,a) = --2 ~)Xi - a)2 - Pln(J2;a).
2a i=l
and
or alternatively,
is an unbiased estimator. We thus see that we can estimate the error bar on
a by
216 8 Examples of Estimation in Physics
and hence,
We can calculate the Fisher information matrix in this case where we make
a joint estimate of the mean a and the variance of the distribution. We have
seen that the log-likelihood is
1 p
t'(xla,a) = --2 ~(Xi - a)2 - Pln(y'27;:a) .
2a ~
i=1
Setting T 1 (X) = 2:::1 (Xi - a) and T2(X) = 2:;=1 (Xi - a)2, we deduce that
at' 1
J':}(xla,a) = 2"T1 (X) ,
va a
at' 1 P
aa (xl a , a) = 2a4 T 2(X) - 2a2 '
where we have put a = a 2 We thus obtain
/ a2 t' ) 1
\ aa 2 (xl a , a) = - a 6 (T2(X))
P
+ 2a4 '
When a = ao and a = ao, we have (T1 (X)) = 0 and (T2(X)) = Pa~. The
Fisher information matrix is thus
and therefore,
ao 0
=-1 2 )
J (
: 2;3 .
8.3 Estimating a Detection Efficiency 217
Emission
Detection
x,
o-+~ ____ ~~ ________-+__ ~~ ____ ~*- ____
I with probability TO ,
{
X = 0 with probability 1 - TO .
218 8 Examples of Estimation in Physics
We write the Bernoulli law in a form which shows that it does belong to the
exponential family. For this purpose, we use the fact that x is binary-valued:
so that
and hence,
1 P
[fDJML = P LXi,
i=l
which finally turns out to be very simple. It is easy to check that this estimator
is unbiased. Let us now consider its variance. To this end we set
We then have
where TO is the true but unknown value of TD. Expanding out, we obtain
and so finally,
2 To(1 - TO)
(JD = P
In other words, the standard deviation (JD of the estimator of TD is
8.4 Estimating the Covariance Matrix 219
/To(1 - TO)
O"D = V P .
This immediately raises the problem that we do not know TO and that if we
replace this value directly by its estimate [fDJML' we may be led to under-
estimate the error bar when [fDJML is large. Indeed, if we find [fDJML = 1,
we will conclude that O"D = 0 and attribute a zero error bar to this estimate
of TD. A very pessimiztic view would lead us to choose an upper bound for
this error bar, and such a thing can be obtained by setting 0"5 c::= 0.25/ P, or
O"D c::= 1/(2vP).
A less radical solution can be implemented when TO ~ 1. If we find
[fDJML = 1, we can consider that this value is situated at the maximum
of the error bar. In other words, we will assume that, for the estimator fD of
the parameter TD for determining D, it is reasonable to choose the value such
that 1 = fD + O"G, where O"b = fD(1- f D)/ P. We then find O"G ~ 1/(P + 1).
Note that, although these two approaches are somewhat arbitrary, they are
nevertheless better than an over-optimiztic attitude as far as the conclusions
that can be drawn from our experiments are concerned.
We now consider a numerical example. Assume that we have made 100
measurements and find [fDJML = 0.8. Taking O"~G = [fDJML(1 - [fDJML)/ P,
we obtain O"MG = 0.04, corresponding to an accuracy of 4%. Note that if we
had estimated O"G with O"D c::= 1/2vP, we would have obtained O"D c::= 5%. On
the other hand, if we find [fDJML = 0.99, where O"~G = [fDJML(1- [fDJML)/ P,
we obtain O"MG = 0.01, corresponding to an accuracy of 1%, which is indeed
of the order of 1/(P + 1). We have just seen that the most cautious attitude
leads to O"D c::= 5%. We find that TD could be of the order of 0.94. This would
then lead to O"MG ~ 0.02, or 2%, which is more reasonable.
r=L Mj (Ujuj) ,
j=l
and hence,
where u}Uj = 1. Let us first estimate the eigenvalues. For this purpose, we
put
= 0,
or
(8.1)
To determine the eigenvectors Uj, we must maximize (xlr) with the con-
straints I Uj 11= 1. To do so, we introduce the Lagrange function
n
1 P n
'2 L L {[xildxilm[Ujlm + [Ujlm[Xilm[Xilk} = MjCtj[ujlk . (8.2)
i=l m=l
In order to analyze (8.1) and (8.2), we introduce the covariance matrix of the
measurements:
8.5 Application to Coherency Matrices 221
or
and
CUj = fLjCY.jUj / P .
W~can deduce from these two last equations that Uj is the j th eigenvector
of C, corresponding to the eigenvalue fLj. In other words we have
n
C = LfLj(UjUJ) .
j=l
1
P(u) = 7r21rl exp -utI'
(--1 u ) ,
where II'I is the determinant ofr and u = uxi + uyj.
Suppose now that we have a sample X = {Ul' U2, ... , up} comprising P
measurements. The log-likelihood is thus
p
(xlr) = - L(u~r-\n) - PIn Irl- 2Pln7r,
n=l
S2 = 2 (ufu: + ulU~) ,
S3 = 2 (UfU~ - ulu:) ,
using upper case letters because we must consider the field components as
random variables whose expectation values we seek to determine. It is easy
to express the coherency matrix in terms of these parameters and conversely.
Here, we simply note that the maximum likelihood estimates of the Stokes
parameters are
p
[Skl ML -- P1 "'" (n)
~Sk , k = 0,1,2,3.
n=1
r(a) = J00
xu-Ie-xdx .
o
and
82 P
8a8/3C( xI 9) = /3 .
The Fisher matrix is therefore
=
]=P
( 8 2 In r (a) 18a2 -1 / /3)
-1//3 al/3 2 .
We thus obtain
8.7 Fluctuation- Dissipation and Estimation 225
E.4itimation re..4iul ts
Fig. 8.3. Intuitive meaning of the covariance matrix for joint estimation of 0: and f3
I = _ 8 2 (C(xIB))e = P~
F 802 ,62 '
and hence (]"~ ~ ,62 I (P a). Since (aAo - 1) 1,62 is the determinant of the Fisher
matrix which is positive, as has been shown in Section 7.14, we have aAo > 1
and we can deduce that aAo/(aAo - 1) > 1. We thus see that the bound is
greater when a is unknown than when a is known.
This is an important result because it shows that the introduction of ex-
tra parameters can lead to an increase in the variance of their estimator. In
other words, there is a price to pay for having complex models in which there
are a large number of parameters to be estimated, namely, the difficulty in
accurately estimating those parameters.
this section, we shall study the analogies between probability laws in the
exponential family and the Gibbs distributions discussed in Chapter 6. For
this purpose, we consider the canonical form of the laws in the exponential
family, using the notation X = {Xl, X2, ... , Xp}, M(X) = Er=l t(Xi), h = -e,
and Ho(X) = Er=l !(Xi) and setting f3 = -1. It is then quite clear that
the laws in the exponential family can be written in the form of the Gibbs
distributions:
Po(X) = _ex_p....o..{-_f3---,[~R_o(-,---:--,---;_-_h_M_(_X-=.)J.:....} ,
where X{3 is the susceptibility defined by X{3 = 8M{3/8h with M{3 = (M(x)).
Let us examine what this says for probability laws in the exponential family.
r
To this end, we return to the canonical notation, whereupon
Furthermore,
Exercises
Exercise 8.1. Speckle at Low Fluxes
An intensity measurement is made in the presence of speckle and at low flux.
We saw in Section 4.10 that the measured intensity is proportional to a random
variable N taking positive integer values according to a probability law
P(n) = Aan .
Consider a partially polarized coherent light source. Assume that the intensi-
ties along the horizontal and vertical axes are described by independent ran-
dom variables with exponential probability density functions, so that the co-
herency matrix is diagonal. P intensity measurements Xi and Yi (i = 1, ... , P)
have been made along the horizontal and vertical axes, respectively. These
measurements correspond to independent speckle realizations. The degree of
polarization of the light for measurement number i is
Xi - Yi
Pi = Xi + Yi
1= 1 +00
-00 [exp(x/2)
1
+ exp( -x/2)]
4~'
In this exercise, we shall use the maximum likelihood method to estimate the
variation of a flux that is assumed to vary linearly with time. We assume that
the signal is measured at discrete time intervals t = 1,2, ... , P and denote
the P-sample by X = {Xl,X2, ... ,Xp} as usual.
The signal without noise is assumed to evolve according to the model
sf = ie. The parameter e is estimated in the presence of independent noise
for each measurement. In other words, the measurement noise is uncorrelated
in such a way that we can write the log-likelihood in the form
p
Let Px(x) be the probability density function of Y>.. For -a/2 < X>. < a/2,
we have Py(y) = Px(x). The probability of having -a ::; X>. ::; -a/2 is 1/4,
as is the probability that a/2 ::; X>. ::; a. We thus see that, for Y>., we must
consider a joint probability density and discrete probability law. This can be
simply achieved using the Dirac distribution 8(y). We can then write
1 1 1
Py(y) = 48(y + a/2) + 2a RecL a/2,a/2(y) + 48(y - a/2) ,
where
R t ( ) _ {1 if - a/2 < X>. < a/2 ,
ec -a/2,a/2 Y - 0 otherwise.
We thus observe that simple transformations can lead to mixtures of discrete
and continuous probability distributions. The Dirac distribution then provides
an extremely useful tool.
Clearly, the range of variation of Y>. is [0, 1]. The function 9 is increasing and we
can therefore apply the relation Py(y)dy = Px(x)dx. Now it is immediately
clear that dy = Px(x)dx, and we deduce that Py(y) = 1, corresponding to
a uniform probability density between 0 and 1. The above transformation
y = g(x) is often used in data processing to obtain a good distribution of the
values of the random variables over a given region.
We have
1 [(X-m)2]
PX(X) = . rrc exp - 2 '
y27ra 2a
and hence,
Note first that, by symmetry, we must have ((x - m)n) = 0 when n is odd.
Setting u = (x - m)/a, we find
((x - m)n) -1
...:....:....-----'--'- -
an
00
-00
U
J27r
(u
n-1- exp - - 2 ) du.
2
If we put
J(a) = I: exp (-a ~2) du = v!2ia- 1/2,
we then have
We deduce that
n= 1:
n=2 :
n=3 :
n:
Finally,
((x - m)2n) = 1 35 (2n - 1)a2n .
Using the covariance matrix r, the probability density function can be written
1 (IZ-MI2)
Pz(z) = 7rE exp - E
234 9 Solutions to Exercises
The probability density function of the Gamma probability law, defined for
x 2: 0, is
PX(X)
X,,-l
= m"r(a)
(X)
exp - m .
The function y = x/3 is increasing and we can therefore apply the relation
Py(y)dy = Px(x)dx. We deduce that dy = ,8x(/3-1)dx and x = yl//3. We then
obtain
y,,//3-1 (yl//3)
Py(y) = ,8m"r(a) exp -~
When a = 1, we obtain
,,/y,-l
Py(y) = ~exp -
[(y),]
p, ,
where "/ = 11,8 and /1 = m/3. Py (y) is the Wei bull probability density function.
(1) We have
PB(b) = -1- exp - - ) ,
J2;0- 20- 2
(b 2
VN
Py(y) = -
J2;0-
[N
- exp - - ( y - g) 2]
20-2
()(u)={~ if u2:0,
otherwise,
1 (X2 + y2)
PX,y(X, y) = 27m 2 exp - 2(12 .
Since dFz(z)jdz = Pz(z) and dO(u)jdu = 8(u), where 8(u) is the Dirac
distribution, we have
= 21+
00
[1: 00
8 (z - ~) PX,y(x, y)dx ] dy .
If we put v = xjy, we can then write
Pz(z) = 21+
00
y [1: 00
8(z - v)PX,y(yv, Y)dV] dy
= 21+
00
y 27rl(12 exp ( _ y2 ~:~ y2) dy .
A direct calculation then leads to
1
Pz(z) = 7r(1 + z2) ,
which corresponds to a Cauchy variable.
We have
PX,t(X) 1
= --exp (X2
--)
v'2iia 2a 2
Furthermore, Py,t(y)dy = Px,t(x)dx and Y>.(t) = g(t)X>.(t). As g(t) is strictly
positive, the transformation y = g(t)x is bijective and dy = g(t)dx. We thus
have
1 [y2]
Py,t(Y) = v'2iig(t)a exp - 2g(t)2a2
{T 1 {T
(h(t - T>.)) = Jo h(t - T)Pr(T)dT = T Jo h(t - T)dT ,
and so
(h(t - T>.)) = ~ faT h(~)d~ ,
which leads to a result independent of t. Likewise,
and hence,
and hence,
It is enough for X)., (t) to be stationary and ergodic up to second order mo-
ments. The proof is immediate.
(Y).,(t)Y).,(t + T))
= ({ a1X).,(t) + a2 [X).,(t)]2} {a1x).,(t + T) + a2 [X).,(t + T)]2}) ,
or
and hence,
238 9 Solutions to Exercises
We thus see that weak stationarity is not enough. On the other hand, if XA(t)
is stationary up to fourth order moments, the quantities
(XA(t) ,
(XA(t)XA(t + 71) ,
(XA (t)XA (t + 71)X A(t + 72) ,
(XA(t)XA(t + 71)X A(t + 72)XA(t + 73)
are independent of t, and in this case YA(t) is stationary up to second order
moments.
Let us now address the question of weak ergodicity. We have YA(t)
--
a1XA(t) + a2[XA(t)] 2 and
We thus see that weak ergodicity is not sufficient. However, if XA(t) is ergodic
up to fourth order moments, the quantities
(1) We have
JA(I/) = a8 (1/ - ~ ) exp (-i<pA) ,
where 8(x) is the Dirac distribution. The phase of JA(I/) is thus -<PA'
(2) We have (!A(t) = a exp(2i1l'tjT) (exp( -i<pA)' fA(t) is not therefore weakly
stationary in general. However, if (exp(-i<pA) = 0, we have (!A(t) = 0
and it is then worth taking the analysis to second order.
In this case, (f~(t)!A(t + 7) = a 2 exp(2i1l'7jT) and we thus observe
that !A (t) is stationary up to second order moments. The condition
(exp( -i<pA) = 0 is equivalent to
9.2 Chapter Three. Fluctuations and Covariance 239
(27r
Jo exp(-i4P,p(4d4> = .
This condition is fulfilled, for example, if 4>>. is a random variable uniformly
distributed over the interval [0,27rJ, i.e., P,p(4)) = l/27r in the interval
[0,27r].
(3) (J>.(v)) = a8(v - liT) (exp( -i4>>.)). When J>.(t) is weakly stationary, we
thus have (i>.(v)) = 0. i>.(v) is therefore a complex random variable with
zero mean. In particular, if 4>>. is a random variable distributed uniformly
over the interval [0, 27rJ, i>.(v) is an isotropic complex random variable, i.e.,
the probability density functions of i>.(v) and exp(-icp)i>.(v) are equal,
Vcp E [0,27r].
(4) We have
n=-oo
and we can therefore generalize the last result for each frequency v = niT.
(5) We have
i~(vdi>.(V2)
= a~am8 (VI -
n=-oom=-oo
and hence,
L L f) 8 (V2 -
00 00
We first analyze the case when VI i=- V2. Since the 4>n,>. are independent
random variables distributed uniformly over the interval [0,27rJ, when n i=-
m, we have
We thus have
f) 8 (V2 - f) = 0 ,
00
since VI i=- V2, by hypothesis. i>.(VI) and i>.(V2) are then uncorrelated. If
VI = V2 = V, i~(v)i>.(v) is not defined because 8(v - nIT)8(v - niT) is
not a distribution. However, the coefficient in front of this term will be
ta n t2 .
240 9 Solutions to Exercises
(1) Let f(t) Q9 g(t) be the convolution of f(t) with g(t) defined by
or
'(J2 2 2
Sss(v) = -.JL [(1- a) + acos(2nVT)] + [asin(2nvT)] ,
2VB
if v E [-VB, VB] and 0 otherwise.
(Y;(t)Y).,(t + T))
= [ : [ : (X~(t - 6)X>-.(t + T - 6))h*(6)h(6)d6d6
= [: [: r xx (6 + T - 6)h*(6)h(6)d6d6 .
(]"~
i:
(Y; (t)Y>. (t + T)) = 8(6 +T - 6)h*(6)h(6)d6 d6
Now
if T
i: h*(()h((
1 -00
00 h*(()h((+T)d(= ~exp(-aT).
2
If T is negative, we have
1 00
-00
h*(()h(( + T)d( = -
a
2
exp( -aITI) ,
and hence,
a(]"2
W;(t)Y>.(t + T)) = 2B exp( -alTI) .
(2) The total power of the fluctuations after filtering is thus
2
Py = (Y;(t)Y>.(t)) = a~B .
where
Y>.(t) = X>.(t) RectO,T(t) = i:
(1) Let f(t) g(t) be the convolution of f(t) with g(t). We first note that
and hence,
S' yy ()
v = [sin(7rVT)] 2 s'XX ()
V .
7rV
242 9 Solutions to Exercises
Sxx(v) = 1,
and hence,
Syy(v) = [sin~:T)r
(4) The inverse transform of [(1j?Tv) sin(?TvT)] 2 is the autocorrelation func-
tion of RectO,T(t). The value of this autocorrelation function at 0 is there-
fore T. We deduce that J~oo Syy(v)dv = T. The power of YA(t) is then
proportional to T. This result should be compared with the one which says
that the variance of the sum of N independent and identically distributed
random variables is proportional to N.
(1) We have
whereupon
or
rXY(T) = [ : rXX(T - ~)h(~)d~ .
(2) The last equation is a convolution relation, so Fourier transforming yields
SXY(v) = Sxx(v)h(v) ,
where we have assumed that the Fourier transforms SXY(v) and Sxx(v)
of rXY(T) and rXX(T) exist.
9.3 Chapter Four. Limit Theorems and Fluctuations 243
LN
8Hn - j - m =
{lifO<i+n-j~N,
0 otherWlse,
.
m=l
and hence,
(3) The probability density function of the sum of two independent random
variables is obtained by convolution of the probability density functions
of each random variable. The convolution of the two indicator functions
RecLa,a(b) defined by
I if Ibl ~ a,
RecLa,a(b) = { 0 ot herwlse,
.
244 9 Solutions to Exercises
Ps (s) = { 2~ ( 1 - ;~) if Is I ~ 2a ,
o otherwise.
(4) If N -+ +00, the central limit theorem is applicable and we deduce that the
probability density of Si will tend toward the normal probability density
function. It has zero mean and variance (5;) = a~.
(5) The central limit theorem is still applicable and we deduce that the proba-
bility density function of Si will tend toward the normal distribution. The
mean is still zero and its variance can be found as follows. We have
INN INN
(5;) = N L L ajak(Bi+jBi+k) = N L L ajaka~Ok-j .
j=lk=l j=lk=l
Hence,
(2) Consider the last model and let Xj be the binary random variable which
i
is equal to 1 if an electron has passed during the th time interval of
length OT and 0 otherwise. We thus have m = 2::i=1 Xi and therefore
(m) = 2::[:,1 (Xi) = Np and
N N
(m 2) = LL(XiXj ) = Np+N(N _1)p2.
i=1 j=1
9.3 Chapter Four. Limit Theorems and Fluctuations 245
Px(x) = 00
/ -00 O(x - ~)B(~)~1 exp (x -~)
--a- 1 exp
~ (~)
-~ d~,
where
O( u) = {Io otherwise.
if u > 0,
:
We thus obtain
:2
or
Px(x) = O(x) exp (-~) .
The probability density function of the difference of two independent ran-
dom variables is obtained by correlation of the probability density func-
tions for each random variable. We have
Py(y) = 00
/ -00 O(y + ~)O(~)~1 exp (Y+~)l
--a- ~ exp (~)
-; d~,
or
Py(y) = :2 exp (-~) : O(y + ~)B(~) exp ( -2~) d~ ,
:
Let us consider the two cases y 2': 0 and y :::; 0 separately. When y 2': 0,
:2 exp (-~)
i:
Py(y) = B(y + e) exp ( -2~) de
= :2 exp (-~) exp (-2~) de
= ~2 exp (-~) exp (2~)
2a a a
= ~exp (~) .
2a a
This result can be written in the form
Py(y) = -exp
1
2a
--
a
(IYI)
for any value of y.
(1) Put Pr(r) = pt5(r -1) + st5(r) + qt5(r + 1), where t5(x) is the Dirac distri-
i:
bution. The characteristic function is
we have
FRn (v) = [pexp(iv) + s + qexp( -iv)r .
(2) First method: direct calculation. We have Rn = E~=l ri and hence, (Rn) =
E~=l (ri). Now (ri) = p-q, whereupon (Rn) = n(p-q). Moreover, (R~) =
E~=l E?=l (rirj). Now
and thus
n n
(R~) = L L { [p + q - (p - q)2] t5i - j + (p - q)2} .
i=l j=l
We then obtain
9.3 Chapter Four. Limit Theorems and Fluctuations 247
and therefore
o A
02 A
ov 2 PRn (v)
= n[ - pexp(iv) - qexp( -iv)] [pexp(iv) + s + qexp( -iv)r- 1
+n(n - 1) [ipexp(iv) - iq exp( -iv)] 2 [pexp(iv) + s + qexp( -iv)r-~
and hence,
02 2
ov 2 PR n (0) = -n(p + q) - n(n - l)(p - q) .
A
(1) P(r) has finite first and second moments. We deduce that Rn/fo will
be normally distributed. From the symmetry of P(r), we have (ri) = 0
and so (Rn) = O. Further, (rT) = 2 Iooo r2 P(r)dr, or (r'f) = 2, and hence,
(R~) = 2n.
(2) P(r) does not have finite first and second moments. The characteristic
function of P(r) is (see Section 4.2)
P(v) = exp(-Ivl) .
(1) We have
P(x, t) =
~
~ ~
1 [ (X-na)2]
exp - 2 2 .
n=-oo
v2rrta a t
(2) The restriction of the last solution to the interval [0,1] is a solution of
the partial differential equation which describes diffusion, with boundary
condition that the derivatives of the concentration should be equal at 0
and 1 (for the problem is invariant under translation by whole numbers n).
Consider a circle of unit circumference and let x be curvilinear coordinates
defined on [-1/2,1/2]. From the symmetry of the problem, the concen-
trations are equal at -1/2 and 1/2 and the derivative is continuous. We
thus have two problems governed by the same partial differential equation
with the same boundary conditions and the same initial conditions. The
solutions must therefore be the same. We deduce that, for a circle of radius
R,
oo
L+ (x - 2nrrR)2]
P( x, t ) -_ -- 1 - exp [ - -'-----;::-----''--
n=-oo
.;27rta 2a 2 t
~
Pz(v) = exp ( -aJvJ - "2a
L v 22) ,
This result shows that the asymptotic behavior of the random walk is
totally conditioned by the Cauchy distribution, i.e., by the large deviations
corresponding here to the flea's jumps.
(1) As the Xn are strictly positive, we can set Zn = In Y n . We then find that
n
Zn = LlnXi.
i=l
If mlog = J~oo Px(x) lnxdx and a?Og = J~oo Px(x) Inx 2dx exist, we can
apply the central limit theorem. For large n, the probability density func-
tion of Zn is approximately
(z - mlOg)2]
Pz ( z) = J27r1 exp -
[
2 2 .
27ralog a 10g
( )_
P yy-rn= 1 [ (In y - mlOg)2]
exp- 2 '
V 27ryalog 2a1og
n
ryy-
( ) _ 1 [ (In iyi - 2 mlOg)2]
exp-
2v'27riyialog 2a1og
(2) We have
S = - [~ + (N -1)0:] In [~ + (N -1)0:]
- [1- ~- (N -1)0:] In (~ - 0:)
(3) We know that the entropy is maximal when there is equiprobability, i.e.,
when 0: = O. This can be checked by setting 8S/80: = O.
(1) We have
PE(e) = 1 _ exp (_etr- 1
7r 2 det r
e) ,
r
where det is the determinant of r.
(2) The entropy of the system is S = - f PE(e) InpE(e)de, and hence,
S=ln(7r 2 detr) + J (etr-1e)PE(e)de.
9.4 Chapter Five. Information and Fluctuations 251
Now,
and therefore,
1- 4 detr
-2 '
trY'
- - - -2
where trY' is the trace of Y'. W0hus have 1- pi = 4 det Y' /trY' . Now the
total intensity 10 is equal to trY' and so
PE(e) = 1 _ exp
1fd detr
(_etr- e) l
Sd = In ( 1fd e d det r)
_ =d
We now introduce the dimensionless quantity a = det r /trr ,whereupon
we obtain
Sd = In (1fd edaIt) .
This expression can be expanded to bring out a term more directly related
to the disorder, since it turns out to be the same for all situations which
differ only by a multiplicative factor:
Sd = d + dln(7rIo ) + Ina.
There are several ways to define a degree of polarization. In order to push
the analysis a little further, it is interesting to express the last term OSd
of the entropy in terms of the eigenvalues AI, A2, ... , Ad of the covariance
matrix:
In 3 dimensions, we obtain
252 9 Solutions to Exercises
and hence,
J p~c)(x)dx.
no+o/2
Pa(n) =
no-o/2
J p~c)(x)dx.
no+o/2
Pb(n) =
no-o/2
The Kullback measure of the two probability laws for N). is
and we obtain
where ( )a indicates the expectation value with the probability law Pa(x).
It is easy to show that
and consequently,
+00 Pa(n)
Ku(PaIIPb) = ; Pa(n) In Pb(n)
= (InPa(n) -lnPb(n))a
= Aa (~: - In ~: - 1) .
Here again, we can check that this measure is always positive.
9.4 Chapter Five. Information and Fluctuations 255
We have
(n}a = (1 ~aaa) ,
Ku(PallPb) = In 1- aa + ~ In aa .
1- ab 1- aa ab
(1) We seek the probability law Ps(n) which minimizes Ku(PsIIPb) for a given
value of Ku(PsIIPa). The problem can thus be formulated as minimizing
We then find
and hence,
>. is the Lagrange parameter associated with the constraint 'Ej~ Ps (n) = 1
and we can therefore say that
with
+00
C(8) = L Pt(n)p~-S(n) ,
n=l
and 8 = 1/(1 - jL).
(2) We have
+00
Ku(PsIIPa) = L Ps(n)[lnPs(n) -lnPa(n)] ,
n=l
and
+00
Ku(PsIIPb) = L Ps(n) [InPs(n) -lnPb(n)] .
n=l
The value of 8* which corresponds to Ku(Ps*llPa) = Ku(Ps*IIPb) thus
satisfies
+00 +00
L Ps* (n) In Pa(n) = L Ps* (n) In Pb(n) .
n=l n=l
The value of 8 which minimizes C(8) satisfies dC(8)/d8 = 0, i.e.,
+00
L
n=l
:8 [Pt(n)p~-S(n)] = 0.
and hence,
+00
L Ps(n) [lnPb(n) -lnPa(n)] = 0 .
n=l
9.4 Chapter Five. Information and Fluctuations 257
It is thus clear that the value of 8* which minimizes C(8) is the same as
that for which we have Ku(PsllPa) = Ku(Ps.IIPb). [One can show that it
is indeed a minimum by checking that 8 2 Cj88 2 2: 0.]
(3) We have
+CXl
Ku(PsIIPa) = L Ps' (n)[lnPs (n) - lnPb(n)] ,
n=l
and
P s' (n) = p{ (n)P~-s' (n)jC(s*) .
Hence,
or
whereupon,
Now,
so that
Y(PaIIPb) = -In C(8*) is the Chernov measure for Pa(n) and Pb(n), where
(1) We have
d InC(s) = ds
ds d In [+00
~ Pt(n)p~-8(n) 1
1 d +00
= C(s) ds ~exp [s In Pb(n) + (1- s)InPa(n)]
1 +00
= C(s) ~ [InPb(n) -InPa(n)]
d I +00 H(n)
ds InC(s) 8=0 = ~ Pa(n) In Pa(n) .
Now,
+00 Pa(n)
Ku(PaIIPb) = ~ Pa(n) In Pb(n) .
and hence,
(2) We set
In C (s) ~ A + B s + C s2 .
Now InC(O) = 0, so A = 0. Moreover, InC(1) = 0, whereupon B+C = 0,
and
InC(s) ~ Bs - BS2 = Bs(1- s) .
9.5 Chapter Six. Statistical Physics 259
where
Z{j =
When the temperature tends to zero, each particle tends to be in the state
with the lowest energy. Only the state with E(n) = E l , \:In = 1, ... , N
then has nonzero probability.
(1) We have
260 9 Solutions to Exercises
If we set
(2) We have
But
InZ{3 = -,BNEo + Nln2 + Nlncosh(a,Bh) ,
and hence,
eOi.{3h _ e-Oi.{3h
U{3 = NEo - Nah eOi.{3h +e-Oi.(3h
= N [Eo - ah tanh(a,Bh)] ,
where Ihl is the absolute value h. We thus find that U{3 < N Eo.
(4) We have,Bh = h/kT. If T - t 0, it follows that the particles will be located
for the main part in the state with lowest energy, viz., Eo - alhl and the
total energy is of the order of N(Eo-alhl). There is an analogous situation
if, at finite temperature, the modulus Ihl of the field tends to +00.
(1) We have
exp [ - ,B(Eo - hm)]
P= L=-n
n
exp [ - ,B(Eo - hm) ] '
and hence,
and
InZ,6,h = Nln Ltn exp [ - f3(Eo - hme)] }
( ) = _1_ 8InZ,6,h
m Nf3 8h .
~N 8 2In8h2Z,6,h = 132 [~ 2~ _ ( ~
~ me e
~ ) 2]
~ me. ,
e=-n e=-n
or
_1_ 8 2 In Z,6,h = [( 2) _ ( )2]
Nf32 8h2 m m
+00 1
z,6 = l: exp( -f3mgpn) =
n=O
1 (13
- exp - mgp
) ,
mgpN exp(-f3mgp)
8,6 = -NkB In [1 - exp( -f3m gp)] + - T - 1-exp(- 1mgp
3) .
262 9 Solutions to Exercises
(2) When T --.0, exp(-{3mgp) --. and (l/T)exp(-{3mgp) --. 0. Therefore,
S{3 --. 0, because
mgpN
S{3 ~ -T-exp(-{3mgp).
(3) When T --. +00, {3 --. 0, exp( -(3mgp) ~ 1 - (3mgp and hence,
mgpN
S{3 ~ -NkB In({3mgp) + {3 (1 - (3mgp)
mgpT
~ NkB [In(kBT) + 1-ln(mgp)] .
(4) When p --. 0, S{3 --. +00. This is understandable because, when p --. 0,
the number of degrees of freedom diverges as we tend toward a continuum
of states.
where Xj describes the state of the j th site, Xj is equal to if the adsorbed
particle is of type A and 1 if it is of type B. The Gibbs distribution leads to
where
where
Z{3 = 2:= exp { - {3 [xjEB + (1 - Xj)EAl } .
xj=O,l
and
where j3 = l/k B T.
(1) We have
+= +=
z(3 = L e-(3nEo = Lan,
n=O n=O
where a = e-(3Eo . We deduce that Inz(3 = -In(l- a), and since Z(3 = zfj,
(3) When T ---+ +00, we have j3Eo ---+ 0 and hence, e-(3Eo ~ 1 - j3Eo. We can
then write
kBT
S(3 ~ -kBNln(j3Eo) ~ kBNln Eo .
When T ---+ 0, we have j3Eo ---+ +00 and hence, e-(3Eo ---+ O. It follows that
1 e-(3E o
P(O) = 1 + e-(3Eo and P( 1) = -1-+-e--(3=EO;-o
264 9 Solutions to Exercises
We have
Lx H(X) exp [-t3H (X)]
U(3 = Lx exp [-t3H(X)] ,
whereupon
aU(3
at3
1
if t 2: 0 ,
X(t) = {
xo~exp
T2
(-~)
2T2
if t 2: 0,
o if t < 0 .
9.5 Chapter Six. Statistical Physics 265
(1) We saw in Section 6.7 that the spectral density of charges SQQ(v) at the
terminals of an RC circuit is given by
RC2
SQQ(V) = 2kaT 1 + (27rRCv)2 .
A
(2) We have
lim SQQ(v) = 0 .
R--+O
(3) We saw in Section 6.4 that the total power of the fluctuations in an ex-
tensive quantity M(X) is given by
(4) We observe that the total power of the fluctuations on the capacitor plates
is independent of the value of the resistance. However, we must have
lim
R--+O
1-00+00 SQQ(v)dv = kaTC ,
and hence,
lim
R--+O
1-00+00 SQQ(v)dv =I- 1+00
-00 lim SQQ(v)dv .
R--+O
If the domain of definition of X>. does not depend on 8, then the variance of
the statistic T(X>.) satisfies the Cramer-Rao inequality
_1~h(8)12
2 88
CTT(8) ?
J 2
8 In L(xI8) L( 18)d
88 2 X X
'
2 1
J
CTT(8) ? - --2~------
8 In L(xI8) L( 18)d
88 2 X X
We see that if 18h(8)/881 < 1, in the case of a biased estimator, the Cramer-
Rao bound may actually be less than the Cramer-Rao bound of an unbiased
estimator. A trivial example of a biased estimator for which the bound is zero
is provided by the choice of statistic T(X>.) = O.
(3) Considering a sample X = {Xl, X2, ... ,xp}, the log-likelihood can be writ-
ten
p 1-b
.e(X) = 2)np(xi) = Plnb + T2 (X) In ~ ,
i=l
where T 2(X) = 2::[=1 X;' It is clear that p(x) belongs to the exponential
family and that T 2 (X) is its sufficient statistic. The maximum likelihood
estimator is obtained from
8
-.e(X)
8b
= 0 = -P +T2 (X)
b
(1 1)
- - --
1- b b
This leads to
9.6 Chapter Seven. Statistical Estimation 267
where T(X) = 2:f=l IXil It is clear that PA(X) belongs to the exponential
family and that T(X) is its sufficient statistic.
For PB(x), the log-likelihood can be written
p p
This estimator is unbiased and depends only on the sufficient statistic T(X)
of a probability density function belonging to the exponential family. It
thus attains the minimal variance. Moreover, since the estimator is pro-
portional to the sufficient statistic which can be efficiently estimated, it is
itself efficient.
(2) We have
and hence,
a
an InB(n,p) = T1(X)/P.
Likewise for p, we find that
a
op InB(n,p) = T2(x)/P.
dx 1
x= -y- and
l+y dy (l+y)2 '
and hence,
1 n-l
p ( )- y
y y - B(n,p) -'-(I--=+'--y-"-)-n+-p
(4) Consider now the sample X' = {yl, Y2, ... , yp}. We have
we have
:n InB(n,p) = [T3(X') - T4(X')] /P.
Likewise for p, we obtain
~ InB(n,p) = -T4(X')/P.
(2) For concreteness, consider a sample X = {Xl, X2, ... , xp}. In this case,
A 2 p
eMM(X) = P LXj .
j=l
where the notation means that we must choose the largest value of the Xj
for e.
(4) The uniform distribution is not in the exponential family and, in contrast
to the situation where the probability law does belong to this family, we
cannot assert that this estimator attains the minimal variance.
(5) We now have
Px(X) = {1/
(2e) if. X E [-e,e] ,
otherWlse .
(6) Consider once again a sample X = {Xl, X2, ... , X P }. In this case, the esti-
mator
1 p
iJMM(X) =P LXj
j=l
whence
where IXjl is the absolute value of Xj. The choice between iJMM,(X) and
iJ M Mil (X) can be made by comparing the bias and variance of each esti-
mator.
270 9 Solutions to Exercises
where the notation means that we must choose the largest value of the
IXjl for O.
Solution to Exercise 7.6
(1) The probability density function of X A is
(:;2
where
IF = -P InPx(X))
We have
82 1
-lnPx(x) = -- - 12c(x - 0)2
80 2 0"5 '
and therefore,
IF = P [:5 + 12c((x - 0)2)]
IF =P (:5 + 12c( 2 ) ,
(2) When c = 0, we have 0"2 = 0"5. When c > 0, we then have CRB < 0"5/ P. We
can interpret this result by observing that, as c increases, the probability
density of X).. concentrates around 0, and this leads to a lower CRB.
(1) We must have L~~ P(n) = 1. P(n) ~ 0 implies that a ~ O. L~~ P(n) =
1 then implies a > O. However, L~~ P(n) < 00 implies that a < 1. We
deduce that
O<a<l.
Further, in this case, L~~ an = 1/(1 - a) and thus A = 1 - a.
(2) We can write
P(n) = exp [n In a + In(l - a)] ,
which shows that P( n) belongs to the exponential family. Considering a
sample X = {nl' n2, .. , np}, the log-likelihood can be written
p
a P
-(X) = 0 =
aa
Lp n
2. - - ,
j=1 a 1- a
or
[-1-a]
a ML
1
(X) = - Lnj.
P .
p
J=1
T(X) = L~=1 nj is the sufficient statistic for the probability law. Only
statistics proportional to this one can be efficiently estimated. It is easy
to show that
Let xf be the gray level of the i th pixel in the region under consideration and
in the image labeled by C. We have
o:a(xf)a-1 (o:xf)
Pe(x i ) = ()aT(o:) exp ------0
L= ~ t; (-O:;i ) + (0: -
M P
1)
M
~ t;
P
In x; + M P [0: In 0: - 0: In () -In T( 0:)] .
The maximum likelihood estimator for () is therefore
1 M P
BML = MP LLx;.
=1 i=1
Let xf be the gray level of the i th pixel in the region under consideration and
in the image labeled by C. We have
We see that, rather than averaging the gray levels of the M images, we must
average the gray levels to the power of 0:.
9.7 Chapter Eight. Examples of Estimation in Physics 273
Considering a sample X = {Xl, X2, ... ,Xp} as usual, the log-likelihood can be
written
(X) =- Lp (lnx-m)2
J
2a
2 - Pln(xjaJ2:;;:) .
j=l
This result shows that Px (x) belongs to the exponential family and that
T(X) = L:f=lln Xj is a sufficient statistic for m. The maximum likelihood
estimator of m is obtained from
and hence,
We thus have
(mMLCx)) = (lnxj)
= 1o xay'27r
[(lnx - m)2] d
+= Inx exp-
2a 2
x.
where w = 2n IT and ti = iT IN. With X = {Xl, X2, ... , XN}, the log-likelihood
is
274 9 Solutions to Exercises
1
L
N
(X) = --2 [Xi - aSin(wti + <p)]2 - Nln(J2;O") .
20"
i=l
We thus have
{) 1
L sin(wti + <p) [asin(wti + <p) - xd '
N
~(X) = -2
ua 0"
i=l
{) 1
L acos(wti + <p) [asin(wt i + <p) -
N
!'l(X) = - 2 Xi] ,
u<p 0" i=l
1
L
{)2 N
{f2(X) = 2 asin(wti + <p) [asin(wti + <p) - Xi]
<p 0" i=l
N
1,,",22
- 2 6 a cos (wti + <p) ,
0" i=l
{) 1
L COS(wti + <p) [asin(wti + <p) -
N
~(X) = -2 Xi]
u<pua 0" i=l
1
L acos(wti + <p) sin(wti + <p) .
N
-2
0" i=l
However, we have
N N N
L sin2(wt i + <p) = L sin2(27ri/N + <p) = 2" '
i=l i=l
N N N
L cOS (wt 2 i + <p) = L cos 2(27ri/N + <p) = 2" '
i=l i=l
and
9.7 Chapter Eight. Examples of Estimation in Physics 275
N N
L COS(wti + cp) sin(wti + cp) = L cos(27fi/N + cp) sin(27fi/N + cp) = 0 ,
i=1 i=1
which leads to
with oo(x) = a(x) - a and ocp(x) = cp(x) - cp, because we have assumed that
the estimators are unbiased and hence, (a(x)) = a and (cp(x)) = cpo We deduce
that
(1) We have
Pz(z) r+ ior+
= io
oo oo
0(z- x)y Px(x)Py(y)dxdy.
Pz(z) =
roo 10roo
10 yJ(z - lI)PX (Yll)Py (y)dlldy
roo
= 10 yPx(yz)Py(y)dy.
Pz(z) = -1-
Ixly 0
1+ 00
yexp (YZ
- - - -Y ) dy
Ix ly
Ixly
P ( )_ 2Ixly
p p - 2 '
[Ix + I + p(ly - Ix)]
y
which can be written more simply as a function of u :
1- u 2
Pp(p) = 2(1 - up)2
The log-likelihood of a P-sample X = {PI, P2, ... , pp} is
1 2 P
(X) = PIn ~ u - 2 2:)n(1- UPi) .
i=l
We thus see that this probability density function does not belong to the
exponential family when U is the parameter of the probability law.
(2) We have (3 = In z, which is a bijective transformation, so that we can write
P[3((3)d(3 = Pz(z)dz .
We deduce that
r'
P ((3) = Ix ly exp {3
[3 (Iy exp (3 + Ix)2
C(X) = -2 L In [CP(f3i -
~)l ,
i=l
where cp(x) = exp(x/2) + exp( -x/2). We thus have
~C(x) = -2 ~ CP'(f3i - ~)
[h :r
CP(f3i - ~) ,
1
-2L
p
= 2 .
i= 1 [exp (f3i ; ~) + exp ( _ f3i ; ~) ]
Taking the expectation value,
whereupon
1
eRB = 2PI .
278 9 Solutions to Exercises
1 p
f(X) = - 2a 2 ~)Xi - iO)2 - PIn ( ..J'iia)
.=1
The derivative is then
We have
which shows that the maximum likelihood estimator is unbiased. The sec-
ond derivative of the log-likelihood is
n ()
X =
I7ML
L;=l Xi
",p ..
ui=l '/,
We have
{)2 1 p
{)02(X) = - 02 LXi,
i=l
which gives the Fisher information IF = (L;=l i)jO, whilst the Cramer-
Rao bound for estimation of 0 is
2 0
a (0) ~ - P -..
Li=lz
(3) The noise is now Gamma noise and we can write
1 P P P
(X) = -0 L ~i + (a -1) Llnxi - PaInO - a Llni - Plnr(a) .
i=l i=l i=l
The derivative is then
280 9 Solutions to Exercises
p
() 1 '"' Xi Po.
{)(}(x) = (}2 ~ i - 7) .
i=l
We have
which shows that the maximum likelihood estimator is unbiased. The sec-
ond derivative of the log-likelihood is
{)2 2 ~ Xi Po.
{)(}2 (x) = - (}3 ~ --:- + -(}2 '
i=l Z
and hence,
(4) The various maximum likelihood estimators can be written as linear com-
binations of the measurements. We consider the three cases below.
Gaussian Case:
P
G G_ i
(}G (x) = ~ ai where ai - " p
A '"'
Xi, .2
i=l L..."i=l Z
Poisson Case:
P
(}p(X) = ~ a iP Xi, where
A '"'
a iP = -P1- -..
i=l Ei=l Z
9.7 Chapter Eight. Examples of Estimation in Physics 281
Gamma Case:
p
,
Or(X) = '~ai
" r Xi, where r = -po
ai
1 .
a z
i=l
where 8B(X) = O(X) - (B(X)) and OXi = Xi - (Xi). We consider the three
cases in turn.
Gaussian Case:
where ( )G indicates that averages are taking under the assumption that
the variables are Gaussian. We then obtain
The variance of the estimator is equal to the Cramer-Rao bound and the
estimator is therefore efficient.
Poisson Case:
where ( )p indicates that averages are taking under the assumption that
we have Poisson variables. We thus obtain
The variance of the estimator is equal to the Cramer-Rao bound and the
estimator is therefore efficient.
282 9 Solutions to Exercises
Gamma Case:
p
'2 1",2
(OOr(X) )r = 2 p2 ~ a() ,
a i=1
where ( ) r indicates that averages are taking under the assumption that
we have Gamma variables. We thus obtain
, 2 ()2
(OOr(X) )r = aP .
The variance of the estimator is equal to the Cramer-Rao bound and the
estimator is therefore efficient.
(5) It is easy to check that BG(X) remains unbiased if the data are distributed
according to a Poisson law. We have (ox~)p = iO, and the variance of the
estimator is therefore
We thus have
(OBG(X)2)p (E::1 i (Ef=1 i)
3)
(OBp(X)2)p
(Ef=1 i 2f
It is easy to show that
(E::1 i3 ) (Ef=1i) ~ 1,
(Ef=1 i2) 2
and we thus deduce that the least squares estimator is not efficient, unlike
the maximum likelihood estimator, because its variance is greater.
(6) We have
In the Gamma case, (xi)r = ai() and hence, (BG(X))r = a(). It is better
to consider the unbiased estimator
We thus have
(OBG(X)2)r P ""p '4
L..i=l Z
(OBr(X)2)r
i2r - ,
L..i=l Z
(2:f=l
>1
and we may thus deduce that the least squares estimator is not efficient,
unlike the maximum likelihood estimator, because its variance is greater.
References