You are on page 1of 5

ENTROPY AND THE MCMILLAN THEOREM

Let X = {xP 1 , . . . , xr } be a finite set. Let µ be a probability measure on X, i.e. µ(xj ) =


pj ≥ 0 with rj=1 pj = 1. For a finite probability distribution {pj }, the entropy of the
distribution is
Xr
H(µ) = − pj ln pj .
j=1

Note that h(x) = −x log x is non-negative and concave for x ∈ [0, 1] and h(0) = h(1) = 0.
The intuition from information theory is that the entropy H(µ) of a discrete probability
measure µ is a measure of the degree to which µ is uniform. The larger the entropy, the
more uniform the measure.

Exercise: Show that if X = {x1 , . . . , xr } is a finite set, then the probability measure of
maximal entropy on X is the uniform measure assigning probabiilty 1r to each point. Thus
the maximal entropy is log r.
If µ = δxj0 is a measure assigning probability 1 to one point xj0 then H(µ) = 0 and µ is a
probability measure of minimal entropy – it is the least uniform measure.

The McMillan theorem states an AEP (an asymptotic equipartition property). Roughly
speaking it asserts that if a probability measure µ on a finite set X has entropy H(µ), then as
N → ∞ it concentrates on a (specified) set GN of cardinality ' ehN and is roughly uniform
on that set. We now state the result precisely.

0.0.1. Notation. Let Ωn = X {1,...,n} be the n-fold Cartesian product of X. Thus,


Ωn = {~ω = (a1 , . . . , an ) : aj ∈ {1, . . . , r}}.
We refer to elements of Ωn as ‘paths’. Henceforth we write ω ~ = ω. Let πk : Ωn → X be
πn (ω) = ωn . This is often written as Xk as well, so that product measure is equivalent to
Xk being i.i.d. ∼ µ.
Let µn = µ × µ · · · × µ be product measure. Thus,
n
ν (~
ω)
Y
µn (ω) = pj j ,
j=1

where
νj (~ω ) = #{i : ωi = j}.
The product measure can therefore be re-written in the form,
Lemma 0.1.
Pr νk (ω)
µn (ω) = e−n k=1 (− n
log pk )
.

Date: January 23, 2019.


1
2 ENTROPY AND THE MCMILLAN THEOREM

~ ∈ ΩN let
0.1. Some important random variables and sets. For ω
n
1 1X
Pn (ω1 , . . . , ωn ) = − log µn (ω1 , . . . , ωn ) = − log µ(ωj ). (1)
n n j=1

The second important random variables are the νk . For each j we introduce iid random
variables on Ωn :

 1, ak = j,
j
ξk (ω) := 1πk (ω)=j = , (k = 1, . . . , n).
 0, a 6= j
k

Then,
n
X
ξkj (ω) = νj (ω).
k=1

We must actually consider the random variables ξkj as defined on Ω∞ in order to have them
defined on a single probability space as n → ∞. If πk : Ω∞ → X is the projection to the kth
coordinate, then ξkj := 1{j} ◦ πk where 1{j} ∈ C[X].
Define  1
 GN (δ) := {ωN : − N log PN (ω) − H ≤ δ} ⊂ ΩN ,
(2)
νj (ω)
CN (δ) := {ω : | N − pj | < δ : ∀j = 1, . . . , r}.

0.2. Weak law of large numbers. The weak law of large numbers says that P (| Snn −
EP X| > δ) → 0 for all δ > 0. Here Sn = X1 + · · · + Xn are i.i.d. and EP X is their common
expectation. We would like to apply it to show that µN (GN (δ)) → 1, µN (C(N, δ)) → 1.
Proposition 0.2. Let X = {1, . . . , r}. Then, for any  > 0, µN (C(N, ) :) → 1.
Proof. Note that
r
X νj (ω)
µn (C(n, )) ≥ 1 − µn (ω : µn ( − pj | > ).
j=1
n

We want to show that the sum on the RHS tends to zero. Since the number of summands
is fixed at r, it suffices to show that each term tends to zero.
Lemma 0.3. For each j ∈ {1, . . . , n} define νj as above.
νj
µ∞ {ω : | − pj | > } → 0.
n
Proof:
First note that Eνj = npj . Indeed, the ξkj above are i.i.d. and
n
X
ξkj (ω) = νj (ω).
k=1
ν
Hence, E nj = Eξ1j = E1{j} = pj .
ENTROPY AND THE MCMILLAN THEOREM 3

Since the ξkj are i.i.d. for each fixed j on Ω, so by WLLN,


n
νj 1X j
= ξ → pj in probability
n n k=1 k
as n → ∞.
This proves the Lemma and the Proposition 0.2.


The next Proposition concerns G(N, ).


Proposition 0.4.
 
1
lim µN | − log µN (ω1 , . . . , ωN ) − H(µ)| >  = 0, (∀ > 0).
N →∞ N
Proof.
Lemma 0.5. For any N , EPN (ω1 , . . . , ωN ) = H(µ).
Proof:
EPN (ω1 , . . . , ωN ) = − N1 E log µN (ω1 , . . . , ωN )
PN
= − N1 j=1 E log µ(ωj ) = −E log µ(ω1 )
Pr
= `=1 (−p(xr ) log p(xr ) = H(µ).
PN
Note further that PN = N1 j=1 (− log µ(ωj )) is a sum of independent random variables.
Hence the WLLN implies that PN − EPN → 0 in probability.


1. McMillan’s theorem
Theorem 1.1.
For any α, β > 0 there exists N0 (α, β) such that for N > N0 (α, β).
(1) µN (CN (δ)) ≥ 1 − α and µN (CN (δ)) ≥ 1 − α

(2) For every ω ∈ CN ,


e−N (h+β) ≤ p(ω) ≤ e−N (h−β) .
(3)
eN (h−β) ≤ |CN | ≤ eN (h+β) .
This says that CN is almost all of ΩN from a probabilistic viewpoint but if h << log r it
is very small from a cardinality viewpoint; and p(ω) is almost e−N h for every element of CN .
That is, ΩN = GN ∪ BN (a good set and a bad set). In the good set, the probabilities
are almost uniform and equal to e−nH , and are ‘typical sequences’. The bad set has small
probability.
4 ENTROPY AND THE MCMILLAN THEOREM

2. Proof
2.1. Proofs of (2)-(3). Now write,
Pr νk (ω)
µn (ω) = e−n k=1 (− n
log pk )
. (3)
Lemma 2.1. Suppose that ω ∈ C(n, ). Then,
r r r
X νk (ω) X νk (ω) X
| (− log pk − H(µ)| ≤ − | − pk | log pk ≤ − log pk
k=1
n k=1
n k=1

The right inequality is immediate from the definition of C(n, ). The left inequality is just
the triangle inequality once one writes in the definition of H.
Note: we did not assume that pk > 0 so this inequality can be rather weak.
If ω ∈ C(n, ) then, for 1 ≤ k ≤ r,
npk − 1 n ≤ νk (ω) ≤ npk + 1 n.
Hence,
P P P
µn (ω) = e− k νk log pk
≤ e−n k pk log pk −1 n pk

1
≤ e−n(H(µ)− 2 ) .
Similarly,
1
µn (ω) ≥ e−n(H(µ)+ 2 ) .
This proves (2).

For (3), note that


P (C(n, )) ≥ N (C(n, )) min µn (ω).
ω∈C(n,)

Hence,
P (C(n,))
N (C(n, )) ≤ minω∈C(n,) µn (ω)

P (C(n,))
≤ 1 .
e−n(H(µ)+ 2 )
And similarly,
P (C(n,))
N (C(n, )) ≥ maxω∈C(n,) µn (ω)

P (C(n,))
≥ 1 .
e−n(H(µ)− 2 )
For all 1 there exists N (1 ) so that for n ≥ N (1 ), P (C(n, )) ≥ 1 − 1 .
Then,
P (C(n,))
N (C(n, )) ≥ −n(H(µ)− 1
e 2 )

(1−1 ) 1
≥ −n(H(µ)− 1
= en(H(µ)− 2 )+log(1−1 ) ≥ en(H(µ)−)
e 2 )
ENTROPY AND THE MCMILLAN THEOREM 5

3. Additional remarks
Suppose that r = 2, so that Ωn consists of strings of n 0’s and 1’s. Let µ{1} = p, µ{0} =
1 − p. Then µn (ω) = pν1 (1 − p)ν0 . A typical element (X1 , . . . , Xn ) has probability pnp (1 −
p)n(1−p) ' e−nH .
Typical elements: number of 1’s is close to np, number of 0’s is close to n(1 − p).
Recall,
νj (~ω ) = #{i : ωi = j}.
Thus,
νj : Ωn → {0, 1, . . . , n}
are non-negative integer valued random variables.
References
[Sh] Shiryaev, Albert N. Probability. 1. Third edition. Translated from the fourth (2007) Russian edition
by R. P. Boas and D. M. Chibisov. Graduate Texts in Mathematics, 95. Springer, New York, 2016

You might also like