Information-Theoretic Identities

Information-Theoretic Identities, Part 1
Prapun Suksompong
ps92@cornell.edu
May 6, 2008
Abstract 1 Mathematical Background and

This article reviews some basic results in information theory.
Notation
It is based on information-theoretic entropy, developed by
Shannon in the landmark paper [22]. 1.1. Based on continuity arguments, we shall assume that
0 ln 0 = 0, 0 ln 0q = 0 for q > 0, 0 ln p0 = ∞ for p > 0, and
0 ln 00 = 0.
Contents
1.2. log x = (log e) (ln x),
d
1 Mathematical Background and Notation 1 dx (x log x) = log ex = log x + log e, and
d 0
dx g (x) log g (x) = g (x) log (eg (x)).
2 Entropy: H 3
2.1 MEPD: Maximum Entropy Probability Distri- 1.3. Fundamental Inequality: 1 − x1 ≤ ln (x) ≤ x − 1 with
butions . . . . . . . . . . . . . . . . . . . . . . 6 equality if and only if x = 1. Note that the first inequality
2.2 Stochastic Processes and Entropy Rate . . . . . 8 follows from the second one via replacing x by 1/x.
5
3 Relative Entropy / Informational Divergence / 4
Kullback Leibler “distance”: D 8 3
2 x-1
ln(x)
1
4 Mutual Information: I 11 1
1−
0
-1 x
5 Functions of random variables 13 -2
-3
-4
6 Markov Chain and Markov Strings 14 -5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
6.1 Homogeneous Markov Chain . . . . . . . . . . 15 x
Proof. To show the second inequality, consider f ( x ) = ln x − x + 1 for x > 0. Then,

7 Independence 15 Figure
1 1:′ Fundamental ′′Inequality
1
f ′( x ) =− 1 , and f ( x ) = 0 iff x = 1 . Also, f ( x ) = − 2 < 0 . Hence, f is ∩,
x x
and attains its maximum value when x = 1 . f (1) = 0 . Therefore,
8 Convexity 15
ln x − x + 1 ≤ 0 . It is also clear that equality holds iff x = 1 .
For To
x> 0the
show and y ≥ 0, use
fist inequality, y−
theysecond ≤ x − with
ln yinequality, y lnx replaced
x, with x byequality
1
.
9 Continuous Random Variables 16 x
if and
only if x = y.
9.1 MEPD . . . . . . . . . . . . . . . . . . . . . . . 20 • In all definitions of information measures, we adopt the convention that summation is
9.2 Stochastic Processes . . . . . . . . . . . . . . . 21 taken x≥
Forover 0, x − x2support.
the corresponding ≤ −x ln x ≤ 1 − x.
• Log-sum inequality:
For
For x ∈numbers
positive x2nonnegative
[0, 1],a , ax,…−and ≤ −(1numbers
− x) ln(1
b , b ,…− x)that≤∑x.
such a < So,
∞ for
10 General Probability Space 21 1 2 1 2
i
i
small x, −(1
and 0 < ∑ b < ∞ ,
− x) ln(1 − x) ≈ x.
i
i
11 Typicality and AEP (Asymptotic Equipartition ⎛ ⎞

1.4. Figure 2 shows ⎛plotsa ⎞of⎛ p log ⎞
p⎜ ∑ a ⎟ i(x) = − log p(x) on
and i
Properties) 22 ∑ ⎜ a log ⎟ ≥ ⎜ ∑ a ⎟ log ⎝ ⎠ i i
[0, 1]. The first one has
⎝ maximum
b ⎠ ⎝ ⎠ at ⎛ 1/e.
⎞
i i
⎜ ∑b ⎟
i i i
i
⎝ ⎠ i
12 I-measure 26 a
1.5.withLog-sum inequality:
the convention that log = ∞ . For positive numbers
i
P a1 , a2 , . . .
0
13 MATLAB 27 and nonnegative numbers b 1
a , b2 , . . . such that ai < ∞ and
Moreover, equality holds if and only if = constant ∀i . i
i
bi
Note: x log x is convex ∪.
1
p ( x x) 1
i ( x ) = i ( x, x ) = log = log = − log p ( x ) .
p ( x) p ( x)
0.7
7 1.7. For measure space (X, F, µ), suppose non-negative f
− p ( x ) log p ( x )
0.6
6
and g are µ-integrable. Consider A ∈ F and assume g > 0 on
0.5
5
A.
0.4
i ( x)
4
0.3 3 Divergence/Gibbs
R R Inequality: Suppose
0.2 2
A
f dµ = A gdµ < ∞, then
0.1 1
Z
0 0 f
0 0.1 0.2 0.3 0.4
p ( x)
0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f log dµ ≥ 0,
p ( x) g
A
1
H ( X ) ≤ log X with equality iff ∀• p ( x) =
x ∈ X Average (X has a uniform distribution over X ).
X Mutual information
or, equivalently,
• A measure ⎡ Figure 2: p log p and i(x) = − log p(x)
of
1 the ⎤ amount of information that one random variable contains
Proof H ( X ) − log X = E ⎡⎣ − log p ( X ) ⎤⎦ − E ⎡⎣ X ⎤⎦ = E ⎢log
( ( ))
⎥
about ⎣⎢another X p ( X ) ⎦⎥
random variable. H X Y = H X − I X ;Y . ) ( ) (
Z Z
⎡
≤ E⎢
1 ⎤ ⎛
− 1⎥ = ∑ p ( x ) ⎜
1 ⎞ − f log f dµ ≤ − f log gdµ
⎟⎟ − 1
•⎥⎦ The reduction
( X ) ∞,
0 < ⎢⎣ bXi p< ⎜
⎝ X p ( X ) ⎠ in the uncertainty of one random variable due to the knowledge
P
x∈X
A A
X
− 1 =of the other.
1
=i
∑X −1 = 0
X X
•
x∈
1
A special case relative entropy. with equality if and only if f = g µ−a.e. on A.
H ( X ) = log X ⇔ ∀x
H ({ p ( x, y )}) info
=1.
X p ( x)
P
• Need on ai bits to describe (x,y). If instead,
R R
average Log-sum inequality: Suppose A f dµ, A gdµ < ∞,
!
X a i
X
log i then would need on average
1
Proof. Let q ( x ) = ∀x ∈ X . Then, ai log ≥ X and Yaare then
X assume that i independent,
b
R
X ) +({ pX( x. ) p ( y )}) + D ( p ( x, y ) p ( x )i p ( y ) ) info bits to describe (x,y).
f dµ
 
p( X ) ip ( X ) i P
D ( p q ) = E log = E log = −H ( H log
i b Z
f
Z
q( X ) 1 A
X
i f log dµ ≥  f dµ log R
 .
• In view of relative entropy, it is natural to think of I ( X ;Y ) as a measure g gdµ
that D (the
p q ) ≥convention aX 1
We know with ( x ) = q ( log
0 with equality iff pthat x ) ∀x ∈ i , i.e., p ( x ) =
= ∞. Moreover, equality holds A A A
of how far X0and Y are from X being independent.
ai
∀x ∈ X if. and only if
bi = constant ∀i. In particular, In fact, the log can be replaced byany function
• Average ⎛
mutual
⎞
information
Proof. Let p = p ( i ) . Set G ( p ) = ∑ p ln p + λ ⎜ ∑ p ⎟ . Let 1
∞] → ≥ −
P ( X , Y ) a⎤1 + a2 P ( X Y ) Q (Y X )
i
⎝ ⎠
i i
⎡
i
⎡ ⎤ ⎡ h : [0, ⎤ R such that h(y) c 1 y for some
a1 i
a2 i
∂ ⎛
G ( p ) = − ⎜ ln p a+ 1p log
1 ⎞ 0
⎟ + λ , then+p a
≤
= e2 logI ( X ; Y ) =
≥the(a
E ⎢ 1 + a2 ) log ⎥
log = E ⎢ log
. ⎥ = E ⎢ log
c >q 0.Y ⎥ .
p ( X ) q (Y b) ⎥⎦1 + b⎣⎢2 p ( X ) ⎦⎥ ( ) ⎦⎥
λ −1
0= ; so all p are same.
∂p p ⎠ iffbindependent b2 ⎢⎣ ⎣⎢
i i i i
n ⎝ 1 i

R R
Pinsker’s inequality: Suppose f dµ = gdµ =1
p ( x, y ) as a
P
The proof follows from rewriting the inequality
I ( X ; Y ) = ∑ ∑ p ( x, y ) log
ai log bii − (A = X), then
P A x∈X y∈Y
P p ( xP) p ( y ) i
ai log B , where A = ai and B = bi . Then, combine 2
p ( X , Y ) i⎤
Z Z
i ⎡
f log e
= E ⎡⎣i ( X ; Y ) ⎤⎦
i f log dµ ≥ |f − g|dµ . (1)
= E p( x , y ) ⎢log 1
the sum and apply ln(x) ≥ 1 −px( X . ) p (Y ) ⎥ ⎣ ⎦
g 2
1.6. The function =x D ln( ( on y ) )1] is convex ∪ in the

x, y )[0,p (1]x )×p ([0, x
p
The three inequalities here include (1.5), (3.3), (9.11), and
y
pair (x, y). ≥ 0 with equality iff X and Y are independent (3.11) as special cases. The log-sum inequality implies the
Proof Gibbs inequality. However, it is easier to prove the Gibbs
This follows directly from log-sum inequality (1.5).
inequality first using the fundamental inequality ln(x) ≥R1 − x1 .
For fixed y, it is a convex ∪ function of x, starting at 0, Then, to prove the log-sum inequality, we let α = f dµ,
A
decreasing to its minimum at − ye < 0, then increasing to
β = gdµ and normalize f, g to fˆ = f /α, ĝ = g/β. Note that
R
− ln y > 0. A
fˆ and ĝ satisfy the condition for Gibbs inequality. The log-sum
For fixed x, it is a decreasing, convex ∪ function of y.
inequality is proved by substituting f = αfˆ and g = βĝ into
See also Figure 3. the LHS.
For Pinsker’s inequality, partition the LHS integral into A =
2.5
[f ≤ g] and Ac . Applying log-sum inequality to each term gives
3
2 α ln α 1−α
β + (1 − α) ln 1−β which we denoted by r(β). Note also
2.5
y = 0.1
1.5
2
Rβ
that |f − g|dµ = 2 (β − α). Now, r (β) = r(α) + r0 (t)dt
R
1.5
1 x =1 α
1
0.5 where r(α) = 0 and r0 (t) = t−α

t(1−t) ≥ 4 (t − α) for t ∈ [0, 1].
0.5
y =1
0
x = 0.1 1.8. For n ∈ N = {1, 2, . . .}, we define [n] = {1, 2, . . . , n}.
-0.5 -0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x y
• For fixed y, it is a convex ∪ function of x, starting at 0, decreasing to its minimum d x x d2 x x

We denote random variables by capital letters, e.g., X, and
Proof. x ln = − < 0 . 2 x ln = 2 > 0 .
at −
y
< 0 , then increasing to − ln y > 0 . dy y y dy
xy y their realizations by lower case letters, e.g., x. The probability
e Figure 3: The plots of x ln .
• It is convex ∪ in the pair ( x, y ) . That is for λ ∈ [ 0,1] ,
That is for λ ∈ [ 0,1] , y
( λ x1 + (1 − λ ) x2 ) ≤ λ ⎛ x ln x1 ⎞ + 1 − λ ⎛ x ln x2 ⎞ .
mass function (pmf) of the (discrete) random variable X is
( λ x1 + (1 − λ ) x2 ) x( λ⎞ x1 + (1 − λ ) x2 ) ln ( )⎜ 2
( λ x1 + (1 − λ ) x2 ) ln
⎛ x1 ⎞ ⎛
≤ λ ⎜ x1 ln ⎟ + (1 − λ ) ⎜ x2 ln 2 ⎟ . ( λ y1 + (1 − λ ) y2 ) ⎜⎝ 1 y1 ⎟⎠ ⎝
⎟
y2 ⎠ denoted by pX (x). When the subscript is just the capitalized
y ⎝ y⎠ ⎝ y⎠
a1 + a2 a1 a
d x ⎛x⎞ d2 x 1 Proof. Apply the log-sum inequality ( a1 + a2 ) log ≤ a1 log + a2 log 2 .
Proof. x ln = ln ⎜ ⎟ + 1 . 2 x ln = > 0 . b1 + b2 b1 b2
dx y ⎝ y⎠ dx y x
• In fact, this part already implies the first two.
• For fixed x, it is a decreasing, convex ∪ function of y.
2
version of the argument in the parentheses, we will often write If the sum in the definition diverges, then the entropy
simply p(x) or px . Similar convention applies to the (cumula- H(X) is infinite.
tive) distribution function (cdf) FX (x) and the (probability) 0 ln 0 = 0, so the x whose p(x) = 0 does not contribute
density function (pdf) fX (x). The distribution of X will be to H(X).
denoted by P X or L(X).
H (X) = log |X | − D P X kU where U is the uniform

1.9. Suppose I is an index set. When Xi ’s are random distribution with the same support.
variables, we define a random vector XI by XI = (Xi : i ∈ I).
Then, for disjoint A, B, XA∪B = (XA , XB ). If I = [n], then 2.1. Entropy is both shift and scale invariant, that is ∀a 6= 0
we write XI = X1n . and ∀b: H(aX + b) = H(X). In fact, for any injective (1-1)
When Ai ’s are sets, we define the set AI by the union function g on X , we have H(g(X)) = H(X).
∪i∈I Ai .
2.2. H ({p (x)}) is concave (convex ∩) in {p (x)}. That
We think of information gain (I) as the removal of uncer- is, ∀λ ∈ [0, 1] and any two p.m.f. {p1 (x) , x ∈ X } and
tainty. The quantification of information then necessitates the {p2 (x) , x ∈ X }, we have
development of a way to measure one level of uncertainty (H).
In what followed, although the entropy (H), relative entropy H (p∗ ) ≥ λH (p1 ) + λH (p2 ) ,
(D), and mutual information (I) are defined in terms of random
where p∗ (x) ≡ λp1 (x) + (1 − λ) p2 (x) ∀x ∈ X .
variables, their definitions extend to random vectors in a
straightforward manner. Any collection X1n of discrete random 2.3. Asymptotic value of multinomial coefficient [6, Q11.21
variables can be thought of as a discrete random variable itself. p 406]: Fix a p.m.f. P = (p1 , p2 , . . . , pM ). For i = 1, . . . , m −
Pm−1
1, define an,i = bnpi c. Set am = n − j=0 bnpj c so that
Pm
2 Entropy: H i=1 ai = n. Then,

1 n 1 n!
We begin with the concept of entropy which is a measure of lim log = lim log M = H(P ).
uncertainty of a random variable [6, p 13]. Let X be a discrete n n→∞ a1 a2 · · · am n→∞ n Q
ai !
random variable which takes values in alphabet X . i=1
The entropy H(X) of a discrete random variable X is a (2)
functional of the distribution of X defined by
2.4. (Differential Entropy Bound on Discrete Entropy) For
X X on {a1 , a2 , . . .}, let pi = pX (ai ), then
H (X) = − p (x) log p (x) = −E [log p (X)]
x∈X

1 (∗) 1
≥ 0 with equality iff ∃x ∈ X p (x) = 1 H(X) ≤ log 2πe Var [X] +
2 12
1
≤ log |X | with equality iff ∀x ∈ X p (x) = . where
|X | ! !2
X X
(∗) 2
In summary, Var [X] = i pi − ipi
i∈N i∈N
0 ≤ H (X) ≤ log |X | . which is not the variance of X itself but of an integer-valued

deterministic uniform
random variable with the same probabilities (and hence the
The base of the logarithm used in defining H can be chosen same entropy). Moreover, for every permutation σ, we can
(∗)
to be any convenient real number b > 1. If the base of the replace Var [X] above by
logarithm is b, denote the entropy as Hb (X). When using ! !2
b = 2, the unit for the entropy is [bit]. When using b = e, the (σ)
X
2
X
unit is [nat]. Var [X] = i pσ(i) − ipσ(i) .
i∈N i∈N
Remarks:
2.5. Example:
The entropy depends only on the (unordered) probabilities
(pi ) and not the values {x}. Therefore, sometimes, we If X ∼ P (λ), then H(X) ≡ H (P (λ)) = λ − λ log λ +
write H P X instead of H(X) to emphasize that the E log X!. Figure 4 plots H (P (λ)) as a function of λ.
entropy is a functional for the probability distribution. Note that, by CLT, we may approximate H (P (λ)) by
H(X) is 0 if an only if there is no uncertainty, that is h (N (λ, λ)) when λ is large.
when one of the possible values is certain to happen.
3
X∼ Support set X pX (k) H(X)
1
Uniform Un {1, 2, . . . , n} n log n
1 − p, k = 0
Bernoulli B(1, p) {0, 1} hb (p)
p, k=1
n k
n−k
Binomial B(n, p) {0, 1, . . . , n} k p (1 − p)
1 1
Geometric G(p) N ∪ {0} (1 − p) pk 1−p hb (p)
= 1−p h (1 − p)
1
b
= EX log 1 + EX + log(1 + EX)
Geometric G 0 (p) N (1 − p)k−1 p 1
p hb (p)
k
Poisson P(λ) N ∪ {0} e−λ λk! λ log e + EX log λ + E log X!
Table 1: Examples of probability mass functions and corresponding discrete entropies. Here, p, β ∈ (0, 1). λ > 0. hb (p) =
−p log p − (1 − p) log (1 − p) is the binary entropy function.
X∼ fX (x) h(X)
1
Uniform U(a, b) b−a 1[a,b] (x) log(b − a) = 12 log 12σX
2
≈ 1.79 + log2 σX [bits]
Exponential E(λ) λe−λx 1[0,∞] (x) e
log λ = log eσX ≈ 1.44 + log2 σX [bits]
x−s
1 − µ−s0
Shifted Exponential µ−s0 e
0 1[s0 ,∞) (x); log (e(µ − s0 ))
on [s0 , ∞), mean µ µ > s0 −αa −αb
e1−αa −e1−αb −be
Bounded Exp. α
e−αa −e−αb
e−αx 1[a,b] (x) log α + α aee−αa −e −αb log e
α −α|x| 2e 1 2 2
Laplacian L(α) 2e log α = 2 log 2e σX ≈ 1.94 + log2 σX [bits]
1 x−µ 2
√1 e− 2 ( σ ) 1
2 2

Normal N (µ, σ ) σ 2π 2 log 2πeσ ≈ 2.05 + log2 σ [bits]
λq xq−1 e−λx
Gamma Γ (q, λ) Γ(q) 1(0,∞) (x) q (log e) + (1 − q) ψ (q) + log Γ(q) λ
−(α+1)
− log (α) + α1 + 1 log e

Pareto P ar(α) αx 1[1,∞] (x)
α c α+1
log αc + α1 + 1 log e

P ar(α, c) = cP ar(α) c x 1(c,∞) (x)
Γ(q1 +q2 ) q1 −1 q2 −1 log B (q1 , q2 ) − (q1 − 1) (ψ (q1 ) − ψ (q1 + q2 ))
Beta β (q1 , q2 ) Γ(q1 )Γ(q2 ) x (1 − x) 1(0,1) (x)
− (q2 − 1) (ψ (q2 ) − ψ (q1 + q2 ))
Γ(q1 +q2 ) xq1 −1
Beta prime Γ(q1 )Γ(q2 ) (x+1)(q1 +q2 ) 1(0,∞) (x)
2

γ
2αxe−αx 1[0,∞] (x) 1

Rayleigh log √
2 α
+ 1+ 2 log e
1 1
Standard Cauchy π 1+x2 log (4π)
1 α
Cau(α) π α2 +x2 log (4πα)
Γ(d) 1
Cau(α, d) √
παΓ(d− 12 ) 1+( x )2 d

α
2 ln x−µ 2
Log Normal eN (µ,σ ) e− 2 ( ) 1
1
1 1

√
σx 2π
σ
(0,∞) (x) 2 log 2πeσ 2 + µ log e
Table 2: Examples of probability density functions and their entropies. Here, c, α, q, q1 , q2 , σ, λ are all strictly positive and d > 12 .
0
γ = −ψ(1) ≈ .5772 is the Euler-constant. ψ (z) = d
dz log Γ (z) = (log e) ΓΓ(z)
(z)
is the digamma function. B(q1 , q2 ) = Γ(q1 )Γ(q2 )
Γ(q1 +q2 ) is
the beta function.
4
4.5
Logarithmic Bounds:
4 h ( N ( λ, λ ))
1
3.5 (ln p) (ln q) ≤ (log e) H (p) ≤ (ln p) (ln q) .
3
ln 2
2.5
H ( P (λ )) 2 Power-type bounds:
2 1.5
1
1.5 1
(ln 2) (4pq) ≤ (log e) H (p) ≤ (ln 2) (4pq) ln 4 .
1 0.5
0.5 0
2.7. For two random variables X and Y with a joint p.m.f.
0 -0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p(x, y) and marginal p.m.f. p(x) and p(y), the conditional
λ
-0.5
entropy is defined as
0 2 4 6 8 10 12 14 16 18 20
λ
XX
H (Y |X ) = −E [log p (Y |X )] = − p (x, y) logp (y |x )
x∈X y∈Y
Figure 4: H (P (λ)) and its approximation by h (N (λ, λ)) = X
1 = p (x) H (Y |X = x ),
2 log (2πeλ)
x∈X
1
0.9
where
0.8
0.7
H (Y |X = x ) = −E [ log(p(Y |x))| X = x]
0.6
X
=− p (y |x ) logp (y |x ) .
H(p)
0.5
0.4 y∈Y
0.3
0.2 Note that H(Y |X) is a function of p(x, y), not just p(y|x).
0.1
0
0 0.1 0.2 0.3 0.4 0.5
p
0.6 0.7 0.8 0.9 1 Example 2.8. Thinned Poisson: Suppose we have X ∼ P (λ)
and conditioned on X = x, we define Y to be a Binomial r.v.
1
• Logarithmic Bounds: ( ln p )( ln q ) ≤ ( log e ) H ( p ) ≤ ( ln p )( ln q ) with size x and success probability y:
Figure 5: Binary Entropy Function
ln 2
0.7
X → s → Y.
2.6. Binary Entropy Function : We define hb (p), h (p) or Then, Y ∼ P (sλ) with H(Y ) = H (P (sλ)). Moreover,
0.6
H(p) to be −p log p − (1 − p) log (1 − p), whose plot is shown

0.5
x−y
in Figure 5. The concavity of this function can be regarded (λ (1 − s))
p (x |y ) = e−λ(1−s) 1 [x ≥ y] ;
as an example of 2.2, 8.1. Some properties of hb are:
0.4
(x − y)!
0.3
(a) H(p) = H(1 − p). which is simply a Poisson r.v. shifted by y. Consequently,
0.2
dh 1−p H (X |y ) = H (P (λ (1 − s))) = H (X |Y ) .
(b) dp (p) = log 0.1 p .
d
0
0

dg
0.1

0.2 0.3 1−g(x)
0.4 0.5 0.6 0.7 0.8 0.9 1
2.9. 0 ≤ H (Y |X ) ≤ H (Y ) .
(c) dx h (g (x)) = dx (x) log g(x) . Y =g(X) X,Y independent
1
2.10. The discussion above for entropy and conditional en-
• Power-type bounds:
(d) 2h(p) ( ln 2(1)(−
= p−p p)) ≤ ( log e. ) H ( p ) ≤ ( ln 2 )( 4 pq ) ln 4
4 pq −(1−p)
tropy is still valid if we replace random variables X and Y in
1−p the discussion above with random vectors. For example, the
(e) 2−h(p) = p0.7
p
(1 − p) .
joint entropy for random variables X and Y is defined as
0.6
(f) h 1b = log b − b−1

b log (b − 1).
XX
0.5
H (X, Y ) = −E [log p (X, Y )] = − p (x, y) logp (x, y).
nhb ( n ) ≤ 2nhb ( n ) [5, p 284-286].
r r
1 n x∈X y∈Y

(g) n+1 2 0.4
≤ r
More generally, for a random vector X1n ,
bαnc
0.3
1 n n
lim 1 log bαnc
P
(h) lim log = = hb (α) [6, n
n→∞ n i n→∞ n
X
H (X1n ) = −E [log p (X1n )] = H Xi X1i−1

i=0
0.2
surface shell
volume
Q11.21 p 406].
0.1 See also (2). i=1
n
X
(i) Quadratic approximation:
0
0 0.1 0.2 0.3 0.4 hb0.5 ≈ 4p(1
(p) 0.6 0.7 − p)0.9
0.8 1
≤ H (Xi ) .
i=1
There are two bounds for H(p): Xi ’s are independent
ntropy for two random variables
For two random variables X and Y with a joint pmf p ( x, y ) and marginal pmf p(x) and p(y).
5
2.11. Chain rule: 2.17. Suppose X is a random variable on X . Consider
n
X A ⊂ X . Let PA = P [X ∈ X]. Then, H(X) ≤ h(PA ) + (1 −
H (X1 , X2 , . . . , Xn ) = H (Xi |Xi−1 , . . . , X1 ), PA ) log(|X | − |A|) + PA log |A|.
i=1
2.18. Fano’s Inequality: Suppose U and V be random
or simply variables on common alphabet set of cardinality M . Let
n
X E = 1U 6=V and Pe = P [E = 1] = P [U 6= V ]. Then,
H (X1n ) = H Xi X1i−1 .

i=1 H (U |V ) ≤ h (Pe ) + Pe log (M − 1)
Note that the term in sum when i = 1 is H(X1 ). Moreover,
with equality if and only if
n
X
H (X1n ) =
n
H Xi Xi+1 . Pe
u 6= v
P [U = u, V = v] = P [V = v] × M −1 , .
i=1
1 − Pe , u=v
In particular, for two variables,
Note that if Pe = 0, then H(U |V ) = 0
H (X, Y ) = H (X) + H (Y |X ) = H (Y ) + H (X |Y ) .
2.19. Extended Fano’s Inequality: Let U1L , V1L ∈ U L =
Chain rule is still true with conditioning: V L where |U| = |V| = M . Define Pe,` = P [V` 6= U` ] and
L
P e = L1
P
n
X Pe,` . Then,
(X1n H Xi X1i−1 , Z .

H |Z ) = `=1
i=1
1
H U1L V1L ≤ h P e + P e log (M − 1) .

In particular, L
H ( X, Y | Z) = H (X |Z ) + H (Y |X, Z ) 2.20 (Han’s Inequality).
= H (Y |Z ) + H (X |Y, Z ) . n
1 X
H (X1n ) ≤

H X[n]\{i} .
2.12. Conditioning only reduces entropy: n − 1 i=1
H (Y |X ) ≤ H (Y ) with equality if and only if X and Y
2.21. Independent addition (in an additive group G) increase
are independent. That is H ({p (x) p (y)}) = H ({p (y)})+
entropy in a sublinear way: For two random variables X and
H ({p (x)}).
Z,
H (X |Y ) ≥ H (X |Y, Z ) with equality if and only if given H(X) ≤ H(X ⊕ Z) ≤ H(X) + H(Z)
Y we have X and Z are independent i.e. p (x, z |y ) =
p (x |y ) p (z |y ). [4]. To see this, note that conditioning only reduces entropy;
therefore,
2.13. H (X |X ) = 0.
H(X) = H(X ⊕ Z|Z) ≤ H(X ⊕ Z).
2.14. H (X, Y ) ≥ max {H (X) , H (Y |X ) , H (Y ) , H (X |Y )}.
Pn The last item is simply a function of X, Z. We know that
2.15. H ( X1 , X2 , . . . , Xn | Y ) ≤ H ( Xi | Y ) with equality
i=1 H(g(X, Z)) ≤ H(X, Z) ≤ H(X) + H(Z).
if and
only if X1 ,nX2 , . . . , Xn are independent conditioning on
Y p (xn1 |y ) = The independence between X and Z is only used in the last

Q
p (xi |y ) .
i=1 inequality. See also Figure 6.
2.16. Suppose Xi ’s are independent.
2.22. Majorization: Given two probability distributions p =
Let Y = X1 + X2 . Then, H(Y |X1 ) = H(X2 ) ≤ H(Y )
(p0 ≥ p1 ≥ · · · ≥ pm ) and q = (q0 ≥ q1 ≥ · · · ≥ qm ), we say
and H(Y |X2 ) = H(X1 ) ≤ H(Y ). Pk Pk
that p majorizes q if for all k, i=1 pi ≥ i=1 qi . In which
If two
P finite index
sets satisfy J ⊂ I, then case, H(p) ≤ H(q). [17]
P
H j∈J Xj ≤ H i∈I Xi . Also,
P P
H |J| 1
j∈J Xj ≤ H |I|
1
i∈I Xi .
2.1 MEPD: Maximum Entropy Probability
n Distributions
More specifically, H n1
P
Xi is an increasing function
i=1 Consider fixed (1) countable (finite or infinite) S ⊂ R and
of n. (2) functions g1 , . . . , gm on S. Let C be a class of probability
6
5
X Z
4.5
a
H ( X X + Z,Z ) = 0 0 0 H (Z X + Z, X ) = 0 4
b b = −a if X and Z
3.5
c are independent
H (X )
d
3
H ( X + Z X ,Z ) = 0 0 2.5
X +Z 2
1.5
H (X )
Figure 6: Information diagram For X, Z, and X + Z. 1
EX
0.5
mass function pX which are supported on S (pX = 0 on 0

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
S c ) and satisfy the moment constraints E [gk (X)] = µk , for EX
∗
1 ≤ k ≤ m. Let PMF p be the MEPD for C. Define PMF q
m m !−1
P
λk gk (x)
P
λk gk (x)
Figure 7: Entropy of Geometric distribution as a function of

P
on S by qx = c0 ek=1 where c0 = ek=1 m its mean.
= 0:0.01:5;
x∈S h = m.*log2(1+1./m) + log2(1+m);
and λ1 , . . . , λm are chosen so that q ∈ C. Note that p∗ and q set(plot(m,h,'k',m,h./m,'k-.'),'LineWidth',1.5')
axis([0 5 0 5])
may not exist. 2.24.
The Poisson distribution P(λ) maximizes entropy
within the class of Bernoulli sums of mean λ. Formally
(a) If p∗ exists (and ∃p0 ∈ C such that p0 > 0 on S), then q speaking, define S(λ) to be the class of Bernoulli sums
exists and q = p∗ . Sn = X1 + · · · + Xn , for Xi independent Bernoulli, and
ESn = λ. Then, the entropy of a random variable in S(λ) is
(b) If q exists, then p∗ exists and p∗ = q.
dominated by that of the Poisson distribution:
Note also that c0 > 0.
sup H (S) = H (P (λ)) .
S∈S(λ)
Example 2.23.
Note that P(λ) is not in the class S(λ).
(a) When there is no constraint, if X is finite, then MEPD
is uniform with pi = |X1 | . If X is countable, then MEPD
2.25. Most of the standard probability distributions can
does not exist.
be characterized as being MEPD. Consider any pmf on S
m
P
(b) If require EX = µ, then MEPD is geometric (or truncated λk gk (x)
of the form qX (x) = c0 e with finite E [gk (X)] for
k=1
geometric) pi = c0 β i . If, in addition, S = N ∪ {0}, then
i k = 1, . . . , m. qX is the MEPD for the class of pmf p on S
µi 1 µ
pi = (1+µ) i+1 = 1+µ 1+µ with corresponding entropy under the constraints by the E [gk (X)]’s. Hence, to see which
constraints characterize a pmf as an MEPD, rewrite the pmf
1 1 in the exponential form.
hb (β) = hb (1 − β) (3)
1−β 1−β
= −µ log µ + (1 + µ) log(1 + µ) (4)
S Constraints MEPD
1
= µ log 1 + + log(1 + µ). (5) [n] No constraint Un
µ
N ∪ {0} EX G0 (p)
Observe that N EX G1 (p)
n

0, 1, . . . , n EX, E ln X B(n, p)
(i) (4) is similar to the one defining the binary entropy N ∪ {0} EX = λ, E ln X! P(λ)
function hb in (2.6),
Table 3: MEPD: Discrete cases. Most of the standard proba-
(ii) H(X) is a strictly increasing function of µ, and
bility distributions can be characterized as being MEPD when
H(X) H(X)
(iii) EX = µ is a strictly decreasing function of µ. values of one or more of some moments are prescribed [14, p
359].
See [14] for more examples.
7
2.2 Stochastic Processes and Entropy Rate
P
ui H (U2 |U1 = i ) where H (U2 |U1 = i )’s are computed,
i
In general, the uncertainty about the values of X(t) on the for each i, by the transition probabilities Pij from state i.
entire t axis or even on a finite interval, no matter how small, When there are more than one communicating class,
is infinite. However, if X(t) can be expressed in terms of its P
HU = P [classi ]H (U2 |U1 , classi ).
values on a countable set of points, as in the case for bandlim- i
ited processes, then a rate of uncertainty can be introduced.
It suffices, therefore, to consider only discrete-time processes. The statement of convergence of the entropy at time n of
[21] a random process divided by n to a constant limit called the
entropy rate of the process is known as the ergodic theorem
2.26. Consider discrete stationary source (DSS) {U (k)}. Let of information theory or asymptotic equirepartition
U be the common alphabet set. By staionaryness, H (U1n ) = property (AEP). Its original version proven in the 50’s
k+n−1

H Uk . Define for ergodic stationary process with a finite state space, is
known as Shannon-McMillan theorem for the convergence in
Per letter entropy of an L-block:
mean and as Shannon-McMillan-Breiman theorem for the a.s.
H U1 L

H Uk k+L−1
convergence.
HL = = ;
L L 2.27. Weak AEP: For Uk i.i.d. pU (u),
Incremental entropy change: 1 P
− log pU1L U1L −→ H (U ) .
L
hL = H UL U1L−1 = H U1L − H U1L−1

See also section (11).

It is the conditional entropy of the symbol when the pre-
ceding ones are known. Note that for stationary Markov 2.28. Shannon-McMillan-Breiman theorem [5, Sec
chain, hL,markov = H UL U1L−1 = H (UL |UL−1 ) = 15.7]: If the source Uk is stationary and ergodic, then
H (U2 |U1 ) ∀L ≥ 2.
1 a.s.
Then, − log pU1L U1L −−→ HU .
L
h1 = H1 = H (U1 ) = H (Uk ). This is also referred to as the AEP for ergodic stationary
hL ≤ HL . sources/process.
hL = LHL − (L − 1) HL−1 .
Both hL and HL are non-increasing (&) function of L, 3 Relative Entropy / Informational
converging to same limit, denoted H.
Divergence / Kullback Leibler
The entropy rate of a stationary source is defined as
“distance”: D
H (U) = H ({U` }) = HU
3.1. Let p and q be two pmf’s on a common countable
H U1L
= lim H UL U1L−1 . alphabet X . Relative entropy (Kullback Leibler “distance”,

= lim
L→∞ L L→∞ cross-entropy, directed divergence, informational divergence,
Remarks: or information discrimination) between p and q (from p to q)
is defined as
Note that H(U1L ) is an increasing function of L. So, for
stationary source, the entropy H(U1L ) grows (asymptoti-
X p (x) p (X)
D (p kq ) = p (x) log = E log , (6)
cally) linearly with L at a rate HU . q (x) q (X)
x∈X
For stationary Markov chain of order r, HU =
H (Ur+1 |U1r ) = hr+1 . where the distribution of X is p.
For stationary Markov chain of order 1, HU = Not a true distance since symmetry and triangle inequal-
H (U2 |U1 ) = h2 < H (U1 ) = H (U2 ). In partic- ity fail. However, it can be regarded as a measure of
ular, let {Xi } be a stationary Markov chain with the difference between the distributions of two random
stationary distribution u and transition
P matrix P . variables. Convergence in D is stronger than convergence
Then, the entropy rate is H = − ui Pij log Pij , in L1 [see (3.11)].
ij
where Pij = P [Next state is j |Current state is i ] = If q is uniform, i.e. q = u = 1
|X | , then D (p ku ) = log |X |−
P [X2 = j |X1 = i ] and ui = P [X1 = i]. Also, HU = H (X).
8
h i
◦ By convexity of D, this also shows that H(X) is 3.8. D (p (x |z ) kq (x |z ) ) ≥ 0. In fact, it is E log p(X|Z )
=
concave ∩ w.r.t. p. h i h i q(X|Z )
p (z) E log p(X|z ) p(X|z )
P
q(X|z ) z where E log q(X|z ) z ≥ 0 ∀z.

◦ Maximizing the entropy subject to any given set of z
constraints is identical to minimizing the relative
3.9. Chain rule for relative entropy:
entropy from the uniform distribution subject to the
same constraints [13, p 160]. D (p (x, y) kq (x, y) ) = D (p (x) kq (x) )+D (p (y |x ) kq (y |x ) ) .
If ∃x such that p(x) > 0 but q(x) = 0 (that is support of 3.10. Let p and q be two probability distributions on a
p is not a subset of support of q), then D(pkq) = ∞. common alphabet X . The variational distance between p and
q is defined by
3.2. Although relative entropy is not a proper metric, it is
natural measure of “dissimilarity” in the context of statistics 1 X
dT V (p, q) = |p (x) − q (x)|.
[5, Ch. 12]. Some researchers regard it as an information- 2
x∈X
theoretic distance measures.
It is the metric induced by the `1 norm. See also 10.5.
3.3. Divergence Inequality: D (p kq ) ≥ 0 with equality if
and only if p = q. This inequality is also known as Gibbs 3.11. Pinsker’s Inequality [6, lemma 11.6.1 p 370],[5,
Inequality. Note that this just means if we have two vectors lemma 12.6.1 p 300]:
u, v with the same length, P each have nonnegative elements D (p kq ) ≥ 2 (logK e) d2T V (p, q) ,
which summed to 1. Then, ui log uvii ≥ 0.
i where K is the base of the log used to define D. In particular,
3.4. D (p kq ) is convex ∪ in the pair (p, q). That is if (p1 , q1 ) if we use ln when defining D (p kq ), then
and (p2 , q2 ) are two pairs of probability mass functions, then D (p kq ) ≥ 2d2T V (p, q) .
D (λp1 + (1 − λ) p2 kλq1 + (1 − λ) q2 ) Pinsker’s Inequality shows that convergence in D is stronger
≤ λD (p1 kq1 ) + (1 − λ) D (p2 kq2 ) ∀ 0 ≤ λ ≤ 1. than convergence in L1 . See also 10.6
This follows directly from the convexity ∪ of x ln xy . 3.12. Suppose X1 , X2 , . . . , Xn are independent and
Y1 , Y2 , . . . , Yn are independent, then
For fixed p, D (p kq ) is convex ∪ functional of q. That is n
X
D pX1n pY1n = D (pXi kpYi ).
D (λq1 + (1 − λ) q2 kp ) ≤ λD (q1 kp ) + (1 − λ) D (q2 kp ) .
i=1
For fixed q, D (p kq ) is convex ∪ functional of p. Combine with (3.14), we have

! n
3.5. For binary random variables: X
D pP p P ≤ D (pXi kpYi ).

n n
Xi Yi
p 1−p i=1 i=1 i=1
D (p kq ) = p log + (1 − p) log .
q 1−q
3.13 (Relative entropy expansion). Let (X, Y ) have joint
Note that D (p kq ) is convex ∪ in the pair (p, q); hence it is distribution pX,Y (x, y) on X × Y with marginals pX (x) and
convex ∪ in p for fix q and convex ∪ in q for fix p. pY (y), respectively. Let qX̃ (x) and qỸ (y) denote two arbitrary
marginal probability distributions on X and Y, respectively.
3.6. For p = (p1 , . . . , pn ) and q = (q1 , . . . , qn ), Then,
D(pkq) is a continuous function of p1 , . . . , pn and of D (pX,Y kqX̃ qỸ ) = D (pX kqX̃ ) + D (pY kqỸ ) + I (X; Y ) .
(q1 , . . . , qn ) [13, p 155];
More generally, Let X1n have joint distribution pX1n (xn1 ) on
D(pkq) is permutationally symmetric, i.e., the value of
X1 × X2 × · · · × Xn with marginals pXi (xi )’s. Let qX̃i (xi )’s
this measure does not change if the outcomes are labeled
denote n arbitrary marginal distributions. Then,
differently if the pairs (p1 , q1 ), . . . , (pn , qn ) are permuted n !
among themselves [13, p 155]. Y
D pX1n qX̃1

3.7. The conditional relative entropy is defined by
i=1
n
n−1

p (Y |X ) X X
n

D (p (y |x ) kq (y |x ) ) = E log = D pXi qX̃i + I Xi ; Xi+1
q (Y |X )
i=1 i=1
X X p (y |x ) n n
= p (x) p (y |x ) log . X X
H (Xi ) − H (X1n ) .

q (y |x ) = D pXi qX̃i +
x y
i=1 i=1
9
Note that the final equation follows easily from Note that the first term quantifies how small the pi ’s
n
Q are. The second term quantifies the degree of dependence.
pX (xi ) (See also (7.3).)
p (xn1 ) p (xn1 ) i=1 i
ln Q
n = ln Q
n n
Q Also see (2.24).
qX̃i (xi ) qX̃i (xi ) pXi (xi )
i=1 i=1 i=1
n n
3.16 (Han’s inequality for relative entropies). Suppose
X pXi (xi ) n
X Y1 , Y2 , . . . are independent. Then,
= ln + log p (x1 ) − log pXi (xi ).
i=1
qX̃i (xi ) i=1 n
1 X
D (X1n kY1n ) ≥

D X[n]\{i} Y[n]\{i} ,
See also (7.3). n − 1 i=1
3.14 (Data Processing Inequality for relative entropy). Let
X1 and X2 be (possibly dependent) random variables on X . or equivalently,
Q(y|x) is a channel. Yi is the output
g(X )
of the channel when the n
X
n n
D (X1n kY1n ) − D X[n]\{i} Y[n]\{i} .

input is Xi . Then,
X D (X1 kY1 ) ≤
0
i=1
D (Y1≥ kY
0 2 ) ≤ D (X1 kX2 ) .
0 0 3.17. Minimum cross-entropy probability distribution
In particular, (MCEPD) [13, p 158, 160][6, Q12.2, p 421]:
D (g (X1 ) kg (X2 ) ) ≤ D (X1 kX2 ) . Consider fixed (1) finite S ⊂ R, (2) functions g1 , . . . , gm
Z on S, and (3) pmf q on S. Let C be a class of probability
The inequality follows from applying the log-sum inequality mass function pX which are supported on S (pX = 0 on
p (y)
to pY1 (y) ln pYY1 (y) where pYi (y) = Q (y |x ) pXi (x).
P
S c ) and satisfy the moment constraints E [gk (X)] = µk , for
2
x
1 ≤ k ≤ m. Let pmf p∗ be the probability distribution that
X1
minimizes the relative entropy D( · kq) for C. Define PMF p̂
Q ( y x)
m
Y1
P
λk gk
on S by p̂ = qeλ0 ek=1 where λ0 , λ1 , . . . , λm are chosen so
that p̂ ∈ C. Note that p∗ and p̂ may not exist.
Q ( y x) Y2 Suppose p̂ exists.
X2
(a) Fix r ∈ C.
p̂(x)
Figure 8: Data Processing Inequality for relative entropy
P
x∈S r(x) log q(x) = D(p̂kq); that is
for any Y ∼ r ∈ C,
3.15. Poisson Approximation for sums of binary ran-
dom variables [15, 12]: p̂(Y )
E log = D(p̂kq).
q(Y )
Given a random variable X with corresponding pmf p
whose support is inside N ∪ {0}, the relative entropy D(rkq) − D(p̂kq) = D(rkp̂) ≥ 0.
D (pkP(λ)) is minimized over λ at λ = EX [12, Lemma
7.2 p. 131]. (b) p∗ exists and p∗ = p̂.
Pm
◦ In fact, (c) D(p∗ kq) = D(p̂kq) = (λ0 + k=1 λk µk ) log e.
∞
X Example 3.18. Suppose pmf q = P(b) and C is the class of
D (p kq ) = λ − EX ln (λ) + p (x) ln (p (x) x!). pmf with mean λ. Then, the MCEPD is P(λ) [13, p 176–177].
i=0
and 3.19. Alternating Minimization Procedure: Given two

d 1 convex sets A and B of PMF p and q, the minimum relative
D (p kq ) = 1 − EX.
dλ λ entropy between this two sets of distributions is defined as
Let X1 , X2 , . . . , Xn denote n possibly dependent binary
dmin = min D(pkq).
randomPvariables with parameters pi = P [Xi = 1]. Let p∈A,q∈B
n
Sn = i=1 Xi . Let Λn be a Poisson random variable
Pn Suppose we first take any distribution p(1) in A, and find a
with mean λ = pi . Then, distribution q (1) in B that is closet to it. Then fix this q (1)
i=1
n n
! and find the closet distribution in A. Repeating this process,
Sn Λn
X
2
X
n then the relative entropy converges to dmin [6, p 332]. (See
D P P ≤ log e pi + H (Xi ) − H (X1 ) .
also [8].)
i=1 i=1
10
3.20. Let p(x)Q(y|x) be a given joint distribution with cor- The name mutual information and the notation I(X; Y )
responding distributions q(y), P (x, y), and P (x|y). was introduced by [Fano 1961 Ch 2].
(a) argmin D (p(x)Q(y|x)kp(x)r(y)) = q(y). Mutual information is a measure of the amount of infor-
r(y) mation one random variable contains about another [6, p
P r(x|y)
13]. See (13) and (14).
(b) argmax p(x)Q(y|x) log = P (x|y).
r(x|y)
x,y p(x)
By (11), mutual information is the (Kullback-Leibler)
divergence between the joint and product-of-marginal
[6, Lemma 10.8.1 p 333] distributions. Hence, it is natural to think of I(X; Y ) as
3.21. Related quantities a measure of how far X and Y are from being independent.
(a) J-divergence If we define

Q (Y |x )
J(p, q) = 1
2 (D(pkq) + D(qkp)). I (x) = E log X=x
q (Y )
Average of two K-L distances.
P (x|Y )
Symmetric. = E log X=x
p (x)
(b) Resistor average Alternative Proof. = E [ − log q (Y )| X = x] − H (Y |x ) ,
D(pkq)D(qkp) I ( X ; Y ; Z ) = I ( X , Y ; Z ) − I ( X ; Z Y ) − I (Y ; Z X ) = 0 − 0 − 0
R(p, q) = D(pkq)+D(qkp) . then
X Y
Parallel resistor formula: Y I(X; XY ) = E [I(X)] .
1 1 1 0
= + . 4.2.
0 0I(X; Y ) ≥ 0 with equality if and only if X and Y are
R(p, q) D(pkq) D(qkp) independent.
Does not satisfy the triangle inequality. ZIf X or Y is deterministic,
Z
then I(X; Y ) = 0.
• Suppose I is an index set. Define a random vector X I = ( X i : i ∈ I ) .
• 4.3.
Suppose we have nonempty I (X;
disjoint indexX)
sets A,= H (X).
B. Then, Hence, entropy is the self-
4 Mutual Information: I ∀ nonempty A ⊂ A ∀ nonempty B ⊂ B I ( X A ; X B ) = 0 .
I ( X A ; X B ) = 0 iff information.
4.1. The mutual information I(X; Y ) between Proof.two is obvious because A ⊂ A and B ⊂ B . “⇒” We can write
“⇐” ran-
4.4. I(X; Y ) ≤ min {H(X), H(Y )}.
I ( X A ; X B ) = I ( X A ; X B , X B \ B ) . The above result (*) then gives
dom variables X and Y is defined as
I ( X A ; X B ) = 0Example
. Now write I ( 4.5. = I ( X A , X A \ Aagain
X A ; X B )Consider ; X B ) . Then,
the (*) gives
thinned Poisson example
p (X, Y )
I (X; Y ) = E log I ( A (7)
X ; X B)
= 0in (2.8). The mutual information is
.
p (X) q (Y ) • X 1 , X 2 ,… , X n are independent iff

P (X |Y ) I (X; Y ) = H (P (λ)) − H (P ((1 − s)λ)) .
• H ( X 1n ) = (8)
n
= E log
p (X) ∑i =1
H ( Xi ) .

• I ( X i ; X 1i −1 ) = 0 ∀i
Example 4.6. Binary Channel. Let
Q (Y |X )
= E log (9)
n HX( X=) =Y∑n = H ({0,
Proof. 0 = H ( X 1n ) − ∑ 1 ) − ∑ H ( Xi )
n
q (Y ) X i X1}; i −1
i
i =1 i =1 i =1
XX p (x, y) p (1) = P [Xn = 1] = p = 1 − P [X = 0] = 1 − p (0);
= p (x, y) log = ∑ ( H ( X i X 1i −1 ) − H ( X i ) ) = ∑ I ( X i ; X 1i −1 )
(10) n
p (x) q (y) = i =1

1−a a

ā a

x∈X y∈Y i 1
T = i −1 [Y = j |X = i ]] =
[P
This happens iff I ( X i ; X 1 ) = 0 ∀i
= .
= D ( p (x, y)k p (x) q (y)) (11) b 1−b b b̄
Alternative Proof. p̄ = 1 − p and q̄ = 1 − q.
= H (X) + H (Y ) − H (X, Y ) (12)
This is obvious from X 1 , X 2 ,… , X n are independent iff ∀i X i and
The distribution vectors of X and Y : p = p̄ p and

= H (Y ) − H (Y |X ) 1
i −1
X(13) are independent.
= H (X) − H (X |Y ) , Example (14) q = q̄ q .
• Binary Channel:
where p (x, y) = P [X = x, Y = y] , p (x) = P [X = x] , q (y) =
P [Y = y] , P (x |y ) = P [X = x |Y = y ] , and Q (y |x{ ) } =
X = Y = 0,1 , 1− a
0 0
P [Y = y |X = x ]. p (1) = Pr { X = 1} = p = 1 − Pr { X = 0} = 1 − p ( 0 ) , a
X Y
⎡1 − a a ⎤ ⎡a a ⎤ b
I(X; Y ) = I(Y ; X). T = ⎡⎣ Pr {Y = j X = i}⎤⎦ = ⎢ ⎥=⎢ ⎥.
⎣ b 1 − b⎦ ⎣ b b ⎦ 1
1− b
1
The mutual information quantifies the reduction in the
uncertainty of one random variable due to the knowl- Figure 9: Binary Channel
edge of the other. It can be regarded as the information
contained in one random variable about the other. Then,
11

p̄ā p̄a 4.10. Conditional v.s. unconditional mutual information:
P = [P [X = i, Y = j]] = .
pb pb̄
If X, Y , and Z forms a Markov chain (any order is OK),
q = pT =

p̄ā + pb p̄a + pb̄ . then I(X; Y ; Z) ≥ 0 and conditioning only reduces mu-
" p̄ā # tual information: I (X; Y |Z ) ≤ I (X; Y ) , I (X; Z |Y ) ≤
pb
I (X; Z) , and I (Y ; Z |X ) ≤ I (Y ; Z).
T̃ = [P [X = j |Y = i ]] = p̄ā+pb
p̄a
p̄ā+pb
pb̄ .
p̄a+pb̄ p̄a+pb̄ Furthermore, if, for example, X and Z are not indepen-
dent, then I(X; Z) > 0 and I (X; Y ) > I (X; Y |Z ). In
H (X) = h (p).
particular, let X has nonzero entropy, and X = Y = Z,
H(Y ) = h (p̄ā + pb). then I (X; Y ) = h (X) > 0 = I (X; Y |Z ).
If any of the two r.v.’s among X, Y, and Z are indepen-
H (Y |X ) = p̄h (a) + ph (b) = p̄h (a) + ph b̄ .

dent, then I(X; Y ; Z) ≤ 0 and conditioning only increases
I (X; Y ) = h p̄a + pb̄ − p̄h (a) + ph b̄ . mutual information: I (X; Y ) ≤ I (X; Y |Z ) , I (X; Z) ≤

I (X; Z |Y ) , and I (Z; Y ) ≤ I (Z; Y |X ) .
Recall that h is concave ∩.
Each case above has one inequality which is easy to see. If
For binary symmetric channel (BSC), we set a = b = α.
X − Y − Z forms a Markov chain, then, I(X; Z|Y ) = 0. We
4.7. The conditional mutual information is defined as know that I(X; Z) ≥ 0. So, I(X; Z|Y ) ≤ I(X; Z). On the
other hand, if X and Z are independent, then I(X; Z) = 0.
I (X; Y |Z ) = H (X |Z ) − H (X |Y, Z )
We know that I(X; Z|Y ) ≥ 0. So, I(X; Z|Y ) ≥ I(X; Z).
p (X, Y |Z )
= E log 4.11 (Additive Triangle Inequality). Let X, Y , Z be three
P (X |Z ) p (Y |Z )
X real- or discrete-valued mutually independent random vari-
= p (z) I (X; Y |Z = z ), ables, and let the “+” sign denote real or modulo addition.
z Then
h i
p(X,Y |z )
where I (X; Y |Z = z ) = E log P (X|z )p(Y |z ) z . I(X; X + Z) ≤ I(X; X + Y ) + I(Y ; Y + Z) (15)
I (X; Y |z ) ≥ 0 with equality if and only if X and Y are [27]. This is similar to triangle inequality if we define
independent given Z = z. d(X, Y ) = I(X; X + Y ), then (15) says
I (X; Y |Z ) ≥ 0 with equality if and only if X and Y are d(X, Z) ≤ d(X, Y ) + d(Y, Z).
conditionally independent given Z; that is X − Y − Z
form a Markov chain. 4.12. Given processes X = (X1 , X2 , . . .) and Y =
(Y1 , Y2 , . . .), the information rate between the processes
4.8. Chain rule for information: X and Y is given by
n
X 1
I (X1 , X2 , . . . , Xn ; Y ) = I (Xi ; Y |Xi−1 , Xi−1 , . . . , X1 ), I(X; Y ) = lim I (X1n ; Y1n ) .
n→∞ n
i=1
4.13 (Generalization of mutual information.). There isn’t
or simply
n
X really a notion of mutual information common to three random
I (X1n ; Y ) = I Xi ; Y X1i−1 .

variables [6, Q2.25 p 49].
i=1
(a) Venn diagrams [1]:
In particular, I (X1 , X2 ; Y ) = I (X1 ; Y ) + I (X2 ; Y |X1 ). Sim-
ilarly, I(X1 ; X2 ; · · · ; Xn ) = µ∗ (X1 ∩ X2 ∩ · · · ∩ Xn )
n X
(−1)|S|+1 H(XS ).
X
I (X1n ; Y |Z ) = I Xi ; Y X1i−1 , Z .

=
i=1 S⊂[n]
4.9. Mutual information (conditioned or not) between sets See also section 12 on I-measure.
of random variables can not be increased by removing random n
X1n
Q X
variable(s) from either set: (b) D P P
i
.
i=1
I ( X1 , X2 ; Y | Z) ≥ I ( X1 ; Y | Z) .
4.14. The lautum information [20] is the divergence be-
See also (5.6). In particular, tween the product-of-marginal and joint distributions, i.e.,
swapping the arguments in the definition of mutual informa-
I (X1 , X2 ; Y1 , Y2 ) ≥ I (X1 ; Y1 ) . tion.
12
H ( f ( X ) ) = −∑ p f ( X ) ( y ) log p f ( X ) ( y )
y
⎛ ⎞
= −∑ ⎜ ∑ p X ( x ) ⎟ log p f ( X ) ( y )
⎜
y ⎝ x: f ( x ) = y
⎟
⎠
⎛ ⎞
= −∑
(a) L(X; Y )⎜ =∑ D(ppX X ( x ) log p f ( X ) ( y ) ⎟
pY kpX,Y ) ⎟ 5.5. I (X, f (X, Z) ; Y, g (Y, Z) |Z ) = I (X; Y |Z ).
⎜
y ⎝ x: f ( x ) = y ⎠
(b) Lautum (“elegant” in Latin) is the reverse spelling of 5.6. Compared to X, g(X) gives us less information about
⎛ ⎞
≤ −∑ ⎜ ∑ p X ( x ) log p X ( x ) ⎟
mutual. Y.
⎜
y ⎝ x: f ( x ) = y
⎟
⎠
= −∑ p X ( x ) log p X ( x ) = H ( X ) H (Y |X, g (X) ) = H (Y |X ) ≤ H (Y |g (X) ).
5 Functions of random variables
x
I (X; Y ) ≥ I (g (X) ; Y ) ≥ I (g(X); f (Y )). Note that this
H ( X , g ( X )) = H ( X ) agrees with the data-processing theorem (6.1 and 6.4)
The are several occasions where we have to deal with functions
of random variables. In 0fact, for those who knows I-measure, using the chain f (X) − X − Y − g (Y ).
Proof. H ( X , g ( X ) ) = H ( X ) + H ( g ( X ) X )
the diagram in Figure 10 already summarizes almost all iden- (
p Φ ( g ( X ) ) = z, g ( X ) = y =) ∑ p (Φ ( y ) = z ) p ( x)
x
5.7. I (X; g (X) ; Y ) = I (g (X) ; Yg ()x ) =≥y 0.
H ( g ( X ) tities
X , Y ) =of0 ,our
I ( ginterest.
( X ) ;Y X ) = 0 .
5.8. If X = g1 (W ) and ∑ ),( then H (X |Y )) ≤= p Φ ( g ( X ) ) = z, X = x
g(X ) Ŵ = gg( x2x) =(Y
y
H (W |Y ) ≤ H W Ŵ . The statement is also true when

X
0 Hence,
≥0
( Remark: If) W − X − Y − Ŵ forms a Markov chain, we have
W − X − Y − Ŵ forms a Markov chain and X = g1 (W ).
H Φ ( g ( X )) X
∑∑|Y p) (≤
Φ (H ) ) =Ŵ
g ( XW ) logitp (isΦ (not ) = z in
g ( X )true x)

0 H =(W z, X =
, xbut X =general that
z x
∑∑ ∑ p ( Φ ( g ( X ) ) = z, X = x ) log p ( Φ ( g ( X ) ) = z X = x )
H =(X |Y ) ≤ H (W |Y ).
z y x
Y 5.9. Forg ( xthe

) = y following chain
= ∑∑ = x ) log p ( Φ ( g ( X ) ) = z g ( X ) = y )
∑ p ( Φ ( g ( X ) ) = zY, X=g(X)
Proof. 1) H ( g ( X ) X ) = 0 , H ( g ( X ) X ) ≥ H ( g ( X ) X , Y ) , and H ( ⋅) ≥ 0 ⇒ z y x
g ( x) = y − g (·) −−−−−→ Q (· |· ) →
X→ − Z
H ( g ( X ) Figure
X , Y ) = 010:
. Information diagram for X and g(X) ⎛ ⎞
2) I ( g ( X ) ; Y X ) = H ( g ( X ) X ) − H ( g ( X ) X , Y ) = 0 − 0 = 0 . = ∑∑
where (
p ( Φ ( g ( X ) ) = and
g islogdeterministic z g ( XQ) =
⎜
)⎜
is ya) probabilistic ) ) = z, X = x ) ⎟
∑ p ( Φ ( g ( Xchannel, ⎟
5.1. I(X; g(X)) = H(g(X)). z y ⎜ g ( xx) = y ⎟
⎝ |X, g (X) ). ⎠
Or, can use I ( Z ;Y ) ≤ H ( Z ) . Hence, 0 ≤ I (Y ; g ( X ) X ) ≤ H ( g ( X ) X ) = 0 . H (Z |X ) = H (Z |g (X) ) = H (Z
5.2. When X is given, we can simply disregard g(X). ( ( )) ( (
= ∑∑ log p Φ ( g ( X ) ) = z g ( X ) = y p Φ ( g ( X ) ) = z , g ( X ) = y ))
Note that can also prove that H ( g ( X ) X , Y ) = 0 and I ( g ( X ) ;Y X ) = 0 togetherbyIz (Z;
y X) = I (Z; g (X)) = I (X; g (X) ; Z) ≥ 0.
H (g (X) |X ) = 0. In fact, ∀x ∈ X , H (g (X) |X = x) =
argue that H g X X = H g X X , Y + I g X ; Y X = 0 . Because both of =
( ( is,) given
) (X,( g(X)) is) completely
( ( ) determined
) ( )
theH Φ ( g ( X ) ) g ( X )
0. That and Again, the diagram in Figure 11 summarizes the above results:
→ g ( ⋅) ⎯⎯⎯⎯ ( )
→ Q ( ⋅ ⋅) ⎯⎯
=
summands are nonnegative, they both have to be 0. • Consider X ⎯⎯ → Z . Then,
Y g X
hence has no uncertainty.
• (
H Y X , g ( H
X )(g=(X))
H (Y|X,
X ) Y≤ H (
) =Y 0.g ( X ) ) g(X )
I (g (X) ; Y |X ) = I (g (X) ; Y |X, Z ) = 0. X
0
5.3. g(X) has less uncertainty than X. H (g (X)) ≤ H (X)
with equality if and only if g is one-to-one (injective). That ≥0
is deterministic function only reduces the entropy. Similarly, 0 0
H (X |Y ) ≥ H (g (X) |Y ).
5.4. We can “attach” g(X) to X. Suppose f, g, v are deter-
Z
ministic function on appropriate domain.
H (X, g (X)) = H (X).
• ( )
H (Z X ) = H Z g ( X ) = H Z X , g ( X ) . ( )
Figure 11: Information diagram for Markov chain where the
H (Y |X, g (X) ) = H (Y |X ) and • I ( Z ;first I ( Z ; g ( X ) ) =isI a
X ) = transition ( Xdeterministic
; g ( X ) ; Z ) ≥ 0 . function
H (X, g (X) |Y ) = H (X |Y ).
An expanded version is H (X, g (X) |Y, f (Y ) ) =
• Let X = g1 (W ) , and Ŵ = g 2 (Y ) , then H ( X Y ) ≤ H (W Y ) ≤ H W Wˆ . ( )
H (X |Y, f (Y ) ) = H (X, g (X) |Y ) = H (X |Y ). 5.10. If ∀y g(x, y) is invertible as a function of x,
then H(X|Y ) = H(g(X, Y )|Y ). In fact,∀y H(X|y) =
H (X, g (X) , Y ) = H (X, Y ).
H(g(X, Y )|y).
An expended version is
H (X, Y, v (X, Y )) = H (X, g (X) , Y, f (Y )) = For example, g(x, y) = x − y or x + y. So, H(X|Y ) =
H (X, g (X) , Y ) = H (X, Y, f (Y )) = H (X, Y ). H(X + Y |Y ) = H(X − Y |Y ).
I (X, g (X) ; Y ) = I (X; Y ).
An expanded version is I (X, g (X) ; Y, f (Y )) = 5.11. If T (Y), a deterministic function of Y , is a sufficient
I (X, g (X) ; Y ) = I (X; Y, f (Y )) = I (X; Y ). statistics for X. Then, I (X; Y) = I (X; T (Y)).
13
5.12. Data Processing Inequality for relative entropy: I (X; Y, Z |W ) = I (X; Y |W )
Let Xi be a random variable on X , and Yi be a random variable I (X; Z |Y, W ) = 0
on Y. Q (y |x ) is a channel whose input and output are Xi X − (Y, W ) − Z forms a Markov chain.
and Yi , respectively. Then,
(c) Term moved to condition: The following statements are
D (Y1 kY2 ) ≤ D (X1 kX2 ) . equivalent:
In particular, I (X; Y, Z) = I (X; Y |Z )

I (X; Z) = 0
D (g (X1 ) kg (X2 ) ) ≤ D (X1 kX2 ) .
X and Z are independent.
Remark: The same channel Q or, in the second case, same (d) Term moved to condition: The following statements are
deterministic function g, are applied to X1 and X2 . See also equivalent:
(3.14).
I (X; Y, Z |W ) = I (X; Y |Z, W )
I (X; Z| W ) = 0
6 Markov Chain and Markov Strings X − W − Z forms a Markov chain.
6.1. Suppose X − Y − Z form a Markov chain. 6.3. Suppose U − X − Y − V form a Markov chain, then
Data processing theorem: I (X; Y ) ≥ I (U ; V ).
6.4. Consider a (possibly non-homogeneous) Markov chain
I (X; Y ) ≥ I (X; Z) .
(Xi ).
The interpretation is that no clever manipulation of the If k1 ≤ k2 < k3 ≤ k4 , then I (Xk1 ; Xk4 ) ≤ I (Xk2 ; Xk3 ).
data (received data) can improve the inferences that can (Note that Xk1 − Xk2 − Xk3 − Xk4 is still a Markov chain.)
be made from the data.
Conditioning only reduces mutual information when the
I (Z; Y ) ≥ I (Z; X). random variables involved are from the Markov chain.
X Y Z 6.5. Consider two Markov chains governed by pi (xn0 ) =
Qn−1
pi (x0 ) k=0 pi (xk+1 |xk ) for i = 1, 2. The relative entropy
D (p1 (xn0 )kpi (xn0 )) is given by
n−1
X
D (p1 (x0 ) kp2 (x0 ) ) + D (p1 (xk+1 |xk ) kp2 (xk+1 |xk ) ).
I ( X ;Y Z ) I ( X ;Y ) I ( X;Z ) k=0
Figure 12: Data processing inequality 6.6. Consider two (possibly non-homogeneous) Markov chains
with the same transition probabilities. Let p1 (xn ) and p2 (xn )
be two p.m.f. on the state space Xn of a Markov chain at time
I (X; Y ) ≥ I (X; Y |Z ); that is the dependence of X and
n. (They comes possibly from different initial distributions.)
Y is decreased (or remain unchanged) by the observation
Then,
of a ”downstream” random variable Z. In fact, we also
have I (X; Z |Y ) ≤ I (X; Z) , and I (Y ; Z |X ) ≤ I (Y ; Z)
{ p1 ( x0 )} { p1 ( xn −1 )} { p1 ( xn )}
Pn ( i j )
(see 4.10). That is in this case conditioning only reduces
mutual information.
6.2. Markov-type relations
(a) Disappearing term: The following statements are equiva-
lent: Pn ( i j )
{ p ( x )}{ p ( x )}
2 0 2 n −1 { p ( x )}
2 n
I (X; Y, Z) = I (X; Y )
I (X; Z |Y ) = 0.
X − Y − Z forms a Markov chain. The relative entropy D (p1 (xn ) kp2 (xn ) ) decreases with
n:
(b) Disappearing term: The following statements are equiva-
lent: D (p1 (xn+1 ) kp2 (xn+1 ) ) ≤ D (p1 (xn ) kp2 (xn ) )
14
n n
D (p1 (xI ) kp2 (xI ) ) = D (p1 (xmin I ) kp2 (xmin I ) ) where D P X1n
Q X
P i
≥
P
I (Xi ; Y ) − I (X1n ; Y ).
I is some index set and pi (xI ) is the distribution for the
i=1 i=1
random vector XI = (Xk : k ∈ I) of chain i. n n
X1n
D P
Q X P
P i
≤ H (X ) − maxi H (Xi ) .
For homogeneous Markov chain, if the initial distribu-
i=1 i=1
i
tion p2 (x0 ) of the second chain is the stationary dis-

tribution p̃, then ∀n and x we have p2 (xn = x) = p̃ (x) 7.4. Suppose we have a collection of random variables
and D (p1 (xn ) kp̃ (xn ) ) is a monotonically decreasing non- X1 , X2 , . . . , Xn , then the following statements are equivalent:
negative function of n approaching some limit. The limit X1 , X2 , . . . , Xn are independent.

is actually 0 if the stationary distribution is unique. n
n
P
Consider homogeneous Markov chain. Let {p (x )} be the H (X 1 ) = H (Xi ).
n i=1
p.m.f. on the state space at time n. Suppose the stationary
I X1i ; Xi+1 = 0 ∀i ∈ [n − 1].

distribution p̃ exists.
I Xi ; Xi+1
n

= 0 ∀i ∈ [n − 1].
Suppose the stationary distribution is non-uniform and
n
we set the initial distribution to be uniform, then the D P X1
n Q Xi
P = 0.
entropy H (Xn ) = H ({p (xn )}) decreases with n.

i=1
Suppose the stationary distribution is uniform, then, 7.5. The following statements are equivalent:
H (Xn ) = H ({p (xn )}) = log |X | − D ({p (xn )} ku ) is
monotone increasing. (a) X1 , X2 , . . . , Xn are mutually independent conditioning
on Y (a.s.).
n
6.1 Homogeneous Markov Chain (b) H (X1 , X2 , . . . , Xn |Y ) =
P
H (Xi |Y ).
i=1
For this subsection, we consider homogeneous Markov chain. n
(c) p (xn1 |y ) =
Q
6.7. H(X0 |Xn ) is non-decreasing with n; that is H(X0 |Xn ) ≥ p (xi |y ).
i=1
H(X0 |Xn+1 ) ∀n.
(d) ∀i ∈ [n] \ {1} p xi xi−1

1 , y = p (xi |y ).
6.8. For a stationary Markov process (Xn ),
(e) ∀i ∈ [n] I Xi ; X[n]\{i} |Y = 0.
H(Xn ) = H(X1 ), i.e. is a constant ∀n. (f) ∀i ∈ [n] Xi and the vector (Xj )[n]\{i} are independent
H Xn X n−1 = H (Xn |Xn−1 ) = H (X1 |X0 ).

1 conditioning on Y .
H(Xn |X1 ) increases with n. That is H (Xn |X1 ) ≥ Pn
H (Xn−1 |X1 ) . (g) H ( X1 , X2 , . . . , Xn | Y ) = H Xi | X[n]\{i} , Y .
i=1
n n
n Q
(h) D P X1 Xi
I (Xi ; Y ) − I (X1n ; Y ). (See
P
7 Independence P =
i=1 i=1
also (7.3).)
7.1. I (Z; X, Y ) = 0 if and only if Z and (X, Y ) are inde-
pendent. In which case,
8 Convexity
I (Z; X, Y ) = I (Z; X) = I (Z; Y ) = I ( Z; Y | X) =
I ( Z; X| Y ) = 0. 8.1. H ({p (x)}) is concave ∩ in {p (x)}.
I (X; Y |Z ) = I (X; Y ).
8.2. H (Y |X ) is a linear function of {p(x)} for fixed
I (X; Y ; Z) = 0. {Q(y|x)}.
7.2. Suppose we have nonempty disjoint index sets A, B.
8.3. H (Y ) is a concave function of {p(x)} for fixed {Q(y|x)}.
Then, I (XA ; XB ) = 0 if and only if ∀ nonempty Ã ⊂ A and
∀ nonempty B̃ ⊂ B we have I (XÃ ; XB̃ ) = 0. 8.4. I(X;Y ) = H(Y ) − H(Y |X) is a concave function of

n Q
n
n
{p(x)} for fixed {Q(y|x)}.
7.3. D P X1 Xi n
P
P = H (Xi ) − H (X1 ) =
8.5. I(X;Y ) = D (p(x, y)kp(x)q(y)) is a convex function of

i=1 i=1
n−1 n−1
P
I Xi; X

=
P
I X ; X n . Notice that this {Q(y|x)} for fixed {p(x)}.
1 i+1 i i+1
i=1 i=1
function is symmetric w.r.t. its n arguments. It admits a 8.6. D (p kq ) is convex ∪ in the pair (p, q).
natural interpretation as a measure of how far the Xi are For fixed p, D (p kq ) is convex ∪ functional of q. For fixed
from being independent [15]. q, D (p kq ) is convex ∪ functional of p.
15
9 Continuous Random Variables (c) N m, σ 2 : h(X) = 12 log(2πeσ 2 ) bits. Note that h(X) <

1
0 if and only1 if σ < √2πe
• Gaussian: h ( X ) = log ( 2π eσ )
. Let Z = X1 + X2 where
2
N
The differential entropy h(X) or h (f ) of an absolutely 2
continuous random variable X with a density f (x) is defined 4

as Z
h(X) = − f (x) log f (x)dx = −E [log f (X)] , 3
log ( 2π eσ 2 )
S 1
2
where S is the support set of the random variable. 2
It is also known as Boltzmann entropy or Boltzmann’s 1

H-function.
Differential entropy is the “entropy” of a continuous ran- 0
dom variable. It has no fundamental physical meaning,

-1
but occurs often enough to have a name [3]. 1
≈ 0.242
2π e
As in the discrete case, the differential entropy depends -2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
only on the probability density of the random variable, σ
and hence the differential entropy is sometimes written • Among Y such that EY 2 = a , one that maximize differential entropy is N ( 0, a ) .
Figure 13: h (X ) = 1 log 2πeσ 2

as h(f ) rather than h(X). Proof. Y − EY has 0 mean. So, we already N know that
2
As in every example involving an integral, or even a ( )
h (Y − EY ) ≤ h N ( 0, Var [Y ]) . Now, h (Y − EY ) = h (Y ) and
Var [Y X
X1 and ] = 2a − are a . So, h (Y ) ≤ h ( N ( 0,normal
( EY ) ≤independent Var [Y ]) ) ≤ h (random
N ( 0, a ) ) . variables
2
density, we should include the statement “if it exists”.

with Note
mean that forµZi ~andN ( 0, avariance
) , EZ = a . So,σthei2 ,upper
2
i =bound 1, is2. Then,
achieved by h (Z) =
It is easy to construct examples of random variables for 1 N ( 0, a ) . 2 2

which a density function does not exist or for which the 2 log 2πe σ1 + σ2
• Among Y such that EY ≤ a , one that maximize differential entropy is N ( 0, a ) .
2
above integral does not exist. (d) Γ(q,

Proof. λ):
Among those with EY = b , the maximum is achieve by N ( 0,b ) . For
2
9.1. Unlike discrete entropy, the differential entropy h(X) Gaussian, larger variance means
larger differential entropy.
⎛X⎞
• Suppose ⎜ ⎟ is a jointly Gaussian random vector with covariance matrix
Γ (q)
can be negative. For example, consider the uniform dis- h⎝ Y(X)
⎠ = (log e) q + (1 − q) ψ (q) + ln
⎛ det ( Λ ) det ( Λ ) ⎞
λ
tribution on [0, a]. h(X) can even be −∞ [12, lemma ⎛Λ Λ ⎞ 1
Λ=⎜ ⎟ . Then, I ( X ;Y ) = log ⎜⎜
X XY
⎟.
X Y
Λ ⎠ = h̃ (q) + 2log⎝ σ det ( Λ ) ⎟

⎝Λ X ⎠
1.3]; YX Y
Proof. I ( X ;Y ) = h ( X ) + h (Y ) − h ( X , Y ) .
is not invariant under invertible coordinate transformation where 1
• Exponential family: Let f ( x ) =
θ ⋅T ( x )
e where c (θ ) = ∫ eθ ⋅T ( x ) dx is a
[3]. See also 9.5. c (θ )
X
Γ0 (z)
ψ (z) = d
ln Γ (z) =
normalizing constant. Then, is the digamma function;
dz Γ(z)
9.2. For any one-to-one differentiable g, we have
h(g(X)) = h(X) + E log |g 0 (X)|. h̃ (q) = (log e) q + (1 − q) ψ (q) + ln Γ(q)
√
q is a
strictly increasing function. h̃(1) = log e which
Interesting special cases are as followed:
agrees the exponential case. By CLT, lim h̃ (q) =
q→∞
Differential entropy is translation invariant: 1
2 log 2πe ≈ 2.0471 which agrees with the Gaussian
h(X + c) = h(X). case.
m
In fact,
P
λk g(x)
h(aX + b) = h(X) + log |a| . (e) Note that the entropy of any pdf of the from ce k=1
with support S (e.g. R or [a, b] or [s0 , ∞)) is

h eX = h (X) + EX

m
X
See also 9.5. h (X) = − log c − λk E [g (x)] log e.
k=1
9.3. Examples:
(a) Uniform distribution on [a, b]: h(X) = log(b − a). Note The MEPDs in 9.22 are of this form.
that h(X) < 0 if and only if 0 < b − a < 1. For bounded exponential on [a, b], the entropy is
(b) Triangular-shape pdf with support on [a, b] with height − log c + αµ log e where simple manipulation shows
µα = 1 + c ae−αa − be−αb .

2
A = b−a :

1 1 More examples can be found in Table 2 and [25].
h(X) = − ln(A) = ln(b − a) − ln(2) − [nats]
2 2
9.4. Relation of Differential Entropy to Discrete En-
≈ log2 (b − a) − 0.28 [bits]. tropy:
16
Differential entropy is not the limiting case of the discrete In particular,
entropy. Consider a random variable X with density f (x).
Suppose we divide the range of X into bins of length ∆. Let us h(AX + B) = h(X) + log |det A| .
assume that the density is continuous within the bins. Then
by the mean value theorem, ∃ a value xi within each bin such Note also that, for general g,
R (i+1)∆
that f (xi )∆ = i∆ f (x)dx. Define a quantized (discrete) h (Y) ≤ h (X) + E [log |det (dg (X))|] .
random variable X ∆ , which by X ∆ = xi , if i∆ ≤ X ≤ (i+1)∆.
9.6. Examples [9]:
(a) If the density f (x) of the random variable
X is Riemann
∆
integrable, then as ∆ → 0, H X + log ∆ → h (X). Let X1n have a multivariate normal distribution with
That is H(X ∆ ) ≈ h(X) − log ∆. mean µ and covariance matrix Λ. Then, h(X1n ) =
1 n
(b) When ∆ = n1 , we call X ∆ the n-bit quantization of X. 2 log ((2πe) |Λ|) bits, where |Λ| denotes the determinant
The entropy of an n-bit quantization is approximately of Λ. In particular, if Xi ’s are independent normal r.v.
the same variance σ 2 , then h (X1n ) = n2 log 2πeσ 2

h(X) + n. with
(c) h(X) + n is the number of bits on the average required ◦ For any random vector X = X1n , we have
to describe X to n bit accuracy h
T
i
E (X − µ) Λ−1X (X − µ)
(d) H(X ∆ |Y ∆ ) ≈ h(X|Y ) − log ∆. Z
T
= fX (x) (x − µ) Λ−1 X (x − µ) dx = n
[6, Section 8.3].
Another interesting relationship is that of Figure 4 where we
T
approximate the entropy of a Poisson r.v. by the differential Exponential family: Suppose fX (x) = c(θ) 1
eθ T(x)
entropy of a Gaussian r.v. with the same mean and variance. where the real-valued normalization constant c (θ) =
Note that in this case ∆ = 1 and hence log ∆ = 0. R θT T(x)
e dx, then
The differential entropy of a set X1 , . . . , Xn of random
1 T
variables with density f (X1n ) is defined as h (X) = ln c (θ) − θ (∇θ c (θ))
c (θ)
Z
h(Xn1 ) = − f (xn1 ) log f (xn1 )dxn1 . = ln c (θ) − θT (∇θ (ln c (θ)))
1 θ·T (x)
In 1-D case, we have fX (x) = c(θ) e and
If X, Y have a joint density function f (x, y), we can define
the conditional differential entropy h(X|Y ) as θ 0 d
h (X) = ln c (θ) − c (θ) = ln c (θ) − θ (ln c (θ))
Z c (θ) dθ
h(X|Y ) = − f (x, y) log f (x|y)dxdy
eθ·T (x) dx See also (9.22).
R
where c (θ) =
= −E log fX|Y (X |Y )
Let Y = (Y1 , . . . , Yk ) = eX1 , . . . , eXk , then h (Y) =

= h(X, Y ) − h(Y ) k
P
Z h (X) + (log e) EXi . Note that if X is jointly gaussian,
= f (x)H (Y |x ) dx, i=1
then Y is lognormal.
x
R 9.7. h(X|Y ) ≤ h(X) with equality if and only if X and Y

where H (Y |x ) = fY |X (y |x ) log fY |X (y |x )dy. are independent.
y
9.5. Let X and Y be two random vectors both in Rk such 9.8. Chain rule for differential entropy:
that Y = g(X) where g is a one-to-one differentiable transfor- n
X
mation. Then, h(X1n ) = h(Xi |X1i−1 ).
i=1
1
fY (y) = fX (x) Pn
|det (dg (x))| 9.9. h(X1n ) ≤ i=1 h(Xi ), with equality if and only if
X1 , X2 , . . . Xn are independent.
and hence Pn
9.10 (Hadamard’s inequality). |Λ| ≤ i=1 Λii with equality
h (Y) = h (X) + E [log |det (dg (X))|] . iff Λij = 0, i 6= j, i.e., with equality iff Λ is a diagonal matrix.
17
The relative entropy (Kullback Leibler distance) D(f kg) Hence, knowing how to find the differential entropy, we can
between two densities f and g (with respect to the Lebesgue find the mutual information from (19). Also, from (21), the
measure m) is defined by mutual information between two random variables is the limit
Z of the mutual information between their quantized versions.
f
D(f kg) = f log dm. (16) See also 9.4.
g
9.15. Mutual information is invariant under invertible coor-
Finite only if the support set of f is contained in the
dinate transformations; that is for random vectors X1 and X1
support set of g. (I.e., infinite if ∃x0 f (x0 ) > 0, g(x0 ) =
and invertible functions g1 and g2 , we have
0.)
D(f kg)
R
R is infinite if for some region R, R f (x)dx = 0 I (g1 (X1 ); g2 (X2 )) = I (X1 ; X2 ) .
and R f (x)dx 6= 0.
See also 9.12.
For continuity, assume 0 log 00 = 0. Also, for a > 0,
a 0
a log 0 = ∞, 0 log a = 0. 9.16. I(X; Y ) ≥ 0 with equality if and only if X and Y are
h i
fX (X)
D (fX kfY ) = E log fY (X) = −h (X) − E [log fY (X)]. independent.
Relative entropy usually does not satisfy the symmetry 9.17. Gaussian Random Variables and Vectors
property. In some special cases, it can be symmetric, e.g.
(23). (a) Gaussian Upper Bound for Differential Entropy:
For any random vector X,
9.11. D(f kg) ≥ 0 with equality if and only if f = g almost
everywhere (a.e.). 1 n
h (X) ≤ log ((2πe) det (ΛX ))
2
9.12. Relative entropy is invariant under invertible coordi-
nate transformations such as scale changes and rotation of with equality iff X ∼ N (m, ΛX ) for some m. See also
coordinate axes. 9.22. Thus, among distributions with the same variance,
the normal distribution maximizes the entropy.
9.13. Relative entropy and Uniform random variable:
Let U be uniformly distributed on a set S. For any X with In particular,
the same support, 1 2

h(X) ≤ log 2πeσX .
−E log fU (X) = h (U ) 2
and (b) For any random variable X and Gaussian Z,

D (fX ||U ) = h(U ) − h(X) ≥ 0.
σ 2 + (EX − EZ)2

1 2
This is a special case of (24). −E log fZ (X) = log 2πσZ + (log e) X 2
.
2 σZ
9.14. Relative entropy and exponential random vari-
able: Consider X on [0, ∞). Suppose XE is exponential with
the same mean as X. Then (c) Suppose XN is Gaussian with the same mean and variance
as X. Then,
−E log fXE (X) = h (XE )
−E log fXN (X) = −E log fXN (XN )
and 1 2
D (fX ||fE ) = h(XE ) − h(X) ≥ 0. = h (XN ) = log 2πeσX
2
This is a special case of (24).
and
The mutual information I(X; Y ) between two random h (XN ) − h (X) = D (fX kfXN ) ≥ 0. (22)
variables with joint density f (x, y) is defined as
Z This is a special case of (24).
f (x, y)
I(X; Y ) = f (x, y) log dxdy (17) (d) The relative entropy D (fX kfY ) between n-dimensional
f (x)f (y)
random vectors X ∼ N (mX , ΛX ) and Y ∼ N (mY , ΛY )
= h(X) − h(X|Y ) = h(Y ) − h(Y |X) (18) is given by
= h(X) + h(Y ) − h(X, Y ) (19) 1 det ΛY
log e−n
= D(fX,Y kfX fY ) (20) 2 det ΛX
= lim I(X ∆ ; Y ∆ ). 1
(21) + (log e) tr Λ−1 T −1

Y ΛX + (∆m) ΛY (∆m) ,
∆→0 2
18
where ∆m = mX − mY . In 1-D, we have For Y ∈ [s0 , ∞), by the same reasoning, we have
1 σ2 1 2
σX

∆m
2 ! I (X; Y ) logK (e(µ − s0 )) − h(N ) log K e
log e−1 Y2 + (log e) + , ≤ ≤ h(N )
2 σX 2 σY2 σY EY µ − s0 K
but in this case shifted exponential on [S0 , ∞) maximize
or equivalently, h(Y ) for fixed EY = µ. Second inequality use µ =
2 ! s0 + K h(N ) which maximize the middle term.
2

σY 1 σX ∆m Suppose for Y ∈ [s0 , ∞), we now want to maximize I(X;Y )
log + (log e) + −1 EY +r
σX 2 σY2 σY for some r ≥ 0. Then, we can use the same technique but
will have to solve for optimal µ∗ of µ numerically. The
[24, p 1025]. In addition, when σX = σY = σ, the relative upper bound in this case is µlog K e
∗ −s .
0
entropy is simply
9.19. Additive Gaussian Noise [11, 18, 19]: Suppose N
2
is a (proper complex-valued multidimensional) Gaussian noise

1 mX − mY
D (fX kfY ) = (log e) . (23) which is independent of (a complex-valued random vector)
2 σ
X. Here the distribution of X is not required to be Gaussian.
Then, for √
(e) Monotonic decrease of the non-Gaussianness of the sum
Y = SNRX + N,
of independent random variables [23]: Consider i.i.d. ran-
Pn (n) we have
dom variables X1 , X2 , . . .. Let S (n) = k=1 Xk and SN
be a Gaussian random variable with the same mean and d h
2
i
(n)
variance as S . Then, I(X; Y ) = E |X − E [X|Y ]| .
dSNR
or equivalently, in an expanded form,

(n) (n−1)
D S (n) ||SN ≤ D S (n−1) ||SN .
√ h √
i2
d
I(X; SNRX + N ) = E X − E X SNRX + N ,

dSNR
(f) Suppose X

Y is a jointly Gaussianrandom vector with
covariance matrix Λ = ΛΛYXX ΛΛXY Y
. The mutual infor- where the RHS is the MMSE corresponding to the best
mation between two jointly Gaussian vectors X and Y estimation of X upon the observation Y for a given signal-to-
is noise ratio (SNR). Here, the mutual information is in nats.
Furthermore, for a deterministic matrix A, suppose

1 (det ΛX )(det ΛY )
I(X; Y ) = log .
2 det Λ
Y = AX + N.
In particular, for jointly Gaussian random variables X, Y ,
we have Then, the gradient of the mutual information with respect to
the matrix A can be expressed as
2 !
1 Cov (X, Y )
I (X; Y ) = − log 1 − . ∇A I (X; Y ) = Cov [X − E [X|Y ]]
2 σX σY
= AE (X − E [X|Y ])(X − E [X|Y ])H ),

9.18. Additive Channel: Suppose Y = X + N . Then, where the RHS is the covariance matric of the estimation error
h(Y |X) = h(N |X) and thus I(X; Y ) = h(Y ) − h(N |X) be- vector, also known as the MMSE matrix. Here, the complex
cause h is translation invariant. In fact, h(Y |x) is always derivative of a real-valued scalar function f is defined as
h(N |x).
Furthermore, if X and N are independent, then h(Y |X) = df 1 ∂f ∂f
= +j
h(N ) and I(X; Y ) = h(Y ) − h(N ). In fact, h(Y |x) is always dx∗ 2 ∂Re {x} ∂Im {x}
h(N ). ∂f
and the
h complex
i gradient matrix is defined as ∇A f = ∂A ∗
For nonnegative Y , ∂f
where ∂A∗ ∂f
= ∂[A∗ ]ij .
ij
I (X; Y ) logK (eEY ) − h (N ) log K e 9.20. Gaussian Additive Channel: Suppose X and N
≤ ≤ h(N )
EY EY K are independent independent Gaussian random vectors and
Y = X + N , then
where the first inequality is because exponential Y max-
imizes h(Y ) for fixed EY and the second inequality is 1 det (ΛX + ΛN )
because EY = K h(N ) maximizes the middle term. I (X; Y ) = log .
2 det ΛN
19
In particular, for one-dimensional case, Normal distribution is the law with maximum entropy
among all distributions with finite variances:
2

1 σX
I (X; Y ) = log 1 + 2 . If the constraints is on (1) EX 2 , or (2) σX
2
, then f ∗ has
2 σN the same form as a normal distribution. So, we just have
to find a Normal random variable satisfying the condition
9.21. Let Y = X + Z where X and Z are independent and (1) or (2).
X is Gaussian. Then,
Exponential distribution has maximum entropy among
among Z with fixed mean and variance, Gaussian Z all distributions concentrated on the positive halfline and
minimizes I(X; Y ). possessing finite expectations:
If S = [0, ∞) and EX = µ > 0, then
among Z with fixed EZ 2 , zero-mean Gaussian Z mini- x
f ∗ (x) = µ1 e− µ 1[0,∞) (x) (exponential) with correspond-
mizes I(X; Y ).
ing h (X ∗ ) = log (eµ).
If S = [s0 , ∞) and EX = µ > s0 , then f ∗ (x) =
9.1 MEPD 1
x−s
− µ−s0
µ−s0 e 1[s0 ,∞) (x) (shifted exponential) with corre-
0
9.22. Maximum Entropy Distributions: sponding h (X ∗ ) = log (e (µ − s0 )).

Consider fixed (1) closed S ⊂ R and (2) measurable
9.23. Most of the standard probability distributions can
functions g1 , . . . , gm on S. Let C be a class of probability
be characterized as being MEPD. Consider any pdf on S
density f (of a random variable X) which are supported m
P
k k λ g (x)
on S (f = R0 on S c ) and satisfy the moment constraints of the form fˆX (x) = c0 ek=1 with finite E [gk (X)] for
E [gi (X)] ≡ f (x) gi (x)dx = αi , for 1 ≤ i ≤ m. f ∗ is an k = 1, . . . , m. fˆX is the MEPD for the class of pdf f on S
MEPD for C if ∀f ∈ C, h(f ∗ ) ≥ h(f ). Define fλ on S by under the constraints by the E [gk (X)]’s. Hence, to see which
m m !−1
constraints characterize a pdf as an MEPD, rewrite the pdf
P P
λk gk (x) R λi gi (x)
fλ (x) = c0 ek=1 where c0 = ei=1 dx and in the exponential form.
S
λ1 , . . . , λm are chosen so that fλ ∈ C. Note that f ∗ and fλ
may not exist. S Constraints MEPD
(a, b) No constraint U(a, b)
(a) If f ∗ exists (and ∃fˆ ∈ C such that fˆ > 0 on S), then fλ [0, ∞) No constraint N/A
exists and fλ = f ∗ . or R
1
(b) If fλ exists, then there exists unique MEPD f and f =∗ ∗ [0, ∞) EX = µ > 0 E µ
fλ . [s0 , ∞) EX = µ > s0 Shifted exp.
In which case, ∀ f ∈ C, [a, b] EX = µ ∈ [a, b] Truncated exp.
Z R EX = µ, Var X = σ 2 N (µ, σ 2 )
E log fλ (X) = − f (x) log fλ (x)dx [0, ∞) EX and E ln X Γ (q, λ)
Z (0, 1) E ln X, E ln(1 − X) β(q1 , q2 )
[0, ∞) E ln X, E ln(1 +X) Beta prime
= − fλ (x) log fλ (x)dx = h (fλ )
R E ln 1 +X 2 Cauchy
R E ln 1 + X 2 = 2 ln 2 Std. Cauchy
and 2
(0, ∞) E ln X = µ,Var ln X = σ 2 eN (µ,σ )
D (f kfλ ) = h (fλ ) − h (f ) ≥ 0. (24)
R E|X|
R =w L w1
(c, ∞) S
ln xdx P ar(α, c)
(See also (9.6).)
Table 4: MEPD: Continuous cases. Most of the standard
c0 > 0 or equivalently c0 = eλ0 for some λ0 . probability distributions can be characterized as being MEPD
m
when values of one or more of the following moments are
h (fλ ) = − λ0 +
P
λk αk log e. prescribed: EX,E|X|, EX 2 , E ln X, E ln(1 − X),E ln(1 + X),
k=1
E ln(1 + X 2 ) [14].
Uniform distribution has maximum entropy among all
distributions with bounded support:
Let S = [a, b], with no other constraints. Then, the 9.24. Let X1 , X2 , . . ., Xn be n independent, symmetric
maximum entropy distribution is the uniform distribution random variables supported on thePinterval [−1, 1]. The differ-
n
over this range. ential entropy of their sum Sn = k=1 Xi is maximized when
20
d
X1 , . . . , Xn−1 are Bernoulli taking on +1 or −1 with equal where α = dβ log M (β) with equality if and only if
probability and Xn is uniformly distributed [17]. We list below
the properties of this maximizing distribution. Let Sn∗ be the f (x) = f ∗ (x) = c0 f˜ (x) eβg(x)
corresponding maximum differential entropy distribution. −1
Pn−1 where c0 = (M (β)) . Note that f ∗ is said to generate an
(a) The sum k=1 Xk is binomial on exponential family of distribution.
{2k − (n − 1) : k = 0, 1, . . . , n − 1} where the prob-
ability at the point 2j − (n − 1) is n−1
n−1
j 2 . There is 9.2 Stochastic Processes
no close form expression for the differential entropy of
this binomial part. 9.27. Gaussian Processes: The entropy rate of the Gaussian
process (Xk ) with power spectrum S(w) is
(b) The maximum differential entropy is sum of the binomial
√ 1
Z π
part and the uniform part. The differential entropy of H ((Xk )) = ln 2πe + ln S(ω)dω
the uniform part is logK 2. 4π −π
(c) For n = 1 and n = 2, Sn∗ is uniformly distributed on [21, eq (15-130) p 568].

[−1, 1] and [−2, 2], respectively. 9.28. (Inhomogeneous) Poisson Processes: Let Πi be Pois-
9.25. Stationary Markov Chain: Consider the joint dis- son process with rate λi (t) on [0, T ]. The relative entropy
tribution fX,Y of (X, Y ) on S × S such that the marginal D(Π 1 kΠ2 ) between the two processes is given by
distributions fX = fY ≡ f satisfy the moment constraints Z T Z T
λ1 (t)
Z − λ1 (t) − λ2 (t)dt log e + λ1 (t) log dt
0 0 λ2 (t)
f (x) gi (x)dx = αi , for 1 ≤ i ≤ m.
[24, p 1025]. This is the same as D (P1 kP2 ) where
Note also that h (X) = h (Y ) = h (f ) . Let fλ be the n n
(mi (T )) Y λi (uk )
MEPD for the a class of probability density which are sup- Pi (n, un1 ) = e−mi (T ) .
ported on S and satisfy the moment constraints. Define | {z n! } k=1 mi (T )
(λ) Probability of having n | {z }
fX,Y (x, y) = fλ (x)fλ (y) with corresponding conditional den- points in [0,T ] Conditional pdf of
(λ) unordered times
sity fY |X (y|x) = fλ (y). Then,
(a) fX,Y = fX,Y maximizes h(X, Y ) and h(Y |X) with corre- 10 General Probability Space
(λ)
sponding maximum values 2h(fλ ) and h(fλ );

10.1. Consider probability measures P and Q on a common
(λ) (λ) measurable space (Ω, F).
(b) D fX,Y fX,Y = h fX,Y − h (fX,Y ) = 2h (fλ ) −
h (X, Y ); If P is not absolutely continuous with respect to Q, then
D (P ||Q) = ∞.
(c) the conditional relative entropy can be expressed as
If P Q, then D (P ||Q) < ∞, the Radon-Nikodym
(λ) dP
derivative δ = dQ exists, and
D fY |X fY |X
fY |X (y |x )
Z Z Z Z
≡ f (x) fY |X (y |x ) log (λ) dydx D (P kQ ) = log δdP = δ log δdQ.
fY |X (y |x )
= h (fλ ) − h (Y |X ) The quantity log δ (if it exists) is called the entropy density
or relative entropy density of P with respect to Q [10,
So, for stationary (first-order) Markov chain X1 , X2 , . . . with lemma 5.2.3].
moment constraint(s), the entropy rate is maximized when If P and Q are discrete with corresponding pmf’s p and
Xi ’s are independent with marginal distribution being the dP
q, then dQ = pq and we have (6). If P and Q are both
corresponding MEPD distribution fλ . The maximum entropy absolutely continuous with respect to (σ-finite) measure M
rate is h(fλ ). (e.g. M = (P + Q)/2) with corresponding densities (Radon-
dP dQ
9.26. Minimum relative entropy from a reference dis- Nikodym derivatives) dM = δP and dM = δQ respectively,
dP δP
tribution [16]: Fix a pdf f˜Rand a measurable g. Consider then dQ = δQ and
any pdf f such that α = g (x) f (x)dx exists. Suppose Z
δP
M (β) = f˜ (x) eβg(x) dx exists for β in some interval. Then
R
D (P kQ ) = δP log dM.
δQ

D f f˜ ≥ αβ − log M (β)

If M is the Lebesgue measure m, then we have (16).
21
n
where the supre- and the supremum in (25) is achieved by the set B =
P (Ak )
P
10.2. D (P kQ ) = sup P (Ak ) log Q(Ak )
k=1 [δP > δQ ].
mum is taken on all finite partitions of the space. dT V is a true metric. In particular, 1) dT V (µ1 , µ2 ) ≥ 0 with
10.3. For random variables X and Y on common probability equality if and only if µ1 = µ2 , 2) dT V (µ1 , µ2 ) = dT V (µ2 , µ1 ),
space, we define and 3) d T V (µ1 , µ2 ) ≤ dT V (µ1 , ν) + dT V (ν, µ2 ). Furthermore,
because µi (A) ∈ [0, 1], we have |µ1 (A) − µ2 (A)| ≤ 1 and thus
I(X; Y ) = D P X,Y kP X × P Y dT V (µ1 , µ2 ) ≤ 1.

and 10.6 (Pinsker’s inequality).

H(X) = I(X; X). D(P ||Q) ≥ 2(log e)d2T V (P, Q).
10.4. More directly, we define mutual information in terms This is exactly (1). In other words, if P and Q are both
of finite partitions of the range of the the random variable absolutely continuous with respect to some measure M
[6, p 251–252]. Let X be the range of a random variable X ( i.e. P, Q M ), and have corresponding densities (Radon-
and P = {Xi } be a finite partition of X . The quantization Nikodym derivatives) dM dP
= δP and dMdQ
= δQ respectively,
of X by P (denoted by [X]P ) is the discrete random variable then
defined by P [X ∈ Xi ]. For two random variables X and Y Z 2
with partitions P and Q, the mutual information between X
Z
δP
2 δP log dM ≥ (log e) |δP − δQ |dM .
and Y is given in terms of the its discrete version by δQ
I(X; Y ) = sup I ([X]P ; [Y ]Q ) , See [10, lemma 5.2.8] for detailed proof.
P,Q
where the supremum is over all finite partitions P and Q.

11 Typicality and AEP (Asymptotic
By continuing to refine the partitions P and Q, one finds Equipartition Properties)
a monotonically increasing sequence I ([X]P ; [Y ]Q ) % I.
The material in this section is based on (1) chapter 3 and
10.5. Let (X , A) be any measurable space. The total varia-
section 13.6 in [5], (2) chapter 5 in [26]. Berger [2] introduced
tion distance dT V between two probability measures P and
strong typicality which was further developed into the method
Q on X is defined to be
of types in the book by Csiszár and Körner [7]. First, we
dT V (P, Q) = sup |P (A) − Q (A)| . (25) consider discrete random variables.
A∈A
11.1. Weak Typicality: Consider a sequence {Xk : k ≥ 1}
The total variation distance between two random variables where Xk are i.i.d. with distribution pX (x). The quantity
n
X and Y is denoted by dT V (L (X) , L (Y )) where L (X) is
− n1 log p (xn1 ) = − n1
P
log p (xk ) is called the empirical en-
the distribution or law of X. We sometimes simply write k=1
dT V (X, Y ) with the understanding that it is in fact a function tropy of the sequence xn1 . By weak law of large number,
of the marginal distributions and not the joint distribution. If − n1 log p (X1n ) → H (X) in probability as n → ∞. That is
X and Y are discrete random variables, then
∀ε > 0, lim P − n1 log p (X1n ) − H (X) < ε = 1;

n→∞
1X
dT V (X, Y ) = |pX (k) − pY (k)|. (1) ∀ε > 0, for n sufficiently large,
2
k∈X

1 n

If X and Y are absolutely continuous random variables with P − log p (X1 ) − H (X) < ε > 1 − ε.

n
densities fX (x) and fY (y), then
Z (n)
1 The weakly typical set Aε (X) w.r.t. p (x) is the set
dT V (X, Y ) = |fX (x) − fY (x)|dx.
2 of sequence x1 ∈ X such that − n1 log p (xn1 ) − H (X) ≤ ε,
n n
x∈X
or equivalently, 2−n(H(X)+ε) ≤ p (xn1 ) ≤ 2−n(H(X)−ε) , where
More generally, if P and Q are both absolutely continuous ε is an arbitrarily small positive real number. The sequence
(n)
with respect to some measure M ( i.e. P, Q M ), and have xn1 ∈ Aε (X) are called weakly ε-typical sequences.The
dP
corresponding densities (Radon-Nikodym derivatives) dM = following hold ∀ε > 0:
dQ
δP and dM = δQ respectively, then
(2) For n sufficiently large,
Z Z
1 h i h i
dT V (P, Q) = |δP − δQ |dM = 1 − min (δP , δQ )dM P A (n)
ε (X) = P X n
1 ∈ A (n)
ε (X) > 1 − ε;
2
22
equivalently, (2) ∀ε0 > 0 ∃N > 0 such that ∀n > N
h c i h i
P A(n) ε (X) = P X1n ∈
/ A(n)
ε (X) < ε. (1 − ε0 ) 2n(H(X,Y )−ε) < A(n)

ε ≤2
n(H(X,Y )+ε)
,
(3) For n sufficiently large, where the second inequality is true ∀n ≥ 1.

(3) If X̃1n and Ỹ1n are independent with the same
(1 − ε) 2n(H(X)−ε) ≤ A(n) (X) ≤ 2n(H(X)+ε) . n n 0 −n(I(X;Y )+3ε)

ε marginals
h as P (x1 ,iy1 ), then (1 − ε ) 2 ≤
n n (n) −n(I(X;Y )−3ε)
P X̃1 , Ỹ1 ∈ Aε ≤2 where the second
Note that the second inequality in fact holds ∀n ≥ 1.
inequality is true ∀n ≥ 1.
Remark: This does not say that most of the sequences
in X n are weakly typical. In fact, when X is not uniform, 11.3. Weak Typicality:
|A(n)
ε (X)| Let (X1 , X2 , . . . , Xk ) denote a finite collection of discrete
|X |n → 0 as n → ∞. (If X is uniform, then all sequence
random variables with some fixed joint distribution p xk1 ,
is typical.) Although the size of the weakly typical set may be k
(i)
xk1 ∈
Q
insignificant compared with the size of the set of all sequences, Xi . Let J ⊂ [k]. Suppose XJ is drawn i.i.d.
the former has almost all the probability. The most likely i=1
(n) (1) (n)
sequence is in general not weakly typical. Roughly speaking, p xk1 . We consider sequence xiJ i=(1) = xJ , . . . , xJ ∈
probability-wise, for n large, we only have to focus on ≈ 2nH(X) k n k (n)
Xin . Note that xiJ i=(1) is in fact a matrix
Q Q
typical sequences, each with probability ≈ 2−nH(X) . Xi =
i=1 i=1
k
11.2. Jointly Weak Typicality: A pair of sequences with n element. For conciseness, we shall denote it by s.
(xn1 , y1n ) ∈ X n × Y n is said to be (weakly) δ-typical w.r.t.
By law of large number, the
empirical entropy
the distribution P (x, y) if 1 1

i (n)

− n log(S) = − n log XJ i=(1) → H (XJ ).
log p(xn )
(a) − n 1 − H (X) < ε,

(n) (n)
The set Aε of ε-typical n-sequence is Aε (XJ ) =
log q(y1n )

(b) − n − H (Y ) < ε, and

s : − n1 log p (s) − H (XT ) < ε, ∀ T ⊂ J .
6=∅
log P (xn1 ,y1n )
(c) − − H (X, Y ) < ε.

(n)
n
By definition, if J ⊂ K ⊂ [k], then Aε (XJ ) ⊂
(n) (n)
The set Aε (X, Y ) is the collection of all jointly typical se- Aε (XK ).
quences with respect to the distribution P (x, y). It is the set h
(n)
i
of n-sequences with empirical entropies ε- close to the true (1) ∀ε > 0 and n large enough P A ε (X J ) ≥ 1 − ε.
(n)
entropies. Note that if (xn1 , y1n ) ∈ Aε (X, Y ), then (n) .
(2) If s ∈ Aε (XJ ), then p (s) = 2−n(H(XJ )∓ε) .
2 −n(H(X,Y )+ε) n n
< P (x1 , y1 ) < 2 −n(H(X,Y )−ε)
. n(H(XJ )±2ε)

(n)
(n) (3) ∀ε > 0 and n large enough (X J =2
) .
xn1 ∈ Aε (X) . That is A ε
2−n(H(X)+ε) < p (xn1 ) < 2−n(H(X)−ε) .

(n)
Consider disjoint Jr ⊂ [k]. Let sr = xiJr (1) .
(n)
y1n ∈ Aε (Y ) . That is (n)
(4) If (s1 , s2 ) ∈ Aε (XJ1 , XJ2 ), then, p (s1 |s2 )
.
=
2−n(H(Y )+ε) < q (y1n ) < 2−n(H(Y )−ε) . 2−n(H (XJ1 |XJ2 )±2ε) .
(n)
For any ε > 0, define Aε (XJ1 |s2 ) to be the set of sequences
Suppose that (Xi , Yi ) is drawn i.i.d. ∼ {P (x, y)}. Then,
s1 that are jointly ε-typical with a particular s2 sequence,
h i (n)
(1) lim P (X1n , Y1n ) ∈ Aε
(n)
= 1. i.e. the elements of the form(s1 , s2 ) ∈ Aε (XJ1 , XJ2 ). If
n→∞ (n)
s2 ∈ Aε (XJ2 ), then
Equivalently, ∀ε0 > 0 ∃N > 0 such that ∀n > N
h i (5.1) For sufficiently large n,
0 ≤ P (X1n , Y1n ) ∈
/ A(n)
ε < ε0
Aε (XJ1 |s2 ) ≤ 2n(H (XJ1 |XJ2 )+2ε) .
(n)
which is equivalent to
h i
1 − ε0 < P (X1n , Y1n ) ∈ A(n) (5.2) (1 − ε) 2n(H (XJ1 |XJ2 )−2ε) ≤
≤ 1.
P (n)
p (s2 ) Aε (XJ1 |s2 ).

ε
s2
23
Suppose X̃Ji 1 , X̃Ji 2 , X̃Ji 3 is drawn i.i.d. according to 11.5. Jointly Strong Typicality:
p (xJ1 |xJ3 ) p (xJ2 |xJ3 ) p (xJ3 ), that is X̃Ji 1 , X̃Ji 2 are condition- Let {P (x, y) = p (x) Q (y |x ) , x ∈ X , y ∈ Y} be the joint
i
ally independent given X̃J3 but otherwise share the same pair- p.m.f. over X × Y. Denote the number of occurrences of
(n) the point (x, y) in the pair of sequences (xn1 , y1n )
wise marginals of X̃J1 , X̃J2 , X̃J3 . Let s̃r = x̃iJr (1) . Then
N (a, b |xn1 , y1n ) = |{k : 1 ≤ k ≤ n, (xk , yk ) = (a, b)}|
.
h i
(n) n
(6) P S̃1 , S̃2 , S̃3 ∈ Aε (XJ1 , XJ2 , XJ3 ) = X
= 1 [xk = a] 1 [yk = b].
2−n(I (XJ1 ;XJ2 |XJ3 )∓6ε) . k=1
n
N (x, y |xn1 , y1n ) and N (y |y1n ) =
P P
11.4. Strong typicality Suppose X is finite. For a sequence Then, N (x |x1 ) =
y∈Y x∈X
∀xn1 ∈ X n and a ∈ X , define
N (x, y |xn1 , y1n ).
N (a |xn1 ) = |{k : 1 ≤ k ≤ n, xk = a}| A pair of sequences (xn1 , y1n ) ∈ X n × Y n is said to be
Pn strongly δ-typical w.r.t. {P (x, y)} if
= 1 [xk = a].
N (a, b |xn1 , y1n )

k=1 δ
∀a ∈ X , ∀b ∈ Y − P (a, b) < .
n
Then, N (a |x1 ) is the number of occurrences of the symbol a n |X | |Y|
in the sequence xn1 . Note that ∀xn1 ∈ X n N (a |xn1 ) = n. The set of all strongly typical sequences is called the strongly
P
x∈X
typical set and is denoted by
For i.i.d. Xi ∼ {p (x)}, ∀x1 ∈ X n n
Tδ = Tδn (pQ) = Tδn (X, Y )
N (x|xn = {(xn1 , y1n ) : (xn1 , y1n ) is δ - typical of {P (x, y)}} .
Y
1 )
pX1n (xn1 ) = p (x) ;
x∈X
Suppose (Xi , Yi ) is drawn i.i.d. ∼ {P (x, y)}.
A sequence xn1 ∈ X n is said to be δ-strongly typical
w.r.t. (1) lim P [(X1n , Y1n ) ∈
/ Tδn (X, Y )] = 0. That is ∀α > 0 for n
n→∞
N (a|xn )
{p (x)} if (1) ∀a ∈ X with p (a) > 0, n 1 − p (a) < |Xδ | , sufficiently large,

and (2) ∀a ∈ X with p (a) = 0, N (a |xn1 ) = 0. P [(X1n , Y1n ) ∈

/ Tδn (X, Y )] > 1 − α.
This implies
P 1 n

n N (a |x1 ) − p (a) < δ which is the typ-

x (2) Suppose (xn1 , y1n ) ∈ Tδn (X, Y ), then 2−n(H(X,.Y )+εδ ) <
icality condition used in [26, Yeung]. P (xn1 , y1n ) < 2−n(H(X,Y )−εδ ) where εδ = δ |log Pmin | and
Pmin = min {P (x, y) > 0 : x ∈ X , y ∈ Y}.
The set of all δ-strongly typical sequences above is called the
Suppose (xn1 , y1n ) ∈ Tδn (X, Y ), then xn1 ∈ Tδ (X) , y1n ∈
strongly typical set and is denoted Tδ = Tδn (p) = Tδn (X) =
Tδ (Y ). That is joint typicality ⇒ marginal typicality.
{xn1 ∈ X n : xn1 is δ-typical}.
Suppose Xi ’s are drawn i.i.d. ∼ pX (x). This further implies
(1) ∀δ > 0 lim P [X1n ∈ Tδ ] = 1 and lim P [X1n ∈

/ Tδ ] = 2−n(H(X)+εδ ) < pX1n (xn1 ) < 2−n(H(X)−εδ )
n→∞ n→∞
0. Equivalently, ∀α < 1, for n sufficiently large, and
P [X1n ∈ Tδ ] > 1 − α.
2−n(H(Y )+εδ ) < qY1n (y1n ) < 2−n(H(Y )−εδ ) .
Define pmin = min {p (x) > 0 : x ∈ X }, εδ = δ |log pmin |.
Note that pmin gives maximum |log pmin | and εδ > 0 Consider (xn1 , y1n ) ∈ Tδn (X, Y ). Suppose g : X × Y → R.
δ
can
be made arbitrary small by making δ small enough. Then, g (xn1 , y1n ) = Eg(U, V ) ± gmax |X ||Y| . where gmax =
maxu,v g(u, v).
lim εδ = 0 . Then,
δ→0
(3) ∀α > 0 for n sufficiently large, we have
log p n (xn )
(2) For xn1 ∈ Tδ (p), we have −
X1 1
− H (X) < εδ
(1 − α) 2n(H(X,Y )−εδ ) ≤ |Tδ (P )| ≤ 2n(H(X,Y )+εδ ) .
n
which is equivalent to 2−n(H(X)+εδ ) < pX1n (xn1 ) < n
(4) If X̃1n , Ỹ1n ∼
Q
2−n(H(X)−εδ ) . Hence, Tδn (X) ⊂ Anεδ (X). (pX (xi ) qY (yi )) that is X̃i and Ỹi are
i=1
(3) ∀α < 1, for n sufficiently large, (1 − α) 2n(H(X)−εδ ) ≤ independent with the same marginals as Xi and Yi . Then,
|Tδ (X)| ≤ 2n(H(X)+εδ ) where the second inequality holds ∀αh> 0 for n sufficiently large,
i (1 − α) 2−n(I(X;Y )+3εδ ) ≤
∀n ≥ 1. P X̃1n , Ỹ1n ∈ Tδn (X, Y ) ≤ 2−n(I(X;Y )−3εδ ) .
24
For any xn1 ∈ Tδn (X), define implies that an and bn are equal to the first order in the
exponent.
Tδn (Y |X ) (xn1 ) = {y1n : (xn1 , y1n ) ∈ Tδn (X, Y )} .
The volume of the smallest set that contains most of
(5) For any xn1 such that ∃y1n with (xn1 , y1n ) ∈ Tδn (X, Y ), the probability is approximately 2nh(X) . This is an n-
dimensional volume, so the corresponding side length
1
. 0 is (2nh(X) ) n = 2h(X) . Differential entropy is then the
|Tδn (Y |X ) (xn1 )| = 2n(H(Y |X )±εδ ) logarithm of the equivalent side length of the smallest
where ε0δ → 0 as δ → 0 and n → ∞. set that contains most of the probability. Hence, low
entropy implies that the random variable is confined to
xn1 ∈ Tδn (X) combined with the condition of the state-
a small effective volume and high entropy indicates that
ment above is equivalent to |Tδn (Y |X ) (xn1 )| ≥ 1.
the random variable is widely dispersed.
(6) Let Yi be drawn i.i.d. ∼ qY (y), then Remark: Just as the entropy is related to the volume of
.
P [(xn1 , Y1n ) ∈ Tδn (X, Y )] = P [Y1n ∈ Tδn (Y |X ) (xn1 )] = the typical set, there is a quantity called Fisher informa-
00
2−n(I(X;Y )∓εδ ) , where ε00δ → 0 as δ → 0 and n → ∞. tion which is related to the surface area of the typical
set.
Now, we consider the continuous random variables.
11.7. Jointly Typical Sequences
11.6. The AEP for continuous random variables: (n)
The set Aε of jointly typical sequences x(n) , y (n) with
n
Let (Xi )i=1 be a sequence of random variables drawn i.i.d. respect to the distribution f
X,Y (x, y) is the set of n-sequences
according to the density f (x). Then (n)
with empirical entropies ε-close to the true entropies, i.e., Aε
− 1 log f (X n ) → E [− log f (X)] = h(X) in probability. is the set of x(n) , y (n) ∈ X n × Y n such that
n 1

For ε > 0 and any n, we define the typical set with
(n)
Aε (a) − n1 log fX1n (xn1 ) − h (X) < ε,
respect to f (x) as follows:
(b) − n1 log fY1n (y1n ) − h (Y ) < ε, and

(n) n n 1 n

Aε = x1 ∈ S : − log f (x1 ) − h (X) ≤ ε ,
(c) − n1 log fX1n ,Y1n (xn1 , y1n ) − h (X, Y ) < ε.
n
where S is the support set of the random variable X, and Note that the followings give equivalent definition:
f (xn1 ) = Πni=1 f (xi ). Note that the condition is equivalent to
(a) 2−n(h(X)+ε) < fX1n (xn1 ) < 2−n(h(X)−ε) ,
2−n(h(X)+ε) ≤ f (xn1 ) ≤ 2−n(h(X)−ε) .
(b) 2−n(h(Y )+ε) < fY1n (y1n ) < 2−n(h(Y )−ε) , and
R We also define the volume Vol (A) of a set A to be Vol (A) = (c) 2−n(h(X,Y )+ε) < fX1n ,Y1n (xn1 , y1n ) < 2−n(h(X,Y )−ε) .
A
dx1 dx2 · · · dxn .
h
(n)
i Let (X1n , Y1n ) be sequences of length n drawn i.i.d. according
(1) P Aε > 1 − ε for n sufficiently large.
to fXi ,Yi (xi , yi ) = fX,Y (xi , yi ). Then

(n) (n)
(2) ∀n, Vol Aε ≤ 2n(h(X)+ε) , and Vol Aε ≥ (1 − h
(n)
i h
(n)
i
n n
(1) P A ε = P (X1 , Y1 ) ∈ A ε → 1 as n → ∞.
ε)2n(h(X)−ε) for n sufficiently large. That is for n suffi-

(n)
ciently large, we have (1 − ε)2n(h(X)−ε) ≤ Vol Aε ≤ (2) A(n)
ε ≤ 2
n(H(X,Y )+ε)
, and for sufficiently large n,
n(h(X)+ε)

2 . (n)
A ≥ (1 − ε) 2n(h(X,Y )−ε) .

ε
(n)
The set Aε is the smallest volume set with probability
≥ 1 − ε to first order in the exponent. More specifically, (3) If (U1n , V1n ) ∼ fX1n (un1 ) fY1n (v1n ), i.e., U1n and V1n are
(n) independent with the same marginals as fX1n ,Y1n (xn1 , y1n ),
forh eachi n = 1, 2, . . ., let Bδ ⊂ S n be any set with
(n)
then
P Bδ ≥ 1 − δ. Let X1 , . . . , Xn be i.i.d. ∼ p(x). For h i
P (U1n , V1n ) ∈ A(n) ≤ 2−n(I(X;Y )−3ε) .

(n)
δ < and any δ 0 > 0, n1 log Vol Bδ
1
2 > h(X) − δ 0 for ε
n sufficiently large. Also, for sufficiently large n,

. .

(n) (n)
Equivalently, for δ < 12 , Vol Bδ = Vol Aε = 2nH . h i
.
The notation an = bn means lim n1 log abnn = 0, which P (U1n , V1n ) ∈ A(n)
ε ≥ (1 − ε) 2−n(I(X;Y )+3ε) .
n→∞
25
12 I-measure In fact, µ∗ is the unique signed measure on Fn which is
consistent with all Shannon’s information measures. We then
In this section, we present theories which establish one-to- have substitution of symbols as shown in table (5).
one correspondence between Shannon’s information measures
and set theory. The resulting theorems provide alternative H, I ↔ µ∗
approach to information-theoretic equalities and inequalities. , ↔ ∪
Consider n random variables X1 , X2 , . . . , Xn . For any ran- ; ↔ ∩
dom variable X, let X̃ be aSset corresponding to X. Define | ↔ -
the universal set Ω to be X̃i . The field Fn generated
i∈[n] Table 5: Substitution of symbols
by sets X̃1 , X̃2 , . . . , X̃n is the collection of sets which can be
obtained by any sequence of usual set operations (union, inter-
section, complement, and difference) on X̃1 , X̃2 , . . . , X̃n . The Motivated by the substitution of symbols, we
n

∗
will write µ X̃ ∩ X̃ ∩ · · · ∩ X̃ − X̃ as
T
atoms of Fn are sets of the form Yi , where Yi is either G1 G2 Gm F
i=1
that all atoms in Fn are disjoint. The set I X̃G1 ; X̃G2 ; · · · ; X̃Gm X̃F .

c
X̃i or X̃
Ti . Note
A0 = X̃ic = ∅ is called the empty atom of Fn . All the
i∈Nn 12.3. If there is no constraint on X1 , X2 , . . . , Xn , then µ∗
atoms of Fn other than A0 are called nonempty atoms. Let can take any set of nonnegative values on the nonempty atoms
A be the set of all nonempty atoms of Fn . Then, |A|, the of Fn .
cardinality of A, is equal to 2n − 1.
12.4. Because of the one-to-one correspondence between
Each set in Fn can be expressed uniquely as the union of Shannon’s information measures and set theory, it is valid
a subset of the atoms of Fn . to use an information diagram, which is a variation of a
Venn diagram, to represent relationship between Shannon’s
Any signed measure µ on Fn is completely specified by information measures. However, one must be careful. An
the values of µ on the nonempty atoms of Fn . I-measure µ∗ can take negative values. Therefore, when we
see in an information diagram that A is a subset of B, we
12.1. We define the I -measure µ∗ on Fn by cannot conclude from this fact alone that µ∗ (A) ≤ µ∗ (B)
unless we know from the setup of the problem that µ∗ is
µ∗ X̃G = H (XG ) for all nonempty G ⊂ [n] . nonnegative. For example, µ∗ is nonnegative if the random
variables involved form a Markov chain.
12.2. For all (not necessarily disjoint) subsets G, G0 , G00 of
[n] n
For a given n, there are n
P
k nonempty atoms that do
k=3
(a) µ∗ X̃G ∪ X̃G00 = µ∗ X̃G∪G00 = H (XG∪G00 ) not correspond to Shannon’s information measures and
hence can be negative.
(b) µ∗ X̃G ∩ X̃G0 − X̃G00 = I (XG ; XG0 |XG00 )
For n ≥ 4, it is not possible to display an information
diagram perfectly in two dimensions. In general, an
(c) µ∗ (A0 ) = 0
information diagram for n random variables, needs n − 1
Note that (2) is the necessary and sufficient condition for dimensions to be displayed perfectly.
µ∗ to be consistent with all Shannon’s information measures In information diagram, the universal set Ω is not shown
because explicitly.
When 0
G and G are nonempty,
When µ∗ takes the value zero on an atom A of Fn , we do
∗
µ X̃G ∩ X̃G0 − X̃G00 = I (XG ; XG0 |XG00 ). not need to display A in an information diagram because
A does not contribute to µ∗ (B) for any set B ∈ Fn
When G00 = ∅, we have µ∗ X̃G ∩ X̃G0 = I (XG ; XG0 ). containing the atom A.
When 0 12.5. Special cases:

G = G, we have
µ∗ X̃G − X̃G00 = I (XG ; XG |XG00 ) = H (XG |XG00 ). When we are given that X and Y are independent, we
can’t draw X̃ and Ỹ as disjoint sets because we can’t
When 0 00
G = G, and G = ∅, we have guarantee that all subsets (or atoms which are subsets
µ∗ X̃G = I (XG ; XG ) = H (XG ). of) of X̃ ∩ Ỹ has µ∗ = 0.
26
≤0
• Look at the example below for a case when I ( X ; Y ) < I ( X ; Y Z ) .

• For four random variables, the following atoms colored in sky-blue can be
negative:
ex1
◦ For example, try X and Y Bernoulli(0.5) on {0, 1}. X1
Then, conditioned on another random variable Z = 1
X + Y mod 2, they are no longer independent. In
fact, conditioned on Z, knowing Y completely speci-
fies X and vice versa. 3 4
Y2
When Y = g(X), we can draw Ỹ as a subset of X̃. That
is, any atom X 1c ∩ Xis2c a subset of Ỹ \X̃ = Ỹ ∩ X̃ c 2
Y1
Ṽ which X2
X 1 µ
∗
satisfies Ṽ
= 0. In fact, let I1 and I2 be disjoint MC
T
index set. Then, for any set of the form Ṽ = Ỹ ∩ Zi ∩
X 1 ∩ X 2 i∈I1 1 2 X 23 n
c
T c ∩T
X c
X c c ∩
X c X ∗
• I ( X ;Y ) ≥ I ( f ( X Figure(Y ) )16: Information Diagram for n = 4
X̃ Zj = Ỹ ∩ 2 Zi ∩ X̃ ∩ 1Z , we have µ Ṽ = 0.
) ; gIDMC
1 2
I2
j∈I2 i∈I1
Proof. We first show that I ( X ;Y ) ≥ I ( f ( X ) ;Y ) :

In other words,
X1 X2 Xn
◦ H (g (X) |X, Z 1 , . . . , Zn ) = 0 I ( X , f ( X ) ;Y )
◦ I (g (X) ; V1 ; V2 ; · · · ; Vm |X, Z 1 , . . . , Zn ) = 0. 0
12.6. For two random variables, µ∗ is always nonnegative.

(
= I ( X ;Y ) + I ( f ( X ) ;Y X ) = I ( f ( X ) ;Y ) + I X ;Y f ( X ) )
≥0
The information diagram is shown in Figure 14.
By symmetry,
Figure 17: The ;Y ) ≥ I ( Z ; g (Y
I ( Zinformation )) .
diagram for the Markov chain
Proof. We have shown Xthat μ * is consistent with all Shannon’s
Y information X → · ·
( )
· →
Let1 Z = f X andX n combine the two inequalities above.
measures. Now, for μ to be consistent with all Shannon’s information
measures, need μ ( X G ) = H ( X G ) , which is in fact the definition of μ * . ⎛1⎞
H ( X Y ) I ( X ;Y ) H (Y X ) •
Example:12.10.
X and YFor i.i.d.a Bernoulli
Markov chain
⎜ ⎟. X
Let1− ZX =2X−⊕ =X
· ·Y· − ( Xn ,+the
Y ) mod
informa-
2 . Then,
• Motivated by the substitution of symbols, we will write tion diagram can be displayed ⎝2⎠ in two dimensions. One such
μ * ( X G ∩ X G ∩ ∩ X G − X F ) as I ( X G ; X G ; ; X G X F ) . (1) H ( X
1 2 m 1 2 m
) = H (Y ) = His (Figure
construction Z ) = 1 17.
The I-measure µ∗ for a Markov chain X1 → · · · → Xn is
• For n = 2, μ * is always nonnegative.
H(X ) H (Y ) (2) Any always
pair of nonnegative.
variables is independent: I ( X ;the
This facilitates Y ) =use
I ( Zof;Ythe
) = information
I ( X;Z ) = 0.
Proof. For n = 2, the three nonempty atoms of F2 are X 1 ∩ X 2 , X 1 − X 2 , and diagram B⊂ 0
(3) Given any two,because
then theiflast Bisin
one the information diagram, then
determined:
X 2 − X 1 . The values of μ * on these atoms are
( ) ( ) ( )
∗ 0 ∗
Figure 14: Information diagram for n = 2. µ (B ) ≥ µ (B).
H X Y , Z = H Y X , Z = H Z X ,Y = 0 .
I ( X 1; X 2 ) , H ( X 1 X 2 ) , H ( X 2 X 1 ) , respectively. These quantities are
( 13 ) MATLAB
Shannon’s information measures and hence nonnegative by the basic (4) I X ; Y Z = H ( X ) − I ( X ; Y ) − H X Y , Z = 1 . ( )
inequalities. μ * for∗any set in F2 is a sum of μ * on atoms and hence is
12.7. For n = 3, µ (X1 ∩ X2 ∩ X3 ) = I (X1 ; X2 ; X3 ) can be
always nonnegative.
negative. µ ∗
on other nonempty atoms are always nonnegative.
( ) (
By symmetry, I X ; Y Z = I Z ;Y X = I Z ; X Y = 1 . ) ( )
• For n = 3, μ * ( X 1 ∩ X 2 ∩ X 3 ) = I ( X 1; X 2 ; X 3 ) can be negative. μ * on other
In this section, we provide some MATLAB codes for calculating
information theoretic quantities.
nonempty atoms are always nonnegative.
X2 13.1. Function entropy2 calculates the entropy H(X) =
X1
I ( X 1; X 2 X 3 ) H(p) (in bits) of a pmf pX specified by a row vector p where
H ( X1 X 2 , X 3 ) the ith element is the probability of xi .
H ( X 2 X1, X 3 )
I ( X 1; X 2 ; X 3 )
I ( X 1; X 3 X 2 ) I ( X 2 ; X 3 X1 )
function H = entropy2(p)
% ENTROPY2 accepts probability mass function
H ( X 3 X1, X 2 ) % as a row vector, calculate the corresponding
% entropy in bits.
X3
p=p(find(abs(sort(p)−1)>1e−8));
Proof. We will give an example which has I ( X 1; X 2 ; X 3 ) < 0 . p=p(find(abs(p)>1e−8));
Figure 15: Information diagram for n = 3. if length(p)==0
Let X 1 ⊕ X 2 ⊕ X 3 = 0 . Then, for distinct i, j, and k, xi = f ( x j , xk ) , and H = 0;
( )
else
H X i X j , X k = 0 . So, ∀ ( i, j ) i ≠ j , H ( X 1 , X 2 , X 3 ) = H ( X i , X j ) . H = −sum(p.*log(p))/log(2);
12.8. Information diagram for the Markov chain X → X →
Furthermore, let X 1 , X 2 , X 3 be pair-wise independent. Then,1∀ ( i, j ) 2i ≠ j ,
end
X3 :
I ( Xi; X j ) = 0 .
13.2. Function information calculates the mutual informa-
12.9. For four random variables (or random vectors), the tion I(X; Y ) = I(p, Q) where p is the row vector describing
atoms colored in sky-blue in Figure 16 can be negative. pX and Q is a matrix defined by Qij = P [Y = yj |X = xi ].
27
6
function I = information(p,Q) function [ps C] = capacity fmincon(p,Q)
X = length(p); % CAPACITY FMINCON accects intial input
q = p*Q; % probability mass function p = pX and
HY = entropy2(q); % transition matrix Q = fY |X ,
temp = []; % calculate the corresponding capacuty.
for i = 1:X % Note that p is a column vector.
temp = [temp entropy2(Q(i,:))]; mi = @(p) −information(p.',Q);
end sp = size(p);
HYgX = sum(p.*temp); onep = ones(sp); zerop = zeros(sp);
I = HY−HYgX; [ps Cm] = fmincon(mi,p,[],[],onep.',1,zerop);
%The 5th and 6th arguments force the sum of
%elements in p to be 1. The 7th argument forces
13.3. Function capacity calculate the pmf p∗ which achieves %the elements to be nonnegative.
ps; C = −Cm;
capacity C = maxp I(p, Q) using Blahut-Arimoto algorithm
[6, Section 10.8]. Given a DMC with transition probabilities
13.4. The following script demonstrate how to use symbolic
Q(y|x) and any input distribution p0 (x), define a sequence
toolbox to calculate the mutual information between two
pr (x), r = 0, 1, . . . according to the iterative prescription
continuous random variables.
pr (x) cr (x)
pr+1 (x) = P ,
pr (x) cr (x) syms x y
x %Define the densities fX and fY |X
fX = 1/sqrt(2*pi*4)*exp(−1/2*xˆ2/4);
where fYcX = 1/sqrt(2*pi)*exp(−1/2*(y−x)ˆ2);
X Q (y |x ) %Support for X and Y
log cr (x) = Q (y |x ) log (26) rX = [−inf, inf];
y
qr (y)
rY = [−inf, inf];
%Calculate mutual information
and X fY = int(fX*fYcX,x,rX(1),rX(2));
qr (y) = pr (x) Q (y |x ). hY = −int(fY*log2(fY),y,rY(1),rY(2));
x hYcx = −int(fYcX*log2(fYcX),y,rY(1),rY(2));
hYcX = int(fX*hYcx,x,rX(1),rX(2));
Then, IXY = hY−hYcX;
eval(IXY)
!
X
log pr (x) cr (x) ≤ C ≤ log max cr (x) .
x
x
References

Note that (26) is D PY |X=x | PY when PX = pr .
[1] N. Abramson. Information theory and coding. McGraw-
Hill, New York, 1963. 1
function ps = capacity(pT,Q,n)
%n = number of iteration [2] T. Berger. Multiterminal source coding. In Lecture notes
for k = 1:n presented at the 1977 CISM Summer School, Udine, Italy,
X = size(Q,1); July 18-20 1977. 11
Y = size(Q,2);
qT = pT*Q; [3] Richard E. Blahut. Principles and practice of information
CT = [];
theory. Addison-Wesley Longman Publishing Co., Inc.,
for i = 1:X
sQlq = Q(i,:).*log2(qT); Boston, MA, USA, 1987. 9, 9.1
temp = −entropy2(Q(i,:))−sum(sQlq);
CT = [CT 2ˆ(temp)]; [4] A. S. Cohen and R. Zamir. Entropy amplification property
end and the loss for writing on dirty paper. Information
CT; Theory, IEEE Transactions on, 54(4):1477–1487, April
temp = sum(pT.*CT);
2008. 2.21
pT = 1/temp*(pT.*CT);
if(plt) [5] Thomas M. Cover and Joy A. Thomas. Elements of
figure
plot(pT) Information Theory. Wiley Series in Telecommunications.
end John Wiley & Sons, New York, 1991. 7, 2.28, 3.2, 3.11,
end 11
ps = pT;
[6] Thomas M. Cover and Joy A. Thomas. Elements of
Alternatively, the following code use MATLAB function Information Theory. Wiley-Interscience, 2006. 2, 2.3, 8,
fmincon to find p∗ . 3.11, 3.17, 3.19, 3.20, 4.1, 4.13, 9.4, 10.4, 13.3
28
[7] I. Csiszár and J. Körner. Information Theory: Coding [22] Claude E. Shannon. A mathematical theory of commu-
Theorems for Discrete Memoryless Systems. Academic nication. Bell Syst. Tech. J., 27(3):379–423, July 1948.
Press, 1981. 11 Continued 27(4):623-656, October 1948. (document)
[8] I. Csiszar and G. Tusnady. Information geometry and [23] A.M. Tulino and S. Verdu. Monotonic decrease of the
alternating minimization procedures. Recent results in non-gaussianness of the sum of independent random vari-
estimation theory and related topics, 1984. 3.19 ables: A simple proof. IEEE Transactions on Information
Theory, 52:4295–4297, 2006. 5
[9] G.A. Darbellay and I. Vajda. Entropy expressions for
multivariate continuous distributions. IEEE Transactions [24] S. Verdu. On channel capacity per unit cost. Information
on Information Theory, 46:709–712, 2000. 9.6 Theory, IEEE Transactions on, 36(5):1019–1030, Sep
1990. 4, 9.28
[10] R. M. Gray. Entropy and Information Theory. Springer-
Verlag, New York, New York, 1990. 10.1, 10.6 [25] A.C.G. Verdugo Lazo and P.N. Rathie. On the entropy of
continuous probability distributions. IEEE Transactions
[11] D. Guo, S. Shamai (Shitz), and S. Verd. Mutual in- on Information Theory, 24:120–122, 1978. 9.3
formation and minimum mean-square error in gaussian
channels. IEEE Trans. Inf. Theory, 51:1261–1282, 2005. [26] Raymond W. Yeung. First Course in Information Theory.
9.19 Kluwer Academic Publishers, 2002. 11, 11.4
[12] Oliver Johnson. Information Theory and the Central [27] R. Zamir and U. Erez. A gaussian input is not too bad.
Limit Theorem. Imperial College Press, 2004. 3.15, 9.1 Information Theory, IEEE Transactions on, 50(6):1362–
1367, June 2004. 4.11
[13] J. N. Kapur and H. K. Kesavan. Entropy Optimization
Principles With Applications. Academic Pr, 1992. 3.1,
3.6, 3.17, 3.18
[14] Jagat Narain Kapur. Maximum Entropy Models in Sci-

ence and Engineering. John Wiley & Sons, New York,
1989. 2.23, 3, 4
[15] I. Kontoyiannis, P. Harremoës, and O. Johnson. Entropy

and the law of small numbers. IEEE Trans. Inform.
Theory, 51:466– 472, 2005. 3.15, 7.3
[16] S. Kullback. Information Theory and Statistics. Peter

Smith, Gloucester, 1978. 9.26
[17] E. Ordentlich. Maximizing the entropy of a sum of inde-

pendent bounded random variables. Information Theory,
IEEE Transactions on, 52(5):2176–2181, May 2006. 2.22,
9.24
[18] D. P. Palomar and S. Verd. Gradient of mutual informa-

tion in linear vector gaussian channels. IEEE Trans. Inf.
Theory, 52:141–154, 2006. 9.19
[19] D. P. Palomar and S. Verdu. Representation of mutual

information via input estimates. IEEE Transactions on
Information Theory, 53:453–470, 2007. 9.19
[20] D. P. Palomar and S. Verdu. Lautum information. In-

formation Theory, IEEE Transactions on, 54(3):964–975,
March 2008. 4.14
[21] Athanasios Papoulis. Probability, Random Variables and

Stochastic Processes. McGraw-Hill Companies, 1991. 2.2,
9.27
29

Information-Theoretic Identities

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information-Theoretic Identities

Uploaded by

Copyright:

Available Formats

Information-Theoretic Identities, Part 1

Abstract 1 Mathematical Background and

Kullback Leibler “distance”: D 8 3

Proof. To show the second inequality, consider f ( x ) = ln x − x + 1 for x > 0. Then,

11 Typicality and AEP (Asymptotic Equipartition ⎛ ⎞

1.6. The function =x D ln( ( on y ) )1] is convex ∪ in the

0.5 where r(α) = 0 and r0 (t) = t−α

• For fixed y, it is a convex ∪ function of x, starting at 0, decreasing to its minimum d x x d2 x x

0 ≤ H (X) ≤ log |X | . which is not the variance of X itself but of an integer-valued

H(p) to be −p log p − (1 − p) log (1 − p), whose plot is shown

Y p (xn1 |y ) = The independence between X and Z is only used in the last

mass function pX which are supported on S (pX = 0 on 0

See also section (11).

 For fixed q, D (p kq ) is convex ∪ functional of p. Combine with (3.14), we have

and 3.19. Alternating Minimization Procedure: Given two

(a) J-divergence  If we define

Y 5.9. Forg ( xthe

In particular,  I (X; Y, Z) = I (X; Y |Z )

tion p2 (x0 ) of the second chain is the stationary dis-

negative function of n approaching some limit. The limit  X1 , X2 , . . . , Xn are independent.

continuous random variable X with a density f (x) is defined 4

 It is also known as Boltzmann entropy or Boltzmann’s 1

dom variable. It has no fundamental physical meaning,

density, we should include the statement “if it exists”.

above integral does not exist. (d) Γ(q,

Λ ⎠ = h̃ (q) + 2log⎝ σ det ( Λ ) ⎟

with support S (e.g. R or [a, b] or [s0 , ∞)) is

R 9.7. h(X|Y ) ≤ h(X) with equality if and only if X and Y

and (b) For any random variable X and Gaussian Z,

9.22. Maximum Entropy Distributions: sponding h (X ∗ ) = log (e (µ − s0 )).

(c) For n = 1 and n = 2, Sn∗ is uniformly distributed on [21, eq (15-130) p 568].

sponding maximum values 2h(fλ ) and h(fλ );

and 10.6 (Pinsker’s inequality).

where the supremum is over all finite partitions P and Q.

(3) For n sufficiently large, where the second inequality is true ∀n ≥ 1.

2−n(H(X)+ε) < p (xn1 ) < 2−n(H(X)−ε) .

and (2) ∀a ∈ X with p (a) = 0, N (a |xn1 ) = 0. P [(X1n , Y1n ) ∈

(1) ∀δ > 0 lim P [X1n ∈ Tδ ] = 1 and lim P [X1n ∈

n sufficiently large. Also, for sufficiently large n,

 When 0 12.5. Special cases:

• Look at the example below for a case when I ( X ; Y ) < I ( X ; Y Z ) .

Proof. We first show that I ( X ;Y ) ≥ I ( f ( X ) ;Y ) :

12.6. For two random variables, µ∗ is always nonnegative.

[14] Jagat Narain Kapur. Maximum Entropy Models in Sci-

[15] I. Kontoyiannis, P. Harremoës, and O. Johnson. Entropy

[16] S. Kullback. Information Theory and Statistics. Peter

[17] E. Ordentlich. Maximizing the entropy of a sum of inde-

[18] D. P. Palomar and S. Verd. Gradient of mutual informa-

[19] D. P. Palomar and S. Verdu. Representation of mutual

[20] D. P. Palomar and S. Verdu. Lautum information. In-

[21] Athanasios Papoulis. Probability, Random Variables and

You might also like

For fixed q, D (p kq ) is convex ∪ functional of p. Combine with (3.14), we have

(a) J-divergence If we define

In particular, I (X; Y, Z) = I (X; Y |Z )

negative function of n approaching some limit. The limit X1 , X2 , . . . , Xn are independent.

It is also known as Boltzmann entropy or Boltzmann’s 1

When 0 12.5. Special cases: