Professional Documents
Culture Documents
Prapun Suksompong
ps92@cornell.edu
May 6, 2008
2 x-1
ln(x)
1
4 Mutual Information: I 11 1
1−
0
-1 x
5 Functions of random variables 13 -2
-3
-4
6 Markov Chain and Markov Strings 14 -5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
6.1 Homogeneous Markov Chain . . . . . . . . . . 15 x
small x, −(1
and 0 < ∑ b < ∞ ,
− x) ln(1 − x) ≈ x.
i
i
⎜ ∑b ⎟
i i i
i
⎝ ⎠ i
12 I-measure 26 a
1.5.withLog-sum inequality:
the convention that log = ∞ . For positive numbers
i
P a1 , a2 , . . .
0
13 MATLAB 27 and nonnegative numbers b 1
a , b2 , . . . such that ai < ∞ and
Moreover, equality holds if and only if = constant ∀i . i
i
bi
Note: x log x is convex ∪.
1
p ( x x) 1
i ( x ) = i ( x, x ) = log = log = − log p ( x ) .
p ( x) p ( x)
0.7
7 1.7. For measure space (X, F, µ), suppose non-negative f
− p ( x ) log p ( x )
0.6
6
and g are µ-integrable. Consider A ∈ F and assume g > 0 on
0.5
5
A.
0.4
i ( x)
4
0.3 3 Divergence/Gibbs
R R Inequality: Suppose
0.2 2
A
f dµ = A gdµ < ∞, then
0.1 1
Z
0 0 f
0 0.1 0.2 0.3 0.4
p ( x)
0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f log dµ ≥ 0,
p ( x) g
A
1
H ( X ) ≤ log X with equality iff ∀• p ( x) =
x ∈ X Average (X has a uniform distribution over X ).
X Mutual information
or, equivalently,
• A measure ⎡ Figure 2: p log p and i(x) = − log p(x)
of
1 the ⎤ amount of information that one random variable contains
Proof H ( X ) − log X = E ⎡⎣ − log p ( X ) ⎤⎦ − E ⎡⎣ X ⎤⎦ = E ⎢log
( ( ))
⎥
about ⎣⎢another X p ( X ) ⎦⎥
random variable. H X Y = H X − I X ;Y . ) ( ) (
Z Z
⎡
≤ E⎢
1 ⎤ ⎛
− 1⎥ = ∑ p ( x ) ⎜
1 ⎞ − f log f dµ ≤ − f log gdµ
⎟⎟ − 1
•⎥⎦ The reduction
( X ) ∞,
0 < ⎢⎣ bXi p< ⎜
⎝ X p ( X ) ⎠ in the uncertainty of one random variable due to the knowledge
P
x∈X
A A
X
− 1 =of the other.
1
=i
∑X −1 = 0
X X
•
x∈
1
A special case relative entropy. with equality if and only if f = g µ−a.e. on A.
H ( X ) = log X ⇔ ∀x
H ({ p ( x, y )}) info
=1.
X p ( x)
P
• Need on ai bits to describe (x,y). If instead,
R R
average Log-sum inequality: Suppose A f dµ, A gdµ < ∞,
!
X a i
X
log i then would need on average
1
Proof. Let q ( x ) = ∀x ∈ X . Then, ai log ≥ X and Yaare then
X assume that i independent,
b
R
X ) +({ pX( x. ) p ( y )}) + D ( p ( x, y ) p ( x )i p ( y ) ) info bits to describe (x,y).
f dµ
p( X ) ip ( X ) i P
D ( p q ) = E log = E log = −H ( H log
i b Z
f
Z
q( X ) 1 A
X
i f log dµ ≥ f dµ log R
.
• In view of relative entropy, it is natural to think of I ( X ;Y ) as a measure g gdµ
that D (the
p q ) ≥convention aX 1
We know with ( x ) = q ( log
0 with equality iff pthat x ) ∀x ∈ i , i.e., p ( x ) =
= ∞. Moreover, equality holds A A A
of how far X0and Y are from X being independent.
ai
∀x ∈ X if. and only if
bi = constant ∀i. In particular, In fact, the log can be replaced byany function
• Average ⎛
mutual
⎞
information
Proof. Let p = p ( i ) . Set G ( p ) = ∑ p ln p + λ ⎜ ∑ p ⎟ . Let 1
∞] → ≥ −
P ( X , Y ) a⎤1 + a2 P ( X Y ) Q (Y X )
i
⎝ ⎠
i i
⎡
i
⎡ ⎤ ⎡ h : [0, ⎤ R such that h(y) c 1 y for some
a1 i
a2 i
∂ ⎛
G ( p ) = − ⎜ ln p a+ 1p log
1 ⎞ 0
⎟ + λ , then+p a
≤
= e2 logI ( X ; Y ) =
≥the(a
E ⎢ 1 + a2 ) log ⎥
log = E ⎢ log
. ⎥ = E ⎢ log
c >q 0.Y ⎥ .
p ( X ) q (Y b) ⎥⎦1 + b⎣⎢2 p ( X ) ⎦⎥ ( ) ⎦⎥
λ −1
0= ; so all p are same.
∂p p ⎠ iffbindependent b2 ⎢⎣ ⎣⎢
i i i i
n ⎝ 1 i
R R
Pinsker’s inequality: Suppose f dµ = gdµ =1
p ( x, y ) as a
P
The proof follows from rewriting the inequality
I ( X ; Y ) = ∑ ∑ p ( x, y ) log
ai log bii − (A = X), then
P A x∈X y∈Y
P p ( xP) p ( y ) i
ai log B , where A = ai and B = bi . Then, combine 2
p ( X , Y ) i⎤
Z Z
i ⎡
f log e
= E ⎡⎣i ( X ; Y ) ⎤⎦
i f log dµ ≥ |f − g|dµ . (1)
= E p( x , y ) ⎢log 1
the sum and apply ln(x) ≥ 1 −px( X . ) p (Y ) ⎥ ⎣ ⎦
g 2
2 α ln α 1−α
β + (1 − α) ln 1−β which we denoted by r(β). Note also
2.5
y = 0.1
1.5
2
Rβ
that |f − g|dµ = 2 (β − α). Now, r (β) = r(α) + r0 (t)dt
R
1.5
1 x =1 α
1
y =1
0
x = 0.1 1.8. For n ∈ N = {1, 2, . . .}, we define [n] = {1, 2, . . . , n}.
-0.5 -0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x y
1.9. Suppose I is an index set. When Xi ’s are random distribution with the same support.
variables, we define a random vector XI by XI = (Xi : i ∈ I).
Then, for disjoint A, B, XA∪B = (XA , XB ). If I = [n], then 2.1. Entropy is both shift and scale invariant, that is ∀a 6= 0
we write XI = X1n . and ∀b: H(aX + b) = H(X). In fact, for any injective (1-1)
When Ai ’s are sets, we define the set AI by the union function g on X , we have H(g(X)) = H(X).
∪i∈I Ai .
2.2. H ({p (x)}) is concave (convex ∩) in {p (x)}. That
We think of information gain (I) as the removal of uncer- is, ∀λ ∈ [0, 1] and any two p.m.f. {p1 (x) , x ∈ X } and
tainty. The quantification of information then necessitates the {p2 (x) , x ∈ X }, we have
development of a way to measure one level of uncertainty (H).
In what followed, although the entropy (H), relative entropy H (p∗ ) ≥ λH (p1 ) + λH (p2 ) ,
(D), and mutual information (I) are defined in terms of random
where p∗ (x) ≡ λp1 (x) + (1 − λ) p2 (x) ∀x ∈ X .
variables, their definitions extend to random vectors in a
straightforward manner. Any collection X1n of discrete random 2.3. Asymptotic value of multinomial coefficient [6, Q11.21
variables can be thought of as a discrete random variable itself. p 406]: Fix a p.m.f. P = (p1 , p2 , . . . , pM ). For i = 1, . . . , m −
Pm−1
1, define an,i = bnpi c. Set am = n − j=0 bnpj c so that
Pm
2 Entropy: H i=1 ai = n. Then,
1 n 1 n!
We begin with the concept of entropy which is a measure of lim log = lim log M = H(P ).
uncertainty of a random variable [6, p 13]. Let X be a discrete n n→∞ a1 a2 · · · am n→∞ n Q
ai !
random variable which takes values in alphabet X . i=1
The entropy H(X) of a discrete random variable X is a (2)
functional of the distribution of X defined by
2.4. (Differential Entropy Bound on Discrete Entropy) For
X X on {a1 , a2 , . . .}, let pi = pX (ai ), then
H (X) = − p (x) log p (x) = −E [log p (X)]
x∈X
1 (∗) 1
≥ 0 with equality iff ∃x ∈ X p (x) = 1 H(X) ≤ log 2πe Var [X] +
2 12
1
≤ log |X | with equality iff ∀x ∈ X p (x) = . where
|X | ! !2
X X
(∗) 2
In summary, Var [X] = i pi − ipi
i∈N i∈N
3
X∼ Support set X pX (k) H(X)
1
Uniform Un {1, 2, . . . , n} n log n
1 − p, k = 0
Bernoulli B(1, p) {0, 1} hb (p)
p, k=1
n k
n−k
Binomial B(n, p) {0, 1, . . . , n} k p (1 − p)
1 1
Geometric G(p) N ∪ {0} (1 − p) pk 1−p hb (p)
= 1−p h (1 − p)
1
b
= EX log 1 + EX + log(1 + EX)
Geometric G 0 (p) N (1 − p)k−1 p 1
p hb (p)
k
Poisson P(λ) N ∪ {0} e−λ λk! λ log e + EX log λ + E log X!
Table 1: Examples of probability mass functions and corresponding discrete entropies. Here, p, β ∈ (0, 1). λ > 0. hb (p) =
−p log p − (1 − p) log (1 − p) is the binary entropy function.
X∼ fX (x) h(X)
1
Uniform U(a, b) b−a 1[a,b] (x) log(b − a) = 12 log 12σX
2
≈ 1.79 + log2 σX [bits]
Exponential E(λ) λe−λx 1[0,∞] (x) e
log λ = log eσX ≈ 1.44 + log2 σX [bits]
x−s
1 − µ−s0
Shifted Exponential µ−s0 e
0 1[s0 ,∞) (x); log (e(µ − s0 ))
on [s0 , ∞), mean µ µ > s0 −αa −αb
e1−αa −e1−αb −be
Bounded Exp. α
e−αa −e−αb
e−αx 1[a,b] (x) log α + α aee−αa −e −αb log e
α −α|x| 2e 1 2 2
Laplacian L(α) 2e log α = 2 log 2e σX ≈ 1.94 + log2 σX [bits]
1 x−µ 2
√1 e− 2 ( σ ) 1
2 2
Normal N (µ, σ ) σ 2π 2 log 2πeσ ≈ 2.05 + log2 σ [bits]
λq xq−1 e−λx
Gamma Γ (q, λ) Γ(q) 1(0,∞) (x) q (log e) + (1 − q) ψ (q) + log Γ(q) λ
−(α+1)
− log (α) + α1 + 1 log e
Pareto P ar(α) αx 1[1,∞] (x)
α c α+1
log αc + α1 + 1 log e
P ar(α, c) = cP ar(α) c x 1(c,∞) (x)
Γ(q1 +q2 ) q1 −1 q2 −1 log B (q1 , q2 ) − (q1 − 1) (ψ (q1 ) − ψ (q1 + q2 ))
Beta β (q1 , q2 ) Γ(q1 )Γ(q2 ) x (1 − x) 1(0,1) (x)
− (q2 − 1) (ψ (q2 ) − ψ (q1 + q2 ))
Γ(q1 +q2 ) xq1 −1
Beta prime Γ(q1 )Γ(q2 ) (x+1)(q1 +q2 ) 1(0,∞) (x)
2
γ
2αxe−αx 1[0,∞] (x) 1
Rayleigh log √
2 α
+ 1+ 2 log e
1 1
Standard Cauchy π 1+x2 log (4π)
1 α
Cau(α) π α2 +x2 log (4πα)
Γ(d) 1
Cau(α, d) √
παΓ(d− 12 ) 1+( x )2 d
α
2 ln x−µ 2
Log Normal eN (µ,σ ) e− 2 ( ) 1
1
1 1
√
σx 2π
σ
(0,∞) (x) 2 log 2πeσ 2 + µ log e
Table 2: Examples of probability density functions and their entropies. Here, c, α, q, q1 , q2 , σ, λ are all strictly positive and d > 12 .
0
γ = −ψ(1) ≈ .5772 is the Euler-constant. ψ (z) = d
dz log Γ (z) = (log e) ΓΓ(z)
(z)
is the digamma function. B(q1 , q2 ) = Γ(q1 )Γ(q2 )
Γ(q1 +q2 ) is
the beta function.
4
4.5
Logarithmic Bounds:
4 h ( N ( λ, λ ))
1
3.5 (ln p) (ln q) ≤ (log e) H (p) ≤ (ln p) (ln q) .
3
ln 2
2.5
H ( P (λ )) 2 Power-type bounds:
2 1.5
1
1.5 1
(ln 2) (4pq) ≤ (log e) H (p) ≤ (ln 2) (4pq) ln 4 .
1 0.5
0.5 0
2.7. For two random variables X and Y with a joint p.m.f.
0 -0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p(x, y) and marginal p.m.f. p(x) and p(y), the conditional
λ
-0.5
entropy is defined as
0 2 4 6 8 10 12 14 16 18 20
λ
XX
H (Y |X ) = −E [log p (Y |X )] = − p (x, y) logp (y |x )
x∈X y∈Y
Figure 4: H (P (λ)) and its approximation by h (N (λ, λ)) = X
1 = p (x) H (Y |X = x ),
2 log (2πeλ)
x∈X
1
0.9
where
0.8
0.7
H (Y |X = x ) = −E [ log(p(Y |x))| X = x]
0.6
X
=− p (y |x ) logp (y |x ) .
H(p)
0.5
0.4 y∈Y
0.3
0.2 Note that H(Y |X) is a function of p(x, y), not just p(y|x).
0.1
0
0 0.1 0.2 0.3 0.4 0.5
p
0.6 0.7 0.8 0.9 1 Example 2.8. Thinned Poisson: Suppose we have X ∼ P (λ)
and conditioned on X = x, we define Y to be a Binomial r.v.
1
• Logarithmic Bounds: ( ln p )( ln q ) ≤ ( log e ) H ( p ) ≤ ( ln p )( ln q ) with size x and success probability y:
Figure 5: Binary Entropy Function
ln 2
0.7
X → s → Y.
2.6. Binary Entropy Function : We define hb (p), h (p) or Then, Y ∼ P (sλ) with H(Y ) = H (P (sλ)). Moreover,
0.6
(a) H(p) = H(1 − p). which is simply a Poisson r.v. shifted by y. Consequently,
0.2
dh 1−p H (X |y ) = H (P (λ (1 − s))) = H (X |Y ) .
(b) dp (p) = log 0.1 p .
d
0
0
dg
0.1
0.2 0.3 1−g(x)
0.4 0.5 0.6 0.7 0.8 0.9 1
2.9. 0 ≤ H (Y |X ) ≤ H (Y ) .
(c) dx h (g (x)) = dx (x) log g(x) . Y =g(X) X,Y independent
1
2.10. The discussion above for entropy and conditional en-
• Power-type bounds:
(d) 2h(p) ( ln 2(1)(−
= p−p p)) ≤ ( log e. ) H ( p ) ≤ ( ln 2 )( 4 pq ) ln 4
4 pq −(1−p)
tropy is still valid if we replace random variables X and Y in
1−p the discussion above with random vectors. For example, the
(e) 2−h(p) = p0.7
p
(1 − p) .
joint entropy for random variables X and Y is defined as
0.6
(f) h 1b = log b − b−1
b log (b − 1).
XX
0.5
H (X, Y ) = −E [log p (X, Y )] = − p (x, y) logp (x, y).
nhb ( n ) ≤ 2nhb ( n ) [5, p 284-286].
r r
1 n x∈X y∈Y
(g) n+1 2 0.4
≤ r
More generally, for a random vector X1n ,
bαnc
0.3
1 n n
lim 1 log bαnc
P
(h) lim log = = hb (α) [6, n
n→∞ n i n→∞ n
X
H (X1n ) = −E [log p (X1n )] = H Xi X1i−1
i=0
0.2
surface shell
volume
Q11.21 p 406].
0.1 See also (2). i=1
n
X
(i) Quadratic approximation:
0
0 0.1 0.2 0.3 0.4 hb0.5 ≈ 4p(1
(p) 0.6 0.7 − p)0.9
0.8 1
≤ H (Xi ) .
i=1
There are two bounds for H(p): Xi ’s are independent
ntropy for two random variables
For two random variables X and Y with a joint pmf p ( x, y ) and marginal pmf p(x) and p(y).
5
2.11. Chain rule: 2.17. Suppose X is a random variable on X . Consider
n
X A ⊂ X . Let PA = P [X ∈ X]. Then, H(X) ≤ h(PA ) + (1 −
H (X1 , X2 , . . . , Xn ) = H (Xi |Xi−1 , . . . , X1 ), PA ) log(|X | − |A|) + PA log |A|.
i=1
2.18. Fano’s Inequality: Suppose U and V be random
or simply variables on common alphabet set of cardinality M . Let
n
X E = 1U 6=V and Pe = P [E = 1] = P [U 6= V ]. Then,
H (X1n ) = H Xi X1i−1 .
i=1 H (U |V ) ≤ h (Pe ) + Pe log (M − 1)
Note that the term in sum when i = 1 is H(X1 ). Moreover,
with equality if and only if
n
X
H (X1n ) =
n
H Xi Xi+1 . Pe
u 6= v
P [U = u, V = v] = P [V = v] × M −1 , .
i=1
1 − Pe , u=v
In particular, for two variables,
Note that if Pe = 0, then H(U |V ) = 0
H (X, Y ) = H (X) + H (Y |X ) = H (Y ) + H (X |Y ) .
2.19. Extended Fano’s Inequality: Let U1L , V1L ∈ U L =
Chain rule is still true with conditioning: V L where |U| = |V| = M . Define Pe,` = P [V` 6= U` ] and
L
P e = L1
P
n
X Pe,` . Then,
(X1n H Xi X1i−1 , Z .
H |Z ) = `=1
i=1
1
H U1L V1L ≤ h P e + P e log (M − 1) .
In particular, L
H ( X, Y | Z) = H (X |Z ) + H (Y |X, Z ) 2.20 (Han’s Inequality).
= H (Y |Z ) + H (X |Y, Z ) . n
1 X
H (X1n ) ≤
H X[n]\{i} .
2.12. Conditioning only reduces entropy: n − 1 i=1
H (Y |X ) ≤ H (Y ) with equality if and only if X and Y
2.21. Independent addition (in an additive group G) increase
are independent. That is H ({p (x) p (y)}) = H ({p (y)})+
entropy in a sublinear way: For two random variables X and
H ({p (x)}).
Z,
H (X |Y ) ≥ H (X |Y, Z ) with equality if and only if given H(X) ≤ H(X ⊕ Z) ≤ H(X) + H(Z)
Y we have X and Z are independent i.e. p (x, z |y ) =
p (x |y ) p (z |y ). [4]. To see this, note that conditioning only reduces entropy;
therefore,
2.13. H (X |X ) = 0.
H(X) = H(X ⊕ Z|Z) ≤ H(X ⊕ Z).
2.14. H (X, Y ) ≥ max {H (X) , H (Y |X ) , H (Y ) , H (X |Y )}.
Pn The last item is simply a function of X, Z. We know that
2.15. H ( X1 , X2 , . . . , Xn | Y ) ≤ H ( Xi | Y ) with equality
i=1 H(g(X, Z)) ≤ H(X, Z) ≤ H(X) + H(Z).
if and
only if X1 ,nX2 , . . . , Xn are independent conditioning on
6
5
X Z
4.5
a
H ( X X + Z,Z ) = 0 0 0 H (Z X + Z, X ) = 0 4
b b = −a if X and Z
3.5
c are independent
H (X )
d
3
H ( X + Z X ,Z ) = 0 0 2.5
X +Z 2
1.5
H (X )
Figure 6: Information diagram For X, Z, and X + Z. 1
EX
0.5
7
2.2 Stochastic Processes and Entropy Rate
P
ui H (U2 |U1 = i ) where H (U2 |U1 = i )’s are computed,
i
In general, the uncertainty about the values of X(t) on the for each i, by the transition probabilities Pij from state i.
entire t axis or even on a finite interval, no matter how small, When there are more than one communicating class,
is infinite. However, if X(t) can be expressed in terms of its P
HU = P [classi ]H (U2 |U1 , classi ).
values on a countable set of points, as in the case for bandlim- i
ited processes, then a rate of uncertainty can be introduced.
It suffices, therefore, to consider only discrete-time processes. The statement of convergence of the entropy at time n of
[21] a random process divided by n to a constant limit called the
entropy rate of the process is known as the ergodic theorem
2.26. Consider discrete stationary source (DSS) {U (k)}. Let of information theory or asymptotic equirepartition
U be the common alphabet set. By staionaryness, H (U1n ) = property (AEP). Its original version proven in the 50’s
k+n−1
H Uk . Define for ergodic stationary process with a finite state space, is
known as Shannon-McMillan theorem for the convergence in
Per letter entropy of an L-block:
mean and as Shannon-McMillan-Breiman theorem for the a.s.
H U1 L
H Uk k+L−1
convergence.
HL = = ;
L L 2.27. Weak AEP: For Uk i.i.d. pU (u),
Incremental entropy change: 1 P
− log pU1L U1L −→ H (U ) .
L
hL = H UL U1L−1 = H U1L − H U1L−1
8
h i
◦ By convexity of D, this also shows that H(X) is 3.8. D (p (x |z ) kq (x |z ) ) ≥ 0. In fact, it is E log p(X|Z )
=
concave ∩ w.r.t. p. h i h i q(X|Z )
p (z) E log p(X|z ) p(X|z )
P
q(X|z ) z where E log q(X|z ) z ≥ 0 ∀z.
◦ Maximizing the entropy subject to any given set of z
constraints is identical to minimizing the relative
3.9. Chain rule for relative entropy:
entropy from the uniform distribution subject to the
same constraints [13, p 160]. D (p (x, y) kq (x, y) ) = D (p (x) kq (x) )+D (p (y |x ) kq (y |x ) ) .
If ∃x such that p(x) > 0 but q(x) = 0 (that is support of 3.10. Let p and q be two probability distributions on a
p is not a subset of support of q), then D(pkq) = ∞. common alphabet X . The variational distance between p and
q is defined by
3.2. Although relative entropy is not a proper metric, it is
natural measure of “dissimilarity” in the context of statistics 1 X
dT V (p, q) = |p (x) − q (x)|.
[5, Ch. 12]. Some researchers regard it as an information- 2
x∈X
theoretic distance measures.
It is the metric induced by the `1 norm. See also 10.5.
3.3. Divergence Inequality: D (p kq ) ≥ 0 with equality if
and only if p = q. This inequality is also known as Gibbs 3.11. Pinsker’s Inequality [6, lemma 11.6.1 p 370],[5,
Inequality. Note that this just means if we have two vectors lemma 12.6.1 p 300]:
u, v with the same length, P each have nonnegative elements D (p kq ) ≥ 2 (logK e) d2T V (p, q) ,
which summed to 1. Then, ui log uvii ≥ 0.
i where K is the base of the log used to define D. In particular,
3.4. D (p kq ) is convex ∪ in the pair (p, q). That is if (p1 , q1 ) if we use ln when defining D (p kq ), then
and (p2 , q2 ) are two pairs of probability mass functions, then D (p kq ) ≥ 2d2T V (p, q) .
D (λp1 + (1 − λ) p2 kλq1 + (1 − λ) q2 ) Pinsker’s Inequality shows that convergence in D is stronger
≤ λD (p1 kq1 ) + (1 − λ) D (p2 kq2 ) ∀ 0 ≤ λ ≤ 1. than convergence in L1 . See also 10.6
This follows directly from the convexity ∪ of x ln xy . 3.12. Suppose X1 , X2 , . . . , Xn are independent and
Y1 , Y2 , . . . , Yn are independent, then
For fixed p, D (p kq ) is convex ∪ functional of q. That is n
X
D pX1n
pY1n = D (pXi kpYi ).
D (λq1 + (1 − λ) q2 kp ) ≤ λD (q1 kp ) + (1 − λ) D (q2 kp ) .
i=1
9
Note that the final equation follows easily from Note that the first term quantifies how small the pi ’s
n
Q are. The second term quantifies the degree of dependence.
pX (xi ) (See also (7.3).)
p (xn1 ) p (xn1 ) i=1 i
ln Q
n = ln Q
n n
Q Also see (2.24).
qX̃i (xi ) qX̃i (xi ) pXi (xi )
i=1 i=1 i=1
n n
3.16 (Han’s inequality for relative entropies). Suppose
X pXi (xi ) n
X Y1 , Y2 , . . . are independent. Then,
= ln + log p (x1 ) − log pXi (xi ).
i=1
qX̃i (xi ) i=1 n
1 X
D (X1n kY1n ) ≥
D X[n]\{i}
Y[n]\{i} ,
See also (7.3). n − 1 i=1
3.14 (Data Processing Inequality for relative entropy). Let
X1 and X2 be (possibly dependent) random variables on X . or equivalently,
Q(y|x) is a channel. Yi is the output
g(X )
of the channel when the n
X
n n
D (X1n kY1n ) − D X[n]\{i}
Y[n]\{i} .
input is Xi . Then,
X D (X1 kY1 ) ≤
0
i=1
D (Y1≥ kY
0 2 ) ≤ D (X1 kX2 ) .
0 0 3.17. Minimum cross-entropy probability distribution
In particular, (MCEPD) [13, p 158, 160][6, Q12.2, p 421]:
D (g (X1 ) kg (X2 ) ) ≤ D (X1 kX2 ) . Consider fixed (1) finite S ⊂ R, (2) functions g1 , . . . , gm
Z on S, and (3) pmf q on S. Let C be a class of probability
The inequality follows from applying the log-sum inequality mass function pX which are supported on S (pX = 0 on
p (y)
to pY1 (y) ln pYY1 (y) where pYi (y) = Q (y |x ) pXi (x).
P
S c ) and satisfy the moment constraints E [gk (X)] = µk , for
2
x
1 ≤ k ≤ m. Let pmf p∗ be the probability distribution that
X1
minimizes the relative entropy D( · kq) for C. Define PMF p̂
Q ( y x)
m
Y1
P
λk gk
on S by p̂ = qeλ0 ek=1 where λ0 , λ1 , . . . , λm are chosen so
that p̂ ∈ C. Note that p∗ and p̂ may not exist.
Q ( y x) Y2 Suppose p̂ exists.
X2
(a) Fix r ∈ C.
p̂(x)
Figure 8: Data Processing Inequality for relative entropy
P
x∈S r(x) log q(x) = D(p̂kq); that is
for any Y ∼ r ∈ C,
3.15. Poisson Approximation for sums of binary ran-
dom variables [15, 12]: p̂(Y )
E log = D(p̂kq).
q(Y )
Given a random variable X with corresponding pmf p
whose support is inside N ∪ {0}, the relative entropy D(rkq) − D(p̂kq) = D(rkp̂) ≥ 0.
D (pkP(λ)) is minimized over λ at λ = EX [12, Lemma
7.2 p. 131]. (b) p∗ exists and p∗ = p̂.
Pm
◦ In fact, (c) D(p∗ kq) = D(p̂kq) = (λ0 + k=1 λk µk ) log e.
∞
X Example 3.18. Suppose pmf q = P(b) and C is the class of
D (p kq ) = λ − EX ln (λ) + p (x) ln (p (x) x!). pmf with mean λ. Then, the MCEPD is P(λ) [13, p 176–177].
i=0
10
3.20. Let p(x)Q(y|x) be a given joint distribution with cor- The name mutual information and the notation I(X; Y )
responding distributions q(y), P (x, y), and P (x|y). was introduced by [Fano 1961 Ch 2].
(a) argmin D (p(x)Q(y|x)kp(x)r(y)) = q(y). Mutual information is a measure of the amount of infor-
r(y) mation one random variable contains about another [6, p
P r(x|y)
13]. See (13) and (14).
(b) argmax p(x)Q(y|x) log = P (x|y).
r(x|y)
x,y p(x)
By (11), mutual information is the (Kullback-Leibler)
divergence between the joint and product-of-marginal
[6, Lemma 10.8.1 p 333] distributions. Hence, it is natural to think of I(X; Y ) as
3.21. Related quantities a measure of how far X and Y are from being independent.
4.1. The mutual information I(X; Y ) between Proof.two is obvious because A ⊂ A and B ⊂ B . “⇒” We can write
“⇐” ran-
4.4. I(X; Y ) ≤ min {H(X), H(Y )}.
I ( X A ; X B ) = I ( X A ; X B , X B \ B ) . The above result (*) then gives
dom variables X and Y is defined as
I ( X A ; X B ) = 0Example
. Now write I ( 4.5. = I ( X A , X A \ Aagain
X A ; X B )Consider ; X B ) . Then,
the (*) gives
thinned Poisson example
p (X, Y )
I (X; Y ) = E log I ( A (7)
X ; X B)
= 0in (2.8). The mutual information is
.
p (X) q (Y ) • X 1 , X 2 ,… , X n are independent iff
P (X |Y ) I (X; Y ) = H (P (λ)) − H (P ((1 − s)λ)) .
• H ( X 1n ) = (8)
n
= E log
p (X) ∑i =1
H ( Xi ) .
• I ( X i ; X 1i −1 ) = 0 ∀i
Example 4.6. Binary Channel. Let
Q (Y |X )
= E log (9)
n HX( X=) =Y∑n = H ({0,
Proof. 0 = H ( X 1n ) − ∑ 1 ) − ∑ H ( Xi )
n
q (Y ) X i X1}; i −1
i
i =1 i =1 i =1
XX p (x, y) p (1) = P [Xn = 1] = p = 1 − P [X = 0] = 1 − p (0);
= p (x, y) log = ∑ ( H ( X i X 1i −1 ) − H ( X i ) ) = ∑ I ( X i ; X 1i −1 )
(10) n
p (x) q (y) = i =1
1−a a
ā a
x∈X y∈Y i 1
T = i −1 [Y = j |X = i ]] =
[P
This happens iff I ( X i ; X 1 ) = 0 ∀i
= .
= D ( p (x, y)k p (x) q (y)) (11) b 1−b b b̄
Alternative Proof. p̄ = 1 − p and q̄ = 1 − q.
= H (X) + H (Y ) − H (X, Y ) (12)
This is obvious from X 1 , X 2 ,… , X n are independent iff ∀i X i and
The distribution vectors of X and Y : p = p̄ p and
= H (Y ) − H (Y |X ) 1
i −1
X(13) are independent.
= H (X) − H (X |Y ) , Example (14) q = q̄ q .
• Binary Channel:
where p (x, y) = P [X = x, Y = y] , p (x) = P [X = x] , q (y) =
P [Y = y] , P (x |y ) = P [X = x |Y = y ] , and Q (y |x{ ) } =
X = Y = 0,1 , 1− a
0 0
P [Y = y |X = x ]. p (1) = Pr { X = 1} = p = 1 − Pr { X = 0} = 1 − p ( 0 ) , a
X Y
⎡1 − a a ⎤ ⎡a a ⎤ b
I(X; Y ) = I(Y ; X). T = ⎡⎣ Pr {Y = j X = i}⎤⎦ = ⎢ ⎥=⎢ ⎥.
⎣ b 1 − b⎦ ⎣ b b ⎦ 1
1− b
1
The mutual information quantifies the reduction in the
uncertainty of one random variable due to the knowl- Figure 9: Binary Channel
edge of the other. It can be regarded as the information
contained in one random variable about the other. Then,
11
p̄ā p̄a 4.10. Conditional v.s. unconditional mutual information:
P = [P [X = i, Y = j]] = .
pb pb̄
If X, Y , and Z forms a Markov chain (any order is OK),
q = pT =
p̄ā + pb p̄a + pb̄ . then I(X; Y ; Z) ≥ 0 and conditioning only reduces mu-
" p̄ā # tual information: I (X; Y |Z ) ≤ I (X; Y ) , I (X; Z |Y ) ≤
pb
I (X; Z) , and I (Y ; Z |X ) ≤ I (Y ; Z).
T̃ = [P [X = j |Y = i ]] = p̄ā+pb
p̄a
p̄ā+pb
pb̄ .
p̄a+pb̄ p̄a+pb̄ Furthermore, if, for example, X and Z are not indepen-
dent, then I(X; Z) > 0 and I (X; Y ) > I (X; Y |Z ). In
H (X) = h (p).
particular, let X has nonzero entropy, and X = Y = Z,
H(Y ) = h (p̄ā + pb). then I (X; Y ) = h (X) > 0 = I (X; Y |Z ).
If any of the two r.v.’s among X, Y, and Z are indepen-
H (Y |X ) = p̄h (a) + ph (b) = p̄h (a) + ph b̄ .
dent, then I(X; Y ; Z) ≤ 0 and conditioning only increases
I (X; Y ) = h p̄a + pb̄ − p̄h (a) + ph b̄ . mutual information: I (X; Y ) ≤ I (X; Y |Z ) , I (X; Z) ≤
I (X; Z |Y ) , and I (Z; Y ) ≤ I (Z; Y |X ) .
Recall that h is concave ∩.
Each case above has one inequality which is easy to see. If
For binary symmetric channel (BSC), we set a = b = α.
X − Y − Z forms a Markov chain, then, I(X; Z|Y ) = 0. We
4.7. The conditional mutual information is defined as know that I(X; Z) ≥ 0. So, I(X; Z|Y ) ≤ I(X; Z). On the
other hand, if X and Z are independent, then I(X; Z) = 0.
I (X; Y |Z ) = H (X |Z ) − H (X |Y, Z )
We know that I(X; Z|Y ) ≥ 0. So, I(X; Z|Y ) ≥ I(X; Z).
p (X, Y |Z )
= E log 4.11 (Additive Triangle Inequality). Let X, Y , Z be three
P (X |Z ) p (Y |Z )
X real- or discrete-valued mutually independent random vari-
= p (z) I (X; Y |Z = z ), ables, and let the “+” sign denote real or modulo addition.
z Then
h i
p(X,Y |z )
where I (X; Y |Z = z ) = E log P (X|z )p(Y |z ) z . I(X; X + Z) ≤ I(X; X + Y ) + I(Y ; Y + Z) (15)
I (X; Y |z ) ≥ 0 with equality if and only if X and Y are [27]. This is similar to triangle inequality if we define
independent given Z = z. d(X, Y ) = I(X; X + Y ), then (15) says
I (X; Y |Z ) ≥ 0 with equality if and only if X and Y are d(X, Z) ≤ d(X, Y ) + d(Y, Z).
conditionally independent given Z; that is X − Y − Z
form a Markov chain. 4.12. Given processes X = (X1 , X2 , . . .) and Y =
(Y1 , Y2 , . . .), the information rate between the processes
4.8. Chain rule for information: X and Y is given by
n
X 1
I (X1 , X2 , . . . , Xn ; Y ) = I (Xi ; Y |Xi−1 , Xi−1 , . . . , X1 ), I(X; Y ) = lim I (X1n ; Y1n ) .
n→∞ n
i=1
4.13 (Generalization of mutual information.). There isn’t
or simply
n
X really a notion of mutual information common to three random
I (X1n ; Y ) = I Xi ; Y X1i−1 .
variables [6, Q2.25 p 49].
i=1
(a) Venn diagrams [1]:
In particular, I (X1 , X2 ; Y ) = I (X1 ; Y ) + I (X2 ; Y |X1 ). Sim-
ilarly, I(X1 ; X2 ; · · · ; Xn ) = µ∗ (X1 ∩ X2 ∩ · · · ∩ Xn )
n X
(−1)|S|+1 H(XS ).
X
I (X1n ; Y |Z ) = I Xi ; Y X1i−1 , Z .
=
i=1 S⊂[n]
4.9. Mutual information (conditioned or not) between sets See also section 12 on I-measure.
of random variables can not be increased by removing random
n
X1n
Q X
variable(s) from either set: (b) D P
P
i
.
i=1
I ( X1 , X2 ; Y | Z) ≥ I ( X1 ; Y | Z) .
4.14. The lautum information [20] is the divergence be-
See also (5.6). In particular, tween the product-of-marginal and joint distributions, i.e.,
swapping the arguments in the definition of mutual informa-
I (X1 , X2 ; Y1 , Y2 ) ≥ I (X1 ; Y1 ) . tion.
12
H ( f ( X ) ) = −∑ p f ( X ) ( y ) log p f ( X ) ( y )
y
⎛ ⎞
= −∑ ⎜ ∑ p X ( x ) ⎟ log p f ( X ) ( y )
⎜
y ⎝ x: f ( x ) = y
⎟
⎠
⎛ ⎞
= −∑
(a) L(X; Y )⎜ =∑ D(ppX X ( x ) log p f ( X ) ( y ) ⎟
pY kpX,Y ) ⎟ 5.5. I (X, f (X, Z) ; Y, g (Y, Z) |Z ) = I (X; Y |Z ).
⎜
y ⎝ x: f ( x ) = y ⎠
(b) Lautum (“elegant” in Latin) is the reverse spelling of 5.6. Compared to X, g(X) gives us less information about
⎛ ⎞
≤ −∑ ⎜ ∑ p X ( x ) log p X ( x ) ⎟
mutual. Y.
⎜
y ⎝ x: f ( x ) = y
⎟
⎠
= −∑ p X ( x ) log p X ( x ) = H ( X ) H (Y |X, g (X) ) = H (Y |X ) ≤ H (Y |g (X) ).
5 Functions of random variables
x
I (X; Y ) ≥ I (g (X) ; Y ) ≥ I (g(X); f (Y )). Note that this
H ( X , g ( X )) = H ( X ) agrees with the data-processing theorem (6.1 and 6.4)
The are several occasions where we have to deal with functions
of random variables. In 0fact, for those who knows I-measure, using the chain f (X) − X − Y − g (Y ).
Proof. H ( X , g ( X ) ) = H ( X ) + H ( g ( X ) X )
the diagram in Figure 10 already summarizes almost all iden- (
p Φ ( g ( X ) ) = z, g ( X ) = y =) ∑ p (Φ ( y ) = z ) p ( x)
x
5.7. I (X; g (X) ; Y ) = I (g (X) ; Yg ()x ) =≥y 0.
H ( g ( X ) tities
X , Y ) =of0 ,our
I ( ginterest.
( X ) ;Y X ) = 0 .
5.8. If X = g1 (W ) and ∑ ),( then H (X |Y )) ≤= p Φ ( g ( X ) ) = z, X = x
g(X ) Ŵ = gg( x2x) =(Y
y
H (W |Y ) ≤ H W Ŵ . The statement is also true when
X
0 Hence,
≥0
( Remark: If) W − X − Y − Ŵ forms a Markov chain, we have
W − X − Y − Ŵ forms a Markov chain and X = g1 (W ).
H Φ ( g ( X )) X
∑∑|Y p) (≤
Φ (H ) ) =Ŵ
g ( XW ) logitp (isΦ (not ) = z in
g ( X )true x)
0 H =(W z, X =
, xbut X =general that
z x
∑∑ ∑ p ( Φ ( g ( X ) ) = z, X = x ) log p ( Φ ( g ( X ) ) = z X = x )
H =(X |Y ) ≤ H (W |Y ).
z y x
= ∑∑ = x ) log p ( Φ ( g ( X ) ) = z g ( X ) = y )
∑ p ( Φ ( g ( X ) ) = zY, X=g(X)
Proof. 1) H ( g ( X ) X ) = 0 , H ( g ( X ) X ) ≥ H ( g ( X ) X , Y ) , and H ( ⋅) ≥ 0 ⇒ z y x
g ( x) = y − g (·) −−−−−→ Q (· |· ) →
X→ − Z
H ( g ( X ) Figure
X , Y ) = 010:
. Information diagram for X and g(X) ⎛ ⎞
2) I ( g ( X ) ; Y X ) = H ( g ( X ) X ) − H ( g ( X ) X , Y ) = 0 − 0 = 0 . = ∑∑
where (
p ( Φ ( g ( X ) ) = and
g islogdeterministic z g ( XQ) =
⎜
)⎜
is ya) probabilistic ) ) = z, X = x ) ⎟
∑ p ( Φ ( g ( Xchannel, ⎟
5.1. I(X; g(X)) = H(g(X)). z y ⎜ g ( xx) = y ⎟
⎝ |X, g (X) ). ⎠
Or, can use I ( Z ;Y ) ≤ H ( Z ) . Hence, 0 ≤ I (Y ; g ( X ) X ) ≤ H ( g ( X ) X ) = 0 . H (Z |X ) = H (Z |g (X) ) = H (Z
5.2. When X is given, we can simply disregard g(X). ( ( )) ( (
= ∑∑ log p Φ ( g ( X ) ) = z g ( X ) = y p Φ ( g ( X ) ) = z , g ( X ) = y ))
Note that can also prove that H ( g ( X ) X , Y ) = 0 and I ( g ( X ) ;Y X ) = 0 togetherbyIz (Z;
y X) = I (Z; g (X)) = I (X; g (X) ; Z) ≥ 0.
H (g (X) |X ) = 0. In fact, ∀x ∈ X , H (g (X) |X = x) =
argue that H g X X = H g X X , Y + I g X ; Y X = 0 . Because both of =
( ( is,) given
) (X,( g(X)) is) completely
( ( ) determined
) ( )
theH Φ ( g ( X ) ) g ( X )
0. That and Again, the diagram in Figure 11 summarizes the above results:
→ g ( ⋅) ⎯⎯⎯⎯ ( )
→ Q ( ⋅ ⋅) ⎯⎯
=
summands are nonnegative, they both have to be 0. • Consider X ⎯⎯ → Z . Then,
Y g X
hence has no uncertainty.
• (
H Y X , g ( H
X )(g=(X))
H (Y|X,
X ) Y≤ H (
) =Y 0.g ( X ) ) g(X )
I (g (X) ; Y |X ) = I (g (X) ; Y |X, Z ) = 0. X
0
5.3. g(X) has less uncertainty than X. H (g (X)) ≤ H (X)
with equality if and only if g is one-to-one (injective). That ≥0
is deterministic function only reduces the entropy. Similarly, 0 0
H (X |Y ) ≥ H (g (X) |Y ).
5.4. We can “attach” g(X) to X. Suppose f, g, v are deter-
Z
ministic function on appropriate domain.
H (X, g (X)) = H (X).
• ( )
H (Z X ) = H Z g ( X ) = H Z X , g ( X ) . ( )
Figure 11: Information diagram for Markov chain where the
H (Y |X, g (X) ) = H (Y |X ) and • I ( Z ;first I ( Z ; g ( X ) ) =isI a
X ) = transition ( Xdeterministic
; g ( X ) ; Z ) ≥ 0 . function
H (X, g (X) |Y ) = H (X |Y ).
An expanded version is H (X, g (X) |Y, f (Y ) ) =
• Let X = g1 (W ) , and Ŵ = g 2 (Y ) , then H ( X Y ) ≤ H (W Y ) ≤ H W Wˆ . ( )
H (X |Y, f (Y ) ) = H (X, g (X) |Y ) = H (X |Y ). 5.10. If ∀y g(x, y) is invertible as a function of x,
then H(X|Y ) = H(g(X, Y )|Y ). In fact,∀y H(X|y) =
H (X, g (X) , Y ) = H (X, Y ).
H(g(X, Y )|y).
An expended version is
H (X, Y, v (X, Y )) = H (X, g (X) , Y, f (Y )) = For example, g(x, y) = x − y or x + y. So, H(X|Y ) =
H (X, g (X) , Y ) = H (X, Y, f (Y )) = H (X, Y ). H(X + Y |Y ) = H(X − Y |Y ).
I (X, g (X) ; Y ) = I (X; Y ).
An expanded version is I (X, g (X) ; Y, f (Y )) = 5.11. If T (Y), a deterministic function of Y , is a sufficient
I (X, g (X) ; Y ) = I (X; Y, f (Y )) = I (X; Y ). statistics for X. Then, I (X; Y) = I (X; T (Y)).
13
5.12. Data Processing Inequality for relative entropy: I (X; Y, Z |W ) = I (X; Y |W )
Let Xi be a random variable on X , and Yi be a random variable I (X; Z |Y, W ) = 0
on Y. Q (y |x ) is a channel whose input and output are Xi X − (Y, W ) − Z forms a Markov chain.
and Yi , respectively. Then,
(c) Term moved to condition: The following statements are
D (Y1 kY2 ) ≤ D (X1 kX2 ) . equivalent:
Figure 12: Data processing inequality 6.6. Consider two (possibly non-homogeneous) Markov chains
with the same transition probabilities. Let p1 (xn ) and p2 (xn )
be two p.m.f. on the state space Xn of a Markov chain at time
I (X; Y ) ≥ I (X; Y |Z ); that is the dependence of X and
n. (They comes possibly from different initial distributions.)
Y is decreased (or remain unchanged) by the observation
Then,
of a ”downstream” random variable Z. In fact, we also
have I (X; Z |Y ) ≤ I (X; Z) , and I (Y ; Z |X ) ≤ I (Y ; Z)
{ p1 ( x0 )} { p1 ( xn −1 )} { p1 ( xn )}
Pn ( i j )
(see 4.10). That is in this case conditioning only reduces
mutual information.
6.2. Markov-type relations
(a) Disappearing term: The following statements are equiva-
lent: Pn ( i j )
{ p ( x )}{ p ( x )}
2 0 2 n −1 { p ( x )}
2 n
I (X; Y, Z) = I (X; Y )
I (X; Z |Y ) = 0.
X − Y − Z forms a Markov chain. The relative entropy D (p1 (xn ) kp2 (xn ) ) decreases with
n:
(b) Disappearing term: The following statements are equiva-
lent: D (p1 (xn+1 ) kp2 (xn+1 ) ) ≤ D (p1 (xn ) kp2 (xn ) )
14
n n
D (p1 (xI ) kp2 (xI ) ) = D (p1 (xmin I ) kp2 (xmin I ) ) where D P X1n
Q X
P i
≥
P
I (Xi ; Y ) − I (X1n ; Y ).
I is some index set and pi (xI ) is the distribution for the
i=1 i=1
random vector XI = (Xk : k ∈ I) of chain i.
n n
X1n
D P
Q X P
P i
≤ H (X ) − maxi H (Xi ) .
For homogeneous Markov chain, if the initial distribu-
i=1 i=1
i
Suppose the stationary distribution is uniform, then, 7.5. The following statements are equivalent:
H (Xn ) = H ({p (xn )}) = log |X | − D ({p (xn )} ku ) is
monotone increasing. (a) X1 , X2 , . . . , Xn are mutually independent conditioning
on Y (a.s.).
n
6.1 Homogeneous Markov Chain (b) H (X1 , X2 , . . . , Xn |Y ) =
P
H (Xi |Y ).
i=1
For this subsection, we consider homogeneous Markov chain. n
(c) p (xn1 |y ) =
Q
6.7. H(X0 |Xn ) is non-decreasing with n; that is H(X0 |Xn ) ≥ p (xi |y ).
i=1
H(X0 |Xn+1 ) ∀n.
(d) ∀i ∈ [n] \ {1} p xi xi−1
1 , y = p (xi |y ).
6.8. For a stationary Markov process (Xn ),
(e) ∀i ∈ [n] I Xi ; X[n]\{i} |Y = 0.
H(Xn ) = H(X1 ), i.e. is a constant ∀n. (f) ∀i ∈ [n] Xi and the vector (Xj )[n]\{i} are independent
H Xn X n−1 = H (Xn |Xn−1 ) = H (X1 |X0 ).
1 conditioning on Y .
H(Xn |X1 ) increases with n. That is H (Xn |X1 ) ≥ Pn
H (Xn−1 |X1 ) . (g) H ( X1 , X2 , . . . , Xn | Y ) = H Xi | X[n]\{i} , Y .
i=1
n n
n
Q
(h) D P X1
Xi
I (Xi ; Y ) − I (X1n ; Y ). (See
P
7 Independence
P =
i=1 i=1
also (7.3).)
7.1. I (Z; X, Y ) = 0 if and only if Z and (X, Y ) are inde-
pendent. In which case,
8 Convexity
I (Z; X, Y ) = I (Z; X) = I (Z; Y ) = I ( Z; Y | X) =
I ( Z; X| Y ) = 0. 8.1. H ({p (x)}) is concave ∩ in {p (x)}.
I (X; Y |Z ) = I (X; Y ).
8.2. H (Y |X ) is a linear function of {p(x)} for fixed
I (X; Y ; Z) = 0. {Q(y|x)}.
7.2. Suppose we have nonempty disjoint index sets A, B.
8.3. H (Y ) is a concave function of {p(x)} for fixed {Q(y|x)}.
Then, I (XA ; XB ) = 0 if and only if ∀ nonempty à ⊂ A and
∀ nonempty B̃ ⊂ B we have I (XÃ ; XB̃ ) = 0. 8.4. I(X;Y ) = H(Y ) − H(Y |X) is a concave function of
n
Q
n
n
{p(x)} for fixed {Q(y|x)}.
7.3. D P X1
Xi n
P
P = H (Xi ) − H (X1 ) =
8.5. I(X;Y ) = D (p(x, y)kp(x)q(y)) is a convex function of
i=1 i=1
n−1 n−1
P
I Xi; X
=
P
I X ; X n . Notice that this {Q(y|x)} for fixed {p(x)}.
1 i+1 i i+1
i=1 i=1
function is symmetric w.r.t. its n arguments. It admits a 8.6. D (p kq ) is convex ∪ in the pair (p, q).
natural interpretation as a measure of how far the Xi are For fixed p, D (p kq ) is convex ∪ functional of q. For fixed
from being independent [15]. q, D (p kq ) is convex ∪ functional of p.
15
9 Continuous Random Variables (c) N m, σ 2 : h(X) = 12 log(2πeσ 2 ) bits. Note that h(X) <
1
0 if and only1 if σ < √2πe
• Gaussian: h ( X ) = log ( 2π eσ )
. Let Z = X1 + X2 where
2
N
The differential entropy h(X) or h (f ) of an absolutely 2
log ( 2π eσ 2 )
S 1
2
where S is the support set of the random variable. 2
Var [Y X
X1 and ] = 2a − are a . So, h (Y ) ≤ h ( N ( 0,normal
( EY ) ≤independent Var [Y ]) ) ≤ h (random
N ( 0, a ) ) . variables
2
9.1. Unlike discrete entropy, the differential entropy h(X) Gaussian, larger variance means
larger differential entropy.
⎛X⎞
• Suppose ⎜ ⎟ is a jointly Gaussian random vector with covariance matrix
Γ (q)
can be negative. For example, consider the uniform dis- h⎝ Y(X)
⎠ = (log e) q + (1 − q) ψ (q) + ln
⎛ det ( Λ ) det ( Λ ) ⎞
λ
tribution on [0, a]. h(X) can even be −∞ [12, lemma ⎛Λ Λ ⎞ 1
Λ=⎜ ⎟ . Then, I ( X ;Y ) = log ⎜⎜
X XY
⎟.
X Y
Proof. I ( X ;Y ) = h ( X ) + h (Y ) − h ( X , Y ) .
is not invariant under invertible coordinate transformation where 1
• Exponential family: Let f ( x ) =
θ ⋅T ( x )
e where c (θ ) = ∫ eθ ⋅T ( x ) dx is a
[3]. See also 9.5. c (θ )
X
Γ0 (z)
ψ (z) = d
ln Γ (z) =
normalizing constant. Then, is the digamma function;
dz Γ(z)
9.2. For any one-to-one differentiable g, we have
h(g(X)) = h(X) + E log |g 0 (X)|. h̃ (q) = (log e) q + (1 − q) ψ (q) + ln Γ(q)
√
q is a
strictly increasing function. h̃(1) = log e which
Interesting special cases are as followed:
agrees the exponential case. By CLT, lim h̃ (q) =
q→∞
Differential entropy is translation invariant: 1
2 log 2πe ≈ 2.0471 which agrees with the Gaussian
h(X + c) = h(X). case.
m
In fact,
P
λk g(x)
h(aX + b) = h(X) + log |a| . (e) Note that the entropy of any pdf of the from ce k=1
16
Differential entropy is not the limiting case of the discrete In particular,
entropy. Consider a random variable X with density f (x).
Suppose we divide the range of X into bins of length ∆. Let us h(AX + B) = h(X) + log |det A| .
assume that the density is continuous within the bins. Then
by the mean value theorem, ∃ a value xi within each bin such Note also that, for general g,
R (i+1)∆
that f (xi )∆ = i∆ f (x)dx. Define a quantized (discrete) h (Y) ≤ h (X) + E [log |det (dg (X))|] .
random variable X ∆ , which by X ∆ = xi , if i∆ ≤ X ≤ (i+1)∆.
9.6. Examples [9]:
(a) If the density f (x) of the random variable
X is Riemann
∆
integrable, then as ∆ → 0, H X + log ∆ → h (X). Let X1n have a multivariate normal distribution with
That is H(X ∆ ) ≈ h(X) − log ∆. mean µ and covariance matrix Λ. Then, h(X1n ) =
1 n
(b) When ∆ = n1 , we call X ∆ the n-bit quantization of X. 2 log ((2πe) |Λ|) bits, where |Λ| denotes the determinant
The entropy of an n-bit quantization is approximately of Λ. In particular, if Xi ’s are independent normal r.v.
the same variance σ 2 , then h (X1n ) = n2 log 2πeσ 2
h(X) + n. with
(c) h(X) + n is the number of bits on the average required ◦ For any random vector X = X1n , we have
to describe X to n bit accuracy h
T
i
E (X − µ) Λ−1X (X − µ)
(d) H(X ∆ |Y ∆ ) ≈ h(X|Y ) − log ∆. Z
T
= fX (x) (x − µ) Λ−1 X (x − µ) dx = n
[6, Section 8.3].
Another interesting relationship is that of Figure 4 where we
T
approximate the entropy of a Poisson r.v. by the differential Exponential family: Suppose fX (x) = c(θ) 1
eθ T(x)
entropy of a Gaussian r.v. with the same mean and variance. where the real-valued normalization constant c (θ) =
Note that in this case ∆ = 1 and hence log ∆ = 0. R θT T(x)
e dx, then
The differential entropy of a set X1 , . . . , Xn of random
1 T
variables with density f (X1n ) is defined as h (X) = ln c (θ) − θ (∇θ c (θ))
c (θ)
Z
h(Xn1 ) = − f (xn1 ) log f (xn1 )dxn1 . = ln c (θ) − θT (∇θ (ln c (θ)))
1 θ·T (x)
In 1-D case, we have fX (x) = c(θ) e and
If X, Y have a joint density function f (x, y), we can define
the conditional differential entropy h(X|Y ) as θ 0 d
h (X) = ln c (θ) − c (θ) = ln c (θ) − θ (ln c (θ))
Z c (θ) dθ
h(X|Y ) = − f (x, y) log f (x|y)dxdy
eθ·T (x) dx See also (9.22).
R
where c (θ) =
= −E log fX|Y (X |Y )
Let Y = (Y1 , . . . , Yk ) = eX1 , . . . , eXk , then h (Y) =
= h(X, Y ) − h(Y ) k
P
Z h (X) + (log e) EXi . Note that if X is jointly gaussian,
= f (x)H (Y |x ) dx, i=1
then Y is lognormal.
x
9.5. Let X and Y be two random vectors both in Rk such 9.8. Chain rule for differential entropy:
that Y = g(X) where g is a one-to-one differentiable transfor- n
X
mation. Then, h(X1n ) = h(Xi |X1i−1 ).
i=1
1
fY (y) = fX (x) Pn
|det (dg (x))| 9.9. h(X1n ) ≤ i=1 h(Xi ), with equality if and only if
X1 , X2 , . . . Xn are independent.
and hence Pn
9.10 (Hadamard’s inequality). |Λ| ≤ i=1 Λii with equality
h (Y) = h (X) + E [log |det (dg (X))|] . iff Λij = 0, i 6= j, i.e., with equality iff Λ is a diagonal matrix.
17
The relative entropy (Kullback Leibler distance) D(f kg) Hence, knowing how to find the differential entropy, we can
between two densities f and g (with respect to the Lebesgue find the mutual information from (19). Also, from (21), the
measure m) is defined by mutual information between two random variables is the limit
Z of the mutual information between their quantized versions.
f
D(f kg) = f log dm. (16) See also 9.4.
g
9.15. Mutual information is invariant under invertible coor-
Finite only if the support set of f is contained in the
dinate transformations; that is for random vectors X1 and X1
support set of g. (I.e., infinite if ∃x0 f (x0 ) > 0, g(x0 ) =
and invertible functions g1 and g2 , we have
0.)
D(f kg)
R
R is infinite if for some region R, R f (x)dx = 0 I (g1 (X1 ); g2 (X2 )) = I (X1 ; X2 ) .
and R f (x)dx 6= 0.
See also 9.12.
For continuity, assume 0 log 00 = 0. Also, for a > 0,
a 0
a log 0 = ∞, 0 log a = 0. 9.16. I(X; Y ) ≥ 0 with equality if and only if X and Y are
h i
fX (X)
D (fX kfY ) = E log fY (X) = −h (X) − E [log fY (X)]. independent.
Relative entropy usually does not satisfy the symmetry 9.17. Gaussian Random Variables and Vectors
property. In some special cases, it can be symmetric, e.g.
(23). (a) Gaussian Upper Bound for Differential Entropy:
For any random vector X,
9.11. D(f kg) ≥ 0 with equality if and only if f = g almost
everywhere (a.e.). 1 n
h (X) ≤ log ((2πe) det (ΛX ))
2
9.12. Relative entropy is invariant under invertible coordi-
nate transformations such as scale changes and rotation of with equality iff X ∼ N (m, ΛX ) for some m. See also
coordinate axes. 9.22. Thus, among distributions with the same variance,
the normal distribution maximizes the entropy.
9.13. Relative entropy and Uniform random variable:
Let U be uniformly distributed on a set S. For any X with In particular,
the same support, 1 2
h(X) ≤ log 2πeσX .
−E log fU (X) = h (U ) 2
18
where ∆m = mX − mY . In 1-D, we have For Y ∈ [s0 , ∞), by the same reasoning, we have
1 σ2 1 2
σX
∆m
2 ! I (X; Y ) logK (e(µ − s0 )) − h(N ) log K e
log e−1 Y2 + (log e) + , ≤ ≤ h(N )
2 σX 2 σY2 σY EY µ − s0 K
but in this case shifted exponential on [S0 , ∞) maximize
or equivalently, h(Y ) for fixed EY = µ. Second inequality use µ =
2 ! s0 + K h(N ) which maximize the middle term.
2
σY 1 σX ∆m Suppose for Y ∈ [s0 , ∞), we now want to maximize I(X;Y )
log + (log e) + −1 EY +r
σX 2 σY2 σY for some r ≥ 0. Then, we can use the same technique but
will have to solve for optimal µ∗ of µ numerically. The
[24, p 1025]. In addition, when σX = σY = σ, the relative upper bound in this case is µlog K e
∗ −s .
0
entropy is simply
9.19. Additive Gaussian Noise [11, 18, 19]: Suppose N
2
is a (proper complex-valued multidimensional) Gaussian noise
1 mX − mY
D (fX kfY ) = (log e) . (23) which is independent of (a complex-valued random vector)
2 σ
X. Here the distribution of X is not required to be Gaussian.
Then, for √
(e) Monotonic decrease of the non-Gaussianness of the sum
Y = SNRX + N,
of independent random variables [23]: Consider i.i.d. ran-
Pn (n) we have
dom variables X1 , X2 , . . .. Let S (n) = k=1 Xk and SN
be a Gaussian random variable with the same mean and d h
2
i
(n)
variance as S . Then, I(X; Y ) = E |X − E [X|Y ]| .
dSNR
or equivalently, in an expanded form,
(n) (n−1)
D S (n) ||SN ≤ D S (n−1) ||SN .
√ h √
i2
d
I(X; SNRX + N ) = E X − E X SNRX + N ,
dSNR
(f) Suppose X
Y is a jointly Gaussianrandom vector with
covariance matrix Λ = ΛΛYXX ΛΛXY Y
. The mutual infor- where the RHS is the MMSE corresponding to the best
mation between two jointly Gaussian vectors X and Y estimation of X upon the observation Y for a given signal-to-
is noise ratio (SNR). Here, the mutual information is in nats.
Furthermore, for a deterministic matrix A, suppose
1 (det ΛX )(det ΛY )
I(X; Y ) = log .
2 det Λ
Y = AX + N.
In particular, for jointly Gaussian random variables X, Y ,
we have Then, the gradient of the mutual information with respect to
the matrix A can be expressed as
2 !
1 Cov (X, Y )
I (X; Y ) = − log 1 − . ∇A I (X; Y ) = Cov [X − E [X|Y ]]
2 σX σY
= AE (X − E [X|Y ])(X − E [X|Y ])H ),
9.18. Additive Channel: Suppose Y = X + N . Then, where the RHS is the covariance matric of the estimation error
h(Y |X) = h(N |X) and thus I(X; Y ) = h(Y ) − h(N |X) be- vector, also known as the MMSE matrix. Here, the complex
cause h is translation invariant. In fact, h(Y |x) is always derivative of a real-valued scalar function f is defined as
h(N |x).
Furthermore, if X and N are independent, then h(Y |X) = df 1 ∂f ∂f
= +j
h(N ) and I(X; Y ) = h(Y ) − h(N ). In fact, h(Y |x) is always dx∗ 2 ∂Re {x} ∂Im {x}
h(N ). ∂f
and the
h complex
i gradient matrix is defined as ∇A f = ∂A ∗
For nonnegative Y , ∂f
where ∂A∗ ∂f
= ∂[A∗ ]ij .
ij
I (X; Y ) logK (eEY ) − h (N ) log K e 9.20. Gaussian Additive Channel: Suppose X and N
≤ ≤ h(N )
EY EY K are independent independent Gaussian random vectors and
Y = X + N , then
where the first inequality is because exponential Y max-
imizes h(Y ) for fixed EY and the second inequality is 1 det (ΛX + ΛN )
because EY = K h(N ) maximizes the middle term. I (X; Y ) = log .
2 det ΛN
19
In particular, for one-dimensional case, Normal distribution is the law with maximum entropy
among all distributions with finite variances:
2
1 σX
I (X; Y ) = log 1 + 2 . If the constraints is on (1) EX 2 , or (2) σX
2
, then f ∗ has
2 σN the same form as a normal distribution. So, we just have
to find a Normal random variable satisfying the condition
9.21. Let Y = X + Z where X and Z are independent and (1) or (2).
X is Gaussian. Then,
Exponential distribution has maximum entropy among
among Z with fixed mean and variance, Gaussian Z all distributions concentrated on the positive halfline and
minimizes I(X; Y ). possessing finite expectations:
If S = [0, ∞) and EX = µ > 0, then
among Z with fixed EZ 2 , zero-mean Gaussian Z mini- x
f ∗ (x) = µ1 e− µ 1[0,∞) (x) (exponential) with correspond-
mizes I(X; Y ).
ing h (X ∗ ) = log (eµ).
If S = [s0 , ∞) and EX = µ > s0 , then f ∗ (x) =
9.1 MEPD 1
x−s
− µ−s0
µ−s0 e 1[s0 ,∞) (x) (shifted exponential) with corre-
0
20
d
X1 , . . . , Xn−1 are Bernoulli taking on +1 or −1 with equal where α = dβ log M (β) with equality if and only if
probability and Xn is uniformly distributed [17]. We list below
the properties of this maximizing distribution. Let Sn∗ be the f (x) = f ∗ (x) = c0 f˜ (x) eβg(x)
corresponding maximum differential entropy distribution. −1
Pn−1 where c0 = (M (β)) . Note that f ∗ is said to generate an
(a) The sum k=1 Xk is binomial on exponential family of distribution.
{2k − (n − 1) : k = 0, 1, . . . , n − 1} where the prob-
ability at the point 2j − (n − 1) is n−1
n−1
j 2 . There is 9.2 Stochastic Processes
no close form expression for the differential entropy of
this binomial part. 9.27. Gaussian Processes: The entropy rate of the Gaussian
process (Xk ) with power spectrum S(w) is
(b) The maximum differential entropy is sum of the binomial
√ 1
Z π
part and the uniform part. The differential entropy of H ((Xk )) = ln 2πe + ln S(ω)dω
the uniform part is logK 2. 4π −π
(a) fX,Y = fX,Y maximizes h(X, Y ) and h(Y |X) with corre- 10 General Probability Space
(λ)
21
n
where the supre- and the supremum in (25) is achieved by the set B =
P (Ak )
P
10.2. D (P kQ ) = sup P (Ak ) log Q(Ak )
k=1 [δP > δQ ].
mum is taken on all finite partitions of the space. dT V is a true metric. In particular, 1) dT V (µ1 , µ2 ) ≥ 0 with
10.3. For random variables X and Y on common probability equality if and only if µ1 = µ2 , 2) dT V (µ1 , µ2 ) = dT V (µ2 , µ1 ),
space, we define and 3) d T V (µ1 , µ2 ) ≤ dT V (µ1 , ν) + dT V (ν, µ2 ). Furthermore,
because µi (A) ∈ [0, 1], we have |µ1 (A) − µ2 (A)| ≤ 1 and thus
I(X; Y ) = D P X,Y kP X × P Y dT V (µ1 , µ2 ) ≤ 1.
22
equivalently, (2) ∀ε0 > 0 ∃N > 0 such that ∀n > N
h c i h i
P A(n) ε (X) = P X1n ∈
/ A(n)
ε (X) < ε. (1 − ε0 ) 2n(H(X,Y )−ε) < A(n)
ε ≤2
n(H(X,Y )+ε)
,
23
Suppose X̃Ji 1 , X̃Ji 2 , X̃Ji 3 is drawn i.i.d. according to 11.5. Jointly Strong Typicality:
p (xJ1 |xJ3 ) p (xJ2 |xJ3 ) p (xJ3 ), that is X̃Ji 1 , X̃Ji 2 are condition- Let {P (x, y) = p (x) Q (y |x ) , x ∈ X , y ∈ Y} be the joint
i
ally independent given X̃J3 but otherwise share the same pair- p.m.f. over X × Y. Denote the number of occurrences of
(n) the point (x, y) in the pair of sequences (xn1 , y1n )
wise marginals of X̃J1 , X̃J2 , X̃J3 . Let s̃r = x̃iJr (1) . Then
N (a, b |xn1 , y1n ) = |{k : 1 ≤ k ≤ n, (xk , yk ) = (a, b)}|
.
h i
(n) n
(6) P S̃1 , S̃2 , S̃3 ∈ Aε (XJ1 , XJ2 , XJ3 ) = X
= 1 [xk = a] 1 [yk = b].
2−n(I (XJ1 ;XJ2 |XJ3 )∓6ε) . k=1
n
N (x, y |xn1 , y1n ) and N (y |y1n ) =
P P
11.4. Strong typicality Suppose X is finite. For a sequence Then, N (x |x1 ) =
y∈Y x∈X
∀xn1 ∈ X n and a ∈ X , define
N (x, y |xn1 , y1n ).
N (a |xn1 ) = |{k : 1 ≤ k ≤ n, xk = a}| A pair of sequences (xn1 , y1n ) ∈ X n × Y n is said to be
Pn strongly δ-typical w.r.t. {P (x, y)} if
= 1 [xk = a].
N (a, b |xn1 , y1n )
k=1 δ
∀a ∈ X , ∀b ∈ Y − P (a, b) < .
n
Then, N (a |x1 ) is the number of occurrences of the symbol a n |X | |Y|
in the sequence xn1 . Note that ∀xn1 ∈ X n N (a |xn1 ) = n. The set of all strongly typical sequences is called the strongly
P
x∈X
typical set and is denoted by
For i.i.d. Xi ∼ {p (x)}, ∀x1 ∈ X n n
Tδ = Tδn (pQ) = Tδn (X, Y )
N (x|xn = {(xn1 , y1n ) : (xn1 , y1n ) is δ - typical of {P (x, y)}} .
Y
1 )
pX1n (xn1 ) = p (x) ;
x∈X
Suppose (Xi , Yi ) is drawn i.i.d. ∼ {P (x, y)}.
A sequence xn1 ∈ X n is said to be δ-strongly typical
w.r.t. (1) lim P [(X1n , Y1n ) ∈
/ Tδn (X, Y )] = 0. That is ∀α > 0 for n
n→∞
N (a|xn )
{p (x)} if (1) ∀a ∈ X with p (a) > 0, n 1 − p (a) < |Xδ | , sufficiently large,
24
For any xn1 ∈ Tδn (X), define implies that an and bn are equal to the first order in the
exponent.
Tδn (Y |X ) (xn1 ) = {y1n : (xn1 , y1n ) ∈ Tδn (X, Y )} .
The volume of the smallest set that contains most of
(5) For any xn1 such that ∃y1n with (xn1 , y1n ) ∈ Tδn (X, Y ), the probability is approximately 2nh(X) . This is an n-
dimensional volume, so the corresponding side length
1
. 0 is (2nh(X) ) n = 2h(X) . Differential entropy is then the
|Tδn (Y |X ) (xn1 )| = 2n(H(Y |X )±εδ ) logarithm of the equivalent side length of the smallest
where ε0δ → 0 as δ → 0 and n → ∞. set that contains most of the probability. Hence, low
entropy implies that the random variable is confined to
xn1 ∈ Tδn (X) combined with the condition of the state-
a small effective volume and high entropy indicates that
ment above is equivalent to |Tδn (Y |X ) (xn1 )| ≥ 1.
the random variable is widely dispersed.
(6) Let Yi be drawn i.i.d. ∼ qY (y), then Remark: Just as the entropy is related to the volume of
.
P [(xn1 , Y1n ) ∈ Tδn (X, Y )] = P [Y1n ∈ Tδn (Y |X ) (xn1 )] = the typical set, there is a quantity called Fisher informa-
00
2−n(I(X;Y )∓εδ ) , where ε00δ → 0 as δ → 0 and n → ∞. tion which is related to the surface area of the typical
set.
Now, we consider the continuous random variables.
11.7. Jointly Typical Sequences
11.6. The AEP for continuous random variables: (n)
The set Aε of jointly typical sequences x(n) , y (n) with
n
Let (Xi )i=1 be a sequence of random variables drawn i.i.d. respect to the distribution f
X,Y (x, y) is the set of n-sequences
according to the density f (x). Then (n)
with empirical entropies ε-close to the true entropies, i.e., Aε
− 1 log f (X n ) → E [− log f (X)] = h(X) in probability. is the set of x(n) , y (n) ∈ X n × Y n such that
n 1
For ε > 0 and any n, we define the typical set with
(n)
Aε (a) − n1 log fX1n (xn1 ) − h (X) < ε,
respect to f (x) as follows:
(b) − n1 log fY1n (y1n ) − h (Y ) < ε, and
(n) n n 1 n
Aε = x1 ∈ S : − log f (x1 ) − h (X) ≤ ε ,
(c) − n1 log fX1n ,Y1n (xn1 , y1n ) − h (X, Y ) < ε.
n
where S is the support set of the random variable X, and Note that the followings give equivalent definition:
f (xn1 ) = Πni=1 f (xi ). Note that the condition is equivalent to
(a) 2−n(h(X)+ε) < fX1n (xn1 ) < 2−n(h(X)−ε) ,
2−n(h(X)+ε) ≤ f (xn1 ) ≤ 2−n(h(X)−ε) .
(b) 2−n(h(Y )+ε) < fY1n (y1n ) < 2−n(h(Y )−ε) , and
R We also define the volume Vol (A) of a set A to be Vol (A) = (c) 2−n(h(X,Y )+ε) < fX1n ,Y1n (xn1 , y1n ) < 2−n(h(X,Y )−ε) .
A
dx1 dx2 · · · dxn .
h
(n)
i Let (X1n , Y1n ) be sequences of length n drawn i.i.d. according
(1) P Aε > 1 − ε for n sufficiently large.
to fXi ,Yi (xi , yi ) = fX,Y (xi , yi ). Then
(n) (n)
(2) ∀n, Vol Aε ≤ 2n(h(X)+ε) , and Vol Aε ≥ (1 − h
(n)
i h
(n)
i
n n
(1) P A ε = P (X1 , Y1 ) ∈ A ε → 1 as n → ∞.
ε)2n(h(X)−ε) for n sufficiently large. That is for n suffi-
(n)
ciently large, we have (1 − ε)2n(h(X)−ε) ≤ Vol Aε ≤ (2) A(n)
ε ≤ 2
n(H(X,Y )+ε)
, and for sufficiently large n,
n(h(X)+ε)
2 . (n)
A ≥ (1 − ε) 2n(h(X,Y )−ε) .
ε
(n)
The set Aε is the smallest volume set with probability
≥ 1 − ε to first order in the exponent. More specifically, (3) If (U1n , V1n ) ∼ fX1n (un1 ) fY1n (v1n ), i.e., U1n and V1n are
(n) independent with the same marginals as fX1n ,Y1n (xn1 , y1n ),
forh eachi n = 1, 2, . . ., let Bδ ⊂ S n be any set with
(n)
then
P Bδ ≥ 1 − δ. Let X1 , . . . , Xn be i.i.d. ∼ p(x). For h i
P (U1n , V1n ) ∈ A(n) ≤ 2−n(I(X;Y )−3ε) .
(n)
δ < and any δ 0 > 0, n1 log Vol Bδ
1
2 > h(X) − δ 0 for ε
25
12 I-measure In fact, µ∗ is the unique signed measure on Fn which is
consistent with all Shannon’s information measures. We then
In this section, we present theories which establish one-to- have substitution of symbols as shown in table (5).
one correspondence between Shannon’s information measures
and set theory. The resulting theorems provide alternative H, I ↔ µ∗
approach to information-theoretic equalities and inequalities. , ↔ ∪
Consider n random variables X1 , X2 , . . . , Xn . For any ran- ; ↔ ∩
dom variable X, let X̃ be aSset corresponding to X. Define | ↔ -
the universal set Ω to be X̃i . The field Fn generated
i∈[n] Table 5: Substitution of symbols
by sets X̃1 , X̃2 , . . . , X̃n is the collection of sets which can be
obtained by any sequence of usual set operations (union, inter-
section, complement, and difference) on X̃1 , X̃2 , . . . , X̃n . The Motivated by the substitution of symbols, we
n
∗
will write µ X̃ ∩ X̃ ∩ · · · ∩ X̃ − X̃ as
T
atoms of Fn are sets of the form Yi , where Yi is either G1 G2 Gm F
i=1
that all atoms in Fn are disjoint. The set I X̃G1 ; X̃G2 ; · · · ; X̃Gm X̃F .
c
X̃i or X̃
Ti . Note
A0 = X̃ic = ∅ is called the empty atom of Fn . All the
i∈Nn 12.3. If there is no constraint on X1 , X2 , . . . , Xn , then µ∗
atoms of Fn other than A0 are called nonempty atoms. Let can take any set of nonnegative values on the nonempty atoms
A be the set of all nonempty atoms of Fn . Then, |A|, the of Fn .
cardinality of A, is equal to 2n − 1.
12.4. Because of the one-to-one correspondence between
Each set in Fn can be expressed uniquely as the union of Shannon’s information measures and set theory, it is valid
a subset of the atoms of Fn . to use an information diagram, which is a variation of a
Venn diagram, to represent relationship between Shannon’s
Any signed measure µ on Fn is completely specified by information measures. However, one must be careful. An
the values of µ on the nonempty atoms of Fn . I-measure µ∗ can take negative values. Therefore, when we
see in an information diagram that A is a subset of B, we
12.1. We define the I -measure µ∗ on Fn by cannot conclude from this fact alone that µ∗ (A) ≤ µ∗ (B)
unless we know from the setup of the problem that µ∗ is
µ∗ X̃G = H (XG ) for all nonempty G ⊂ [n] . nonnegative. For example, µ∗ is nonnegative if the random
variables involved form a Markov chain.
12.2. For all (not necessarily disjoint) subsets G, G0 , G00 of
[n] n
For a given n, there are n
P
k nonempty atoms that do
k=3
(a) µ∗ X̃G ∪ X̃G00 = µ∗ X̃G∪G00 = H (XG∪G00 ) not correspond to Shannon’s information measures and
hence can be negative.
(b) µ∗ X̃G ∩ X̃G0 − X̃G00 = I (XG ; XG0 |XG00 )
For n ≥ 4, it is not possible to display an information
diagram perfectly in two dimensions. In general, an
(c) µ∗ (A0 ) = 0
information diagram for n random variables, needs n − 1
Note that (2) is the necessary and sufficient condition for dimensions to be displayed perfectly.
µ∗ to be consistent with all Shannon’s information measures In information diagram, the universal set Ω is not shown
because explicitly.
When 0
G and G are nonempty,
When µ∗ takes the value zero on an atom A of Fn , we do
∗
µ X̃G ∩ X̃G0 − X̃G00 = I (XG ; XG0 |XG00 ). not need to display A in an information diagram because
A does not contribute to µ∗ (B) for any set B ∈ Fn
When G00 = ∅, we have µ∗ X̃G ∩ X̃G0 = I (XG ; XG0 ). containing the atom A.
26
≤0
( )
else
H X i X j , X k = 0 . So, ∀ ( i, j ) i ≠ j , H ( X 1 , X 2 , X 3 ) = H ( X i , X j ) . H = −sum(p.*log(p))/log(2);
12.8. Information diagram for the Markov chain X → X →
Furthermore, let X 1 , X 2 , X 3 be pair-wise independent. Then,1∀ ( i, j ) 2i ≠ j ,
end
X3 :
I ( Xi; X j ) = 0 .
13.2. Function information calculates the mutual informa-
12.9. For four random variables (or random vectors), the tion I(X; Y ) = I(p, Q) where p is the row vector describing
atoms colored in sky-blue in Figure 16 can be negative. pX and Q is a matrix defined by Qij = P [Y = yj |X = xi ].
27
6
function I = information(p,Q) function [ps C] = capacity fmincon(p,Q)
X = length(p); % CAPACITY FMINCON accects intial input
q = p*Q; % probability mass function p = pX and
HY = entropy2(q); % transition matrix Q = fY |X ,
temp = []; % calculate the corresponding capacuty.
for i = 1:X % Note that p is a column vector.
temp = [temp entropy2(Q(i,:))]; mi = @(p) −information(p.',Q);
end sp = size(p);
HYgX = sum(p.*temp); onep = ones(sp); zerop = zeros(sp);
I = HY−HYgX; [ps Cm] = fmincon(mi,p,[],[],onep.',1,zerop);
%The 5th and 6th arguments force the sum of
%elements in p to be 1. The 7th argument forces
13.3. Function capacity calculate the pmf p∗ which achieves %the elements to be nonnegative.
ps; C = −Cm;
capacity C = maxp I(p, Q) using Blahut-Arimoto algorithm
[6, Section 10.8]. Given a DMC with transition probabilities
13.4. The following script demonstrate how to use symbolic
Q(y|x) and any input distribution p0 (x), define a sequence
toolbox to calculate the mutual information between two
pr (x), r = 0, 1, . . . according to the iterative prescription
continuous random variables.
pr (x) cr (x)
pr+1 (x) = P ,
pr (x) cr (x) syms x y
x %Define the densities fX and fY |X
fX = 1/sqrt(2*pi*4)*exp(−1/2*xˆ2/4);
where fYcX = 1/sqrt(2*pi)*exp(−1/2*(y−x)ˆ2);
X Q (y |x ) %Support for X and Y
log cr (x) = Q (y |x ) log (26) rX = [−inf, inf];
y
qr (y)
rY = [−inf, inf];
%Calculate mutual information
and X fY = int(fX*fYcX,x,rX(1),rX(2));
qr (y) = pr (x) Q (y |x ). hY = −int(fY*log2(fY),y,rY(1),rY(2));
x hYcx = −int(fYcX*log2(fYcX),y,rY(1),rY(2));
hYcX = int(fX*hYcx,x,rX(1),rX(2));
Then, IXY = hY−hYcX;
eval(IXY)
!
X
log pr (x) cr (x) ≤ C ≤ log max cr (x) .
x
x
References
Note that (26) is D PY |X=x | PY when PX = pr .
[1] N. Abramson. Information theory and coding. McGraw-
Hill, New York, 1963. 1
function ps = capacity(pT,Q,n)
%n = number of iteration [2] T. Berger. Multiterminal source coding. In Lecture notes
for k = 1:n presented at the 1977 CISM Summer School, Udine, Italy,
X = size(Q,1); July 18-20 1977. 11
Y = size(Q,2);
qT = pT*Q; [3] Richard E. Blahut. Principles and practice of information
CT = [];
theory. Addison-Wesley Longman Publishing Co., Inc.,
for i = 1:X
sQlq = Q(i,:).*log2(qT); Boston, MA, USA, 1987. 9, 9.1
temp = −entropy2(Q(i,:))−sum(sQlq);
CT = [CT 2ˆ(temp)]; [4] A. S. Cohen and R. Zamir. Entropy amplification property
end and the loss for writing on dirty paper. Information
CT; Theory, IEEE Transactions on, 54(4):1477–1487, April
temp = sum(pT.*CT);
2008. 2.21
pT = 1/temp*(pT.*CT);
if(plt) [5] Thomas M. Cover and Joy A. Thomas. Elements of
figure
plot(pT) Information Theory. Wiley Series in Telecommunications.
end John Wiley & Sons, New York, 1991. 7, 2.28, 3.2, 3.11,
end 11
ps = pT;
[6] Thomas M. Cover and Joy A. Thomas. Elements of
Alternatively, the following code use MATLAB function Information Theory. Wiley-Interscience, 2006. 2, 2.3, 8,
fmincon to find p∗ . 3.11, 3.17, 3.19, 3.20, 4.1, 4.13, 9.4, 10.4, 13.3
28
[7] I. Csiszár and J. Körner. Information Theory: Coding [22] Claude E. Shannon. A mathematical theory of commu-
Theorems for Discrete Memoryless Systems. Academic nication. Bell Syst. Tech. J., 27(3):379–423, July 1948.
Press, 1981. 11 Continued 27(4):623-656, October 1948. (document)
[8] I. Csiszar and G. Tusnady. Information geometry and [23] A.M. Tulino and S. Verdu. Monotonic decrease of the
alternating minimization procedures. Recent results in non-gaussianness of the sum of independent random vari-
estimation theory and related topics, 1984. 3.19 ables: A simple proof. IEEE Transactions on Information
Theory, 52:4295–4297, 2006. 5
[9] G.A. Darbellay and I. Vajda. Entropy expressions for
multivariate continuous distributions. IEEE Transactions [24] S. Verdu. On channel capacity per unit cost. Information
on Information Theory, 46:709–712, 2000. 9.6 Theory, IEEE Transactions on, 36(5):1019–1030, Sep
1990. 4, 9.28
[10] R. M. Gray. Entropy and Information Theory. Springer-
Verlag, New York, New York, 1990. 10.1, 10.6 [25] A.C.G. Verdugo Lazo and P.N. Rathie. On the entropy of
continuous probability distributions. IEEE Transactions
[11] D. Guo, S. Shamai (Shitz), and S. Verd. Mutual in- on Information Theory, 24:120–122, 1978. 9.3
formation and minimum mean-square error in gaussian
channels. IEEE Trans. Inf. Theory, 51:1261–1282, 2005. [26] Raymond W. Yeung. First Course in Information Theory.
9.19 Kluwer Academic Publishers, 2002. 11, 11.4
[12] Oliver Johnson. Information Theory and the Central [27] R. Zamir and U. Erez. A gaussian input is not too bad.
Limit Theorem. Imperial College Press, 2004. 3.15, 9.1 Information Theory, IEEE Transactions on, 50(6):1362–
1367, June 2004. 4.11
[13] J. N. Kapur and H. K. Kesavan. Entropy Optimization
Principles With Applications. Academic Pr, 1992. 3.1,
3.6, 3.17, 3.18
29