Professional Documents
Culture Documents
farmankttk@gmail.com
Entropy and Conditional Entropy
2 Mutual Information
Denitions
3 Properties
4 Information Divergence
Denitions
5 Properties
2/ I-Hsiang IT
61 Wang Lecture 1
Entropy and Conditional Entropy
Denition 1 (Entropy)
The entropy of a (discrete) random variable X X with probability mass function P X () is dened
as
[ ]
H ( X ) E X log PX (X
1) PX 1
= x X P X (x) log
(x)
. (by convention we set 0 log(1/0) = 0 since lim t 0 t log t
= 0.)
Example 2
2 Mutual
Information
Denitions
Properties
3 Information
Divergence
Denitions
Properties
Entropy and Conditional Entropy
Entropy: Denition
Initially we dene entropy for a random variable; it is straightforward to extend this denition to
a sequence of random variables, or, a random vector.
The entropy of a random vector is also called the joint entropy of the component random
variables.
Denition 2 (Entropy)
[ ]
The entropy of a d-dimensional random vector X X 1 X d is dened by the expectation
of the self information [
] P X ( X ) log 1
= H (X 1 , . . . ,
H ( X ) E X logPX 1
= X X1 Xd
PX
(X ) Xd ) . (X )
Remark: Entropy of a rv. is a function of the distribution of the rv.. Hence, we often write H (P )
and
H ( X ) interchangeably for a rv. X P .
D enitions
Example 3
Compute H (X 1 ), H (X 2 ), and H (X 1 , X 2 ).
sol
H ( X 1, X ) = 2 16 log 6 + 2 13 log 3 =13 + log
:
3. H2 ( X ) = 2 ( 1 + ) 1 = 1 = H (X )
1 3
1 1 2
log 6
1 +
3 .
6
Compared to Example 2, it can be understood that the value of entropy only depends on
the distribution of the random variable/vector, not on the actual values it may take.
Entropy and Conditional Entropy D enitions
Conditional Entropy
For two r.v.'s with conditional p.m.f. PX | Y (x|y), we are able to dene "the entropy of X given
Y = y " according to PX| Y (|y):
1
H ( X |Y = y ) x X PX | Y (x|y) log P X | Y (x|y) .
H ( X |Y = y ): the amount of uncertainty of X when we know that Y takes value at
y. Averaging over Y , we obtain the amount of uncertainty of X given Y :
Denition 3 (Conditional Entropy)
Example 4
2 Mutual
Information
Denitions
Properties
3 Information
Divergence
Denitions
Properties
11 / I-Hsiang IT
61 Wang Lecture 1
Entropy and Conditional Entropy P roperties
Properties of Entropy
pf: Let the support of X , suppX , denote the subset of X where X takes non-zero
probability.
Dene a new r.v. U . Note that E [U ] = |supp
X |.
PX (X )
1 Hence,
( Jensen )
H ( X ) = E [log U ] log (E [U ]) =
log |suppX| log |X |.
The rst inequality holds with equality i U is deterministic i x suppX , P X (x) are
Chain Rule
Interpretation: The more one learns, the less the uncertainty is.
The amount of uncertainty of your target remains the same if and only if what you have learned
is independent of your target.
Exercise 3
While it is always true that H ( X |Y ) H ( X ), for y Y, the following two are both possible:
H ( X |Y = y ) < H ( X ), or
H ( X |Y = y ) > H ( X ) .
Please construct examples for the above two cases respectively.
Entropy and Conditional Entropy P roperties
Example 5
H (X 1 , X 2 ) = log 3 +3 1 , H (X 1 ) = H (X 2 ) =
1, ( X 1 |X 2) = H ( X 2 |X ) = log 3 2.
H 3
1
It is straightforward to check that the chain rule holds. Besides, it can be easily seen
that conditioning reduces entropy.
16 / I-Hsiang IT
61 Wang Lecture 1
Entropy and Conditional Entropy P roperties
Generalization
Proofs of the more general "Chain Rule" and "Conditioning Reduces Entropy" are left as
H (X) Le
arn
Y ing
H (X|Y )
0 0
}
H (X) Le
arn
Y ing I ( X;
Y)
H (X|Y )
0 0
Mutual Information
Denition 4 (Mutual
}
H (X) Lea
Information)
For a pair of jointly distributed r.v.'s (X, Y ), Y
rn
in g I (X;
the mutual information between them is Y)
dened as H (X|
I ( X ; Y ) H ( X ) H ( X |Y ). Y)
pf: The proof of the rst one is due to the fact that conditioning reduces entropy. The proof of
the second one is due to H ( X |Y ) 0.
I ( X ; Y |Z ) = H ( X |Z ) H ( X |Y, Z ) = H (Y |Z ) H (Y |X, Z )
= H ( X |Z ) + H (Y |Z ) H (X, Y |Z ) .
2 Mutual
Information
Denitions
Properties
3 Information
Divergence
Denitions
Properties
Mutual Information P roperties
n
I ( X ; Y1 , . . . , Y n ) = I ( X ; Y i |Y1, . . . ,
Yi 1 ) .
i=1
pf: Proved by denition and the chain rule for entropy.
M utual Information P roperties
X[1 : N pY | X Y [1 :
Noisy
W Encoder ] NChannel
] Decoder W
Exercise 7
Sh ow that X 1 X 2 X 3 X 4 = I ( X 1 ; X 4 ) I ( X 2 ; X 3 ).
M utual Information P roperties
Theorem 12
Let (X, Y ) P X,Y = P X PY | X .
With PY | X xed, I ( X ; Y ) is a concave function of P X .
With P X xed, I ( X ; Y ) is a convex function of PY | X .
Information Divergence
2 Mutual
Information
Denitions
Properties
3 Information
Divergence
Denitions
Properties
30 / I-Hsiang IT
61 Wang Lecture 1
Information Divergence
Information Divergence