You are on page 1of 36

Entropy and Conditional Entropy

Information Theory and Coding


(Lecture 2)

Dr. Farman Ullah

farmankttk@gmail.com
Entropy and Conditional Entropy

1 Entropy and Conditional Entropy


Denitions
Properties

2 Mutual Information
Denitions
3 Properties

4 Information Divergence
Denitions
5 Properties

2/ I-Hsiang IT
61 Wang Lecture 1
Entropy and Conditional Entropy

Entropy: Measure of Uncertainty of a Random


Variable
1
logP {X=x } : measure of information/uncertainty of an outcome
x.
If the outcome has small probability, it contains higher uncertainty. However, on the average,
it happens rarely. Hence, to measure the uncertainty of a random variable, we should take
the expectation of the self information over all possible realizations:

Denition 1 (Entropy)

The entropy of a (discrete) random variable X X with probability mass function P X () is dened
as
[ ]
H ( X ) E X log PX (X
1) PX 1
= x X P X (x) log
(x)
. (by convention we set 0 log(1/0) = 0 since lim t 0 t log t
= 0.)

Note: Entropy can be understood as the (average)


amount of information when one learns the actual outcome/realization of r.v. X .
Entropy and Conditional Entropy

Example 1 (Binary entropy


function) Hb (p
Let X Ber(p) be a Bernoulli random
)
variable, that is, X {0, 1}, P X (1) = 1 P X
(0) = p,
p [0, 1]. Then, the entropy of X is called
the binary entropy function Hb(p), where
0.
5
Hb(p) H ( X ) = p log p (1 p)
log(1 p).
Exercise 1
1 Analytically check that

max Hb (p) = 1, arg max Hb (p) = 0 p


1/2. 0 0.
p [0, 1] p [0, 1] 5
2
Analytically prove that Hb (p) is concave in p.
Entropy and Conditional Entropy

Example 2

Consider a random variable X {0, 1, 2, 3} with p.m.f. dened as


follows:
0 1 2 3
x 1 1 1 1
6 3 3 6
P (x)

Compute H ( X ) and H (Y ), where Y


X mod 2.
H ( X ) = 2 16 log 6 + 2
1 log 3 =1 + log
3 3
sol: 3. 1
H (Y ) = 2 2 log 2 =
1.

(when the context is clear, we drop the subscripts in P X , P Y , P Y |X ,


etc.)
Entropy and Conditional Entropy D enitions

1 Entropy and Conditional


Entropy Denitions
Properties

2 Mutual
Information
Denitions
Properties

3 Information
Divergence
Denitions
Properties
Entropy and Conditional Entropy

Entropy: Denition

Initially we dene entropy for a random variable; it is straightforward to extend this denition to
a sequence of random variables, or, a random vector.

The entropy of a random vector is also called the joint entropy of the component random
variables.

Denition 2 (Entropy)
[ ]
The entropy of a d-dimensional random vector X X 1 X d is dened by the expectation
of the self information [
] P X ( X ) log 1
= H (X 1 , . . . ,
H ( X ) E X logPX 1
= X X1 Xd
PX
(X ) Xd ) . (X )

Remark: Entropy of a rv. is a function of the distribution of the rv.. Hence, we often write H (P )
and
H ( X ) interchangeably for a rv. X P .
D enitions

Example 3

Consider two random variables X 1 , X 2 {0, 1} with joint p.m.f.

(x 1 , x 2 ) (0, 0) (0, 1) (1, 0) (1, 1)


1 1 1 1
P (x 1 , x 2 ) 6 3 3 6

Compute H (X 1 ), H (X 2 ), and H (X 1 , X 2 ).

sol
H ( X 1, X ) = 2 16 log 6 + 2 13 log 3 =13 + log
:
3. H2 ( X ) = 2 ( 1 + ) 1 = 1 = H (X )
1 3
1 1 2
log 6
1 +
3 .
6

Compared to Example 2, it can be understood that the value of entropy only depends on
the distribution of the random variable/vector, not on the actual values it may take.
Entropy and Conditional Entropy D enitions

Conditional Entropy

For two r.v.'s with conditional p.m.f. PX | Y (x|y), we are able to dene "the entropy of X given
Y = y " according to PX| Y (|y):
1
H ( X |Y = y ) x X PX | Y (x|y) log P X | Y (x|y) .
H ( X |Y = y ): the amount of uncertainty of X when we know that Y takes value at
y. Averaging over Y , we obtain the amount of uncertainty of X given Y :
Denition 3 (Conditional Entropy)

The conditional entropy of X given Y is dened by


[ ]
H ( X |Y ) y Y PY (y)H ( X |Y = y ) = P X,Y (x, y) logP X | Y 1(x|y) = E X,Y log P X | Y (X
1 |Y )
.
x X ,y Y
Entropy and Conditional Entropy D enitions

Example 4

Consider two random variables X 1 , X 2 {0, 1} with joint p.m.f.

(x 1 , x 2 ) (0, 0) (0, 1) (1, 0) (1, 1)


1 1 1 1
P (x 1 , x 2 ) 6 3 3 6

Compute H (X 1 |X2 = 0 ), H (X 1 |X2 = 1 ), H (X 1 |X2 ), and H


(X 2 |X1 ). (x 1 , x 2 ) (0, 0) (0, 1) (1, 0) (1, 1)
P (x 1|x 2) 1 2 2 1
3 3 3 3
sol:
P (x 2|x 1) 1 2 2 1
3 3 3 3

H (X 1 |X = 0 ) = 13 log 3 +23 log3 = Hb ( 13) , H (X 2 3 1 ( 1)


1 log 3 + 2 1 log3 = H 1
( 1) |X = 1 ) =3 log2 + log 3 = bH 3
. (X2
H 1 | X ) = 2 2 b 3
2 = log 3 =
3 H ( X2 | X )
6 3 2 3 2
2 1
Entropy and Conditional Entropy P roperties

1 Entropy and Conditional


Entropy Denitions
Properties

2 Mutual
Information
Denitions
Properties

3 Information
Divergence
Denitions
Properties

11 / I-Hsiang IT
61 Wang Lecture 1
Entropy and Conditional Entropy P roperties

Properties of Entropy

Theorem 2 (Properties of (Joint)


Entropy)
1 H ( X ) 0, with equality i X is deterministic.

2 H ( X ) log |X |, with equality i X is uniformly distributed over


X.
H ( X ) i = log |Xi|, with equality i X is uniformly distributed over X1
d
3.
Xd. 1

Interpretation: Quite natural:

Amount of uncertainty in X = 0 X is deterministic.


Amount of uncertainty in X is maximized X is equally likely to take every value in X .
Entropy and Conditional Entropy P roperties

Lemma 1 (Jensen's Inequality)


f : R R be a strictly concave function, and X be a real-valued r.v.. Then, E [f (X)] f (E
[X]), with equality i X is deterministic.
We shall use the above lemma to prove that H ( X ) log |X |, with equality i X Unif [X ].

pf: Let the support of X , suppX , denote the subset of X where X takes non-zero
probability.
Dene a new r.v. U . Note that E [U ] = |supp
X |.
PX (X )
1 Hence,
( Jensen )
H ( X ) = E [log U ] log (E [U ]) =
log |suppX| log |X |.

The rst inequality holds with equality i U is deterministic i x suppX , P X (x) are

equal. The second inequality holds with equality i suppX = X .


Entropy and Conditional Entropy P roperties

Chain Rule

Theorem 3 (Chain Rule)


H (X, Y ) = H (Y ) + H ( X |Y ) = H ( X ) + H (Y |X ).

Interpretation: Amount of uncertainty of (X, Y ) = Amount of uncertainty of Y +


Amount of uncertainty of X after knowing Y .
Entropy and Conditional Entropy P roperties

Conditioning Reduces Entropy

Theorem 4 (Conditioning Reduces Entropy)


H ( X |Y ) H ( X ), with equality i X is independent of Y .

Interpretation: The more one learns, the less the uncertainty is.
The amount of uncertainty of your target remains the same if and only if what you have learned
is independent of your target.

Exercise 3
While it is always true that H ( X |Y ) H ( X ), for y Y, the following two are both possible:
H ( X |Y = y ) < H ( X ), or

H ( X |Y = y ) > H ( X ) .
Please construct examples for the above two cases respectively.
Entropy and Conditional Entropy P roperties

Example 5

Consider two random variables X 1 , X 2 {0, 1} with joint p.m.f.

(x 1 , x 2 ) (0, 0) (0, 1) (1, 0) (1, 1)


1 1 1 1
P (x 1 , x 2 ) 6 3 3 6

In the previously examples, we have

H (X 1 , X 2 ) = log 3 +3 1 , H (X 1 ) = H (X 2 ) =
1, ( X 1 |X 2) = H ( X 2 |X ) = log 3 2.
H 3
1
It is straightforward to check that the chain rule holds. Besides, it can be easily seen
that conditioning reduces entropy.

16 / I-Hsiang IT
61 Wang Lecture 1
Entropy and Conditional Entropy P roperties

Generalization

Proofs of the more general "Chain Rule" and "Conditioning Reduces Entropy" are left as

exercises. Theorem 5 (Chain Rule)

The chain rule can be generalized to more than two r.v.'s:


n

H (X 1 , . . . , X n ) = H ( X i |X 1 , . . . ,
X i 1 ).
i=1

Conditioning reduces entropy


Theorem 6 (Conditioning can be
Reduces generalized to more than two
Entropy)
r.v.'s:
H ( X |Y, Z ) H ( X |Y ).
Entropy and Conditional Entropy P roperties

Upper Bound on Joint Entropy

Corollary 1 (Joint Entropy Sum of Marginal


Entropies)
n

H (X 1 , . . . , X n ) H (Xi )
i =1

Proof is left as exercise (chain rule of entropy +


conditioning reduces entropy).
Mutual Information

Conditioning Reduces Entropy Revisited


Entropy quanties the amount of uncertainty of a r.v., say, X .
Conditional entropy quanties the amount of uncertainty of a r.v. X given another r.v., say, Y .

H (X) Le
arn
Y ing

H (X|Y )

0 0

Question: How much information does Y tell about


X?
Mutual Information

Conditioning Reduces Entropy Revisited


Entropy quanties the amount of uncertainty of a r.v., say, X .
Conditional entropy quanties the amount of uncertainty of a r.v. X given another r.v., say, Y .

}
H (X) Le
arn
Y ing I ( X;
Y)
H (X|Y )

0 0

Question: How much information does Y tell about X ?


Ans: The amount of information about X that one obtains by learning Y is H ( X ) H ( X |
Y ).
Mutual Information D enitions

Mutual Information

Denition 4 (Mutual

}
H (X) Lea
Information)
For a pair of jointly distributed r.v.'s (X, Y ), Y
rn
in g I (X;
the mutual information between them is Y)
dened as H (X|
I ( X ; Y ) H ( X ) H ( X |Y ). Y)

Relate: what channel coding does is to infer 0 0


some information about the channel input X
X Y
from the channel output Y . P Y | X (y|x)
An Indentity about Mutual Information
Mutual Information Denitions

Mutual Information Measures the Level of


Dependency
Theorem 9 (Extremal Values of Mutual
Information)
1 I ( X ; Y ) 0, with equality i X , Y are independent.

2 I ( X ; Y ) H ( X ), with equality i X is a deterministic function of Y


.

pf: The proof of the rst one is due to the fact that conditioning reduces entropy. The proof of
the second one is due to H ( X |Y ) 0.

Interpretation: the mutual information between X and Y , I ( X ; Y ) can also be viewed


as a measure of the dependency between X and Y :
If X is determined by Y (highly dependent), I ( X ; Y ) is
maximized. If X is independent of Y (no dependency), I ( X ; Y )
= 0.
Mutual Information D enitions

Conditional Mutual Information

Denition 5 (Conditional Mutual Information)


For a tuple of jointly distributed r.v.'s (X, Y, Z) , the mutual information between X and Y given Z is
I ( X ; Y |Z ) H ( X |Z ) H ( X |Y, Z ).

Similar to the previous identity (Theorem 8), we have

I ( X ; Y |Z ) = H ( X |Z ) H ( X |Y, Z ) = H (Y |Z ) H (Y |X, Z )
= H ( X |Z ) + H (Y |Z ) H (X, Y |Z ) .

Similar to Theorem 9, we have


1 I ( X ; Y |Z ) 0, with equality i X , Y are independent given Z (X Z Y forms a Markov
chain).

2 I ( X ; Y |Z ) H ( X |Z ), with equality i X is a deterministic function of Y and Z .


Mutual Information P roperties

1 Entropy and Conditional


Entropy Denitions
Properties

2 Mutual
Information
Denitions
Properties

3 Information
Divergence
Denitions
Properties
Mutual Information P roperties

Chain Rule for Mutual Information

Theorem 10 (Chain Rule for Mutual Information)

n

I ( X ; Y1 , . . . , Y n ) = I ( X ; Y i |Y1, . . . ,
Yi 1 ) .
i=1
pf: Proved by denition and the chain rule for entropy.
M utual Information P roperties

Data Processing Inequality: Applications


Markov chains are common in communication systems. For example, in channel coding (without
feedback), the message W , the channel input X N X[1 : N ], the channel output Y N Y [1 :
N ], and the decoded message W form a Markov chain W XN Y N W,

X[1 : N pY | X Y [1 :
Noisy
W Encoder ] NChannel
] Decoder W

Data processing inequality is crucial in obtaining impossibility results in information


theory.
Exercise 6 (Functions of R.V.)
For Z g (Y ) being a deterministic function of Y , show that H (Y ) H (Z ) and I ( X ; Y ) I
(X ; Z )

Exercise 7

Sh ow that X 1 X 2 X 3 X 4 = I ( X 1 ; X 4 ) I ( X 2 ; X 3 ).
M utual Information P roperties

Convexity and Concavity of Mutual Information


The convexity/concavity properties of mutual information turns out to be very useful in
computing channel capacity and rate distortion functions, as we will see in later lectures.

Theorem 12
Let (X, Y ) P X,Y = P X PY | X .
With PY | X xed, I ( X ; Y ) is a concave function of P X .
With P X xed, I ( X ; Y ) is a convex function of PY | X .
Information Divergence

1 Entropy and Conditional Entropy


Denitions
Properties

2 Mutual
Information
Denitions
Properties

3 Information
Divergence
Denitions
Properties

30 / I-Hsiang IT
61 Wang Lecture 1
Information Divergence

Measuring the Distance between Probability


Distributions
Information Divergence D enitions

Information Divergence

Denition 6 (Information Divergence (KullbackLeibler Divergence, Relative


Entropy))
Example 9 (Binary divergence
function) Information Divergence D enitions
Conditional Information Divergence
Log-Sum Inequality

You might also like