Information Theory and Coding (Lecture 2) : Dr. Farman Ullah

Entropy and Conditional Entropy
Information Theory and Coding

(Lecture 2)
Dr. Farman Ullah
farmankttk@gmail.com
1 Entropy and Conditional Entropy

Denitions
Properties
2 Mutual Information
Denitions
3 Properties
4 Information Divergence
Denitions
5 Properties
2/ I-Hsiang IT
61 Wang Lecture 1
Entropy: Measure of Uncertainty of a Random

Variable
1
logP {X=x } : measure of information/uncertainty of an outcome
x.
If the outcome has small probability, it contains higher uncertainty. However, on the average,
it happens rarely. Hence, to measure the uncertainty of a random variable, we should take
the expectation of the self information over all possible realizations:
Denition 1 (Entropy)
The entropy of a (discrete) random variable X X with probability mass function P X () is dened
as
[ ]
H ( X ) E X log PX (X
1) PX 1
= x X P X (x) log
(x)
. (by convention we set 0 log(1/0) = 0 since lim t 0 t log t
= 0.)
Note: Entropy can be understood as the (average)

amount of information when one learns the actual outcome/realization of r.v. X .
Example 1 (Binary entropy

function) Hb (p
Let X Ber(p) be a Bernoulli random
)
variable, that is, X {0, 1}, P X (1) = 1 P X
(0) = p,
p [0, 1]. Then, the entropy of X is called
the binary entropy function Hb(p), where
0.
5
Hb(p) H ( X ) = p log p (1 p)
log(1 p).
Exercise 1
1 Analytically check that
max Hb (p) = 1, arg max Hb (p) = 0 p

1/2. 0 0.
p [0, 1] p [0, 1] 5
2
Analytically prove that Hb (p) is concave in p.
Example 2
Consider a random variable X {0, 1, 2, 3} with p.m.f. dened as

follows:
0 1 2 3
x 1 1 1 1
6 3 3 6
P (x)
Compute H ( X ) and H (Y ), where Y

X mod 2.
H ( X ) = 2 16 log 6 + 2
1 log 3 =1 + log
3 3
sol: 3. 1
H (Y ) = 2 2 log 2 =
1.
(when the context is clear, we drop the subscripts in P X , P Y , P Y |X ,

etc.)
Entropy and Conditional Entropy D enitions
1 Entropy and Conditional

Entropy Denitions
Properties
2 Mutual
Information
Denitions
Properties
3 Information
Divergence
Denitions
Properties
Entropy: Denition
Initially we dene entropy for a random variable; it is straightforward to extend this denition to
a sequence of random variables, or, a random vector.
The entropy of a random vector is also called the joint entropy of the component random
variables.
Denition 2 (Entropy)
[ ]
The entropy of a d-dimensional random vector X X 1 X d is dened by the expectation
of the self information [
] P X ( X ) log 1
= H (X 1 , . . . ,
H ( X ) E X logPX 1
= X X1 Xd
PX
(X ) Xd ) . (X )
Remark: Entropy of a rv. is a function of the distribution of the rv.. Hence, we often write H (P )
and
H ( X ) interchangeably for a rv. X P .
D enitions
Example 3
Consider two random variables X 1 , X 2 {0, 1} with joint p.m.f.
(x 1 , x 2 ) (0, 0) (0, 1) (1, 0) (1, 1)

1 1 1 1
P (x 1 , x 2 ) 6 3 3 6
Compute H (X 1 ), H (X 2 ), and H (X 1 , X 2 ).
sol
H ( X 1, X ) = 2 16 log 6 + 2 13 log 3 =13 + log
:
3. H2 ( X ) = 2 ( 1 + ) 1 = 1 = H (X )
1 3
1 1 2
log 6
1 +
3 .
6
Compared to Example 2, it can be understood that the value of entropy only depends on
the distribution of the random variable/vector, not on the actual values it may take.
Conditional Entropy
For two r.v.'s with conditional p.m.f. PX | Y (x|y), we are able to dene "the entropy of X given
Y = y " according to PX| Y (|y):
1
H ( X |Y = y ) x X PX | Y (x|y) log P X | Y (x|y) .
H ( X |Y = y ): the amount of uncertainty of X when we know that Y takes value at
y. Averaging over Y , we obtain the amount of uncertainty of X given Y :
Denition 3 (Conditional Entropy)
The conditional entropy of X given Y is dened by

[ ]
H ( X |Y ) y Y PY (y)H ( X |Y = y ) = P X,Y (x, y) logP X | Y 1(x|y) = E X,Y log P X | Y (X
1 |Y )
.
x X ,y Y
Example 4
(x 1 , x 2 ) (0, 0) (0, 1) (1, 0) (1, 1)

1 1 1 1
P (x 1 , x 2 ) 6 3 3 6
Compute H (X 1 |X2 = 0 ), H (X 1 |X2 = 1 ), H (X 1 |X2 ), and H

(X 2 |X1 ). (x 1 , x 2 ) (0, 0) (0, 1) (1, 0) (1, 1)
P (x 1|x 2) 1 2 2 1
3 3 3 3
sol:
P (x 2|x 1) 1 2 2 1
3 3 3 3
H (X 1 |X = 0 ) = 13 log 3 +23 log3 = Hb ( 13) , H (X 2 3 1 ( 1)

1 log 3 + 2 1 log3 = H 1
( 1) |X = 1 ) =3 log2 + log 3 = bH 3
. (X2
H 1 | X ) = 2 2 b 3
2 = log 3 =
3 H ( X2 | X )
6 3 2 3 2
2 1
Entropy and Conditional Entropy P roperties

Entropy Denitions
Properties
2 Mutual
Information
Denitions
Properties
3 Information
Divergence
Denitions
Properties
11 / I-Hsiang IT
61 Wang Lecture 1
Properties of Entropy
Theorem 2 (Properties of (Joint)

Entropy)
1 H ( X ) 0, with equality i X is deterministic.
2 H ( X ) log |X |, with equality i X is uniformly distributed over

X.
H ( X ) i = log |Xi|, with equality i X is uniformly distributed over X1
d
3.
Xd. 1
Interpretation: Quite natural:
Amount of uncertainty in X = 0 X is deterministic.

Amount of uncertainty in X is maximized X is equally likely to take every value in X .
Lemma 1 (Jensen's Inequality)

f : R R be a strictly concave function, and X be a real-valued r.v.. Then, E [f (X)] f (E
[X]), with equality i X is deterministic.
We shall use the above lemma to prove that H ( X ) log |X |, with equality i X Unif [X ].
pf: Let the support of X , suppX , denote the subset of X where X takes non-zero
probability.
Dene a new r.v. U . Note that E [U ] = |supp
X |.
PX (X )
1 Hence,
( Jensen )
H ( X ) = E [log U ] log (E [U ]) =
log |suppX| log |X |.
The rst inequality holds with equality i U is deterministic i x suppX , P X (x) are
equal. The second inequality holds with equality i suppX = X .

Chain Rule
Theorem 3 (Chain Rule)

H (X, Y ) = H (Y ) + H ( X |Y ) = H ( X ) + H (Y |X ).
Interpretation: Amount of uncertainty of (X, Y ) = Amount of uncertainty of Y +

Amount of uncertainty of X after knowing Y .
Conditioning Reduces Entropy
Theorem 4 (Conditioning Reduces Entropy)

H ( X |Y ) H ( X ), with equality i X is independent of Y .
Interpretation: The more one learns, the less the uncertainty is.
The amount of uncertainty of your target remains the same if and only if what you have learned
is independent of your target.
Exercise 3
While it is always true that H ( X |Y ) H ( X ), for y Y, the following two are both possible:
H ( X |Y = y ) < H ( X ), or
H ( X |Y = y ) > H ( X ) .
Please construct examples for the above two cases respectively.
Example 5
(x 1 , x 2 ) (0, 0) (0, 1) (1, 0) (1, 1)

1 1 1 1
P (x 1 , x 2 ) 6 3 3 6
In the previously examples, we have
H (X 1 , X 2 ) = log 3 +3 1 , H (X 1 ) = H (X 2 ) =
1, ( X 1 |X 2) = H ( X 2 |X ) = log 3 2.
H 3
1
It is straightforward to check that the chain rule holds. Besides, it can be easily seen
that conditioning reduces entropy.
16 / I-Hsiang IT
61 Wang Lecture 1
Generalization
Proofs of the more general "Chain Rule" and "Conditioning Reduces Entropy" are left as
exercises. Theorem 5 (Chain Rule)
The chain rule can be generalized to more than two r.v.'s:

n

H (X 1 , . . . , X n ) = H ( X i |X 1 , . . . ,
X i 1 ).
i=1
Conditioning reduces entropy

Theorem 6 (Conditioning can be
Reduces generalized to more than two
Entropy)
r.v.'s:
H ( X |Y, Z ) H ( X |Y ).
Upper Bound on Joint Entropy
Corollary 1 (Joint Entropy Sum of Marginal

Entropies)
n

H (X 1 , . . . , X n ) H (Xi )
i =1
Proof is left as exercise (chain rule of entropy +

conditioning reduces entropy).
Mutual Information
Conditioning Reduces Entropy Revisited

Entropy quanties the amount of uncertainty of a r.v., say, X .
Conditional entropy quanties the amount of uncertainty of a r.v. X given another r.v., say, Y .
H (X) Le
arn
Y ing
H (X|Y )
0 0
Question: How much information does Y tell about

X?
Mutual Information
Conditioning Reduces Entropy Revisited

Entropy quanties the amount of uncertainty of a r.v., say, X .
Conditional entropy quanties the amount of uncertainty of a r.v. X given another r.v., say, Y .
}
H (X) Le
arn
Y ing I ( X;
Y)
H (X|Y )
0 0
Question: How much information does Y tell about X ?

Ans: The amount of information about X that one obtains by learning Y is H ( X ) H ( X |
Y ).
Mutual Information D enitions
Mutual Information
Denition 4 (Mutual
}
H (X) Lea
Information)
For a pair of jointly distributed r.v.'s (X, Y ), Y
rn
in g I (X;
the mutual information between them is Y)
dened as H (X|
I ( X ; Y ) H ( X ) H ( X |Y ). Y)
Relate: what channel coding does is to infer 0 0

some information about the channel input X
X Y
from the channel output Y . P Y | X (y|x)
An Indentity about Mutual Information
Mutual Information Denitions
Mutual Information Measures the Level of

Dependency
Theorem 9 (Extremal Values of Mutual
Information)
1 I ( X ; Y ) 0, with equality i X , Y are independent.
2 I ( X ; Y ) H ( X ), with equality i X is a deterministic function of Y

.
pf: The proof of the rst one is due to the fact that conditioning reduces entropy. The proof of
the second one is due to H ( X |Y ) 0.
Interpretation: the mutual information between X and Y , I ( X ; Y ) can also be viewed

as a measure of the dependency between X and Y :
If X is determined by Y (highly dependent), I ( X ; Y ) is
maximized. If X is independent of Y (no dependency), I ( X ; Y )
= 0.
Mutual Information D enitions
Conditional Mutual Information
Denition 5 (Conditional Mutual Information)

For a tuple of jointly distributed r.v.'s (X, Y, Z) , the mutual information between X and Y given Z is
I ( X ; Y |Z ) H ( X |Z ) H ( X |Y, Z ).
Similar to the previous identity (Theorem 8), we have
I ( X ; Y |Z ) = H ( X |Z ) H ( X |Y, Z ) = H (Y |Z ) H (Y |X, Z )
= H ( X |Z ) + H (Y |Z ) H (X, Y |Z ) .
Similar to Theorem 9, we have

1 I ( X ; Y |Z ) 0, with equality i X , Y are independent given Z (X Z Y forms a Markov
chain).
2 I ( X ; Y |Z ) H ( X |Z ), with equality i X is a deterministic function of Y and Z .

Mutual Information P roperties

Entropy Denitions
Properties
2 Mutual
Information
Denitions
Properties
3 Information
Divergence
Denitions
Properties
Mutual Information P roperties
Chain Rule for Mutual Information
Theorem 10 (Chain Rule for Mutual Information)
n

I ( X ; Y1 , . . . , Y n ) = I ( X ; Y i |Y1, . . . ,
Yi 1 ) .
i=1
pf: Proved by denition and the chain rule for entropy.
M utual Information P roperties
Data Processing Inequality: Applications

Markov chains are common in communication systems. For example, in channel coding (without
feedback), the message W , the channel input X N X[1 : N ], the channel output Y N Y [1 :
N ], and the decoded message W form a Markov chain W XN Y N W,
X[1 : N pY | X Y [1 :
Noisy
W Encoder ] NChannel
] Decoder W
Data processing inequality is crucial in obtaining impossibility results in information

theory.
Exercise 6 (Functions of R.V.)
For Z g (Y ) being a deterministic function of Y , show that H (Y ) H (Z ) and I ( X ; Y ) I
(X ; Z )
Exercise 7
Sh ow that X 1 X 2 X 3 X 4 = I ( X 1 ; X 4 ) I ( X 2 ; X 3 ).
M utual Information P roperties
Convexity and Concavity of Mutual Information

The convexity/concavity properties of mutual information turns out to be very useful in
computing channel capacity and rate distortion functions, as we will see in later lectures.
Theorem 12
Let (X, Y ) P X,Y = P X PY | X .
With PY | X xed, I ( X ; Y ) is a concave function of P X .
With P X xed, I ( X ; Y ) is a convex function of PY | X .
Information Divergence
1 Entropy and Conditional Entropy

Denitions
Properties
2 Mutual
Information
Denitions
Properties
3 Information
Divergence
Denitions
Properties
30 / I-Hsiang IT
61 Wang Lecture 1
Measuring the Distance between Probability

Distributions
Information Divergence D enitions
Denition 6 (Information Divergence (KullbackLeibler Divergence, Relative

Entropy))
Example 9 (Binary divergence
function) Information Divergence D enitions
Conditional Information Divergence
Log-Sum Inequality

Information Theory and Coding (Lecture 2) : Dr. Farman Ullah

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Theory and Coding (Lecture 2) : Dr. Farman Ullah

Uploaded by

Copyright:

Available Formats

Entropy and Conditional Entropy

Information Theory and Coding

Dr. Farman Ullah

1 Entropy and Conditional Entropy

Entropy: Measure of Uncertainty of a Random

Note: Entropy can be understood as the (average)

Example 1 (Binary entropy

max Hb (p) = 1, arg max Hb (p) = 0 p

Consider a random variable X {0, 1, 2, 3} with p.m.f. dened as

Compute H ( X ) and H (Y ), where Y

(when the context is clear, we drop the subscripts in P X , P Y , P Y |X ,

1 Entropy and Conditional

Consider two random variables X 1 , X 2 {0, 1} with joint p.m.f.

(x 1 , x 2 ) (0, 0) (0, 1) (1, 0) (1, 1)

The conditional entropy of X given Y is dened by

Consider two random variables X 1 , X 2 {0, 1} with joint p.m.f.

(x 1 , x 2 ) (0, 0) (0, 1) (1, 0) (1, 1)

Compute H (X 1 |X2 = 0 ), H (X 1 |X2 = 1 ), H (X 1 |X2 ), and H

H (X 1 |X = 0 ) = 13 log 3 +23 log3 = Hb ( 13) , H (X 2 3 1 ( 1)

1 Entropy and Conditional

Theorem 2 (Properties of (Joint)

2 H ( X ) log |X |, with equality i X is uniformly distributed over

Interpretation: Quite natural:

Amount of uncertainty in X = 0 X is deterministic.

Lemma 1 (Jensen's Inequality)

equal. The second inequality holds with equality i suppX = X .

Theorem 3 (Chain Rule)

Interpretation: Amount of uncertainty of (X, Y ) = Amount of uncertainty of Y +

Conditioning Reduces Entropy

Theorem 4 (Conditioning Reduces Entropy)

Consider two random variables X 1 , X 2 {0, 1} with joint p.m.f.

(x 1 , x 2 ) (0, 0) (0, 1) (1, 0) (1, 1)

In the previously examples, we have

exercises. Theorem 5 (Chain Rule)

The chain rule can be generalized to more than two r.v.'s:

Conditioning reduces entropy

Upper Bound on Joint Entropy

Corollary 1 (Joint Entropy Sum of Marginal

Proof is left as exercise (chain rule of entropy +

Conditioning Reduces Entropy Revisited

Question: How much information does Y tell about

Conditioning Reduces Entropy Revisited

Question: How much information does Y tell about X ?

Relate: what channel coding does is to infer 0 0

Mutual Information Measures the Level of

2 I ( X ; Y ) H ( X ), with equality i X is a deterministic function of Y

Interpretation: the mutual information between X and Y , I ( X ; Y ) can also be viewed

Conditional Mutual Information

Denition 5 (Conditional Mutual Information)

Similar to the previous identity (Theorem 8), we have

Similar to Theorem 9, we have

2 I ( X ; Y |Z ) H ( X |Z ), with equality i X is a deterministic function of Y and Z .

1 Entropy and Conditional

Chain Rule for Mutual Information

Theorem 10 (Chain Rule for Mutual Information)

Data Processing Inequality: Applications

Data processing inequality is crucial in obtaining impossibility results in information

Convexity and Concavity of Mutual Information

1 Entropy and Conditional Entropy

Measuring the Distance between Probability

Denition 6 (Information Divergence (KullbackLeibler Divergence, Relative

You might also like