You are on page 1of 108

Fundamental Concepts and

Limits in Information Theory


Dr. Shyam Lal
Assistant Professor
Department of E & C Engineering
National Institute of Technology
Karnataka, Surathkal

Fundamental Limits on Performance




Given an information source, and a noisy channel,


information theory provides
1) Limits on the minimum number of bits per symbol
required to fully represent source.
2) Limits on the maximum rate at which reliable
communication can take place over channel.

Information Theory


Information Theory deals with:






The Measure of Source Information


The Information Capacity of the channel
Coding

If the rate of Information from a source does not


exceed the capacity of the Channel, then there exist a
Coding Scheme such that Information can be
transmitted over the Communication Channel with
arbitrary small amount of errors despite the presence
of Noise.

Information Theory


Let the source alphabet,


S = {s0, s1, .. , sK-1}
with the prob. of occurrence
K -1

P(s = sk ) = pk ,

k = 0,1, .. , K -1

and

=1

k =0

Assume the discrete memory-less source (DMS)

What is the measure of information?

Uncertainty, Information, and


Entropy


Interrelations between info., uncertainty or surprise


No surprise
no information

1
( Info.
)
Pr ob.


If A is a surprise and B is another surprise, then what is the


total info. of simultaneous A and B

Info.( A B) Info.( A) + Info.( B)




The amount of info may be related to the inverse of the


prob. of occurrence.

1
I (Sk ) = log( )
pk

Property of Information
1) I (s k ) = 0 for p k = 1
2) I (sk ) 0 for 0 pk 1
3) I (sk ) > I (si ) for pk < pi
4) I (sk si ) = I (sk ) + I (si ), if sk and si statist. indep.
Case 1: Obviously, if we are absolutely certain of the outcome of
an event, even before it occurs, there is no information gained.
Case 2: That is to say, the occurrence of an event S = sk either
provides some or no information, but never brings about a loss of
information.
Case 3: That is, the less probable an event is, the more information
we gain when it occurs.

Entropy of DMS



Definition : Measure of average information contents per


source symbol.
The mean value of I(sk) over S,
K -1

H ( S ) = E [ I ( s k )] =

k=0

The property of H

where K is the radix


1) H(S)=0, if
2)

K -1

pk I ( sk ) =

k =0

p k lo g 2 (

1
)
pk

0 H ( S ) log 2 K

(number of symbols) of the alphabet S of the source.

pk = 1

for some k, and all other pi ' s = 0

No Uncertainty (Lower bound on entropy)


1
H(S)= log 2 K , iff p k =
for all k
K
Maximum Uncertainty (Upper bound on entropy)

More Intuition on Entropy




Assume a binary memoryless source, e.g., a flip of a coin.


How much information do we receive when we are told that
the outcome is heads?
 If its a fair coin, i.e., P(heads) = P (tails) = 0.5, we say that
the amount of information is 1 bit.
 If we already know that it will be (or was) heads, i.e.,
P(heads) = 1, the amount of information is zero!
 If the coin is not fair, e.g., P(heads) = 0.9, the amount of
information is more than zero but less than one bit!
 Intuitively, the amount of information received is the same if
P(heads) = 0.9 or P (heads) = 0.1.

Example of Shannons Entropy




Consider the following string consisting of symbols a and b:


abaabaababbbaabbabab .
 On average, there are equal number of a and b.
 The string can be considered as an output of a below source with
equal probability of outputting symbol a or b:

0.5

0.5
source

We want to characterize the


average information generated
by the source!

Example of Shannons Entropy




Example :Entropy of Binary Memoryless Source

To Illustrate the properties of H(S), we consider a binary source


for which symbol 0 occurs with probability p0, and symbol 1 with
probability p1 = (1 -p0 ).
 We assume that the source is memoryless so that successive
symbols emitted by the source are statistically independent.
 The entropy of such a source equals
H(S) = -po log2 po - p1 log2 p1
= -po log2 po - (1 -p0 ) log2 (1 -p0 ) bits
from which we observe the following:
1.
When p0 = 0, the entropy H(S) = 0; this follows from the fact that
x log x  0 as x  0.
2. When p0 = 1, the entropy H(s) = 0.
3. The entropy H(S) attains its maximum value, Hmax = 1 bit, when
p0=p1= 0.5, that is, symbols 1 and 0 are equally probable.


Example of Shannons Entropy

Extension of DMS (Entropy)









Consider blocks of symbols rather them individual symbols


Coding efficiency can increase if higher order DMS are used
H(Sn) means having Kn disinct symbols where K is the of distinct
symbols in the alphabet
Thus H(Sn) = n H(S)
Second order extension means H(S2)
 Consider a source alphabet S having 3 symbols i.e. {s0, s1, s2}
 Thus S2 will have 9 symbols i.e. {s0s0, s0s1, s0s2, s1s1, ,s2s2}

Extension of DMS
(Entropy)

Extension of DMS
(Entropy)


Consider next the second-order extension of the source. With the


source alphabet Y consisting of three symbols, it follows that the
source alphabet Y2 of the extended source has nine symbols.
The first row of Table 9.1 presents the nine symbols of Y2,
denoted 0, 1, . . . , 8.
The second row of the table presents the composition of these
nine symbols in terms of the corresponding sequences of source
symbols s0, s1, and s2, taken two at a time.
The probabilities of the nine source symbols of the extended
source are presented in the last row of the table.

Extension of DMS
(Entropy)

p(s0) =1/ 4
p(s1) =1/ 4

p(s2) =1/ 2

Extension of DMS
(Entropy)


Accordingly, the use of Equation for entropy of the extended source


as

Source Coding Theorem








Source Encoding : It is the process in which data generated by a


discrete source is represented in binary sequence.
The device that performs this representation is called a source
encoder.
In particular, if some source symbols are known to be more
probable than others, then we may exploit this feature in the
generation of a source code by assigning short codeword's to
frequent source symbols, and long code words to rare source
symbols.
We are to such a source code as a variable-length code.
The Morse code is an example of a variab length code.
In the Morse code, the letters of the alphabet and numerals are
encoded in streams of marks and spaces, denoted as dots "."
and dashes "-", respectively.

Source Coding Theorem






Efficient source encoder that satisfies two functional requirements:


1. The code words produced by the encoder are in binary form.
2. The source code is uniquely decodable, so that the original
source sequence can be reconstructed perfectly from the
encoded binary sequence.

Source Coding Theorem

Shannons Coding Theorem for noiseless channels:


It expresses the lower limit of the average code word length of a
source in terms of its entropy.

Statement:

The theorem states that in any coding scheme, the average


code word length of a source of symbols must be equal to or
greater than source entropy.
The above theorem assumes the coding to be lossless and the
channel to be noiseless.
Let us assume that L
is the average code word length of
coding schemes, then as per Shannons theorem, we can state
that

L H (S )

Source Coding Theorem




Average code word length:

Coding efficiency:

Average code word length is bounded by :

K 1
L = p k lk
k =0

H (S )
=
L
H (S ) L < H (S ) + 1

Prefix Coding


Prefix Code: A prefix code is defined as a code in


which no code word is the prefix of any other code
word.

Prefix Coding


Decoding:

Uniquely Decodable Codes




A variable length code assigns a bit string (codeword)


of variable length to every message value

e.g. a = 1, b = 01, c = 101, d = 011




A uniquely decodable code is a variable length code in


which bit strings can always be uniquely decomposed into
its codewords.

Uniquely Decodable Codes





Fixed-Length versus Variable-Length Codes:


Example: Suppose we want to store messages made
up of 4 characters a, b, c, d with frequencies 60,5,
30,5 (percents) respectively. What are the fixedlength codes and prefix-free codes that use the least
space?

Uniquely Decodable Codes

Kraft McMillan Inequality

Theorem (Kraft-McMillan): For any uniquely


decodable code ,
K 1

l
2 k 1

k =0

NOTE: Kraft McMillan Inequality does not


tell us whether the code is prefix-free or not

Kraft McMillan Inequality




Instantaneous Codes:

A uniquely decodable code is said to be instantaneous if


the end of any code word is recognizable without the
need of inspection of succeeding code symbols. That is
there is no time lag in the process of decoding.

Prefix property: A necessary and sufficient condition for


a code to be instantaneous is that no complete code
word be a prefix of some other code word.
To understand the concept, consider the following codes:




Kraft McMillan Inequality




Example:

Kraft McMillan Inequality




Example:

A six symbol source is encoded into Binary codes shown


below. Which of these codes are instantaneous?

Kraft McMillan Inequality




Example:

Given
S={s1,s2,s3,s4,s5,s6,s7,s8,s9}
and
X={0,1}. Further if l1=l2=2 and l3=l4=l5
=l6=l7=l8=l9=k.

Then from Kraft inequality, we have

Kraft McMillan Inequality




Clearly, if k <4, it is not possible to construct


an instantaneous binary code.
Thus if k 4, Kraft inequality tells us that an
instantaneous code does exist but does not
tell us how to construct such a code.
The codes for the symbols when k=4 are
shown below:

Shannon Fano Coding Technique

Algorithm.
Step 1: Arrange all messages in descending
order of probability.

Step 2: Divide the Seq. in two groups in such a


way that sum of probabilities in each group is same.

Step 3: Assign 0 to Upper group and 1 to Lower


group.

Step 4: Repeat the Step 2 and 3 for Group 1 and 2 and


So on..

Example
No. Of
Bits

Messages
mk

pk

Coding Procedure

m1

m2

1/8/

m3

1/8

m4

1/16

m5

(lk)

Code

100

101

1100

1/16

1101

m6

1/16

1110

m7

1/32

11110

m8

1/32

11111

Shannon Fano Coding Technique


Source entropy
H(S)=-(1/2)log2(1/2)-2*(1/8)log2(1/8)-3*(1/16)log2(1/16)2*(1/32)log2(1/32)
=(1/2)*1+2*(1/8)*3+3*(1/16)*4+2*(1/32)*5
=0.5+0.75+0.75+0.3125=2.3 bits

Average code word length:


= (1/2)*1+ 2*(1/8)*3+ 3*(1/16)*4+ 2*(1/32)*5
=0.50+0.75+0.75+0.3125=2.3 bits

Coding efficiency :100%




Basic principles of Huffman Coding




Invented by Huffman as a class assignment in 1950.

Huffman coding is a popular lossless Variable Length Coding


(VLC) scheme, based on the following principles:

Shorter code words are assigned to more probable symbols


and longer
code words are assigned to less probable
symbols.

No code word of a symbol is a prefix of another code word.


This makes Huffman coding uniquely decodable.

Every source symbol must have a unique code word assigned


to it.
35

Basic principles of Huffman Coding




Huffman encoding algorithm proceeds as follows:

1. The source symbols are listed in order of decreasing


probability. The two source symbols of lowest probability are
assigned a 0 and a 1.

2.These two source symbols are regarded as being combined


into a new source symbol with probability equal to the sum of
the two original probabilities. (The list of source symbols,
and therefore source statistics, is thereby reduced in size by
one.) The probability of the new symbol is placed in the list in
accordance with its value.

3. The procedure is repeated until we are left with a final list of


source statistics (symbols) of only two for which a 0 and a 1
are assigned.
36

Basic principles of Huffman Coding




Input information rate of Source Encoder :

Ri = R s * H ( S )

( bps ) bits per sec ond

Where Rs is the signalling rate and H(S) is the source


entropy
Output information rate of Source Encoder :

R0 = R s * L

bps

Where Rs is the signalling rate and L is the average


code word length
 Redundancy = Source entropy- Average code word
length

= H (S ) L

37

Basic principles of Huffman Coding




The variance of the average code word length of


source encoder :

K 1
2 = p k (lk L ) 2
k =0


Smaller value of the variance is preferred because it


require less memory space.

38

Basic principles of Huffman Coding




EXAMPLE: Huffman Tree

The five symbols of the alphabet of a discrete memoryless


source and their probabilities are shown in the two leftmost
columns of Figure 4.1(a). Following through the Huffman
algorithm, we reach the end of the computation in four
steps, resulting in the Huffman tree shown in Figure 4.1(a).
The code words of the Huffman code for the source are
tabulated in Figure 4.1(b).

Method 1:
Placing the probability of
possible.

the

new symbol as high as

39

Basic principles of Huffman Coding

Figure 4.1.a
40

Basic principles of Huffman Coding

Figure 4.1.b
Memory storage requirement for Huffman code:
In order to store 100 characters in computer , the umber
of bits required=(40*2)+(20*2)+(20*2)+(10*3)+(10*3)= 220
bits.
41

Basic principles of Huffman Coding

The variance of the average code word length of


Huffman code :

K 1
2 = p k (l k L ) 2 = 0 .16
k =0

42

Basic principles of Huffman Coding


Method 2: Placing the probability of the
new symbol as low as possible.

Memory storage requirement for Huffman code 2:


In order to store 100 characters in PC, the number of bits
required=(40*1)+(20*2)+(20*3)+(10*4)+(10*4)= 220 bits
43

Basic principles of Huffman Coding


The average codeword length is still 2.2 bits/symbol
But variances are different!
The variance of the average code word length of
Huffman code 2:

K 1
2 = p k ( l k L ) 2 = 1 .36
k =0

44

Basic principles of Huffman Coding


Let us assume that signalling rate (Rs)=10,000 symbols per
second.
Output information rate of Source Encoder or Transmission
capacity: 2.2 *10000=22000 bits per second.
So, what it means that the transmission channels expects to
receive 22,000 bits per second from the source encoder .
Now, since we are using the variable length coding, the bit
generation rate will not be constant.
The bit generation rate will vary around 22,000 bits per second.
So, usually the output of such a source code is generally fed into
a buffer.
The purpose of the buffer is to smooth out the variation in the
bit generation rate.
 However, the buffer has to be of finite size.
Therefore, the greater the variance in the code word the more
difficult the buffer design problem becomes.
45

Basic principles of Huffman Coding


If the encoder simply writes the compressed data on a file in
the computer, the variance of the code makes no di
erence.
A small-variance Hu
man code is preferable only in cases
where the encoder transmits the compressed data, as it is
being generated, over a transmission network.
In such a case, a code with large variance causes the encoder
to generate bits at a rate that varies all the time. Since the bits
have to be transmitted at a constant rate, the encoder has to
use a bu
er. Bits of the compressed data are entered into the
bu
er as they are being generated and are moved out of it at a
constant rate, to be transmitted.
It is easy to see intuitively that a Hu
man code with zero
variance will enter bits into the bu
er at a constant rate, so only
a short bu
er size will be needed.
The larger the code variance, the more variable is the rate at
which bits enter into the bu
er, requiring the encoder to use a
larger bu
er size.
46

Encoding a string of symbols using


Huffman codes
 After obtaining the Huffman codes for each symbol, it is easy to
construct the encoded bit stream for a string of symbols.

Example:
If we have to encode a string of symbols s3s2s4s3s0s1s2 we
shall start from the left, taking one symbol at a time.
The code corresponding to the first symbol s3 is 010, the second
symbol s2 has a code 11, the third symbol s4 has a code 011, the
fifth symbol s0 has a code 00; and the sixth symbol s1 has a code
10;

Proceeding as above, we obtain the encoded bit stream


using Huffman encoding as: 01011011010001011
In this example, 17 bits were used to encode the string of 7
symbols. A straight binary encoding of 7 symbols, chosen from an
alphabet of 5 symbols would have required 21 bits (3 bits/symbol)
and this encoding scheme therefore demonstrates substantial
compression.
47

Encoding a string of symbols using


Huffman codes
 After obtaining the Huffman codes for each symbol, it is easy to
construct the encoded bit stream for a string of symbols.

Example:
If we have to encode a string of symbols s3s2s4s3s0s1s2 we
shall start from the left, taking one symbol at a time.
The code corresponding to the first symbol s3 is 010, the second
symbol s2 has a code 11, the third symbol s4 has a code 011, the
fifth symbol s0 has a code 00; and the sixth symbol s1 has a code
10;

Proceeding as above, we obtain the encoded bit stream


using Huffman encoding as: 01011011010001011
In this example, 17 bits were used to encode the string of 7
symbols. A straight binary encoding of 7 symbols, chosen from an
alphabet of 5 symbols would have required 21 bits (3 bits/symbol)
and this encoding scheme therefore demonstrates substantial
compression.
48

Decoding a Huffman coded bit stream


Since no codeword is a prefix of another codeword, Huffman
codes are uniquely decodable.
The decoding process is straightforward and can be
summarized below:
Step-1:Examine the leftmost bit in the bit stream. If this
corresponds to the codeword of an elementary symbol, add
that symbol to the list of decoded symbols, remove the
examined bit from the bit stream and go back to step-1 until all
the bits in the bit stream are considered. Else, follow step-2.
Step-2:Append the next bit from the left to the already
examined bit(s) and examine if the group of bits correspond to
the codeword of an elementary symbol. If yes, add that
symbol to the list of decoded symbols, remove the examined
bits from the bit stream and go back to step-1 until all the bits
in the bit stream are considered. Else, repeat step-2 by
appending more bits.
49

Decoding a Huffman coded bit stream


In the encoded bit stream of previous example, if
we receive the bit stream 01011011010001011.
Follow the steps described above, we shall first
decode s3 (010), then s2 ( 11) followed by s4
(011), s3 (010), s0 (00), s1 (10) and s2 (11).
This is exactly what we had encoded.

50

Basic principles of Huffman Coding


Properties:
Generates optimal prefix codes
Cheap to generate codes
Cheap to encode and decode
Average code word length =Source entropy if probabilities
are powers of 2.

Applications:
Used in many, if not most compression algorithms such as
gzip, bzip, jpeg (as option), fax compression,

51

Basic principles of Huffman Coding


Advantages of Huffman algorithm:
The Huffman algorithm generates an optimal prefix

code.
Cheap to generate codes.
Cheap to encode and decode.

Disadvantages of Huffman algorithm:


If the ensemble changes
 the frequencies and
probabilities change  the optimal coding changes
e.g. in text compression symbol frequencies vary with
context.
Re-computing the Huffman code by running through
the entire file in advance?
52

Conditional probabilities
Suppose we have a single event A with possible
outcomes {ai}.
Everything we know is specified by the
probabilities for the possible outcomes: P(ai).
For the coin toss the possible outcomes are
heads and tails:
P(heads) = 1/2 & P(tails) = 1/2.

More generally::

0 P (ai ) 1

P (ai ) = 1

Two events:
Add a second event B with outcomes {bj} and probabilities
P(bj).
Complete description provided by the joint probabilities:
P(ai,bj)
If A and B are independent and uncorrelated then

P(ai,bj) = P(ai) P(bj)


Single event probabilities
and joint probabilities
related by:

P(ai ) = P (ai , b j )
j

P(b j ) = P (ai , b j )
i

Two events:
What does learning the value of A tell us about the
probabilities for the value of B?
If we learn that A = a0, then the quantities of interest are the
conditional probabilities:
P(bj|a0)

given that
This conditional probability is proportional to the joint probability:

P (b j | a0 ) P (a0 , b j )
Finding the constant of proportionality leads to Bayes rule:

P (ai , b j ) = P (b j | ai ) P (ai ) or
P (ai , b j ) = P (ai | b j ) P (b j )

Entropy for two events


H ( A, B) =

P(a , b ) log P(a , b )


i

ij

H ( A) =

P(a , b ) log P(a , b )


i

ij

H ( B) =

P(a , b ) log P(a , b )


i

ij

H ( A) + H ( B) H ( A, B) =

ij

P(ai , b j )
0
P(ai , b j ) log
P(ai ) P(b j )

Entropy of Random Variable


We now extend the notions related with events to
random variables.
Let us consider X (respectively Y) be a discrete
random variable taking on values in {x0, x1, x2, x3,
............., xJ-1} (respectively {y0,y1, y2, y3, ............., yK-1}

p( x j , yk ) = P X = x j Y = yk
As the entropy depends only on the probability
distribution, it is natural to define the joint entropy of
the pair H(X ,Y) as
J 1 K 1

H ( , ) =

j =0 k =0

p ( x j , y k ) log 2
p ( x j , y k )

57

Entropy of Random Variable


Conditional entropy:
Let X; Y be two random variables. Then, expanding H(X; Y ) gives

58

Entropy of Random Variable

59

Discrete Memoryless Channel


A discrete memoryless channel is a statistical model with
an input X and an output Y that is a noisy version of X;
both X and Y are random variables.
 Every unit of time, the channel accepts an input symbol
X selected from an alphabet X and, in response, it emits an
output symbol Y from an alphabet Y.
The channel is said to be "discrete" when both of the
alphabets X and Y have finite sizes.
 It is said to be "memoryless" when the current
output symbol depends only on the current input
symbol and not any of the previous ones.
60

Discrete Memoryless Channel


Diagram of a discrete memoryless channel is given below:

Input alphabet :
Output alphabet :
61

Discrete Memoryless Channel


 A set of transition probabilities:

Also, the input alphabet X and output alphabet Y need


not have the same size.
For example, in channel coding, the size K of the output
alphabet Y may be larger than the size of the input
alphabet X; thus K J ,.
On the other hand, we may have a situation in which the
channel emits the same symbol when either one of two input
symbols is sent, in which case we have K J
62

Discrete Memoryless Channel


The transition probability p(yk|xj) is just conditional
probability that channel output Y=yk; given that the the
channel input X=xj .
There is possibility of errors arising from the process of
information transmission over DMC.
 When k=j, transition probability p(yk|xj) is represents a
conditional probability of correct reception.
Otherwise it represents a conditional probability of error.

Channel matrix, or transition matrix: (J by K)

63

Discrete Memoryless Channel


Note that each row of the channel matrix P corresponds
to a fixed channel input, whereas each column of the matrix
corresponds to a fixed channel output.
Note also that a fundamental property of the
channel matrix P, as defined here, is that the sum of
the elements along any row of the matrix is always
equal to one; that is,

Suppose now that the inputs to a discrete memoryless


channel are selected according to the probability
distribution { p(xj), j = 0, 1,. . . , J-1}.
64

Discrete Memoryless Channel


In other words, the event that the channel input X= xj,
occurs with probability

 Where X is denoting the random variable of channel


input,
If we specify the second random variable Y denoting the
channel output.
The joint probability distribution of the random
variables X and Y is given by

65

Discrete Memoryless Channel


The marginal probability distribution of the output
random variable Y is obtained by averaging out the
dependence of p(xj, yk) on xj, as given by
J 1
p ( y k ) = P (Y = y k ) = P (Y = y k | X = x j ) P ( X = x j )
j =0
J 1
= p ( yk | x j ) p ( x j )
for k = 0 , 1, 2 ......, K 1
j=0

The probabilities p(xj) for j = 0,1, . . . , J-1, are


known as the a priori probabilities of the various input
symbols.
66

Discrete Memoryless Channel


Above equation states that if we are given the input a
priori probabilities p(xj,) and the channel matrix [i.e.,
the matrix of transition probabilities p(yk/xj)], then we
may calculate the probabilities of the various output
symbols, the p(yk,).
For J=K, the average probability of symbol error, Pe
is defined as the probability that the output random
variable Yk is different from the input random variable,
average over all k j
K 1

Pe =

P (Y = y
k =0
k j

J 1 K 1

)=

p( y

/ x j ) p( x j )

j =0 k =0
k j

The difference (1-Pe) gives average probability of


correct reception.

67

Mutual Information
 Channel input X(selected from input alphabet X ) and
Channel output Y(selected from output alphabet Y).
How can we measure the uncertainty about X after
observing Y ?
Now defining the conditional entropy of X selected from
input alphabet X, given that Y=yk. We can write

This quantity is itself a random variable that takes


on the values H ( / Y = y,0 ) . . . , H ( / Y = y K 1 ) with
probabilities p(y0), . . . , p(yK-1), respectively.
68

Mutual Information
The mean of entropy H ( / Y = y k )
alphabet Y is therefore given by

over the output

The conditional entropy represents the amount of


uncertainty remaining the channel input after the
channel output has been observed.
69

Mutual Information
The difference H ( ) H ( / )
is called the
mutual information, which is measure of uncertainty
about the channel input after observing the channel
output.

p ( x j , yk )

I ( ; ) == p ( x j , y k ) log 2
p ( x j ) p ( yk )

j =0 k =0
J 1 K 1

70

Properties of Mutual Information


Property 1 : The mutual information of a channel is
symmetric; that is

I ( ; ) = I (Y ; )
where the mutual information I ( ; ) is a measure
of the uncertainty about the channel input that is
resolved by observing the channel output, and
the mutual information I (Y ; ) is a measure of the
uncertainty about the channel output that is resolved
by sending the channel input.

71

Properties of Mutual Information


Proof:

I ( ; ) = I (Y ; )
I ( ; ) = H ( ) H ( / )

72

Properties of Mutual Information


Proof:

I ( ; ) = H ( ) H ( / )
J 1 K 1

I ( ; ) =

j=0 k =0
J 1 K 1

j=0 k =0

p ( x j , y k ) log 2
p(x j )

p ( x j , y k ) log 2
p(x j / yk )

73

Properties of Mutual Information


J 1 K 1

I ( ; ) =

j=0 k =0

1
log 2

p ( x j , y k ) log 2
p(x j )
p ( x j / y k )

J 1 K 1

I ( ; ) =

j=0 k =0

1
+ log 2 p ( x j / y k )
p ( x j , y k ) log 2
p( x j )

J 1 K 1

I ( ; ) =

j=0 k =0

p ( x j / y k )

p ( x j , y k ) log 2
p ( x j )

74

Properties of Mutual Information


From Bayes' rule for conditional probabilities, we
have following Equations:

J 1 K 1

I ( ; ) =

j=0 k =0

p ( y k / x j )
= I ( ; )
p ( x j , y k ) log 2

p
(
y
)

75

Properties of Mutual Information


Property 2 :The mutual information is always
nonnegative; that is
I ( ;Y ) 0
Proof: To prove this property,

I ( ;Y ) 0
with equality if and only if
76

Properties of Mutual Information


Property 3 :The mutual information of a channel is
related to the joint entropy of the channel input and
channel output by I ( ; ) = H ( ) + H ( ) H ( , )
Where

J 1 K 1

H ( , ) =

j=0 k =0

p ( x j , y k ) log 2
p ( x j , y k )

Proof: To prove this property, we have to re-write


joint entropy

p ( x j ) p ( y k )

H ( , ) =
p ( x j , y k ) log 2
p ( x j , y k )

j=0 k =0

J 1 K 1

+
p ( x j , y k ) log 2
p( x j ) p( yk )

j=0 k =0

J 1 K 1

(1)
77

Properties of Mutual Information


Second term:
J 1 K 1

j=0 k =0

p ( x j , y k ) log 2
p ( x j ) p ( y k )

1 K 1

=
p(x j , yk ) +
log 2
p( x j )
j =0

k =0
J 1

K 1

1 J 1

p(x j , yk )
log 2

p( yk ) j=0
k =0

Expanding intermediate term


K 1

p(x

K 1

j,

yk ) =

k =0

j=0

j ) p( yk

/ x j ) = p(x j ) *

k =0

J 1

p(x

p(x

K 1

yk ) =

p( y
j =0

/ x j ) = p( x j )

k =0

J 1

j,

p( y

J 1

) p( x j / yk ) = p( yk ) *

p(x

/ yk ) = p( yk )

j=0
78

Properties of Mutual Information


Simplified Second term:

p ( x j , y k ) log 2
p( x j ) p ( yk )

j=0 k =0

J 1
1 K 1
1

=
p ( x j ) log 2
+
p ( y k ) log 2

p(x j )
p
(
y
)
k

j =0

k =0
= H ( ) + H ( )
J 1 K 1

p ( x j / y k )

I ( ; ) =
p ( x j , y k ) log 2
p ( x j )

j =0 k =0

J 1 K 1

p ( x j , yk )

=
p ( x j , y k ) log 2
p ( x j ) p( y k )

j=0 k =0

J 1 K 1

79

Properties of Mutual Information


So equation (1), we can write in terms of

H ( , ) = I ( ; ) + H ( ) + H ( )

I ( ; ) = H ( ) + H ( ) H ( , )

80

Channel Coding Theorem


For many real time applications, low level of reliability
(i.e. high value of probability of
error
Pe= 10-1) is
unacceptable.
For may applications, necessary requirement of average
probability of error equal to 10-6 or even lower.
To achieve such a high level of performance, we have to
use of channel coding.

Fig.1.1. Block diagram of digital communication system


81

Channel Coding Theorem


In block codes, the message sequence is subdivided into
sequential blocks each k bits long, and each k-bit block is
mapped into an n-bit block, where n > k.
The number of redundant bits added by the encoder to
each transmitted block is n - k bits.
The ratio k/n is called the code rate. Code rate is denoted by
r, we may thus write:

k
r=
n
 For accurate reconstruction of the original source sequence
at the destination requires that the average probability of
symbol error should be low.
82

Channel Coding Theorem


Suppose that the discrete memoryless source in Figure 1..1
has the source alphabet S and entropy H(S) bits per source
symbol.
We assume that the source emits symbols once every Ts,
seconds. Hence, the average information rate of the source
is H(S)/Ts bits per second.
The decoder delivers decoded symbols to the destination
from the source alphabet S and at the same source rate of
one symbol every Ts, seconds.
The discrete memoryless channel has a channel capacity
equal to C bits per use of the channel.
We assume that the channel is capable of being used once
every Tc seconds. Hence, the channel capacity per unit
time is C/Tc, bits per second, which represents the
maximum rate of information transfer over the channel.
83

Channel Coding Theorem


Shannon's 2 theorem:
The channel coding theorem for a discrete memoryless
channel is stated in two parts as follows:
(a)Let a discrete memoryless source with an alphabet
S have entropy H(S) and produce symbols once every TS
seconds. Let a discrete memoryless channel have
capacity C and be used once every TC seconds. Then if

There exists a coding scheme for which the source


output can be transmitted over the channel and be
reconstructed with an arbitrarily small average probability
of error.

84

Channel Coding Theorem


(b) Conversely, if

It is not possible to transmit information over the channel


and reconstruct it with an arbitrarily small probability of
error.
This theorem specifies that the channel capacity C
as a fundamental limit on the rate at which the
transmission of reliable error-free message can take
place over a discrete memoryless channel.

85

Channel Coding Theorem


Channel Capacity:
The mutual information can be expressed by

p ( x j , yk )

I ( ; ) = p ( x j , y k ) log 2
p( x j ) p ( yk )

j =0 k =0
J 1 K 1

The input probability distribution p(xj) is independent of the


channel.
So, we can maximize mutual information
channel with respect to p(xj).

I ( ;Y )

of

86

Channel Coding Theorem


Therefore, we can define the channel capacity of a
discrete memoryless channel is the maximum mutual
information I(X;Y) in any single use of the channel (i.e.,
signalling interval), where the maximization is over all
possible input probability distributions {p(xj)} on channel
input alphabet X.

So, the channel capacity C is defined as

C = max {I ( ; Y )}
p(x j )

The channel capacity


channel use.

C is measured

in bits per

The Channel Coding Theorem is also known as noisy


coding theorem.
87

Differential Entropy for continuous


random variable
Let X is representing continuous random variable with PDF
.

f X ( x)

By analogy with the entropy of discrete random variable,


we can introduce differential entropy of continuous random

variable X.

h( X ) =

dx
f
(
x
)
log
2
X
f X ( x)

Justification of differential entropy :


Basically the continuous random variable X is the limiting
form of a discrete random variable that assumes the value
xk = kx, where k = 0, 1, 2,. . . , and x approaches zero.
By definition, the continuous random variable X assumes a
value in the interval [xk, xk + x] with probability fX(xk) x.
88

Differential Entropy for continuous


random variable
Hence, permitting x to approach zero, the ordinary
entropy of the continuous random variable X may be
written in the limit form as follows:

H (X )

1
= lim
f X ( x k ) x log 2 f ( x ) x
x 0 k = 0
X k

H (X )

1
1

= lim
f X ( x k ) x log 2 f ( x ) + log 2 x
x 0 k = 0
X k

H (X )


1
f X ( x k ) log 2

f (x )

X k
k =
= lim

x 0

log 2 ( x ) f X ( x k ) x

k =

89

Differential Entropy for continuous


random variable

dx f X ( x)dx lim log2 (x )


H ( X ) = f X ( x) log2
f X ( x)

x0

Using property of PDF:

f X ( x)dx = 1

So, ordinary entropy of the continuous random variable X

dx lim log2 (x )
H ( X ) = f X ( x) log2
f X ( x)
x0

H ( X ) = h() lim log2 (x )


x0

90

Differential Entropy for continuous


random variable
In the limit as x approaches zero, -log2(x) approaches
infinity.
This means that the entropy of a continuous random
variable is infinitely large.
This is also true because a continuous random variable
may assume a value anywhere in the interval (-,) and
the uncertainty associated with the variable is on the order
of infinity.
In order to avoid the problem associated with the term
log2(x) by adopting h(X) as a differential entropy, with
the term -log2(x) serving as reference.
91

Differential Entropy for continuous


random variable
Moreover, since the information transmitted over a
channel is actually the difference between two entropy
terms that have a common reference, the information will
be the same as the difference between the corresponding
differential entropy terms.
When we have a continuous random vector X consisting
of n random variables X1, X2, . . . , Xn, then we can define
the differential entropy of X as the n-fold integral

1
dx
h ( ) = f ( x ) log 2
f ( x)

92

Differential entropy for continuous


random variable
Differential entropy for Uniform distribution random
variable:
1

f X ( x) = b a
0

for

x (a, b)
otherise

1
1
h( X ) =
dx = log 2 ( b a )
log 2

ba
1 (b a )
a

Note that h(X)<0 if (b-a)<1


93

Differential entropy for continuous


random variable
Maximum Differential entropy for Gaussian random
variable with zero mean and variance 2 :
Two constraint :
 (1)
(2)

f X ( x)dx = 1

2 f ( x)dx = 2 = cons tan t


x
X

PDF of Gaussian RV is:

f X ( x) =

1
2 2

x2

2 2
e

94

Differential entropy for continuous


random variable
Differential entropy RV is:

h( X ) =

f X ( x) log2 ( f X ( x))dx

1
h( X ) = f X ( x) log2
2 2

h( X ) = f X ( x) log2

x2

2 2
dx
e

x2

2 2
1
dx
dx f X ( x) log2 e

2
2

95

Differential entropy for continuous


random variable

log
e

h( X ) = f X ( x) log2 2 2 dx + 2 x 2 f X ( x)dx
2

log
e

h( X ) = log2 2 2 f X ( x)dx + 2 x 2 f X ( x)dx


2

log2 e 2

2
h( X ) = log2 2 *1 +
*

2 2

1/ 2 1
1

2
2
h( X ) = log2 2 *1 + log2 e = log2 2
+ log2 e

2
2

1
1
1
2

h( X ) = log2 2 + log2 e = log2 2e 2

2
2

96

Mutual information of continuous


random variable
Consider next a pair of continuous random variables
X and Y. By analogy of discrete random variable, we
define the mutual information between the random
variables X and Y as follows:

f X ( x / y)
dxdy
I ( X ,Y ) =
f X ,Y ( x, y) log2
f X ( x)


Where fX,Y(x,y) is the joint probability density function
of X and Y, and fx(x|y) is the conditional probability
density function of X, given that Y = y.

Also, by analogy with discrete random variable. We


can find that the mutual information I(X; Y) has the
following properties:
97

Mutual information of continuous random


variable
Also, by analogy with discrete random variable. We
can find that the mutual information I(X; Y) has the
following properties:

1 . I ( ; ) = I (Y ; X )
2 . I ( ; ) 0
3.
4.

I ( ; ) = h ( X ) h ( X | Y )
I ( ; ) = h (Y ) h (Y | X )

Where h(X) and h(Y) are the differential entropy of


continuous random variable X and Y respectively.
98

Mutual information of continuous


random variable
The parameter h(X|Y) is differential entropy of
given Y and it is defined as

X ,

1
dxdy
h( X | Y ) =
f X ,Y ( x, y) log2
f X ( x | y)

The parameter h(Y|X) is differential entropy of Y , given


X and it is defined as

1
dxdy
h(Y | X ) =
f X ,Y ( x, y) log2
f Y ( y | x)

99

Information Capacity Theorem


Shannon-Hartley law ( Shannon's 3rd theorem):
Now, we will use the concept of mutual information to
formulate the information capacity theorem for bandlimited, power-limited Gaussian channels.
 To be specific, consider a zero-mean stationary process
X(t) that is band-limited to B hertz.
 Let Xk, k = 1,2,. . . , K, denote the continuous random
variables obtained by uniform sampling of the process X(t)
at the Nyquist rate of 2B samples per second.
These samples are transmitted in T seconds over a noisy
channel, also band-limited to B hertz.
Hence, the number of samples, K, is given by

K = 2 BT

-------------(x)
100

Information Capacity Theorem


We refer to Xk as a sample of the transmitted signal.
Model of discrete-time, memoryless Gaussian channel
given below: .

Nk
Let additive white Gaussian noise (AWGN) of zero mean
and power spectral density N0/2. The noise is bandlimited to B hertz.
 Let Yk, k = 1,2, . . ,K, denote continuous random
variables obtained by uniform sampling of the process Y(t)
at the Nyquist rate of 2B samples per second .

101

Information Capacity Theorem

102

Information Capacity Theorem


We refer to Yk as a sample of the received signal and it is
given by

Yk = X k + N k

Where Nk is sample of Gaussian noise {N(t)} with zero


mean and variance given by

2 = N0B
We assume that the samples Yk, k = 1, 2, . . . , K are
statistically independent.
Typically, the transmitter is power limited; it is therefore
reasonable to define the cost as

E [ X k2 ] = P

; k = 1, 2 , 3,......., K .
103

Information Capacity Theorem


Where P is the average transmitted power. The powerlimited Gaussian channel described herein is of not only
theoretical but also practical importance in that it
models many communication channels, including line-ofsight radio and satellite links.
The information capacity of the channel is defined as
the maximum of the mutual information between the
channel input Xk and the channel output Yk over all
distributions on the input Xk that satisfy the power
constraint.
Let I(Xk; Yk) denote the mutual information between Xk
and Yk.
We may then define the information capacity of the
channel as

C=

max I ( ; Y ) : E [ X 2 ] = P
k

f X k ( x)

104

Information Capacity Theorem


The mutual information I(Xk; Yk) can be expressed

I ( k ; k ) = h (Yk ) h (Yk | X k )
Now we can that the conditional differential entropy of
Yk, given Xk, is equal to the differential entropy of Nk

h (Yk | X k ) = h ( N k )
Hence, we may rewrite mutual information I(Xk; Yk)

I ( k ; k ) = h (Yk ) h ( N k )
In order to maximize mutual information I(Xk; Yk), we
have choose the samples of the transmitted signal from a
noiselike process of average power P.

C = max [I ( ; Y ) ] : X k Gaussian

and

E[ X 2 ] = P
k
105

Information Capacity Theorem


For the evaluation of the information capacity C, we
proceed in three stages:
1. The variance of sample Yk of the received signal
equals
2.

P +

Hence, the maximum differential entropy of


random variable Yk as

Gaussian

1
h (Yk ) = log 2 2 e P + 2

2. The variance of the noise sample Nk equals 2


Hence, the differential entropy of Nk as

1
h ( N k ) = log 2 2 e 2

2
106

Information Capacity Theorem


3. Substituting the values of h(Yk) and h(Nk) in the given
equation Equations

I ( k ; k ) = h (Yk ) h ( N k )
1
1

I ( k ; k ) = log 2 2 e P + log 2 2 e 2

P + 2
1
I ( k ; k ) = log 2
2
2

1
= log 1 + P
2
2
2

So, as per definition of information capacity, it is given


as

1
P bits per transmission -------(p)

C = log 2 1 +

107

Information Capacity Theorem


With the channel used K times for the transmission of K
samples of the process X(t) in T seconds,
So, we find that the information capacity per unit time is
(K/T) times the result given in Equation (p).
The number K equals 2BT, as given in the equation (x).
Hence, we may express the information capacity in the
equivalent form:

C = B log 2 1 +
2

bits per second---------(q)

We may now state Shannon's third (and most


famous) theorem, the information capacity theorem.

108

You might also like