Information Theory

Fundamental Concepts and
Limits in Information Theory

Dr. Shyam Lal
Assistant Professor
Department of E & C Engineering
National Institute of Technology
Karnataka, Surathkal
Fundamental Limits on Performance

Given an information source, and a noisy channel,

information theory provides
1) Limits on the minimum number of bits per symbol
required to fully represent source.
2) Limits on the maximum rate at which reliable
communication can take place over channel.
Information Theory
Information Theory deals with:

The Measure of Source Information

The Information Capacity of the channel
Coding
If the rate of Information from a source does not

exceed the capacity of the Channel, then there exist a
Coding Scheme such that Information can be
transmitted over the Communication Channel with
arbitrary small amount of errors despite the presence
of Noise.
Information Theory
Let the source alphabet,

S = {s0, s1, .. , sK-1}
with the prob. of occurrence
K -1
P(s = sk ) = pk ,
k = 0,1, .. , K -1
and
=1
k =0
Assume the discrete memory-less source (DMS)
What is the measure of information?
Uncertainty, Information, and

Entropy
Interrelations between info., uncertainty or surprise

No surprise
no information
1
( Info.
)
Pr ob.
If A is a surprise and B is another surprise, then what is the

total info. of simultaneous A and B
Info.( A B) Info.( A) + Info.( B)

The amount of info may be related to the inverse of the

prob. of occurrence.
1
I (Sk ) = log( )
pk
Property of Information
1) I (s k ) = 0 for p k = 1
2) I (sk ) 0 for 0 pk 1
3) I (sk ) > I (si ) for pk < pi
4) I (sk si ) = I (sk ) + I (si ), if sk and si statist. indep.
Case 1: Obviously, if we are absolutely certain of the outcome of
an event, even before it occurs, there is no information gained.
Case 2: That is to say, the occurrence of an event S = sk either
provides some or no information, but never brings about a loss of
information.
Case 3: That is, the less probable an event is, the more information
we gain when it occurs.
Entropy of DMS

Definition : Measure of average information contents per

source symbol.
The mean value of I(sk) over S,
K -1
H ( S ) = E [ I ( s k )] =
k=0
The property of H
where K is the radix

1) H(S)=0, if
2)
K -1
pk I ( sk ) =
k =0
p k lo g 2 (
1
)
pk
0 H ( S ) log 2 K
(number of symbols) of the alphabet S of the source.
pk = 1
for some k, and all other pi ' s = 0
No Uncertainty (Lower bound on entropy)

1
H(S)= log 2 K , iff p k =
for all k
K
Maximum Uncertainty (Upper bound on entropy)
More Intuition on Entropy

Assume a binary memoryless source, e.g., a flip of a coin.

How much information do we receive when we are told that
the outcome is heads?
If its a fair coin, i.e., P(heads) = P (tails) = 0.5, we say that
the amount of information is 1 bit.
If we already know that it will be (or was) heads, i.e.,
P(heads) = 1, the amount of information is zero!
If the coin is not fair, e.g., P(heads) = 0.9, the amount of
information is more than zero but less than one bit!
Intuitively, the amount of information received is the same if
P(heads) = 0.9 or P (heads) = 0.1.
Example of Shannons Entropy

Consider the following string consisting of symbols a and b:

abaabaababbbaabbabab .
On average, there are equal number of a and b.
The string can be considered as an output of a below source with
equal probability of outputting symbol a or b:
0.5
0.5
source
We want to characterize the

average information generated
by the source!

Example :Entropy of Binary Memoryless Source
To Illustrate the properties of H(S), we consider a binary source

for which symbol 0 occurs with probability p0, and symbol 1 with
probability p1 = (1 -p0 ).
We assume that the source is memoryless so that successive
symbols emitted by the source are statistically independent.
The entropy of such a source equals
H(S) = -po log2 po - p1 log2 p1
= -po log2 po - (1 -p0 ) log2 (1 -p0 ) bits
from which we observe the following:
1.
When p0 = 0, the entropy H(S) = 0; this follows from the fact that
x log x 0 as x 0.
2. When p0 = 1, the entropy H(s) = 0.
3. The entropy H(S) attains its maximum value, Hmax = 1 bit, when
p0=p1= 0.5, that is, symbols 1 and 0 are equally probable.
Extension of DMS (Entropy)

Consider blocks of symbols rather them individual symbols

Coding efficiency can increase if higher order DMS are used
H(Sn) means having Kn disinct symbols where K is the of distinct
symbols in the alphabet
Thus H(Sn) = n H(S)
Second order extension means H(S2)
Consider a source alphabet S having 3 symbols i.e. {s0, s1, s2}
Thus S2 will have 9 symbols i.e. {s0s0, s0s1, s0s2, s1s1, ,s2s2}
Extension of DMS
(Entropy)
Extension of DMS
(Entropy)
Consider next the second-order extension of the source. With the

source alphabet Y consisting of three symbols, it follows that the
source alphabet Y2 of the extended source has nine symbols.
The first row of Table 9.1 presents the nine symbols of Y2,
denoted 0, 1, . . . , 8.
The second row of the table presents the composition of these
nine symbols in terms of the corresponding sequences of source
symbols s0, s1, and s2, taken two at a time.
The probabilities of the nine source symbols of the extended
source are presented in the last row of the table.
Extension of DMS
(Entropy)
p(s0) =1/ 4
p(s1) =1/ 4
p(s2) =1/ 2
Extension of DMS
(Entropy)
Accordingly, the use of Equation for entropy of the extended source

as
Source Coding Theorem

Source Encoding : It is the process in which data generated by a

discrete source is represented in binary sequence.
The device that performs this representation is called a source
encoder.
In particular, if some source symbols are known to be more
probable than others, then we may exploit this feature in the
generation of a source code by assigning short codeword's to
frequent source symbols, and long code words to rare source
symbols.
We are to such a source code as a variable-length code.
The Morse code is an example of a variab length code.
In the Morse code, the letters of the alphabet and numerals are
encoded in streams of marks and spaces, denoted as dots "."
and dashes "-", respectively.

Efficient source encoder that satisfies two functional requirements:

1. The code words produced by the encoder are in binary form.
2. The source code is uniquely decodable, so that the original
source sequence can be reconstructed perfectly from the
encoded binary sequence.
Shannons Coding Theorem for noiseless channels:

It expresses the lower limit of the average code word length of a
source in terms of its entropy.
Statement:
The theorem states that in any coding scheme, the average

code word length of a source of symbols must be equal to or
greater than source entropy.
The above theorem assumes the coding to be lossless and the
channel to be noiseless.
Let us assume that L
is the average code word length of
coding schemes, then as per Shannons theorem, we can state
that
L H (S )

Average code word length:
Coding efficiency:
Average code word length is bounded by :
K 1
L = p k lk
k =0
H (S )
=
L
H (S ) L < H (S ) + 1
Prefix Coding
Prefix Code: A prefix code is defined as a code in

which no code word is the prefix of any other code
word.
Prefix Coding
Decoding:
Uniquely Decodable Codes

A variable length code assigns a bit string (codeword)

of variable length to every message value
e.g. a = 1, b = 01, c = 101, d = 011

A uniquely decodable code is a variable length code in

which bit strings can always be uniquely decomposed into
its codewords.

Fixed-Length versus Variable-Length Codes:

Example: Suppose we want to store messages made
up of 4 characters a, b, c, d with frequencies 60,5,
30,5 (percents) respectively. What are the fixedlength codes and prefix-free codes that use the least
space?
Kraft McMillan Inequality
Theorem (Kraft-McMillan): For any uniquely

decodable code ,
K 1
l
2 k 1
k =0
NOTE: Kraft McMillan Inequality does not

tell us whether the code is prefix-free or not

Instantaneous Codes:
A uniquely decodable code is said to be instantaneous if

the end of any code word is recognizable without the
need of inspection of succeeding code symbols. That is
there is no time lag in the process of decoding.
Prefix property: A necessary and sufficient condition for

a code to be instantaneous is that no complete code
word be a prefix of some other code word.
To understand the concept, consider the following codes:


Example:

Example:
A six symbol source is encoded into Binary codes shown

below. Which of these codes are instantaneous?

Example:
Given
S={s1,s2,s3,s4,s5,s6,s7,s8,s9}
and
X={0,1}. Further if l1=l2=2 and l3=l4=l5
=l6=l7=l8=l9=k.
Then from Kraft inequality, we have

Clearly, if k <4, it is not possible to construct

an instantaneous binary code.
Thus if k 4, Kraft inequality tells us that an
instantaneous code does exist but does not
tell us how to construct such a code.
The codes for the symbols when k=4 are
shown below:
Shannon Fano Coding Technique
Algorithm.
Step 1: Arrange all messages in descending
order of probability.
Step 2: Divide the Seq. in two groups in such a

way that sum of probabilities in each group is same.
Step 3: Assign 0 to Upper group and 1 to Lower

group.
Step 4: Repeat the Step 2 and 3 for Group 1 and 2 and

So on..
Example
No. Of
Bits
Messages
mk
pk
Coding Procedure
m1
m2
1/8/
m3
1/8
m4
1/16
m5
(lk)
Code
100
101
1100
1/16
1101
m6
1/16
1110
m7
1/32
11110
m8
1/32
11111
Shannon Fano Coding Technique

Source entropy
H(S)=-(1/2)log2(1/2)-2*(1/8)log2(1/8)-3*(1/16)log2(1/16)2*(1/32)log2(1/32)
=(1/2)*1+2*(1/8)*3+3*(1/16)*4+2*(1/32)*5
=0.5+0.75+0.75+0.3125=2.3 bits
Average code word length:

= (1/2)*1+ 2*(1/8)*3+ 3*(1/16)*4+ 2*(1/32)*5
=0.50+0.75+0.75+0.3125=2.3 bits
Coding efficiency :100%

Basic principles of Huffman Coding

Invented by Huffman as a class assignment in 1950.
Huffman coding is a popular lossless Variable Length Coding

(VLC) scheme, based on the following principles:
Shorter code words are assigned to more probable symbols

and longer
code words are assigned to less probable
symbols.
No code word of a symbol is a prefix of another code word.

This makes Huffman coding uniquely decodable.
Every source symbol must have a unique code word assigned

to it.
35

Huffman encoding algorithm proceeds as follows:
1. The source symbols are listed in order of decreasing

probability. The two source symbols of lowest probability are
assigned a 0 and a 1.
2.These two source symbols are regarded as being combined

into a new source symbol with probability equal to the sum of
the two original probabilities. (The list of source symbols,
and therefore source statistics, is thereby reduced in size by
one.) The probability of the new symbol is placed in the list in
accordance with its value.
3. The procedure is repeated until we are left with a final list of

source statistics (symbols) of only two for which a 0 and a 1
are assigned.
36

Input information rate of Source Encoder :
Ri = R s * H ( S )
( bps ) bits per sec ond
Where Rs is the signalling rate and H(S) is the source

entropy
Output information rate of Source Encoder :
R0 = R s * L
bps
Where Rs is the signalling rate and L is the average

code word length
Redundancy = Source entropy- Average code word
length
= H (S ) L
37

The variance of the average code word length of

source encoder :
K 1
2 = p k (lk L ) 2
k =0
Smaller value of the variance is preferred because it

require less memory space.
38

EXAMPLE: Huffman Tree
The five symbols of the alphabet of a discrete memoryless

source and their probabilities are shown in the two leftmost
columns of Figure 4.1(a). Following through the Huffman
algorithm, we reach the end of the computation in four
steps, resulting in the Huffman tree shown in Figure 4.1(a).
The code words of the Huffman code for the source are
tabulated in Figure 4.1(b).
Method 1:
Placing the probability of
possible.
the
new symbol as high as
39
Figure 4.1.a
40
Figure 4.1.b
Memory storage requirement for Huffman code:
In order to store 100 characters in computer , the umber
of bits required=(40*2)+(20*2)+(20*2)+(10*3)+(10*3)= 220
bits.
41

Huffman code :
K 1
2 = p k (l k L ) 2 = 0 .16
k =0
42

Method 2: Placing the probability of the
new symbol as low as possible.
Memory storage requirement for Huffman code 2:

In order to store 100 characters in PC, the number of bits
required=(40*1)+(20*2)+(20*3)+(10*4)+(10*4)= 220 bits
43

The average codeword length is still 2.2 bits/symbol
But variances are different!
Huffman code 2:
K 1
2 = p k ( l k L ) 2 = 1 .36
k =0
44

Let us assume that signalling rate (Rs)=10,000 symbols per
second.
Output information rate of Source Encoder or Transmission
capacity: 2.2 *10000=22000 bits per second.
So, what it means that the transmission channels expects to
receive 22,000 bits per second from the source encoder .
Now, since we are using the variable length coding, the bit
generation rate will not be constant.
The bit generation rate will vary around 22,000 bits per second.
So, usually the output of such a source code is generally fed into
a buffer.
The purpose of the buffer is to smooth out the variation in the
bit generation rate.
However, the buffer has to be of finite size.
Therefore, the greater the variance in the code word the more
difficult the buffer design problem becomes.
45

If the encoder simply writes the compressed data on a file in
the computer, the variance of the code makes no di
erence.
A small-variance Hu
man code is preferable only in cases
where the encoder transmits the compressed data, as it is
being generated, over a transmission network.
In such a case, a code with large variance causes the encoder
to generate bits at a rate that varies all the time. Since the bits
have to be transmitted at a constant rate, the encoder has to
use a bu
er. Bits of the compressed data are entered into the
bu
er as they are being generated and are moved out of it at a
constant rate, to be transmitted.
It is easy to see intuitively that a Hu
man code with zero
variance will enter bits into the bu
er at a constant rate, so only
a short bu
er size will be needed.
The larger the code variance, the more variable is the rate at
which bits enter into the bu
er, requiring the encoder to use a
larger bu
er size.
46
Encoding a string of symbols using

Huffman codes
After obtaining the Huffman codes for each symbol, it is easy to
construct the encoded bit stream for a string of symbols.
Example:
If we have to encode a string of symbols s3s2s4s3s0s1s2 we
shall start from the left, taking one symbol at a time.
The code corresponding to the first symbol s3 is 010, the second
symbol s2 has a code 11, the third symbol s4 has a code 011, the
fifth symbol s0 has a code 00; and the sixth symbol s1 has a code
10;
Proceeding as above, we obtain the encoded bit stream

using Huffman encoding as: 01011011010001011
In this example, 17 bits were used to encode the string of 7
symbols. A straight binary encoding of 7 symbols, chosen from an
alphabet of 5 symbols would have required 21 bits (3 bits/symbol)
and this encoding scheme therefore demonstrates substantial
compression.
47
Encoding a string of symbols using

Huffman codes
After obtaining the Huffman codes for each symbol, it is easy to
construct the encoded bit stream for a string of symbols.
Example:
If we have to encode a string of symbols s3s2s4s3s0s1s2 we
shall start from the left, taking one symbol at a time.
The code corresponding to the first symbol s3 is 010, the second
symbol s2 has a code 11, the third symbol s4 has a code 011, the
fifth symbol s0 has a code 00; and the sixth symbol s1 has a code
10;
Proceeding as above, we obtain the encoded bit stream

using Huffman encoding as: 01011011010001011
In this example, 17 bits were used to encode the string of 7
symbols. A straight binary encoding of 7 symbols, chosen from an
alphabet of 5 symbols would have required 21 bits (3 bits/symbol)
and this encoding scheme therefore demonstrates substantial
compression.
48
Decoding a Huffman coded bit stream

Since no codeword is a prefix of another codeword, Huffman
codes are uniquely decodable.
The decoding process is straightforward and can be
summarized below:
Step-1:Examine the leftmost bit in the bit stream. If this
corresponds to the codeword of an elementary symbol, add
that symbol to the list of decoded symbols, remove the
examined bit from the bit stream and go back to step-1 until all
the bits in the bit stream are considered. Else, follow step-2.
Step-2:Append the next bit from the left to the already
examined bit(s) and examine if the group of bits correspond to
the codeword of an elementary symbol. If yes, add that
symbol to the list of decoded symbols, remove the examined
bits from the bit stream and go back to step-1 until all the bits
in the bit stream are considered. Else, repeat step-2 by
appending more bits.
49
Decoding a Huffman coded bit stream

In the encoded bit stream of previous example, if
we receive the bit stream 01011011010001011.
Follow the steps described above, we shall first
decode s3 (010), then s2 ( 11) followed by s4
(011), s3 (010), s0 (00), s1 (10) and s2 (11).
This is exactly what we had encoded.
50

Properties:
Generates optimal prefix codes
Cheap to generate codes
Cheap to encode and decode
Average code word length =Source entropy if probabilities
are powers of 2.
Applications:
Used in many, if not most compression algorithms such as
gzip, bzip, jpeg (as option), fax compression,
51

Advantages of Huffman algorithm:
The Huffman algorithm generates an optimal prefix
code.
Cheap to generate codes.
Cheap to encode and decode.
Disadvantages of Huffman algorithm:

If the ensemble changes
the frequencies and
probabilities change the optimal coding changes
e.g. in text compression symbol frequencies vary with
context.
Re-computing the Huffman code by running through
the entire file in advance?
52
Conditional probabilities
Suppose we have a single event A with possible
outcomes {ai}.
Everything we know is specified by the
probabilities for the possible outcomes: P(ai).
For the coin toss the possible outcomes are
heads and tails:
P(heads) = 1/2 & P(tails) = 1/2.
More generally::
0 P (ai ) 1
P (ai ) = 1
Two events:
Add a second event B with outcomes {bj} and probabilities
P(bj).
Complete description provided by the joint probabilities:
P(ai,bj)
If A and B are independent and uncorrelated then
P(ai,bj) = P(ai) P(bj)

Single event probabilities
and joint probabilities
related by:
P(ai ) = P (ai , b j )
j
P(b j ) = P (ai , b j )
i
Two events:
What does learning the value of A tell us about the
probabilities for the value of B?
If we learn that A = a0, then the quantities of interest are the
conditional probabilities:
P(bj|a0)
given that
This conditional probability is proportional to the joint probability:
P (b j | a0 ) P (a0 , b j )
Finding the constant of proportionality leads to Bayes rule:
P (ai , b j ) = P (b j | ai ) P (ai ) or
P (ai , b j ) = P (ai | b j ) P (b j )
Entropy for two events

H ( A, B) =
P(a , b ) log P(a , b )

i
ij
H ( A) =
P(a , b ) log P(a , b )

i
ij
H ( B) =
P(a , b ) log P(a , b )

i
ij
H ( A) + H ( B) H ( A, B) =
ij
P(ai , b j )
0
P(ai , b j ) log
P(ai ) P(b j )
Entropy of Random Variable

We now extend the notions related with events to
random variables.
Let us consider X (respectively Y) be a discrete
random variable taking on values in {x0, x1, x2, x3,
............., xJ-1} (respectively {y0,y1, y2, y3, ............., yK-1}
p( x j , yk ) = P X = x j Y = yk
As the entropy depends only on the probability
distribution, it is natural to define the joint entropy of
the pair H(X ,Y) as
J 1 K 1
H ( , ) =
j =0 k =0
p ( x j , y k ) log 2
p ( x j , y k )
57

Conditional entropy:
Let X; Y be two random variables. Then, expanding H(X; Y ) gives
58
59
Discrete Memoryless Channel

A discrete memoryless channel is a statistical model with
an input X and an output Y that is a noisy version of X;
both X and Y are random variables.
Every unit of time, the channel accepts an input symbol
X selected from an alphabet X and, in response, it emits an
output symbol Y from an alphabet Y.
The channel is said to be "discrete" when both of the
alphabets X and Y have finite sizes.
It is said to be "memoryless" when the current
output symbol depends only on the current input
symbol and not any of the previous ones.
60

Diagram of a discrete memoryless channel is given below:
Input alphabet :
Output alphabet :
61

A set of transition probabilities:
Also, the input alphabet X and output alphabet Y need

not have the same size.
For example, in channel coding, the size K of the output
alphabet Y may be larger than the size of the input
alphabet X; thus K J ,.
On the other hand, we may have a situation in which the
channel emits the same symbol when either one of two input
symbols is sent, in which case we have K J
62

The transition probability p(yk|xj) is just conditional
probability that channel output Y=yk; given that the the
channel input X=xj .
There is possibility of errors arising from the process of
information transmission over DMC.
When k=j, transition probability p(yk|xj) is represents a
conditional probability of correct reception.
Otherwise it represents a conditional probability of error.
Channel matrix, or transition matrix: (J by K)
63

Note that each row of the channel matrix P corresponds
to a fixed channel input, whereas each column of the matrix
corresponds to a fixed channel output.
Note also that a fundamental property of the
channel matrix P, as defined here, is that the sum of
the elements along any row of the matrix is always
equal to one; that is,
Suppose now that the inputs to a discrete memoryless

channel are selected according to the probability
distribution { p(xj), j = 0, 1,. . . , J-1}.
64

In other words, the event that the channel input X= xj,
occurs with probability
Where X is denoting the random variable of channel

input,
If we specify the second random variable Y denoting the
channel output.
The joint probability distribution of the random
variables X and Y is given by
65

The marginal probability distribution of the output
random variable Y is obtained by averaging out the
dependence of p(xj, yk) on xj, as given by
J 1
p ( y k ) = P (Y = y k ) = P (Y = y k | X = x j ) P ( X = x j )
j =0
J 1
= p ( yk | x j ) p ( x j )
for k = 0 , 1, 2 ......, K 1
j=0
The probabilities p(xj) for j = 0,1, . . . , J-1, are

known as the a priori probabilities of the various input
symbols.
66

Above equation states that if we are given the input a
priori probabilities p(xj,) and the channel matrix [i.e.,
the matrix of transition probabilities p(yk/xj)], then we
may calculate the probabilities of the various output
symbols, the p(yk,).
For J=K, the average probability of symbol error, Pe
is defined as the probability that the output random
variable Yk is different from the input random variable,
average over all k j
K 1
Pe =
P (Y = y
k =0
k j
J 1 K 1
)=
p( y
/ x j ) p( x j )
j =0 k =0
k j
The difference (1-Pe) gives average probability of

correct reception.
67
Mutual Information
Channel input X(selected from input alphabet X ) and
Channel output Y(selected from output alphabet Y).
How can we measure the uncertainty about X after
observing Y ?
Now defining the conditional entropy of X selected from
input alphabet X, given that Y=yk. We can write
This quantity is itself a random variable that takes

on the values H ( / Y = y,0 ) . . . , H ( / Y = y K 1 ) with
probabilities p(y0), . . . , p(yK-1), respectively.
68
Mutual Information
The mean of entropy H ( / Y = y k )
alphabet Y is therefore given by
over the output
The conditional entropy represents the amount of

uncertainty remaining the channel input after the
channel output has been observed.
69
Mutual Information
The difference H ( ) H ( / )
is called the
mutual information, which is measure of uncertainty
about the channel input after observing the channel
output.
p ( x j , yk )
I ( ; ) == p ( x j , y k ) log 2
p ( x j ) p ( yk )
j =0 k =0
J 1 K 1
70
Properties of Mutual Information

Property 1 : The mutual information of a channel is
symmetric; that is
I ( ; ) = I (Y ; )
where the mutual information I ( ; ) is a measure
of the uncertainty about the channel input that is
resolved by observing the channel output, and
the mutual information I (Y ; ) is a measure of the
uncertainty about the channel output that is resolved
by sending the channel input.
71

Proof:
I ( ; ) = I (Y ; )
I ( ; ) = H ( ) H ( / )
72

Proof:
I ( ; ) = H ( ) H ( / )
J 1 K 1
I ( ; ) =
j=0 k =0
J 1 K 1
j=0 k =0
p ( x j , y k ) log 2
p(x j )
p ( x j , y k ) log 2
p(x j / yk )
73

J 1 K 1
I ( ; ) =
j=0 k =0
1
log 2
p ( x j , y k ) log 2
p(x j )
p ( x j / y k )
J 1 K 1
I ( ; ) =
j=0 k =0
1
+ log 2 p ( x j / y k )
p ( x j , y k ) log 2
p( x j )
J 1 K 1
I ( ; ) =
j=0 k =0
p ( x j / y k )
p ( x j , y k ) log 2
p ( x j )
74

From Bayes' rule for conditional probabilities, we
have following Equations:
J 1 K 1
I ( ; ) =
j=0 k =0
p ( y k / x j )
= I ( ; )
p ( x j , y k ) log 2
p
(
y
)
75

Property 2 :The mutual information is always
nonnegative; that is
I ( ;Y ) 0
Proof: To prove this property,
I ( ;Y ) 0
with equality if and only if
76

Property 3 :The mutual information of a channel is
related to the joint entropy of the channel input and
channel output by I ( ; ) = H ( ) + H ( ) H ( , )
Where
J 1 K 1
H ( , ) =
j=0 k =0
p ( x j , y k ) log 2
p ( x j , y k )
Proof: To prove this property, we have to re-write

joint entropy
p ( x j ) p ( y k )
H ( , ) =
p ( x j , y k ) log 2
p ( x j , y k )
j=0 k =0
J 1 K 1
+
p ( x j , y k ) log 2
p( x j ) p( yk )
j=0 k =0
J 1 K 1
(1)
77

Second term:
J 1 K 1
j=0 k =0
p ( x j , y k ) log 2
p ( x j ) p ( y k )
1 K 1
=
p(x j , yk ) +
log 2
p( x j )
j =0
k =0
J 1
K 1
1 J 1
p(x j , yk )
log 2
p( yk ) j=0
k =0
Expanding intermediate term

K 1
p(x
K 1
j,
yk ) =
k =0
j=0
j ) p( yk
/ x j ) = p(x j ) *
k =0
J 1
p(x
p(x
K 1
yk ) =
p( y
j =0
/ x j ) = p( x j )
k =0
J 1
j,
p( y
J 1
) p( x j / yk ) = p( yk ) *
p(x
/ yk ) = p( yk )
j=0
78

Simplified Second term:
p ( x j , y k ) log 2
p( x j ) p ( yk )
j=0 k =0
J 1
1 K 1
1
=
p ( x j ) log 2
+
p ( y k ) log 2
p(x j )
p
(
y
)
k
j =0
k =0
= H ( ) + H ( )
J 1 K 1
p ( x j / y k )
I ( ; ) =
p ( x j , y k ) log 2
p ( x j )
j =0 k =0
J 1 K 1
p ( x j , yk )
=
p ( x j , y k ) log 2
p ( x j ) p( y k )
j=0 k =0
J 1 K 1
79

So equation (1), we can write in terms of
H ( , ) = I ( ; ) + H ( ) + H ( )
I ( ; ) = H ( ) + H ( ) H ( , )
80
Channel Coding Theorem

For many real time applications, low level of reliability
(i.e. high value of probability of
error
Pe= 10-1) is
unacceptable.
For may applications, necessary requirement of average
probability of error equal to 10-6 or even lower.
To achieve such a high level of performance, we have to
use of channel coding.
Fig.1.1. Block diagram of digital communication system

81

In block codes, the message sequence is subdivided into
sequential blocks each k bits long, and each k-bit block is
mapped into an n-bit block, where n > k.
The number of redundant bits added by the encoder to
each transmitted block is n - k bits.
The ratio k/n is called the code rate. Code rate is denoted by
r, we may thus write:
k
r=
n
For accurate reconstruction of the original source sequence
at the destination requires that the average probability of
symbol error should be low.
82

Suppose that the discrete memoryless source in Figure 1..1
has the source alphabet S and entropy H(S) bits per source
symbol.
We assume that the source emits symbols once every Ts,
seconds. Hence, the average information rate of the source
is H(S)/Ts bits per second.
The decoder delivers decoded symbols to the destination
from the source alphabet S and at the same source rate of
one symbol every Ts, seconds.
The discrete memoryless channel has a channel capacity
equal to C bits per use of the channel.
We assume that the channel is capable of being used once
every Tc seconds. Hence, the channel capacity per unit
time is C/Tc, bits per second, which represents the
maximum rate of information transfer over the channel.
83

Shannon's 2 theorem:
The channel coding theorem for a discrete memoryless
channel is stated in two parts as follows:
(a)Let a discrete memoryless source with an alphabet
S have entropy H(S) and produce symbols once every TS
seconds. Let a discrete memoryless channel have
capacity C and be used once every TC seconds. Then if
There exists a coding scheme for which the source

output can be transmitted over the channel and be
reconstructed with an arbitrarily small average probability
of error.
84

(b) Conversely, if
It is not possible to transmit information over the channel

and reconstruct it with an arbitrarily small probability of
error.
This theorem specifies that the channel capacity C
as a fundamental limit on the rate at which the
transmission of reliable error-free message can take
place over a discrete memoryless channel.
85

Channel Capacity:
The mutual information can be expressed by
p ( x j , yk )
I ( ; ) = p ( x j , y k ) log 2
p( x j ) p ( yk )
j =0 k =0
J 1 K 1
The input probability distribution p(xj) is independent of the

channel.
So, we can maximize mutual information
channel with respect to p(xj).
I ( ;Y )
of
86

Therefore, we can define the channel capacity of a
discrete memoryless channel is the maximum mutual
information I(X;Y) in any single use of the channel (i.e.,
signalling interval), where the maximization is over all
possible input probability distributions {p(xj)} on channel
input alphabet X.
So, the channel capacity C is defined as
C = max {I ( ; Y )}
p(x j )
The channel capacity

channel use.
C is measured
in bits per
The Channel Coding Theorem is also known as noisy

coding theorem.
87
Differential Entropy for continuous

random variable
Let X is representing continuous random variable with PDF
.
f X ( x)
By analogy with the entropy of discrete random variable,

we can introduce differential entropy of continuous random
variable X.
h( X ) =
dx
f
(
x
)
log
2
X
f X ( x)
Justification of differential entropy :

Basically the continuous random variable X is the limiting
form of a discrete random variable that assumes the value
xk = kx, where k = 0, 1, 2,. . . , and x approaches zero.
By definition, the continuous random variable X assumes a
value in the interval [xk, xk + x] with probability fX(xk) x.
88

random variable
Hence, permitting x to approach zero, the ordinary
entropy of the continuous random variable X may be
written in the limit form as follows:
H (X )
1
= lim
f X ( x k ) x log 2 f ( x ) x
x 0 k = 0
X k
H (X )
1
1
= lim
f X ( x k ) x log 2 f ( x ) + log 2 x
x 0 k = 0
X k
H (X )

1
f X ( x k ) log 2
f (x )
X k
k =
= lim
x 0
log 2 ( x ) f X ( x k ) x
k =
89

random variable
dx f X ( x)dx lim log2 (x )

H ( X ) = f X ( x) log2
f X ( x)
x0
Using property of PDF:
f X ( x)dx = 1
So, ordinary entropy of the continuous random variable X
dx lim log2 (x )
H ( X ) = f X ( x) log2
f X ( x)
x0
H ( X ) = h() lim log2 (x )

x0
90

random variable
In the limit as x approaches zero, -log2(x) approaches
infinity.
This means that the entropy of a continuous random
variable is infinitely large.
This is also true because a continuous random variable
may assume a value anywhere in the interval (-,) and
the uncertainty associated with the variable is on the order
of infinity.
In order to avoid the problem associated with the term
log2(x) by adopting h(X) as a differential entropy, with
the term -log2(x) serving as reference.
91

random variable
Moreover, since the information transmitted over a
channel is actually the difference between two entropy
terms that have a common reference, the information will
be the same as the difference between the corresponding
differential entropy terms.
When we have a continuous random vector X consisting
of n random variables X1, X2, . . . , Xn, then we can define
the differential entropy of X as the n-fold integral
1
dx
h ( ) = f ( x ) log 2
f ( x)
92
Differential entropy for continuous

random variable
Differential entropy for Uniform distribution random
variable:
1
f X ( x) = b a
0
for
x (a, b)
otherise
1
1
h( X ) =
dx = log 2 ( b a )
log 2
ba
1 (b a )
a
Note that h(X)<0 if (b-a)<1

93

random variable
Maximum Differential entropy for Gaussian random
variable with zero mean and variance 2 :
Two constraint :
(1)
(2)
f X ( x)dx = 1
2 f ( x)dx = 2 = cons tan t

x
X
PDF of Gaussian RV is:
f X ( x) =
1
2 2
x2
2 2
e
94

random variable
Differential entropy RV is:
h( X ) =
f X ( x) log2 ( f X ( x))dx
1
h( X ) = f X ( x) log2
2 2
h( X ) = f X ( x) log2
x2
2 2
dx
e
x2
2 2
1
dx
dx f X ( x) log2 e
2
2

95

random variable
log
e
h( X ) = f X ( x) log2 2 2 dx + 2 x 2 f X ( x)dx
2
log
e
h( X ) = log2 2 2 f X ( x)dx + 2 x 2 f X ( x)dx

2
log2 e 2
2
h( X ) = log2 2 *1 +
*
2 2
1/ 2 1
1
2
2
h( X ) = log2 2 *1 + log2 e = log2 2
+ log2 e
2
2
1
1
1
2
h( X ) = log2 2 + log2 e = log2 2e 2
2
2
96
Mutual information of continuous

random variable
Consider next a pair of continuous random variables
X and Y. By analogy of discrete random variable, we
define the mutual information between the random
variables X and Y as follows:

f X ( x / y)
dxdy
I ( X ,Y ) =
f X ,Y ( x, y) log2
f X ( x)

Where fX,Y(x,y) is the joint probability density function
of X and Y, and fx(x|y) is the conditional probability
density function of X, given that Y = y.
Also, by analogy with discrete random variable. We

can find that the mutual information I(X; Y) has the
following properties:
97
Mutual information of continuous random

variable
Also, by analogy with discrete random variable. We
can find that the mutual information I(X; Y) has the
following properties:
1 . I ( ; ) = I (Y ; X )
2 . I ( ; ) 0
3.
4.
I ( ; ) = h ( X ) h ( X | Y )
I ( ; ) = h (Y ) h (Y | X )
Where h(X) and h(Y) are the differential entropy of

continuous random variable X and Y respectively.
98
Mutual information of continuous

random variable
The parameter h(X|Y) is differential entropy of
given Y and it is defined as
X ,
1
dxdy
h( X | Y ) =
f X ,Y ( x, y) log2
f X ( x | y)
The parameter h(Y|X) is differential entropy of Y , given

X and it is defined as

1
dxdy
h(Y | X ) =
f X ,Y ( x, y) log2
f Y ( y | x)
99
Information Capacity Theorem

Shannon-Hartley law ( Shannon's 3rd theorem):
Now, we will use the concept of mutual information to
formulate the information capacity theorem for bandlimited, power-limited Gaussian channels.
To be specific, consider a zero-mean stationary process
X(t) that is band-limited to B hertz.
Let Xk, k = 1,2,. . . , K, denote the continuous random
variables obtained by uniform sampling of the process X(t)
at the Nyquist rate of 2B samples per second.
These samples are transmitted in T seconds over a noisy
channel, also band-limited to B hertz.
Hence, the number of samples, K, is given by
K = 2 BT
-------------(x)
100

We refer to Xk as a sample of the transmitted signal.
Model of discrete-time, memoryless Gaussian channel
given below: .
Nk
Let additive white Gaussian noise (AWGN) of zero mean
and power spectral density N0/2. The noise is bandlimited to B hertz.
Let Yk, k = 1,2, . . ,K, denote continuous random
variables obtained by uniform sampling of the process Y(t)
at the Nyquist rate of 2B samples per second .
101
102

We refer to Yk as a sample of the received signal and it is
given by
Yk = X k + N k
Where Nk is sample of Gaussian noise {N(t)} with zero

mean and variance given by
2 = N0B
We assume that the samples Yk, k = 1, 2, . . . , K are
statistically independent.
Typically, the transmitter is power limited; it is therefore
reasonable to define the cost as
E [ X k2 ] = P
; k = 1, 2 , 3,......., K .
103

Where P is the average transmitted power. The powerlimited Gaussian channel described herein is of not only
theoretical but also practical importance in that it
models many communication channels, including line-ofsight radio and satellite links.
The information capacity of the channel is defined as
the maximum of the mutual information between the
channel input Xk and the channel output Yk over all
distributions on the input Xk that satisfy the power
constraint.
Let I(Xk; Yk) denote the mutual information between Xk
and Yk.
We may then define the information capacity of the
channel as
C=
max I ( ; Y ) : E [ X 2 ] = P
k
f X k ( x)
104

The mutual information I(Xk; Yk) can be expressed
I ( k ; k ) = h (Yk ) h (Yk | X k )
Now we can that the conditional differential entropy of
Yk, given Xk, is equal to the differential entropy of Nk
h (Yk | X k ) = h ( N k )
Hence, we may rewrite mutual information I(Xk; Yk)
I ( k ; k ) = h (Yk ) h ( N k )
In order to maximize mutual information I(Xk; Yk), we
have choose the samples of the transmitted signal from a
noiselike process of average power P.
C = max [I ( ; Y ) ] : X k Gaussian
and
E[ X 2 ] = P
k
105

For the evaluation of the information capacity C, we
proceed in three stages:
1. The variance of sample Yk of the received signal
equals
2.
P +
Hence, the maximum differential entropy of

random variable Yk as
Gaussian
1
h (Yk ) = log 2 2 e P + 2
2. The variance of the noise sample Nk equals 2

Hence, the differential entropy of Nk as
1
h ( N k ) = log 2 2 e 2
2
106

3. Substituting the values of h(Yk) and h(Nk) in the given
equation Equations
I ( k ; k ) = h (Yk ) h ( N k )
1
1
I ( k ; k ) = log 2 2 e P + log 2 2 e 2
P + 2
1
I ( k ; k ) = log 2
2
2
1
= log 1 + P
2
2
2
So, as per definition of information capacity, it is given

as
1
P bits per transmission -------(p)
C = log 2 1 +
107

With the channel used K times for the transmission of K
samples of the process X(t) in T seconds,
So, we find that the information capacity per unit time is
(K/T) times the result given in Equation (p).
The number K equals 2BT, as given in the equation (x).
Hence, we may express the information capacity in the
equivalent form:
C = B log 2 1 +
2
bits per second---------(q)
We may now state Shannon's third (and most

famous) theorem, the information capacity theorem.
108

Information Theory

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Theory

Uploaded by

Copyright:

Available Formats

Fundamental Concepts and

Limits in Information Theory

Fundamental Limits on Performance

Given an information source, and a noisy channel,

Information Theory deals with:

The Measure of Source Information

If the rate of Information from a source does not

Let the source alphabet,

Assume the discrete memory-less source (DMS)

What is the measure of information?

Uncertainty, Information, and

Interrelations between info., uncertainty or surprise

If A is a surprise and B is another surprise, then what is the

Info.( A B) Info.( A) + Info.( B)

The amount of info may be related to the inverse of the

Definition : Measure of average information contents per

where K is the radix

(number of symbols) of the alphabet S of the source.

for some k, and all other pi ' s = 0

No Uncertainty (Lower bound on entropy)

More Intuition on Entropy

Assume a binary memoryless source, e.g., a flip of a coin.

Example of Shannons Entropy

Consider the following string consisting of symbols a and b:

We want to characterize the

Example of Shannons Entropy

Example :Entropy of Binary Memoryless Source

To Illustrate the properties of H(S), we consider a binary source

Example of Shannons Entropy

Extension of DMS (Entropy)

Consider blocks of symbols rather them individual symbols

Consider next the second-order extension of the source. With the

Accordingly, the use of Equation for entropy of the extended source

Source Coding Theorem

Source Encoding : It is the process in which data generated by a

Source Coding Theorem

Efficient source encoder that satisfies two functional requirements:

Source Coding Theorem

Shannons Coding Theorem for noiseless channels:

The theorem states that in any coding scheme, the average

Source Coding Theorem

Average code word length:

Average code word length is bounded by :

Prefix Code: A prefix code is defined as a code in

Uniquely Decodable Codes

A variable length code assigns a bit string (codeword)

e.g. a = 1, b = 01, c = 101, d = 011

A uniquely decodable code is a variable length code in

Uniquely Decodable Codes

Fixed-Length versus Variable-Length Codes:

Uniquely Decodable Codes

Kraft McMillan Inequality

Theorem (Kraft-McMillan): For any uniquely

NOTE: Kraft McMillan Inequality does not

Kraft McMillan Inequality

A uniquely decodable code is said to be instantaneous if

Prefix property: A necessary and sufficient condition for

Kraft McMillan Inequality

Kraft McMillan Inequality

A six symbol source is encoded into Binary codes shown

Kraft McMillan Inequality

Then from Kraft inequality, we have

Kraft McMillan Inequality

Clearly, if k <4, it is not possible to construct