LINGI2348 TIC Syllabus en PDF

Université Catholique de Louvain - Ecole polytechnique de Louvain
INGI 2348 - Information Theory and Coding
Information Theory and Coding
Part I :
Source Coding Engineering
Benoît MACQ
benoit.macq@uclouvain.be
With the help of : Guillaume Janssens

Benjamin Mathon
Marie-Gabrielle Wybou
English Translation : Laurie Haustenne
CONTENTS i
Contents
Preface 1
References 2
I Syllabus 3
1 The numerical communication channel 5

1.1 Sources and numerical channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Introduction to Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 The entropy, or the mean information . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Entropy of a binary variable and normalization of the entropy . . . . . . . . . 10
1.3.3 Bounds on the entropy of a random variable . . . . . . . . . . . . . . . . . . . 11
1.3.4 Channel capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.5 Other properties of the entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.6 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Entropy coding 21
2.1 Theoretical limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Coding memoryless sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Shannon-Fano’s Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Huffman’s Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 Remarks about Shannon-Fano’s and Huffman’s Codes . . . . . . . . . . . . . 25
2.2.4 Adaptative Huffman’s Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Dictionnary Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Universal Codings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Arithmetical Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 UVLC : Universal Variable Length Coding . . . . . . . . . . . . . . . . . . . . 31
3 Quantization 33
3.1 Scalar quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.1 Reminder on random sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.2 Noise features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Optimal rate-distortion compromise 41

4.1 Coding gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Coding by linear transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Prediction Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Hybrid coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
INGI 2348 Information Theory and Coding Translation 2010

CONTENTS 1
Preface
The course on Information Theory and Coding will be in three parts. The first one is about efficient
and secured representation of the sources (messages). Some concepts of Information Theory will be
briefly discussed too.
In the second part, we will study the way of coding the information in order to avoid transmis-
sion errors, and see some theoretical elements with their application.
The third part presents the general principles and cryptographic tools used to protect the infor-
mation (authenticity, confidentiality, integrity, . . . ). These two last parts are not included in this
notes, and will be presented respectively by J. Louveaux et O. Pereira.

CONTENTS 2
References
1. HISTORICAL REFERENCES
In the early 1940’s, Claude E. Shannon has developed a mathematical theory, called infor-
mation theory, for dealing with the more fundamental aspects of numerical communication
systems. Two important and distinguishing characteristics of this theory are :
• the link between information and probability,

• the definition and performance of the encoder and decoder.
The first article about Information Theory and Coding appeared more than 60 years ago.
Indeed, its reference "A Mathematical Theory of Communication" was published in 1948 in
"Bell System Tech. Journal", vol. 27, pp. 379-423 et pp. 623-656.
2. BIBLIOGRAPHIC REFERENCES The theoretical considerations of this course is notably

based on :
• Information Theory and Reliable Communication from Robert GALLAGER, published

by John Wiley;
• Digital Coding of Waveforms from JAYANT et NOLL, published by Prentice Hall;
• Théorie des codes from Jean-Guillaume DUMAS, Jean-Louis ROCH, Eric TANNIER
and Sébastien VARRETTE, published by DUNOD.

3
Part I
Syllabus

CHAPTER 1. THE NUMERICAL COMMUNICATION CHANNEL 5
Chapter 1
The numerical communication channel
1.1 Sources and numerical channels

The aim of this course is to focus on the optimal representation of the messages sent by an emetor
(sender) to a receptor through a transmission channel.
The sender products messages, and is considered as a source.

We will consider two different kinds of sources.
• discrete numerical sources, emitting a suite of events

X(o), X(1), . . . , X(n), . . . taking values included in an alphabet of symbols {x0 , x1 , . . . xk , . . . , xD−1 }
with a given probability pk (n) = P (X(n) = xk ).
These sources are, by example, the ASCII types in a text file, or the output of a counter,
. . . An important challenge is to encode these sources without introducing any distortion. In
other words, we wish to transmit to the receptor exactly the same message that was sent by
the emetor.
• the "‘waveforms"’, that are an analogical signal x(t) (representing a voice, a music or an image
signal). In that case, we accept that the received signal x0(t) can be different from x(t). The
relation between the binary rate needed to represent the signal x(t) and the distortion in x0(t)
is the topic of Chapter 4.
Let’s notice that the waveforms sources are most often digitized before being encoded. The digitizing
of a waveform takes two steps
• the discretization consists in sampling the waveform at a suitable frequency. This frequency
has to be at least equal to the highest frequency of the signal x(t), multiplied by two. Inversely,
if we are given the sampling frequency, we eliminate from the signal x(t) every frequency
higher than the sampling frequency divided by two. The signal x(t) is thus so far replaced by
a sequence of real numbers : xe (0), xe (1), . . . , xe (n), . . .

• the quantization consists in restricting the possible values of the samples xe (n) in an al-
phabet {x0 , x1 , . . . , xD−1 }. By doing so, the signal is represented as a discrete numerical
source
Xe (0), Xe (1), . . . , Xe (n), . . .
The digitizing of a waveform is illustrated by the following figure.
We can see that, because of the quantization, the signal reconstruction from the samples Xe (n)
can’t be perfect. However, as described in the distortion-rate problem of chapter 4, the distor-
tion introduced by the digitizing of the signal is most often negligible compared to the distortions
introduced by the rate-reducing coding (i.e. the compression).
The channel through which messages are sent to the receptor is in general an analogical medium
(a carrier current in a telephone pair, an electromagnetic wave in the air or a wave carried by an
optic signal in a fiber), and thus a necessary step consists in constituting an electric signal from the
coding output : we call this operation the modulation. It’s aim is to maximize the occupation of
the channel and simultaneously to minimize the emitted power and the transmission errors.
The modulation and demodulation processes won’t be discussed here, see the Telecommunication
Course for more details... From a practical point of view, the modulator, the physical channel and
the demodulator form together a numerical channel. In this course, we will assume that the channels
are stationary and memoryless, and simply characterized by an input-output relation as defined in
the following image.

Because of the perturbations on the physical channel and the limits related to the modulation-
demodulation operations, some errors can occur during the transmission of the discrete numerical
events. These transmission errors can be characterized by a matrix containing the conditional
probabilities.
pi|j = P (X(n) = xi |Y (n) = xj )
Each element of this matrix contains the probability that the event X(n) = xi was sent on the
channel, when we know that the event Y (n) = xj was actually received at the other end of the
channel.
In the particular case where the transmission channel we use never introduces errors, the matrix
becomes diagonal (pi|i = 1 et pi|j = 0 si i 6= j).
We say that the channel is stationary if the pi|j are time-independant (the index n here above)
and are independant from the discrete values X(n − k) sent previously. It’s quite a restrictive
assumption in regard to the real systems (a cell phone, by example, might have time-changing
transmission features when we drive a car. Not to do.)
The memoryless symmetric binary channel has been widely studied in Information theory :
This kind of channel has an error rate (typically 10−9 for the optical fibers, 10−4 to 10−6 for
satellite transmissions).
1.2 Introduction to Coding

The numerical channel (composed of the the physical channel, the modulator and demodulator) has
three limitations :
a. its capacity R (for "Rate"), generally described in bits/s, is limited by a maximal value;
b. its security is limited too, because an ennemy can listen to messages (confidentiality problem
between the emettor and the receptor), or send himself messages, using the identity of the real
emettor (authentication problem), or modify the transmitted messages (integrity problem);
c. its fiability is limited, and described by the error rate ( for the binary symmetrical memoryless
channel).
Three kinds of coding were proposed to solve this limitations :
a. the source coding gives us a very compact representation of the message, requiring a lower
rate than the channel capacity in any way;

b. the cryptographic coding that ciphers the message (confidentiality problem) and that signs it
(authentication and integrity problems) solves the security problem;
c. the channel coding solves the transmission errors problem by doing an error-correcting coding
(or simply by detecting the errors).
In the first part of this course, we will assume that the channel constitutes a new channel with a
rate slightly lowered by the error-correcting codes, but that now provides an errorless transmission.
It’s important to notice that the cryptographic coding (or, in any case,the ciphering) produces an
output of pseudo-random signals from which it is very hard to recover the original message if we
don’t have access to the decoding process. The aim of the source coding is essentially to reduce
the rate, and is based on the redundancy of the message. It is thus mandatory to execute the
ciphering after the source coding. Indeed, the coded source looks totally like random, so there is
no redundancy left.
The three kinds of coding will be achieved in the order illustrated in the following image :
We will consider two kinds of source coding :
- the lossless coding, where the amount of bits representing the source will depend on the redun-
dancy of that source. In general, this coding will be a variable rate coding.
- the lossy coding, only for the waveforms, is not only based on the redundancy suppression, but on
information suppression too, this is why it is called a lossy coding. In that case, it’s possible
to construct a fixed-rate coding (we suppress information to reach the desired rate). The
loss of information will introduce distortions on the reconstructed signal. The rate-distortion
compromise of the waveforms is studied in Chapter 4.

1.3 Information theory

The amount of information related to an event is inversely proportional to the probability of the
event. So, the event "the sun shines in the desert" contains very few information. On the contrary,
the event "the sun didn’t rise this morning", highly improbable, contains much more information.
Moreover, we would like to measure the amount of information of two independant events by
an additive function :
- let [X1 (n) = xk ] and [X2 (m) = xl ] be two events of two discrete numerical sources;
- let assume that these two events are independant (P is the probability) :
P ([X1 (n) = xk et X2 (m) = x` ]) = P ([X1 (n) = xk ]) .P ([X2 (m) = x` ])
the amount of information related to the event [X1 (n) = xk ] and [X2 (m) = xl ] is written
I ([X1 (n) = xk et X2 (m) = x` ]) and must be such that
I ([X1 (n) = xk et X2 (m) = x` ]) = I ([X1 (n) = xk ]) + I ([X2 (m) = x` ])
The inverse proportionality between the amount of information and the probability, together with
the additivity of two independant events, made us choose the function log(1/x).

1
I ([X1 (n) = xk ]) = log P (X(n)=xk )
= − log (P (X(n) = xk ))
1.3.1 The entropy, or the mean information

Let X(n) be a random variable issued from a discrete numerical source, and taking values in the
alphabet {x0 , x1 , . . . , xD1 } with the probabilities P (X(n) = xk ) = pk .
The information related to the event
[X(n) = xk ]
is
I (X(n) = xk ) = − log pk
The mean information associated to the instances of X(n) is given by
D−1
X D−1
X
H (X(n)) = pk I(X(n) = xk ) = − pk log pk
k=0 k=0
is defined as the entropy of X(n). We can give several interpretations of the entropy :
• the mean amount of information that an event brings (a rare event brings more information
than a frequent one);
• the uncertainty of the output of an event (systems with a very frequent event have less entropy
than systems with plenty of equiprobable events);
• the dispersion in the probability distribution
• the minimal amount of binary digits needed in average to represent a message in a unique
way (k binary digits can represent 2k messages, and M messages require dlog2 M e bits.

Let consider the random vector
(X(−N/2), X(−N/2 + 1), . . . , X(0), X(1), . . . , X(N/2))
issued from the same discrete numerical source as X(n) where every variable takes its values in the
alphabet {x0 , x1 , . . . , xD1 } with the probabilities p0 , p1 , . . . , pD−1 .
The event probability can be written as :
[(X(−N/2) = xk−N/2 ) and . . . and (X(0) = xk )

and (X(1) = xk0 ) and . . . and (X(N/2) = xkN/2 )]
where the k i take values in [0, D − 1]
pk−N/2 ,...,k,k0 k00 ,...,kN/2
then we can measure the amount of information related to this instance of the random vector by
− log pk−N/2 ,...,k,...,kN/2
and its entropy by
H ([X(−N/2) et . . . X(N/2)])
D−1
X D−1
X D−1
X
= ... ... pk−N/2 ,...,k,...kN/2 − log pk−N/2 ,...,k,...,kN/2
k−N/2 =0 k=0 k−N/2 =0
The entropy of the discrete numerical source . . . , X(0), X(1), . . . , X(m) . . . will be measured by
1
lim H ([X(−N/2) . . . X(0) . . . X(N/2)]) = HX
N →∞ N + 1
If the source is composed of independant, identically distributed random variables, then
pk−N/2 ,...,k,...,kN/2 = pk−N/2 . . . pk . . . pkN/2
H ([X(−N/2) . . . X(0) . . . X(N/2)]) = H (X(−N/2)) + . . . + H (X(0)) + . . . + H (X(N/2))
HX = H (X(m))
In most cases, numerical sources aren’t composed of independant variables (however, variables are
most often identically distributed).
In practice, to measure the entropy, we do a transform on the sources in order to get independant
variables, and then we measure the entropy of that representation.
1.3.2 Entropy of a binary variable and normalization of the entropy

Let assume that the random variable X(n) takes two possible values {0, 1} with the probabilities
p0 and p1 (it’s clear that p0 + p1 = 1), then the entropy of this variable can be written as
H (X(n)) = −p0 log p0 − p1 log p1

= −p0 log p0 − (1 − p0 ) log(1 − p0 )

The function H (X(n)) (represented as a function of p0 in the above image) takes its maximal
value when p0 = p1 = 12 and is then equal to H (X(n)) = log(2). Indeed, at this point we have

1 1 1 1 1
− log − log = − log = log(2)
2 2 2 2 2
In general, we normalize the log function measuring the entropy in such a way that the maximal
entropy of a binary variable takes the value 1 (expressed in bits). So, the logarithms are always in
base 2 :
logx
log ≡ ou log(2) = 1
logx (2)
It’s important to notice that, in the case of a random binary variable, the entropy is maximum for
p0 = p1 = 21 , while the amount of information related to the event [X(n) = 0] is on the contrary more
important when p0 decreases. However, when p0 is small, [X(n) = 1] contains very few information
and the event [X(n) = 0] is very rare and so doesn’t contribute very much to the mean information.
Moreover, a certain event P [X(n) = 0] = p0 = 0 or = 1 contains absolutely no information.
Conclusion : the binary variable taking the value 1 if it is raining and 0 if it is not, has a weak
entropy in the desert but a maximal entropy in Belgium.
1.3.3 Bounds on the entropy of a random variable

Theorem :
Let X(n) be a random variable taking values in the alphabet {x0 , x1 , . . . , xD−1 }
We can show that
H(X(n)) ≤ log D
and that H(X(n)) = log D if and only if p0 = p1 = . . . = pD−1 = D1
Demonstration :
Using the definition of the entropy, and the fact that the sum of the probabilities is equal to 1,
and the fact that log(x) + log(y) = log(xy), we have
D−1
X D−1
X
H(X(n)) − log D = − pk log pk − pk log D
k=0 k=0
D−1
X 1
= pk log
pk D
k=0

if we look at the following image, it’s easy to observe that ln z ≤ z − 1 ∀ z, the equality occurs
when z = 1.
The following inequality is derived

D−1
X 1
H(X(n)) − log D = log(e) pk ln
D pk
k=0
logy ?
because logx ? = here with x = e
logy x
D−1
X 1
≤ ln(e) p k0 −1
Dpk0
k0 =0
where k 0 represents the indexes for which pk0 6= 0. So, we find the desired relation, because
"D−1 D−1
# "D−1 #
X 1 X X 1
H (X(n)) − log D ≤ ln(e) − pk0 = ln(e) −1 ≤0
0
D 0 0
D
k =0 k =0 k =0
since
D−1 D−1
X 1 X 1
if pk 6= 0 ∀ k, = 1 and else <1
D D
k0 =0 0 k =0
Besides, it’s clear that if p0 = p1 = . . . = pD−1 = D,

1
H(X(n)) = log D because we have
D−1
X 1
H(X(n)) − log D = log(e) pk ln
D pk
k=0
D−1
X 1
= log(e) ln(1)
D | {z }
k=0 =0
= 0
and the reciprocal implication is easy to prove. This concludes the demonstration.
This bound tells us that the amount of information contained in X(n) is at most equal to the
number of bits that are necessary for it’s natural representation. Indeed, this bound can intuitively
be explained by the fact that the worst we could do is to assign log2 D bits to each value. So, a
sequence of ASCII types coded with 8 bits (256 possible values) has an entropy at most equal to 8
bits. We will see in Chapter 2 how to modify the coding of the random variables in order to get a
representation that allows us to have a rate that is closer from the entropy.

1.3.4 Channel capacity

Paragraph 1.3.1 focused on the amount of information related to a random variable X(n) issued
from a discrete numerical source. The extension to the amount of information of the source is quite
immediate for sources of independant identically distributed (i.i.d.) variables.
The concept of amount of information can be very useful to describe the capacity of a numerical
channel. In this case, we must consider two variables. X(n) is the random variable that enters the
channel, and Y (n) is the output of the channel(a random variable too!).
X(n) and Y (n) take their values in the alphabet {x0 , x1 , . . . , xD−1 } and the probabilities
P (X(n) = xk ) = pX (k) are known.
The channel’s behaviour can be described by a matrix of conditional probabilities. P (X(n) = xk
if Y (n) = xl ) = pk|` gives the probability that the event [X(n) = xk ] was actually sent through the
channel when we observe the event [Y (n) = x` ] at the receptor. In the case of a perfect channel
(without any error), this matrix is diagonal : pk|` = 1 if k = `, 0 else.
The probabilities P (Y (n) = x` ) = pY (`) can directly be computed by
D−1
X
pY (`) = pk pk|`
k=0
by Bayes’rule that establishes that P (B) P (A if B) = P (A and B) = P (A) P (B if A).

With this conditional probabilities, we can measure the amount of uncertainty (or information)
related to the random variable X(n) when we observe the event [Y (n) = x` ] at the receptor, by the
conditional entropy
D−1
X
H X(n) si [Y (n) = x` ] = −

pk|` log2 pk|`
k=0
The mean of the conditional entropy gives a measure of the ambiguity between the random
variable as input and the (known) random variable as output. The formula is :
D−1
X D−1
X
H X(n) si Y (n)

= − p` pk|` log2 pk|`
`=0 k=0
D−1
X D−1
X
= − pk,` log2 pk|`
`=0 k=0
Shannon has shown that the capacity of a channel is equal to the maximal value of the quantity
H (X(n)) − H X(n) if Y (n) over all the probabilities of x0 , x1 , . . . , xD−1 as input of the channel.
This quantity is called the mutual information and will be presented in details in a further paragraph.
We see immediately that, in the case of a perfect channel H X(n) if Y (n) = 0. Moreover, if

the distribution in input is px (k) = D

1
∀k, then the capacity is log D.
In the case of a binary symmetrical channel, with an error rate , we have p0|0 = p1|1 = 1 −
et p0|1 = p1|0 = .

Theorem:
We can show that the capacity of a channel with a binary variable as input such that px (0) =
px (1) = 21 is equal to
1 + ε log ε + (1 − ε) log2 (1 − ε)
Moreover, if there isn’t any perturbation, the capacity is 1.
Demonstration:
Indeed, on the one hand,
1
X 1 1 1
H(X(n)) = − pk log2 pk = 2 − log2 = − log2 = log2 2 = 1
2 2 2
k=0
and on the other hand

1 X
X 1
H X(n) si Y (n)

= − p` pk|` log2 pk|`
`=0 k=0

= − p0 p0|0 log2 p0|0 + p0 p1|0 log2 p1|0 + p1 p0|1 log2 p0|1 + p1 p1|1 log2 p1|1

1
= −2 (1 − ε) log2 (1 − ε) + ε log2 ε
2
= −(1 − ε) log2 (1 − ε) − ε log2 ε
so
H(X(n)) − H X(n) si Y (n) = 1 + ε log2 ε + (1 − ε) log2 (1 − ε)

and it’s clear that if ε = 0 (no perturbation), we get 1 beacause 0 · log2 0 = 0

The coding techniques that allow us to approach the capacity of the channel will be discussed
in the second part of this course (channel coding). They won’t be studied in this part.
It’s important to understand the interest of using entropy concepts to describe sources on the
one hand, and to describe the numerical channel on the other hand1 .
1.3.5 Other properties of the entropy

Theorem:
Let X be a source related to the source Y . We have H(X) ≥ H(X|Y ). Besides, if X and Y are
independant, H(X) = H(X|Y ).
Demonstration:
This inequality means that Y gives additional information about X. Indeed, the fact that we
learn some additional information can’t decrease our amount of information.
More formally,
X px
H(X) − H(X|Y ) = − log2 e px,y ln
x,y
px|y

X px
≥ − log2 e px,y −1
x,y
px|y
by using the inequality seen at 1.3.3 : ln z ≤ z − 1
1
Shannon has continued his work by showing the link between the physical features of a channel and its capacity.
width B, perturbated by a Gaussian white noise N ,
He has demonstrated that the capacity of a channel with a band
S
and with an emission power S is given by C = B log2 1 + N


px X
H(X) − H(X|Y ) ≥ − log2 e px,y −1
x,y
px|y

X px py
= log2 e px,y 1 −
x,y
px,y
by Bayes’ rule
X X
= log2 e px,y − px py
x,y x,y
| {z }
1
X X
= log2 e 1 − px py
x y
| {z }
1
= 0
Besides, if X and Y are independant, we have px|y = px , so

X px X
H(X) − H(X|Y ) = − log2 e px,y ln = − log2 e px,y ln 1 = 0
x,y
px|y x,y
Theorem:
We now focus on the joint entropy : H(X, Y ) = H(X)+H(Y |X) = H(Y )+H(X|Y ). Intuitively,
in order to describe X and Y , we first describe X and then Y if X.
Demonstration:
X
H(X, Y ) = − px,y log2 px,y
x,y
X
= − px,y log2 px py|x
x,y
X X X
= − px py|x log2 px − px,y log2 py|x
x y x,y
| {z }
1
= H(X) + H(Y |X)
where we replaced px,y by px py|x .

Corollary:
A direct consequence of this theorem is the following equality :
H(X) − H(X|Y ) = H(Y ) − H(Y |X)
Corollary:
In the same way, we deduce that
H(X, Y, Z) = H(X, Y ) + H(Z|X, Y )

= H(X) + H(Y |X) + H(Z|X, Y )
≤ H(X) + H(Y ) + H(Z)

1.3.6 Mutual information

Definition The mutual information (MI) gives a measure of the non-linear dependance between
two variables. We define it as :
XX px,y
I(X; Y ) = px,y log
x y
px py
where px,y is the joint probability, and px and py are the marginal probabilities. When we use the
logarithm in base 2, the unit of the MI is the bit.
We can also express the mutual information by using the entropy :
I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X)
The last equation tells us that the MI is equal to zero when the two variables X and Y are
totally independant (H(X) = H(X|Y ) and H(Y ) = H(Y |X)). On the contrary, if X and Y are
equal, the MI is maximal (H(X|X) = 0). So, the MI is a positive quantity that is always lower than
the entropy of the random variable. Intuitively, no variable can give us more information about X
than X itself (I(X; Y ) ≤ I(X; X)).
The MI is most often expressed with the joint entropies H(X, Y ) rather than the conditional
probabilities:
I(X; Y ) = H(X) + H(Y ) − H(X, Y )
Interpretation As said before, the mutual information gives a measure of a channel’s capacity.
Moreover, it represents the shortening of the description of X when we know Y , or the amount of
information that Y gives about X.
In practice, the probability distributions px are given by the histogram of the realizations of X,
and px,y is computed by using the joint histogram. This one gives the number of instances of the
pair of values (x, y). If X can take m different values and that Y can take n different values (the
lengths of px and py are respectively m and n), the length of px,y will be m × n.
EXAMPLE 1 : Let X be a simple source emitting through a channel with noise.
The MI between Y and X is a measure of the original information remaining in Y . We can

also express the channel’s capacity, that gives the maximal quantity of information that can be sent
through the channel, as follows :
C = max I(X; Y )
px
EXAMPLE 2 : Let X be a simple source emitting through 2 different noisy channels.
At the receptor, Y and Z have different entropies. The MI between Y and Z represents the
information of X that is still present in both Y and Z.

EXAMPLE 3 : Let X and Y be two different observations of the same phenomenon. By example,
two photographs of the same object. In that case, the two variables X and Y should contain similar
informations (H(X, Y ) = H(X) = H(Y ) in the best case). However, the transmission channels
have different features (time, exposition, object position...), so the MI decreases after the capture.
If we have two images of the same object in different positions, the entropies H(X) and H(Y )
are equal (same object), but the change in the position reduces the MI, because the joint entropy
increases. Indeed, as shown underneath, the joint histogram becomes larger when the images are
not taken in the same way (i.e. when de ball is taken from an other point of view).

MI as a distance In the following equation, D presents the properties of a distance (positive

definition, symmetry, triangular inequality) and is used in many applications.
I(X; Y )
D(X, Y ) = 1 −
max(H(X), H(Y ))
Applications Let see two classical applications :

First, the Independant Components Analysis (ICA) : We define the MI between the components
of a decomposition of the variable X in a set of n variables as
n
X
I(X1 ; ...; Xn ) = H(Xi ) − H(X)
i=1
We can find the best decomposition of X in independant components by minimizing this ex-
pression. This expression is a measure of the information shared by the different components.
An other application is the image registration : the aim of this method is to find the best
alignment (matching) between two images. In the case where we take multiple captures of the same
object / of the same scene, we would do well to place all the images in the same spatial referencial,
in order to execute an analysis of the pixels based on the different images at the same time.
In the medical sector, by example, different modalities are used. We may want to see a CT
image (X-Rays) and a PET image (Positron Emission Tomography) at the same time, in order to
compare them. However, the two images don’t require the same equipment, they aren’t in the same
spatial referencial, and we don’t know the matching between the pixels. The aim of registration is
to align one of the images (called the moving image) on the other (called the fixed image).

The distance D(X, Y ) between two images can be used to find the best transformation be-
tween these images. Indeed, the parameters for which the distance is the lowest will maximize the
information shared by both images, giving the transformation with the best matching.
Then, knowing the transformation, we can apply it to the fixed image in order to get the fusion
between the different modalities.
Properties of the MI Let’s recall some properties we have already seen :

• I(X; Y ) = I(Y ; X) : MI is symmetrical
• I(X; X) = H(X)
• I(X; Y ) ≤ H(X) and I(X; Y ) ≤ H(Y ) : the information given by a variable about another
one can’t be greater than the information contained by this variable itself.
• I(X; Y ) ≥ 0
• if X and Y are independant, I(X; Y ) = 0
Then, we define the conditional MI from the corresponding conditional entropies :

I(X; Y |Z) = H(X|Z) − H(X|Y, Z)
Finally, let’s conclude by this theorem :
Theorem:
Assuming that the variables X, Y and Z form a Markov chain : X → Y → Z and that Z is
independant from X|Y , meaning that px,y,z = px py|x pz|y : X → py|x → Y → pz|y → Z. Then,
I(X; Y ) ≥ I(X; Z).
Demonstration:
I(X; Y, Z) = H(Y, Z) − H(Y, Z|X) = H(X) − H(X|Y, Z)
= H(Y ) + H(Z|Y ) − H(Y |X) − H(Z|Y, X) = H(X) − H(X|Z) + H(X|Z) − H(X|Y, Z)
= I(X; Y ) + I(X; Z|Y ) = I(X; Z) + I(X; Y |Z)
But I(X; Z|Y ) = 0 because of independancy, and I(X; Y |Z) ≥ 0. So, I(X; Y ) ≥ I(X; Z).

CHAPTER 2. ENTROPY CODING 21
Chapter 2
Entropy coding
The goal of this chapter is to propose coding methods for discrete numerical sources in order to
minimize the number of bits required to encode them. The chosen strategy is to employ vari-
able length codes, associating short words to the most probable events. Let’s illustrate this by
an example. Let’s consider the instances of several variables of a discrete numerical source :
0 0 0 −1 0 0 2 0 0 1 0 0 0
We could imagine to encode these variables with a fixed length code, by allocating 3 bits to each
variable, this would take 39 bits to encode the 13 variables we observed :
000 000 000 101 000 000 010 000 000 001 000 000 000
0 0 0 -1 0 0 2 0 0 1 0 0 0
We could improve our coding with a variable length coding, such that the code’s length is
inversely proportional to the probability of the event.
Ck
[X(i) = 0] 0
[X(i) = 1] 1 0
[X(i) = −1] est codé par 1 1 0
[X(i) = 2] 1 1 1 0
[X(i) = −2] 1 1 1 1
[X(i) = xk] ] is encoded by the code word Ck , with length Lk .
We then get the representation

0 0 0 110 0 0 1110 0 0 10 0 0 0
0 0 0 -1 0 0 2 0 0 1 0 0 0
of 19 bits instead of 39 bits.
The variable length coding must be decoded in a unique way : the code C is uniquely de-
coded over an alphabet V = {0, 1} if and only if ∀ x = x1 , x2 , . . . , xM , xi ∈ V, ∃ ! c =
c1 , c2 , . . . , cm , ci ∈ C such that c = x.
A code C has the prefix property if none of the code words Ck is the beginning of another code
word. Let’s notice that if all the code words have the same length, then they automatically present
this property. A code having the prefix property is uniquely decoded, but the opposite is false in
general. By example, C = {0, 01} is always decoded in a unique way, but doesn’t satisfy the prefix
condition.
Say we have C = {0, 01}. If we receive the code 0001, we will decompose it in 0|0|01 because
we know that there is always a 0 before a 1. This code is thus decoded in a unique way. On the

contrary, with C = {0, 01, 001}, if we receive the code 001001, we can see it in two different ways :
0|01|001 or 001|001. This last code is not decoded in a unique way.
In this way, the decoder can read the compressed code bit by bit, and then simply output an
event [X(i) = xk ] each time a code word is completed.
The features of variable length codes can be easily understood with this example.
1. The variable length coding gives a variable rate : the rate increases when improbable events
occur, i.e. when there is much information. The fax-transmission is a good example : the
page rate is slowed down when there is information, in order to adapt the rate augmentation
to the channel capacity.
2. The variable length coding is very weak in presence of transmission errors. An error in the
compressed code desynchronizes the decoding. This is why a variable length coding has to
be followed by an efficient channel coding, with a code that corrects errors, and above all an
efficient strategy for the resynchronization (regular introduction of synchronization words).
3. The use of variable length coding and decoding require to employ memories of variable length
codes, and efficient mechanisms for the synchronization and temporization of the coder/de-
coder.
2.1 Theoretical limits

We know that the information given by a single bit is maximal when the probability that the bit is
equal to 0 is the same as the probability to be 1 (see Chapter 1).
Let X(n) be a random variable taking its values in the alphabet {x0 , x1 , . . . , xD−1 }. We associate
the code word Ck of length Lk to the event [X(n) = xk ]. By example, in the introduction of this
Chapter, the event [X(n) = 2] was represented by the code word 1110 = Ck of length Lk = 4. This
code is optimal if every bit in it gives a maximum of information, i.e. indicates a binary event such
that p0 = p1 = 12 .
This condition is satisfied only if
Lk
11 1 1
pk = P [X(n) = xk ] = ... =
|2 2 {z 2} 2
Lk
or, in other words, if Lk = − log2 pk

The mean rate of a code is given by
D−1
X
R= pk Lk
k=0
If the code is optimal, then

D−1
X
R=− pk log2 pk = H (X(n))
k=0
We see that the lower limit for the mean rate of a variable length code of a random variable
X(n) is given by the entropy H((X(n)).

The efficiency (in the sense of return) of a code is given by η = H

R and is always equal to 1.
If we wish to study the theoretical limits of the encoding of a discrete numerical source . . . , X(0), X(1),
. . . , X(n), . . ., we must encode vectors (X(−N/2), . . . , X(0), . . . , X(N/2)) of probabilities pk−N/2 ,...,k,...,kN/2 .
Each of the events
[X(−N/2) = xk−N/2 and . . . and X(0) = xk and . . . and X(N/2) = xkN/2 ]
is encoded by a code word Ck−N/2 ,...,k,...,kN/2 that is optimal if its length is such that
Lk−N/2 ,...,k,...,kN/2 = − log2 (pk−N/2 ,...,k,...,kN/2 )
and we can show that the mean rate tends to HX when N tends to the infinity.
In general, we proceed to an entropy coding on random variables that have been done more or less
independant, in such a way that we can encode each of the variables independantly (HX = H (X(n)).
An optimal individual code for X(n) will yield an optimal code for the vectors of the source.
The coding strategy for sources constituted of independant events X(n) is to constitute binary
tree structures, where each of the tree’s bifurcations corresponds to an event, as equiprobable as
possible.
Let’s notice that the condition Lk = − log2 (pk ) is in general impossible to satisfy, because
− log2 (pk ) is very rarely an integer. In particular, if one of the events [X(n) = x0 ] is very probable,
then − log2 (p0 ) will be much lower than 1.
In that case, we must construct vectors of a certain length in order to have well-conditionned
probabilities. We can demonstrate that even with independant events sources, it is always more
interessant to encode vectors than individual variables.
2.2 Coding memoryless sources

Let . . . , X(n − 1), X(n), X(n + 1), . . . be a discrete numerical source with events that take their
values in an alphabet of symbols {x0 , x1 , . . . , xD−1 } with the probabilities pk = P (X(n) = xk ).
We assume that these sources are memoryless, i.e. the events X(n) and X(n + m) are indepen-
dant except for m = 0.
In that case, an efficient entropy coding consists in encoding each of the events [X(n) = xk ] by
a code Ck of length Lk . Two ways of constructing codes are proposed :
• Shannon-Fano’s method
• Huffman’s method
The two methods construct codes Ck where each bit corresponds to an equiprobable binary
event. So, we associate a cut in the alphabet to each bit, such that :
alphabet = alphabet 1 alphabet 2

S
4
and P (X(n) = xi such that xi ∈ alphabet 1 ) = P (alphabet 1)
4
' P (X(n) = xj such that xj ∈ alphabet 2 ) = P (alphabet 2)

2.2.1 Shannon-Fano’s Code

The Shannon-Fano’s coding is a direct application of this principle. The initial alphabet {x0 , x1 , . . . , xD−1 }
is divided in two sets (it’s a partition) S0 = {xi , xj , . . .} and S1 = {xk , x` , . . .} with the same prob-
abilities p(S0 ) ' p(S1 ) ' 21 . The two sets S0 and S1 can in turn be divided in new sets (in a new
partition) S00 and S01 , and S10 and S11 with the same probabilities
1
p(S00 ) = p(S01 ) = p(S10 ) = p(S11 ) =
4
The process continues so... and stops when every subset has only one element. See the example.
Values of X(n) Probability Partition Code words Ck Length
P (X(n) = xk ) Lk
x0 0,25 0 00 2
0
x1 0,25 1 01 2
x2 0,125 0 100 3
0
x3 0,125 1 101 3
x4 0,0625 0 1100 4
1 0
x5 0,0625 1 1101 4
1
x6 0,0625 0 1110 4
1
x7 0,0625 1 1111 4
2.2.2 Huffman’s Code

The central idea of Huffman’s Code is to assume that if we do a sorting of the alphabet {x0 , x1 , . . . , xD−2 , xD−1 }
by decreasing order of probability, then we can assume that pD−2 ' pD−1 . Because of the sorting,
the two least probable events have great chance to have close probabilities. We will use one bit to
distinguish xD−2 and xD−1 . If the probabilities are equal, we will have as many 0 than 1 and so
the code will use this bit with maximum entropy. We create a new alphabet of length D − 1 where
the last element is the event
(X(n) = xD−2 or X(n) = xD−1 ) = x0D−2
We obtain an alphabet {x00 , x01 , . . . , x0D−2 } where the last element, x0D−2 , has the probability
p0D−2 = pD−2 + pD−1 that may not be the lowest probability. In order to reiterate the process,
we must do a new sorting by decreasing order of probability. Then, when we have the new set
{x000 , x001 , . . . , x00D−3 , x00D−2 }, we can regroup the two last elements, hoping their probabilities are quite
close. We reiterate this process until there are only two events left.

This code can be represented as a tree in which we find the code related to each si : s1 = 00, s2 =
10, s3 = 11, s4 = 010, s5 = 0110, s6 = 0111.
We can observe that this coding satisfies the prefix condition and is an entropy coding. That will
always be the case with Huffman’s code.
In this example, the probabilities of the two least probable events are not equal, and so the
coding is badly used. The rate and the entropy are different. However, they are quite close, because
X
R= pk Lk = 0.3 · 2 + 0.25 · 2 + 0.15 · 2 + 0.15 · 3 + 0.10 · 4 + 0.05 · 4 = 2.45
k
X
H=− pk log2 pk = −0.3 log2 0.3−0.25 log2 0.25−2·0.15 log2 0.15−0.1 log2 0.1−0.05 log2 0.05 = 2.39
k
that gives us a good return, close to 1 : η = H

R = 0.975 ≤ 1
2.2.3 Remarks about Shannon-Fano’s and Huffman’s Codes

Efficiency The efficiency of this kind of methods depends on the probability distribution of the
source to encode. If this distribution can yield cuts between sub-alphabets that are strictly equiprob-
able (this was the case for the example of Shannon-Fano’s Code), then the efficiency of the coding
will be equal to 1. This condition (not satisfied in the example of Huffman’s Code) gives probability
j
distributions such that pk = 12 .
Let’s notice that Huffman’s method is optimal for a given probability distribution (there isn’t
any other variable length coding yielding a better efficiency).
A very bad situation for the efficiency occurs when one of the probabilities p0 is much larger
than 0.5. In that case, we can create new events that are the result of the vectorization of several
variables.
For instance, by concatenating two variables X(n) and X(n + 1), we get a new alphabet
{x0 x0 , x0 x1 , . . . , x0 xD−1 , x1 x0 , x1 x1 , . . . x1 xD−1 , . . . , xD−1 xD−1 }
with probabilities that are simply (memoryless sources)
p0 p0 , p0 p1 , . . . , p0 pD−1 , p1 p0 , p1 p1 , . . . , p1 pD−1 , . . . , pD−1 pD−1
So, if p0 = 0.7, a bit will be used at the highest level of two events with probabilities 0.7 and
0.3. On the contrary, by creating a vector of two variables p0 p0 = 0.49 the situation will be better.
We can easily show that even in the case of memoryless sources, we always increase the efficiency
of a variable length code by vectorization.

When an event x0 has a very high probability, we can do a simple vectorization by coding a
volley of x0 x0 . . . x0 and then coding the instances of the other values word by word. This kind
of coding is called run length coding, and is essential to encode the most probable value of binary
events like the series of pixels (0 or 1) of a page (black and white) to transmit by fax.
We will get back to this point in the description of the entropy coding of the coefficients of a
linear transform.
Implementation When we have to carry out the entropy coding of a memoryless source, we
must in first place estimate the probabilities pk . This estimation can be done by computing the
frequencies given by the observation of a large number of variables X(n)
nk
pk '
Ntot
where nk is the number of observations of X(n) = xk , that we divide by the number of observations.
While coding, it’s important to keep in mind that if the probability distribution changes, we
must update the variable length code and send it to the decoder before sending it the compressed
words. There exist many adaptative Huffman’s codes. We will see one of them.
2.2.4 Adaptative Huffman’s Code

Contrary to the static Huffman’s Code, the dynamic coding doesn’t chose the symbols distribution
in advance. This distribution will be generated in parallel by using the employed symbols. So, we
read the message little by little, and we encode it gradually, while constructing the correspondant
distribution.
To do so, we introduce a type (@ for instance) that will appear at each type change, but this
type (@) doesn’t belong to the alphabet. It is used to say "Hey, a new ASCII code (8 bits) has
been created for a new type!"
Here you have the algorithm for the coding of a message :
→ Initialize the Huffman’s tree (HT) with @
We note nb(c) the frequency of the type c
We initialize nb(@) to 1
WHILE (c!=EOF (end of file))
{ get char(c)
IF first instance of c
{ nb(c) ← 0
nb(@) ← nb(@) + 1
We display in output the code of @ in the HT, followed by the ASCII code of c }
ELSE
Display the code of the type c in HT
END IF
nb(c) ← nb(c) + 1
Update HT }
END WHILE

The weight of each node (in the tree) that represents the last encoded symbol is increased. Then, the
corresponding part of the tree is adapted. Consequently, the tree gets closer and closer to the current symbols
distribution. However, the tree depends on the past, but doesn’t reflect the real distribution.
Let’s see an example. Assume we want to encode the sequence : aaa aaa aaa bcd bcd bcd. The following
image illustrates the gradual construction of the code.

We notice that the Huffman’s tree must again be updated after the coding of the last type. Indeed, as
shown in the image, the type @ is higher in the tree than the types c and d, though @ has a lower frequency
(5) than c and d (6). So, at this point, the tree doesn’t satisfy anymore the rule stating that the more
frequent the types are, the shortest code they must have. This is why we update the tree.
The advantage of adaptative Huffman’s Code is that we don’t need to transmit the Huffman’s tree,
because it can be deduced from the message. The decoding is processed by gradually constructing the
Huffman’s tree. Let’s look again at the sequence aaa aaa aaa bcd bcd bcd. With the "classical" Huffman,
it would take 33 bits to encode the sequence + 41 bits to encode the Huffman’s tree (ASCII types + coding
of each type) = 74 bits, while with the dynamic Huffman’s Code, 67 bits are enough (the code of the first
type @ is assumed to be known by the decoder).
2.3 Dictionnary Coding

The dictionnary coding uses a pointer that scans the message to encode. When the pointer encounters a
type it has never seen before, it adds this type in the dictionnary. On the contrary, if it sees again a type
or a sequence of types it has already seen, the pointer will send the position of the first instance of the
sequence, instead of doing all the work again. So, we have a scanning window that slides from left to right,
and composed of two parts : the dictionnary and a scanning buffer.
At each iteration, the algorithm looks for the longest repeating factor. This factor is then coded by a
triplet (i,j,c) where :
• i represents the distance between the start of the buffer and the position of the repetition;
• j is the length of the repetition;
• c gives the first type of the buffer that’s different from de corresponding sequence in the dictionnary.
The following example illustrates the dictionnary coding of the sequence

1 2 3 4 5 6 7 8 9
AABCBBABC
Step Position (↓) Match (z}|{) Type Output Illustration

↓
1 2 3 4 5 6 7 8 9
1 1 / A (0,0,A) A | ABCBBABC
z}|{ ↓
1 2 3 4 5 6 7 8 9
2 2 A B (1,1,B) A AB | CBBABC
↓
1 2 3 4 5 6 7 8 9
3 4 / C (0,0,C) AABC | BBABC
z}|{ ↓
1 2 3 4 5 6 7 8 9
4 5 B B (2,1,B) AA B CBB | ABC
z}|{ ↓
1 2 3 4 5 6 7 8 9
5 7 AB C (5,2,C) A AB CBBABC |

Let’s notice that we can’t use a repetition to encode the last type (here, we could have taken the repetition
of ABC). It must instead be encoded with the aid of the third field of the output.
Another remark is that if we limit the number of bits allocated to the position coding, we restrict the
searching possibilities of the repetitions.
2.4 Universal Codings

A universal coding is a coding that is applied on vectors of a source [X(n − N/2), . . . , X(n),
. . . , X(n + N/2)] and with an efficiency that tends to 1 when the vector’s length increases, whatever the
statistical properties of the source are, provided that they are stationnary. The universal codes used in
practice are based on computed codes (opposite to the classical Huffman’s Code, where the codes Ck are
kept in memory in a table whose adresses are the elements Xk ). The parameters of the computation are
adapted to the features of the source.
This concept of universal coding has been introduced for the coding of binary memoryless sources. For
this kind of sources, the alphabet consists in only two elements {x0 (0) and x1 (1)}, and the statistical features
are limited to p1 , since p0 = 1 − p1 . In practice, we cut the source into vectors of N binary events, then we
count the number of "1", we estimate p1 ' n1 /N and finally ouput a code adapted to p1 .
2.4.1 Arithmetical Coding

Arithmetical coding is a computed code that is perfectly adapted to a memoryless source . . . , X(n −
1), X(n), X(n − 2), . . . if we know the probabilities pk = P (X(n) = xk ). We can make it universal by
using a prefix coding sending the values of pk , and a suffix sending the actual arithmetical code.
Arithmetical coding has been developed by IBM on the basis of Shannon’s and Elias’work.
The concept lying under the arithmetical coding is to encode a vector of the source by a value that
belongs to the interval [0, 1[. This real value corresponds to a sub-interval of length that is proportional to
the probability that the vector has to occur. The number of bits necessary to encode this pointer will be
proportional to the inverse of the length of the sub-interval.
For example,
type a b c d e f
probability 0.1 0.1 0.1 0.2 0.4 0.1
interval [0, 0.1[ [0.1, 0.2[ [0.2, 0.3[ [0.3, 0.5[ [0.5, 0.9[ [0.9, 1[
This table represents the matching (correspondance) that must be sent to the receptor.
Let’s see the algorithms for coding and decoding. We define Inf as the origin of the sub-interval , Sup
as the end, and Size is the width. The goal of the arithmetical coding is to find the interval [Inf, Sup[
corresponding to the sequence of N events that forms the vector to encode. To get the values of Inf and
Sup, follow the procedure :
→ Coding : initialization
Inf = 0
Sup = 1

c = get char()
x,y = bounds of the interval of c in the table
Size = Sup - Inf
Sup = Inf + Size · y
Inf = Inf + Size · x

END WHILE
Return Inf ≤ α ≤ Sup
Besides, the arithmetical coding gives a coding procedure of the interval [Inf, Sup[ by giving the number
α in a binary representation that points to [Inf, Sup[ without any ambiguity.
To illustrate this algorithm, let’s take the sequence "bebecafdead" with the probabilities of the letters a
to f previously given.
Symbol c Binf (c) = x Bsup (c) = y Inf Sup Size

b 0.1 0.2 0.1 0.2 0.1
e 0.5 0.9 0.15 0.19 0.04
b 0.1 0.2 0.154 0.158 0.004
e 0.5 0.9 0.156 0.1576 0.0016
c 0.2 0.3 0.15632 0.15648 0.00016
a 0 0.1 0.15632 0.156336 1.6 10−5
f 0.9 1 0.1563344 0.156336 1.6 10−6
d 0.3 0.5 0.15633488 0.1563352 3.2 10−7
e 0.5 0.9 0.15633504 0.156335168 1.28 10−7
a 0 0.1 0.15633504 0.1563350528 1.28 10−8
d 0.3 0.5 0.15633504384 0.1563350464 2.6 10−9
Ever since, the sequence "bebecafdead" is encoded as a real number between 0.15633504384 and 0.1563350464,
for example α = 0.156335045. We get a coding more efficient than Huffman’s Code, because we encode on
a non-integer number of bits. This difference of efficiency is even more important if one of the probabilities
is greater than 0.5.
In the previous example, we notice that each real in the interval [0.15633504384, 0.1563350464[ represents
an infinite sequence that starts with "bebecafdead". In order to inform de decoding procedure to stop, we
must :
• either give the number of symbols to decode (typically at the beginning of the compressed file, or as
an integer part),
• either use a special type (like EOF) added at the end of the message to encode, and that is given the
weakest probability.
The decoding procedure is based on the fact that the bounds of the first type to decode contain the
number α. The decoding is also based on the use of two registers Inf and Sup updated according to the
following procedure :
→ Decoding : initialization
Input : α ∈ [0, 1[ % number to decode

c = symbol in the interval that contains α in the table
Display c
Size = Bsup (c) - Binf (c) = y - x
α = α − Binf (c)
α = α/Size
END WHILE

Let’s apply this algorithm to the example.
α interval symbol size

0.156335045 [0.1, 0.2[ b 0.1
0.56335045 [0.5, 0.9[ e 0.4
0.158376125 [0.1, 0.2[ b 0.1
0.58376125 [0.5, 0.9[ e 0.4
0.209403125 [0.2, 0.3[ c 0.1
0.09403125 [0, 0.1[ a 0.1
0.9403125 [0.9, 1] f 0.1
0.403125 [0.3, 0.5[ d 0.2
0.515625 [0.5, 0.9[ e 0.4
0.0390625 [0, 0.1[ a 0.1
0.390625 [0.3, 0.5[ d 0.2
A dynamic version of this algorithm exists, but isn’t studied here.
2.4.2 UVLC : Universal Variable Length Coding

UVLC is a coding technique particularly adapted to memoryless sources with probabilities that are very
centered around 0. In that case, the alphabet can be represented by a binary coding : sign + magnitude.
We can encode the magnitudes by coding the positions of the Most Non Zero Significant Bits (MNZSBs) in
a table containing the elements X(n) of the vector to encode.
If we put each input vector in column, we get for instance a table like :
X(0) X(1) X(2) X(3) X(4) X(N-1)

4 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
3 0 0 0 0 0 0 0 X 0 0 0 0 X 0 0
2 0 0 0 0 1 0 0 X 0 1 0 0 X 0 0
1 0 0 0 0 X 0 0 X 0 X 0 0 X 0 0
0 0 0 0 0 X 0 0 X 0 X 0 0 X 0 0
Let’s consider the vector X(4) = [0 0 1 X X]. The X’s represent the bits appearing after the first
encountered "1". The probability of each X is 50% because each bit is equiprobable in a binary word.
However, the zeros before the first 1 represent a very redundant information.
The UVLC principle is that we won’t try to encode the less significant bits (the X’s), called LSB (Least
Significant Bits) : their probability is 50%, so there is no coding gain. On the opposite, the position of the
first "1", called MNZSB, is interessant.
The coding is carried out line after line. We count the number of "0" appearing before a MNZSB, and
we call this number the Run Length (RL). We will explain later how to code a RL, by using a parameter
mi depending on the index of the line being encoded. Once the position of the MNZSB is known, we send
the LSB of the column without coding. The column containing this LSB won’t be used anymore, so we can
delete it from the table. We do again the same procedure until we reach the next MNZSB, and then we go
to the next line.
The coding algorithm is here presented in pseudo-code :
we encode the larger index of the line containing the first MNZSB (n°4 in the example)
WHILE there is still a line to encode
is the line i encoded ? YES/NO (presence of MNZSBs)
RETURN parameter mi of the line
WHILE presence of MNZSB
START coding RL

Mi = 2mi
WHILE RL> Mi
RETURN 0
RL = RL - Mi
END WHILE
RETURN 1
RETURN RL (encoded on mi bits)
END coding RL
RETURN LSBs
END WHILE
END WHILE

CHAPTER 3. QUANTIZATION 33
Chapter 3
Quantization
The goal of Chapter 2 was to give tools that enable us to proceed to the entropy coding of discrete numerical
memoryless sources. The sequence of events X(n), assumed to be independant (memoryless source), is
encoded by a variable length code.
To each event X(n) = xk , we allocate a code Ck of length Lk that has to be as close as possible to
− log2 pk in order to get an efficiency as close as possible to 1.
The entropy coding of discrete numerical memoryless sources can be applied just as described if we want
to proceed to a reversible coding (lossless) of a source of that kind (see Paragraph 1.1). The rate associated
to the coding of such a source is variable, and depends on the probability of the events to encode. Only the
mean rate can be predicted :
D−1
X D−1
X
R= pk Lk ≥ − pk log2 pk = H
k=0 k=0
In the case of waveforms, it’s different : most often, it consists in ensuring a coded representation of the
signal, that will permit a reconstruction of the source as accurate as possible, for a given rate R.
We will assume that the waveforms sources have been sampled at a frequency that makes possible an
accurate reconstruction of the initial signal, and that they have been quantized with such a fineness that we
can consider to be in presence of a discrete source of samples X(n) taking real values.
Besides, the samples X(n) cannot be considered as a memoryless source anymore. Indeed, if we look at
very close points in an image or in an audio signal, these points look like each other. An image is characterized
by objects composed of similar points (the pixels, for "picture elements"). A sound is composed of a series
of samples having similar features too.
A waveforms coder will be composed of 3 elements :
• A decorrelator : transforms the sequence of samples X(n) in a sequence of decorrelated samples Y (n).
It’s a decorrelative transform. Later on, we will assume that decorrelated samples are independant
(this assumption is true if we consider only first order approximations);
• a quantifier : limits the information contained in Y (n). It’s an irreversible operation that introduces
errors. However, quantization permits the use of variable length code. Indeed, the quantized values
Y q (n) take their values in an alphabet of real values y−N/2 , . . . , y0 , . . . , yN/2−1 , and that alphabet has
a bounded (finite) size;

• an entropy coder : encodes the values Y q (n). The entropy of Y q (n) can be changed by the quantization.
So, the mean rate to reach will be obtained for a correct choice of the quantization parameters : the
relation entropy-quantization is one of the subjects of this Chapter. In order to keep the chosen rate
in a stable way, the output of the entropy coder is most often regulated by a buffer memory, that is
filled in with variable length code-words , and emptied at a constant rate. The filling up level of the
buffer acts as a feedback on the quantization in order to avoid an overflow : if we let the entropy
coder send too much information, the channel will overflow. The buffer memory avoids this problem
by returning a feedback on the quantization step.
In the case of a 2-D image, the relation between X and Y is described by the transform coefficients km,n (i, j),
as follows :
XX
ym,n = km,n (i, j)x(i, j)
i j
Let’s notice that in practice, we never apply a transform on a whole image, because a point from the top of
the image is most often totally different from a point from the bottom. Instead, we divide the image into
blocks 8 × 8 in such a way that the correlation between the points remains important. We can so chose a
quantization step q for each block.
Besides, the decorrelation corresponds to a frequential analysis of the signal : the goal of this is to
describe the signal by frequential values. The low frequencies have large values while high frequencies have
small values. Consequently, the employed transform concentrates the information in the coefficients of low
order. So, the quantization is rougher for the higher variations, because the human eye is less disturbed by
the errors occuring on high frequencies than errors on low frequencies.
The waveforms decoder is composed of elements realizing the inverse operations :
• entropy decoding, possibly synchronized by a buffer memory giving the Y q (n);
• the Y (n) are approach from the Y q (n) : a quantization error ε is here introduced;
∼
• a signal X is reconstructed from Y q (n). This signal is different from X(n) because of the quantization
errors. We can describe these errors in a mathematical way with the coefficients of the inverse transform
hm,n (i, j) (Let’s consider the case of a 2-D image) :
XX
x0 (i, j) = x(i, j) + ε(i, j) = q
hm,n (i, j)ym,n
m n
XX
= hm,n (i, j)(ym,n + εm,n )
m n
XX XX
= hm,n (i, j)ym,n + hm,n (i, j)εm,n
m n m n
| {z } | {z }
x(i,j) (inverse transform) ε(i,j) due to the quantization

∼
The decoding produces a signal X , more or less noisy according to the quantization level. Indeed, we
can observe in the equation that if we don’t quantize anything, we find x(i, j) without any error.
∼
The difference between X and X(n) is most often measured in terms of signal-noise ratio. For the
waveforms, we also use other scales for the measure of the distortion : scales that are closer to the perception
of the observer (image or sound quality).
The effect of coding is to compress the signal to a certain rate R, reached by quantizing more or less
in order to limit the information. If we want to have less errors, the rate R increases, but the entropy
H decreases. If we quantize in a rougher way, we get more errors, but R decreases. So, we must do a
compromise.
So, coding of waveforms will produce rate-distortion curves. We will study them mathematically in
Chapter 4. These curves look like :
3.1 Scalar quantization

The linear scalar quantization matches a real variable Y (n) with an integer k by using the quantization step
q:

1 1 Y (n) 1
k− q ≤ Y (n) < k + q ⇔ k = int +
2 2 q 2
where we add 1/2 in order to center the quantization intervals around 0 and where we round the values to
the lower integer. For instance,
3 1 1 1 1 3
k = −1 ⇒ − q≤y≤− q k=0⇒− q≤y≤ q k=1⇒ q≤y≤ q
2 2 2 2 2 2
The quantized value Y (n) is reconstructed with the approximation Y q (n) = kq. The quantization error
Y (n) − kq = ε(n) is a random variable included in [−q/2, q/2[.

3.1.1 Reminder on random sequences

In Chapter 2, we focused on discrete numerical memoryless sequences, that we entirely characterized by their
probability distribution.
In order to study the quantization and coding of waveforms X(n) more in details, we will introduce new
concepts that describe the power of the signal, the power of the noise, and the correlation contained in the
signal. We will suppose all later on that waveforms have stationnary features (i.e. not varying with the time,
or, in other words, not depending on the variable n for X(n)) and ergodic (meaning that the mean and the
probabilities can be approximated by the occurence frequencies over a large number of observations).
Probability Density When the samples X(n) are very finely quantized, we can assume that their
distribution of possible values is continuous.
We define the probability density p(x) as :
Z X
P (X(n) ≤ X) = p(x) dx
−∞
The probability density can be evaluated by observing a large number of values for the sequence X(n)
Number of observed values X(n) ≤ X

P (X(n) < X) '
Total number of observed values
We will specially focus on two kinds of probability density :
• the uniform probability density of a variable taking its values in the interval [−D/2, D/2[
1
p(x) =
D
• the Laplace’s probability density, representing a variable that takes values around 0 with high proba-
bility (it’s typically the observation of the "variation" of a waveform, image or sound).
1 √
p(x) = √ exp − 2|x|/σx
2σx

Mean The mean is the expectancy of the variable X(n). Since the process is stationnary, the mean doesn’t
depend on n:
Z +∞
µx = E{X(n)} = x p(x) dx
−∞
Σ observed values of X(n)
'
Total number of observations
Variance The variance measures the power of X(n):

Z +∞
σx2 = E{(X(n) − µx )2 } = (x − µx )2 p(x) dx
−∞
Σ(observed values of X(n) −µx )2
'
Total number of observations
Autocovariance We generally evaluate the correlation between the samples of a process with the auto-
covariance function, that measures the expectancy of the product : (sample - mean)(neighbouring sample -
mean).
Γx (m) = E{(X(n) − µx )(X(n + m) − µx )}

Σ(val. X(n) obs. −µx ) (val. X(n + m) obs. −µx )
'
Total number of observed pairs
Because of the stationnarity, Γx (m) doesn’t depend on the distance between the two neighbours. We
can also see that Γx (0) = σx2 .
Noise A noise is a sequence that doesn’t contain any information, so there is no correlation (an image of
noise is a totally random alternation of points)
A noise has an autocovariance function that looks like
Γx (m) = σ2 si m = 0
=0 si m =
6 0
Markov Process of order 1 The waveforms (image and sound) are often modeled by a Markov process.
A Markov process of order 1 has an autocovariance function
Γx (m) = σx2 ρ|m| avec ρ<1
The difference between a sample and the mean, and the difference between an other sample, at a distance m
from the first sample and the mean have most often the same sign. So, if m is small, Γx (m) takes significant
values. On the opposite, when m is very large, Γx (m) tends to 0.
The more ρ is close to 1, the more the process is correlated.

Remark on the expectancy operator The expectancy operator is a linear operator. If a and b are
constant, and X(n) and Y (n) are random variables, then
E{aX(n) + bY (n)} = aE{X(n)} + bE{Y (n)}
Besides, if µx = 0 :
E{(aX(n) + bX(n + m))(aX(n) + bX(n + m))}
= a2 E{(X(n))2 } + 2abE{X(n)X(n + m)} + b2 E{(X(n + m))2 }
= (a2 + b2 )σx2 + 2abΓx (m)
3.1.2 Noise features

The quantization noise ε(n) = Y (n) − kq is a random sequence that we can assume to be decorrelated if the
quantization step q is small enough compared to σy .
Γε (m) = σε2 si m = 0
=0 si m =
6 0
Its mean is most often considered to be null:
Z +∞ XZ (k+ 21 )q
µε = ε pε (ε) dε = (y − kq) py (y) dy
−∞ k (k− 12 )q
If the quantization step is small, we can assume that py (y) is constant in the interval 1 1

k− 2 q, k + 2 q
and that py (y) ' py (kq) = pky in that interval. So, we write,
(k+ 12 )q
y2

µε = pky − kqy
2 (k− 21 )q
2 2
" #
k + 21 − k − 12

k 2 1 1
= py q − kq k+ − k− q
2 2 2
= 0
We can compute σε2 in the same way
Z +∞ +∞ Z (k+ 1 )q
X 2
σε2 = 2
ε pε (ε) dε = (y − kq)2 py (y) dy
−∞ k=−∞ (k− 21 )q
If we assume that py (y) is constant over the quantization step, we have

+∞ Z (k+ 1 )q +∞ (k+ 12 )q
y3

X 2 X
σε2 = (y − kq) dy = 2
pky 2 2 2
− kqy + k q y pky
3
k=−∞ (k− 2 )q
1
k=−∞ (k− 12 )q
+∞
" # " #
3 3 2 2
q3

X
k 1 1 3 1 1
= py k+ − k− + kq k− − k+
3 2 2 2 2
k=−∞

1 1
+k 2 q 3 k + − k−
2 2
+∞ 3
k q 1
X
2 3 2 3
= py 3k + + kq (−2k) + k q
3 4
k=−∞
+∞
q3 X k
= py
12
k=−∞
but
Z +∞ +∞ Z
X (k+ 12 )q +∞
X
py (y) dy = 1 ' pky dy = pky · q
−∞ k=−∞ (k− 12 )q k=−∞

+∞
X 1
⇒ pky =
q
k−∞
and so
q2
σε2 =
12
This value is also a good approximation for the other distributions if the quantization step q is small, but it is
the exact value for a uniform distribution with a linear quantifier. Indeed, let’s consider such a distribution
over an interval D. Then we have :
+∞ Z (k+ 1 )q
X 2 1
σε2 = (y − kq)2 dy
(k− 12 )q D
k=−∞
but, for a uniform distribution, the error is null at the middle of each interval
and maximal at the extremities; we have the same error variance over the N intervals
Z q2
N N q3
= y 2 dy =
D − q2 D 12
q2 D
= because N = = number of quantization intervals
12 q
Once we have fixed the quantization step, we can compute the probabilities
Z (k+ 1 )q
1 1 2
pk = P k− q ≤ Y (n) ≤ k + q = py (y) dy ' pky q
2 2 (k− 2 )q
1
Then we can compute the mean rate, as output of the entropy coder, of the quantized values Y q (n),
represented by the integers k :
+∞
X
R'H= −pk log2 pk
k=−∞
Two kinds of distributions deserve a particular attention :
• the uniform distribution over a dynamic D for which pY (y) = D.

1
That corresponds to the distribution
of an isolated sample from a waveform
√
• the Laplace distribution for which pY (y) = √2σ
1
exp(−|y| σy2 ), corresponding to the distribution of
y
a variable that is strongly centered around 0, especially if σ is small. It’s a good model for the
distribution of details in a waveform signal.
For these distributions, we can compute the exact value of σe2 and H for different values of q. Let’s take
the case of the uniform distribution, and let’s calculate the probabilities :
Z q
2 1 q 1
p0 = dq = = = pk ∀ k
− q2 D D N
so X 1 X 1
H=− pk log2 pk = − log2 = log2 N
N N
k k
2
σy
So, we can establish the curve of signal-noise ratio σε2 as a function of the rate H. Again, for a uniform
distribution quantized with a linear quantifier,
D
D2 q2 N 2
Z
2 1 D
σy2 = x2 dx = = N=
−D
2
D 12 12 q

so
σy2 D2 /12 D2
= = = N2
σε2 q 2 /12 q2
thus s !
q
σy2 1 σy2
H (Y (n)) = log2 N = log2 = log2
σε2 2 σε2
There is no simpleanalytic expression for the Laplace distribution. However, the curve that matches
σy2

H (Y (n)) to 2 log σ2 can be seen as a straight line at 45◦ and slightly offset, so we can write H =
q 1
ε
σ2
1
2 log2 (φ σy2 ) with φ = 1, 25.
ε
Generally, we’ll have

X 1 σy2
H (Y q (n)) = H(k) = − pk log2 pk = log2 (φ 2 )
2 σε
k
with φ = 1 for a uniform distribution;

φ = 1.25 for a laplacian distribution.
Consequently, if the variance is weak, we can finely quantize, but if the variance is large, we will have to
quantize more roughly for the same H.
Besides, the parameter φ will be closer to 1 if the quantization step is smaller, beacause we get closer to
a uniform distribution on an interval.
By inverting the relation, we find
σy2 22H σ2
2
= ⇔ ε2 = φ 2−2H
σε φ σy
giving us the following rate-distortion curve :

CHAPTER 4. OPTIMAL RATE-DISTORTION COMPROMISE 41
Chapter 4
Optimal rate-distortion compromise
4.1 Coding gain

If the samples X(n) of a waveform source were directly compressed by applying a scalar quantifier and then
an entropy coding, because of the uniform probability density, we would have the relation :
2
1 σx
R ' H(X q (n)) = log
2 σε2
In order to compress more efficiently the samples X(n), it’s interessant to procede to a decorrelative
transform, transforming them in Y (n). We will show that in this case
2
q 1 σx
R = H(Y (n)) = log /GT
2 σε2
where GT is the coding gain related to the transform.

In the last formula, σx2 is the variance of the waveform signal, and σε2 corresponds to the error variance
of the reconstructed waveform (after the decoding). H(Y q (n)) represents the mean rate needed to encode
the quantized and decorrelated coefficients.
4.2 Coding by linear transform

In the case of decorrelation by linear transform, the source is decomposed into vectors (blocks) of N samples
. . . , [X(kN − N ), . . . , X(kN − N2 ), . . . , X(kN − 1)], [X(kN ), . . . , X(kN + N2 ), . . . , X(kN + N − 1)], . . .
Each vector is decomposed in N decorrelated components, and the elements km,n form the kernel of the
transform.
N
X −1
Yi (k) = ki,j X(kN + j) i = 0, . . . , N − 1
j=0
Conversely, the k th vector can be reconstructed from the transformed coefficients

N
X −1
X(kN + j) = hj,i Yi (k) j = −N/2, . . . , 0, . . . , N/2 − 1
i=0
the elements hm,n form the basis functions of the transform.
So we see that the sequence X(n) is decomposed in N sequences Yi (k) of transformed coefficients, each
containing N times less samples than X(n). The sequence X(n) is reconstructed by vectors of N samples,
and each vector is a weighted sum of N predefined vectors [h0,i , h1,i , . . . , hN −1,i ], weighted by the transformed
coefficients Yi (k).

The rate per sample X(n) to encode the waveform is given by
N −1 N −1 2
1 X 1 X 1 σ Yn
H = H(Ynq (k)) = log2 φn [bits/sample]
N n=0 N n=0 2 σε2n
with φ = 1 if Yi ≡ uniform;
φ = 1.25 if Yi ≡ Laplace.
where σε2n is the variance of the quantization error introduced in the sequence σY2 i .
We will later on assume that all the transformed coefficients are quantized with the same step q, such
that σε2i = q 2 /12 for all i. So we have
N −1 2

N −1
!1/N 
1 X 1 σ Yn 1 Y σY2 n
H = log2 φn = log2  φn 2 
N n=0 2 σε2n 2 n=0
(q /12)
moyenne géométrique
zv }| {
uN −1
uY
N
t φn σY2 n v
uN −1
1 n=0 uY
= log2 s avec t
N
φn σY2 n = G−1 2
T φm σ x
2 −1
NQ
n=0
N
q 2 /12
n=0
σx2

1
= log2 G−1
T φm 2
2 q /12
where GT is the coding gain due to the transform.
To understand why we have this coding gain, we must examinate the terms appearing in the expression
of H. We assume that the variables are centered, so
σY2 n = E {Yn (k)Yn (k)}

(N −1 N −1
)
X X
0
= E kn,i X(kN + i) kn,i0 X(kN + i )
i=0 i0 =0
N
X −1 N
X −1
= kn,i kn,i0 E {X(kN + i)X(kN + i0 )} by linearity
i=0 i0 =0
N
X −1 N
X −1
= kn,i kn,i0 Γx (i − i0 )
i=0 i0 =0
So we can write

N −1
!1/N 
1 Y σY2 n
H = log2  φn 2 
2 n=0
(q /12)
"
N −1
#1/N "N −1 N −1 N −1 !#1/N 
1 Y Y X X 1
= log2  φn kn,i kn,i0 Γx (i − i0 ) 2 /12

2 n=0 n=0 i=0 0
q
i =0
The reconstructed signal contains a noise because of the quantization of the coefficients Yi (k).
N
X −1
X̃(kN + j) = hj,i Yiq (k)
i=0
N
X −1
= hj,i [Yi (k) + εi (k)]
i=0

so
ε(kN + j) = X(kN + j) − X̃(kN + j)

N
X −1
= hj,i εi (k)
i=0
and so
σε2j = E{ε(kN + j)ε(kN + j)}

(N −1 N −1
)
X X
= E hj,i εi (k) hj,i0 εi0 (k)
i=0 i0 =0
N
X −1 N
X −1
= hj,i hj,i0 E{εi (k)εi0 (k)}
i=0 i0 =0
but εi (k) is not correlated to εi0 (k) (it’s a noise) except for i = i0
N
X −1
= h2j,i σε2i
i=0
N −1
X q2
= h2j,i
i=0
12
and as a consequence
N −1 N −1 N −1 2
X 1 2 X X hj,i q 2
σε2 = σεj =
j=0
N j=0 i=0
N 12
so  
N −1 N −1
q 2 X X h2j,i
= σε2 /  
12 j=0 i=0
N
By expressing Γx (m) in its markovian form Γx (m) = σx2 rx (m), we finally get :
 
" #1/N "N −1 !#1/N N −1 N −1  
 −1 −1 N −1
 NY N X X h2j,i σ 2 
1 Y X X
rx (i − i0 )  x

H = log2 
 φn kn,i kn,i0 
2
2  n=0 n=0 i=0 i0 =0 j=0 i=0
N σε 

| {z }| {z }| {z } 
→ φm → G−1
σ → G−1
ε
2
1 σx
H = log2 φm /(Gσ Gε )
2 σε2
So, the coding gain is composed of the product of two terms : Gσ is related to the transform kernel and is
equal to the inverse of the product of the variance of the transformed coefficients, and G is related to the
norm of the transform’s basis functions.
We can show that if the transform is normalized (G = 1), then the maximization of Gσ yields the
decorrelative transform, the one for which :
E {Yi (k)Yi0 (k)} = σy2i if i = i0

=0 if i =
6 i0

Example Let X(n) be a Markovian process of order 1. Let’s consider the normalized transform of order
2 defined by √ √
1/ √2 1/√2
= [km,n ]
−1/ 2 1/ 2
and so √ √
1/√2 −1/√ 2 −1
[hi,j ] = = [km,n ]
1/ 2 1/ 2
Graphically, you can see the transform in the following image :
The points represent pairs of points [X(2k), X(2k + 1)] that we can observe. The values X(n) are
uniformly distributed on the interval [0, D[ but, however, the pairs of points are all around the line X(2k) =
X(2k + 1) because of the correlation of the process.
( 2 )
2 X(2k) + X(2k + 1)
σ Y0 = E √
2
1 1
E (X(2k))2 + E {X(2k) X(2k + 1)} + E (X(2k + 1))2

=
2 2
1 2 1
= σ + Γx (1) + σx2
2 x 2
= σx2 + ρσx2 = (1 + ρ)σx2
( 2 )
2 −X(2k) + X(2k + 1)
σ Y1 = E √
2
= (1 − ρ)σx2
from what we get

1
1X
H = Hn
2 n=0
σY2 0 σY2 1

1 1 1
= log2 φ0 2 + log2 φ1 2
2 2 σε 2 σε
s
1 σ2 σ2
= log2 φ0 φ1 (1 + ρ)(1 − ρ) 2 x x 2
2 (q /12)
σx2

1 p p
2
= log2 φ0 φ1 (1 − ρ ) 2
2 q /12
σx2

1 p p
2
= log2 φ0 φ1 (1 − ρ ) 2
2 σε

and so the coding gain is equal to

φ
GT = √ p
φ0 φ1 (1 − ρ2 )
The more the correlation is important (ρ close to 1), the more the coding gain will be large.
4.3 Prediction Coding
The coding by predicting is an alternative to the transform coding. The principle is to encode (quantiza-
tion and entropy coding) the difference between the sample X(n) to transmit and a predicted value X̂(n)
computed from the samples . . . , X̃(n − 3), X̃(n − 2), X̃(n − 1) that are available at the decoder.
The goal of the predictor in the above figure is to keep in memory a certain number of samples preceding
X(n), as they were received at the decoder (this is why we have a loop at the coder).
Because of the loop on the quantization error, the coding gain is more complicated to compute.
In general, we compute the coefficients of the predictor by minimizing the variance of the sequence Y (n).
4.4 Hybrid coding

In the case of video signals, we generally procede to a predictive coding in the time (we encode the differences
between the successive images, using a first order predictor) and a transform coding in the space (the
differences between images are encoded by linear transform, quantization and entropy coding).

LINGI2348 TIC Syllabus en PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LINGI2348 TIC Syllabus en PDF

Uploaded by

Copyright:

Available Formats

Université Catholique de Louvain - Ecole polytechnique de Louvain

INGI 2348 - Information Theory and Coding

Information Theory and Coding

With the help of : Guillaume Janssens

1 The numerical communication channel 5

4 Optimal rate-distortion compromise 41

INGI 2348 Information Theory and Coding Translation 2010

INGI 2348 Information Theory and Coding Translation 2010

• the link between information and probability,

2. BIBLIOGRAPHIC REFERENCES The theoretical considerations of this course is notably

• Information Theory and Reliable Communication from Robert GALLAGER, published

INGI 2348 Information Theory and Coding Translation 2010

INGI 2348 Information Theory and Coding Translation 2010

The numerical communication channel

1.1 Sources and numerical channels

The sender products messages, and is considered as a source.

• discrete numerical sources, emitting a suite of events

INGI 2348 Information Theory and Coding Translation 2010

INGI 2348 Information Theory and Coding Translation 2010

1.2 Introduction to Coding

Three kinds of coding were proposed to solve this limitations :

INGI 2348 Information Theory and Coding Translation 2010

We will consider two kinds of source coding :

INGI 2348 Information Theory and Coding Translation 2010

1.3 Information theory

P ([X1 (n) = xk et X2 (m) = x` ]) = P ([X1 (n) = xk ]) .P ([X2 (m) = x` ])

I ([X1 (n) = xk et X2 (m) = x` ]) = I ([X1 (n) = xk ]) + I ([X2 (m) = x` ])

1.3.1 The entropy, or the mean information

• the dispersion in the probability distribution

INGI 2348 Information Theory and Coding Translation 2010

Let consider the random vector

(X(−N/2), X(−N/2 + 1), . . . , X(0), X(1), . . . , X(N/2))

[(X(−N/2) = xk−N/2 ) and . . . and (X(0) = xk )

where the k i take values in [0, D − 1]

pk−N/2 ,...,k,k0 k00 ,...,kN/2

− log pk−N/2 ,...,k,...,kN/2

and its entropy by

If the source is composed of independant, identically distributed random variables, then

pk−N/2 ,...,k,...,kN/2 = pk−N/2 . . . pk . . . pkN/2

H ([X(−N/2) . . . X(0) . . . X(N/2)]) = H (X(−N/2)) + . . . + H (X(0)) + . . . + H (X(N/2))

1.3.2 Entropy of a binary variable and normalization of the entropy

H (X(n)) = −p0 log p0 − p1 log p1

INGI 2348 Information Theory and Coding Translation 2010

1.3.3 Bounds on the entropy of a random variable

INGI 2348 Information Theory and Coding Translation 2010

The following inequality is derived

Besides, it’s clear that if p0 = p1 = . . . = pD−1 = D,

INGI 2348 Information Theory and Coding Translation 2010

1.3.4 Channel capacity

by Bayes’rule that establishes that P (B) P (A if B) = P (A and B) = P (A) P (B if A).

the distribution in input is px (k) = D

INGI 2348 Information Theory and Coding Translation 2010

and on the other hand

and it’s clear that if ε = 0 (no perturbation), we get 1 beacause 0 · log2 0 = 0

1.3.5 Other properties of the entropy

INGI 2348 Information Theory and Coding Translation 2010

Besides, if X and Y are independant, we have px|y = px , so

where we replaced px,y by px py|x .

H(X) − H(X|Y ) = H(Y ) − H(Y |X)

H(X, Y, Z) = H(X, Y ) + H(Z|X, Y )

INGI 2348 Information Theory and Coding Translation 2010

1.3.6 Mutual information

I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X)