Professional Documents
Culture Documents
Part I :
Source Coding Engineering
Benoît MACQ
benoit.macq@uclouvain.be
Contents
Preface 1
References 2
I Syllabus 3
2 Entropy coding 21
2.1 Theoretical limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Coding memoryless sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Shannon-Fano’s Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Huffman’s Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 Remarks about Shannon-Fano’s and Huffman’s Codes . . . . . . . . . . . . . 25
2.2.4 Adaptative Huffman’s Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Dictionnary Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Universal Codings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Arithmetical Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 UVLC : Universal Variable Length Coding . . . . . . . . . . . . . . . . . . . . 31
3 Quantization 33
3.1 Scalar quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.1 Reminder on random sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.2 Noise features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Preface
The course on Information Theory and Coding will be in three parts. The first one is about efficient
and secured representation of the sources (messages). Some concepts of Information Theory will be
briefly discussed too.
In the second part, we will study the way of coding the information in order to avoid transmis-
sion errors, and see some theoretical elements with their application.
The third part presents the general principles and cryptographic tools used to protect the infor-
mation (authenticity, confidentiality, integrity, . . . ). These two last parts are not included in this
notes, and will be presented respectively by J. Louveaux et O. Pereira.
References
1. HISTORICAL REFERENCES
In the early 1940’s, Claude E. Shannon has developed a mathematical theory, called infor-
mation theory, for dealing with the more fundamental aspects of numerical communication
systems. Two important and distinguishing characteristics of this theory are :
The first article about Information Theory and Coding appeared more than 60 years ago.
Indeed, its reference "A Mathematical Theory of Communication" was published in 1948 in
"Bell System Tech. Journal", vol. 27, pp. 379-423 et pp. 623-656.
Part I
Syllabus
Chapter 1
• the "‘waveforms"’, that are an analogical signal x(t) (representing a voice, a music or an image
signal). In that case, we accept that the received signal x0(t) can be different from x(t). The
relation between the binary rate needed to represent the signal x(t) and the distortion in x0(t)
is the topic of Chapter 4.
Let’s notice that the waveforms sources are most often digitized before being encoded. The digitizing
of a waveform takes two steps
• the discretization consists in sampling the waveform at a suitable frequency. This frequency
has to be at least equal to the highest frequency of the signal x(t), multiplied by two. Inversely,
if we are given the sampling frequency, we eliminate from the signal x(t) every frequency
higher than the sampling frequency divided by two. The signal x(t) is thus so far replaced by
a sequence of real numbers : xe (0), xe (1), . . . , xe (n), . . .
• the quantization consists in restricting the possible values of the samples xe (n) in an al-
phabet {x0 , x1 , . . . , xD−1 }. By doing so, the signal is represented as a discrete numerical
source
Xe (0), Xe (1), . . . , Xe (n), . . .
The digitizing of a waveform is illustrated by the following figure.
We can see that, because of the quantization, the signal reconstruction from the samples Xe (n)
can’t be perfect. However, as described in the distortion-rate problem of chapter 4, the distor-
tion introduced by the digitizing of the signal is most often negligible compared to the distortions
introduced by the rate-reducing coding (i.e. the compression).
The channel through which messages are sent to the receptor is in general an analogical medium
(a carrier current in a telephone pair, an electromagnetic wave in the air or a wave carried by an
optic signal in a fiber), and thus a necessary step consists in constituting an electric signal from the
coding output : we call this operation the modulation. It’s aim is to maximize the occupation of
the channel and simultaneously to minimize the emitted power and the transmission errors.
The modulation and demodulation processes won’t be discussed here, see the Telecommunication
Course for more details... From a practical point of view, the modulator, the physical channel and
the demodulator form together a numerical channel. In this course, we will assume that the channels
are stationary and memoryless, and simply characterized by an input-output relation as defined in
the following image.
Because of the perturbations on the physical channel and the limits related to the modulation-
demodulation operations, some errors can occur during the transmission of the discrete numerical
events. These transmission errors can be characterized by a matrix containing the conditional
probabilities.
pi|j = P (X(n) = xi |Y (n) = xj )
Each element of this matrix contains the probability that the event X(n) = xi was sent on the
channel, when we know that the event Y (n) = xj was actually received at the other end of the
channel.
In the particular case where the transmission channel we use never introduces errors, the matrix
becomes diagonal (pi|i = 1 et pi|j = 0 si i 6= j).
We say that the channel is stationary if the pi|j are time-independant (the index n here above)
and are independant from the discrete values X(n − k) sent previously. It’s quite a restrictive
assumption in regard to the real systems (a cell phone, by example, might have time-changing
transmission features when we drive a car. Not to do.)
The memoryless symmetric binary channel has been widely studied in Information theory :
This kind of channel has an error rate (typically 10−9 for the optical fibers, 10−4 to 10−6 for
satellite transmissions).
a. its capacity R (for "Rate"), generally described in bits/s, is limited by a maximal value;
b. its security is limited too, because an ennemy can listen to messages (confidentiality problem
between the emettor and the receptor), or send himself messages, using the identity of the real
emettor (authentication problem), or modify the transmitted messages (integrity problem);
c. its fiability is limited, and described by the error rate ( for the binary symmetrical memoryless
channel).
a. the source coding gives us a very compact representation of the message, requiring a lower
rate than the channel capacity in any way;
b. the cryptographic coding that ciphers the message (confidentiality problem) and that signs it
(authentication and integrity problems) solves the security problem;
c. the channel coding solves the transmission errors problem by doing an error-correcting coding
(or simply by detecting the errors).
In the first part of this course, we will assume that the channel constitutes a new channel with a
rate slightly lowered by the error-correcting codes, but that now provides an errorless transmission.
It’s important to notice that the cryptographic coding (or, in any case,the ciphering) produces an
output of pseudo-random signals from which it is very hard to recover the original message if we
don’t have access to the decoding process. The aim of the source coding is essentially to reduce
the rate, and is based on the redundancy of the message. It is thus mandatory to execute the
ciphering after the source coding. Indeed, the coded source looks totally like random, so there is
no redundancy left.
The three kinds of coding will be achieved in the order illustrated in the following image :
- the lossless coding, where the amount of bits representing the source will depend on the redun-
dancy of that source. In general, this coding will be a variable rate coding.
- the lossy coding, only for the waveforms, is not only based on the redundancy suppression, but on
information suppression too, this is why it is called a lossy coding. In that case, it’s possible
to construct a fixed-rate coding (we suppress information to reach the desired rate). The
loss of information will introduce distortions on the reconstructed signal. The rate-distortion
compromise of the waveforms is studied in Chapter 4.
- let [X1 (n) = xk ] and [X2 (m) = xl ] be two events of two discrete numerical sources;
- let assume that these two events are independant (P is the probability) :
the amount of information related to the event [X1 (n) = xk ] and [X2 (m) = xl ] is written
I ([X1 (n) = xk et X2 (m) = x` ]) and must be such that
The inverse proportionality between the amount of information and the probability, together with
the additivity of two independant events, made us choose the function log(1/x).
1
I ([X1 (n) = xk ]) = log P (X(n)=xk )
= − log (P (X(n) = xk ))
[X(n) = xk ]
is
I (X(n) = xk ) = − log pk
The mean information associated to the instances of X(n) is given by
D−1
X D−1
X
H (X(n)) = pk I(X(n) = xk ) = − pk log pk
k=0 k=0
is defined as the entropy of X(n). We can give several interpretations of the entropy :
• the mean amount of information that an event brings (a rare event brings more information
than a frequent one);
• the uncertainty of the output of an event (systems with a very frequent event have less entropy
than systems with plenty of equiprobable events);
• the minimal amount of binary digits needed in average to represent a message in a unique
way (k binary digits can represent 2k messages, and M messages require dlog2 M e bits.
issued from the same discrete numerical source as X(n) where every variable takes its values in the
alphabet {x0 , x1 , . . . , xD1 } with the probabilities p0 , p1 , . . . , pD−1 .
The event probability can be written as :
then we can measure the amount of information related to this instance of the random vector by
H ([X(−N/2) et . . . X(N/2)])
D−1
X D−1
X D−1
X
= ... ... pk−N/2 ,...,k,...kN/2 − log pk−N/2 ,...,k,...,kN/2
k−N/2 =0 k=0 k−N/2 =0
The entropy of the discrete numerical source . . . , X(0), X(1), . . . , X(m) . . . will be measured by
1
lim H ([X(−N/2) . . . X(0) . . . X(N/2)]) = HX
N →∞ N + 1
HX = H (X(m))
In most cases, numerical sources aren’t composed of independant variables (however, variables are
most often identically distributed).
In practice, to measure the entropy, we do a transform on the sources in order to get independant
variables, and then we measure the entropy of that representation.
The function H (X(n)) (represented as a function of p0 in the above image) takes its maximal
value when p0 = p1 = 12 and is then equal to H (X(n)) = log(2). Indeed, at this point we have
1 1 1 1 1
− log − log = − log = log(2)
2 2 2 2 2
In general, we normalize the log function measuring the entropy in such a way that the maximal
entropy of a binary variable takes the value 1 (expressed in bits). So, the logarithms are always in
base 2 :
logx
log ≡ ou log(2) = 1
logx (2)
It’s important to notice that, in the case of a random binary variable, the entropy is maximum for
p0 = p1 = 21 , while the amount of information related to the event [X(n) = 0] is on the contrary more
important when p0 decreases. However, when p0 is small, [X(n) = 1] contains very few information
and the event [X(n) = 0] is very rare and so doesn’t contribute very much to the mean information.
Moreover, a certain event P [X(n) = 0] = p0 = 0 or = 1 contains absolutely no information.
Conclusion : the binary variable taking the value 1 if it is raining and 0 if it is not, has a weak
entropy in the desert but a maximal entropy in Belgium.
Demonstration :
Using the definition of the entropy, and the fact that the sum of the probabilities is equal to 1,
and the fact that log(x) + log(y) = log(xy), we have
D−1
X D−1
X
H(X(n)) − log D = − pk log pk − pk log D
k=0 k=0
D−1
X 1
= pk log
pk D
k=0
if we look at the following image, it’s easy to observe that ln z ≤ z − 1 ∀ z, the equality occurs
when z = 1.
where k 0 represents the indexes for which pk0 6= 0. So, we find the desired relation, because
"D−1 D−1
# "D−1 #
X 1 X X 1
H (X(n)) − log D ≤ ln(e) − pk0 = ln(e) −1 ≤0
0
D 0 0
D
k =0 k =0 k =0
since
D−1 D−1
X 1 X 1
if pk 6= 0 ∀ k, = 1 and else <1
D D
k0 =0 0 k =0
and the reciprocal implication is easy to prove. This concludes the demonstration.
This bound tells us that the amount of information contained in X(n) is at most equal to the
number of bits that are necessary for it’s natural representation. Indeed, this bound can intuitively
be explained by the fact that the worst we could do is to assign log2 D bits to each value. So, a
sequence of ASCII types coded with 8 bits (256 possible values) has an entropy at most equal to 8
bits. We will see in Chapter 2 how to modify the coding of the random variables in order to get a
representation that allows us to have a rate that is closer from the entropy.
The mean of the conditional entropy gives a measure of the ambiguity between the random
variable as input and the (known) random variable as output. The formula is :
D−1
X D−1
X
H X(n) si Y (n)
= − p` pk|` log2 pk|`
`=0 k=0
D−1
X D−1
X
= − pk,` log2 pk|`
`=0 k=0
Shannon has shown that the capacity of a channel is equal to the maximal value of the quantity
H (X(n)) − H X(n) if Y (n) over all the probabilities of x0 , x1 , . . . , xD−1 as input of the channel.
This quantity is called the mutual information and will be presented in details in a further paragraph.
We see immediately that, in the case of a perfect channel H X(n) if Y (n) = 0. Moreover, if
Theorem:
We can show that the capacity of a channel with a binary variable as input such that px (0) =
px (1) = 21 is equal to
1 + ε log ε + (1 − ε) log2 (1 − ε)
Moreover, if there isn’t any perturbation, the capacity is 1.
Demonstration:
Indeed, on the one hand,
1
X 1 1 1
H(X(n)) = − pk log2 pk = 2 − log2 = − log2 = log2 2 = 1
2 2 2
k=0
so
H(X(n)) − H X(n) si Y (n) = 1 + ε log2 ε + (1 − ε) log2 (1 − ε)
px X
H(X) − H(X|Y ) ≥ − log2 e px,y −1
x,y
px|y
X px py
= log2 e px,y 1 −
x,y
px,y
by Bayes’ rule
X X
= log2 e px,y − px py
x,y x,y
| {z }
1
X X
= log2 e 1 − px py
x y
| {z }
1
= 0
Theorem:
We now focus on the joint entropy : H(X, Y ) = H(X)+H(Y |X) = H(Y )+H(X|Y ). Intuitively,
in order to describe X and Y , we first describe X and then Y if X.
Demonstration:
X
H(X, Y ) = − px,y log2 px,y
x,y
X
= − px,y log2 px py|x
x,y
X X X
= − px py|x log2 px − px,y log2 py|x
x y x,y
| {z }
1
= H(X) + H(Y |X)
Corollary:
In the same way, we deduce that
where px,y is the joint probability, and px and py are the marginal probabilities. When we use the
logarithm in base 2, the unit of the MI is the bit.
We can also express the mutual information by using the entropy :
The last equation tells us that the MI is equal to zero when the two variables X and Y are
totally independant (H(X) = H(X|Y ) and H(Y ) = H(Y |X)). On the contrary, if X and Y are
equal, the MI is maximal (H(X|X) = 0). So, the MI is a positive quantity that is always lower than
the entropy of the random variable. Intuitively, no variable can give us more information about X
than X itself (I(X; Y ) ≤ I(X; X)).
The MI is most often expressed with the joint entropies H(X, Y ) rather than the conditional
probabilities:
I(X; Y ) = H(X) + H(Y ) − H(X, Y )
Interpretation As said before, the mutual information gives a measure of a channel’s capacity.
Moreover, it represents the shortening of the description of X when we know Y , or the amount of
information that Y gives about X.
In practice, the probability distributions px are given by the histogram of the realizations of X,
and px,y is computed by using the joint histogram. This one gives the number of instances of the
pair of values (x, y). If X can take m different values and that Y can take n different values (the
lengths of px and py are respectively m and n), the length of px,y will be m × n.
At the receptor, Y and Z have different entropies. The MI between Y and Z represents the
information of X that is still present in both Y and Z.
EXAMPLE 3 : Let X and Y be two different observations of the same phenomenon. By example,
two photographs of the same object. In that case, the two variables X and Y should contain similar
informations (H(X, Y ) = H(X) = H(Y ) in the best case). However, the transmission channels
have different features (time, exposition, object position...), so the MI decreases after the capture.
If we have two images of the same object in different positions, the entropies H(X) and H(Y )
are equal (same object), but the change in the position reduces the MI, because the joint entropy
increases. Indeed, as shown underneath, the joint histogram becomes larger when the images are
not taken in the same way (i.e. when de ball is taken from an other point of view).
I(X; Y )
D(X, Y ) = 1 −
max(H(X), H(Y ))
We can find the best decomposition of X in independant components by minimizing this ex-
pression. This expression is a measure of the information shared by the different components.
An other application is the image registration : the aim of this method is to find the best
alignment (matching) between two images. In the case where we take multiple captures of the same
object / of the same scene, we would do well to place all the images in the same spatial referencial,
in order to execute an analysis of the pixels based on the different images at the same time.
In the medical sector, by example, different modalities are used. We may want to see a CT
image (X-Rays) and a PET image (Positron Emission Tomography) at the same time, in order to
compare them. However, the two images don’t require the same equipment, they aren’t in the same
spatial referencial, and we don’t know the matching between the pixels. The aim of registration is
to align one of the images (called the moving image) on the other (called the fixed image).
The distance D(X, Y ) between two images can be used to find the best transformation be-
tween these images. Indeed, the parameters for which the distance is the lowest will maximize the
information shared by both images, giving the transformation with the best matching.
Then, knowing the transformation, we can apply it to the fixed image in order to get the fusion
between the different modalities.
Chapter 2
Entropy coding
The goal of this chapter is to propose coding methods for discrete numerical sources in order to
minimize the number of bits required to encode them. The chosen strategy is to employ vari-
able length codes, associating short words to the most probable events. Let’s illustrate this by
an example. Let’s consider the instances of several variables of a discrete numerical source :
0 0 0 −1 0 0 2 0 0 1 0 0 0
We could imagine to encode these variables with a fixed length code, by allocating 3 bits to each
variable, this would take 39 bits to encode the 13 variables we observed :
000 000 000 101 000 000 010 000 000 001 000 000 000
0 0 0 -1 0 0 2 0 0 1 0 0 0
We could improve our coding with a variable length coding, such that the code’s length is
inversely proportional to the probability of the event.
Ck
[X(i) = 0] 0
[X(i) = 1] 1 0
[X(i) = −1] est codé par 1 1 0
[X(i) = 2] 1 1 1 0
[X(i) = −2] 1 1 1 1
[X(i) = xk] ] is encoded by the code word Ck , with length Lk .
contrary, with C = {0, 01, 001}, if we receive the code 001001, we can see it in two different ways :
0|01|001 or 001|001. This last code is not decoded in a unique way.
In this way, the decoder can read the compressed code bit by bit, and then simply output an
event [X(i) = xk ] each time a code word is completed.
The features of variable length codes can be easily understood with this example.
1. The variable length coding gives a variable rate : the rate increases when improbable events
occur, i.e. when there is much information. The fax-transmission is a good example : the
page rate is slowed down when there is information, in order to adapt the rate augmentation
to the channel capacity.
2. The variable length coding is very weak in presence of transmission errors. An error in the
compressed code desynchronizes the decoding. This is why a variable length coding has to
be followed by an efficient channel coding, with a code that corrects errors, and above all an
efficient strategy for the resynchronization (regular introduction of synchronization words).
3. The use of variable length coding and decoding require to employ memories of variable length
codes, and efficient mechanisms for the synchronization and temporization of the coder/de-
coder.
We see that the lower limit for the mean rate of a variable length code of a random variable
X(n) is given by the entropy H((X(n)).
is encoded by a code word Ck−N/2 ,...,k,...,kN/2 that is optimal if its length is such that
and we can show that the mean rate tends to HX when N tends to the infinity.
In general, we proceed to an entropy coding on random variables that have been done more or less
independant, in such a way that we can encode each of the variables independantly (HX = H (X(n)).
An optimal individual code for X(n) will yield an optimal code for the vectors of the source.
The coding strategy for sources constituted of independant events X(n) is to constitute binary
tree structures, where each of the tree’s bifurcations corresponds to an event, as equiprobable as
possible.
Let’s notice that the condition Lk = − log2 (pk ) is in general impossible to satisfy, because
− log2 (pk ) is very rarely an integer. In particular, if one of the events [X(n) = x0 ] is very probable,
then − log2 (p0 ) will be much lower than 1.
In that case, we must construct vectors of a certain length in order to have well-conditionned
probabilities. We can demonstrate that even with independant events sources, it is always more
interessant to encode vectors than individual variables.
• Shannon-Fano’s method
• Huffman’s method
The two methods construct codes Ck where each bit corresponds to an equiprobable binary
event. So, we associate a cut in the alphabet to each bit, such that :
4
and P (X(n) = xi such that xi ∈ alphabet 1 ) = P (alphabet 1)
4
' P (X(n) = xj such that xj ∈ alphabet 2 ) = P (alphabet 2)
We obtain an alphabet {x00 , x01 , . . . , x0D−2 } where the last element, x0D−2 , has the probability
p0D−2 = pD−2 + pD−1 that may not be the lowest probability. In order to reiterate the process,
we must do a new sorting by decreasing order of probability. Then, when we have the new set
{x000 , x001 , . . . , x00D−3 , x00D−2 }, we can regroup the two last elements, hoping their probabilities are quite
close. We reiterate this process until there are only two events left.
This code can be represented as a tree in which we find the code related to each si : s1 = 00, s2 =
10, s3 = 11, s4 = 010, s5 = 0110, s6 = 0111.
We can observe that this coding satisfies the prefix condition and is an entropy coding. That will
always be the case with Huffman’s code.
In this example, the probabilities of the two least probable events are not equal, and so the
coding is badly used. The rate and the entropy are different. However, they are quite close, because
X
R= pk Lk = 0.3 · 2 + 0.25 · 2 + 0.15 · 2 + 0.15 · 3 + 0.10 · 4 + 0.05 · 4 = 2.45
k
X
H=− pk log2 pk = −0.3 log2 0.3−0.25 log2 0.25−2·0.15 log2 0.15−0.1 log2 0.1−0.05 log2 0.05 = 2.39
k
So, if p0 = 0.7, a bit will be used at the highest level of two events with probabilities 0.7 and
0.3. On the contrary, by creating a vector of two variables p0 p0 = 0.49 the situation will be better.
We can easily show that even in the case of memoryless sources, we always increase the efficiency
of a variable length code by vectorization.
When an event x0 has a very high probability, we can do a simple vectorization by coding a
volley of x0 x0 . . . x0 and then coding the instances of the other values word by word. This kind
of coding is called run length coding, and is essential to encode the most probable value of binary
events like the series of pixels (0 or 1) of a page (black and white) to transmit by fax.
We will get back to this point in the description of the entropy coding of the coefficients of a
linear transform.
Implementation When we have to carry out the entropy coding of a memoryless source, we
must in first place estimate the probabilities pk . This estimation can be done by computing the
frequencies given by the observation of a large number of variables X(n)
nk
pk '
Ntot
where nk is the number of observations of X(n) = xk , that we divide by the number of observations.
While coding, it’s important to keep in mind that if the probability distribution changes, we
must update the variable length code and send it to the decoder before sending it the compressed
words. There exist many adaptative Huffman’s codes. We will see one of them.
The weight of each node (in the tree) that represents the last encoded symbol is increased. Then, the
corresponding part of the tree is adapted. Consequently, the tree gets closer and closer to the current symbols
distribution. However, the tree depends on the past, but doesn’t reflect the real distribution.
Let’s see an example. Assume we want to encode the sequence : aaa aaa aaa bcd bcd bcd. The following
image illustrates the gradual construction of the code.
We notice that the Huffman’s tree must again be updated after the coding of the last type. Indeed, as
shown in the image, the type @ is higher in the tree than the types c and d, though @ has a lower frequency
(5) than c and d (6). So, at this point, the tree doesn’t satisfy anymore the rule stating that the more
frequent the types are, the shortest code they must have. This is why we update the tree.
The advantage of adaptative Huffman’s Code is that we don’t need to transmit the Huffman’s tree,
because it can be deduced from the message. The decoding is processed by gradually constructing the
Huffman’s tree. Let’s look again at the sequence aaa aaa aaa bcd bcd bcd. With the "classical" Huffman,
it would take 33 bits to encode the sequence + 41 bits to encode the Huffman’s tree (ASCII types + coding
of each type) = 74 bits, while with the dynamic Huffman’s Code, 67 bits are enough (the code of the first
type @ is assumed to be known by the decoder).
• i represents the distance between the start of the buffer and the position of the repetition;
• c gives the first type of the buffer that’s different from de corresponding sequence in the dictionnary.
z}|{ ↓
1 2 3 4 5 6 7 8 9
2 2 A B (1,1,B) A AB | CBBABC
↓
1 2 3 4 5 6 7 8 9
3 4 / C (0,0,C) AABC | BBABC
z}|{ ↓
1 2 3 4 5 6 7 8 9
4 5 B B (2,1,B) AA B CBB | ABC
z}|{ ↓
1 2 3 4 5 6 7 8 9
5 7 AB C (5,2,C) A AB CBBABC |
Let’s notice that we can’t use a repetition to encode the last type (here, we could have taken the repetition
of ABC). It must instead be encoded with the aid of the third field of the output.
Another remark is that if we limit the number of bits allocated to the position coding, we restrict the
searching possibilities of the repetitions.
type a b c d e f
probability 0.1 0.1 0.1 0.2 0.4 0.1
interval [0, 0.1[ [0.1, 0.2[ [0.2, 0.3[ [0.3, 0.5[ [0.5, 0.9[ [0.9, 1[
This table represents the matching (correspondance) that must be sent to the receptor.
Let’s see the algorithms for coding and decoding. We define Inf as the origin of the sub-interval , Sup
as the end, and Size is the width. The goal of the arithmetical coding is to find the interval [Inf, Sup[
corresponding to the sequence of N events that forms the vector to encode. To get the values of Inf and
Sup, follow the procedure :
→ Coding : initialization
Inf = 0
Sup = 1
END WHILE
Return Inf ≤ α ≤ Sup
Besides, the arithmetical coding gives a coding procedure of the interval [Inf, Sup[ by giving the number
α in a binary representation that points to [Inf, Sup[ without any ambiguity.
To illustrate this algorithm, let’s take the sequence "bebecafdead" with the probabilities of the letters a
to f previously given.
Ever since, the sequence "bebecafdead" is encoded as a real number between 0.15633504384 and 0.1563350464,
for example α = 0.156335045. We get a coding more efficient than Huffman’s Code, because we encode on
a non-integer number of bits. This difference of efficiency is even more important if one of the probabilities
is greater than 0.5.
In the previous example, we notice that each real in the interval [0.15633504384, 0.1563350464[ represents
an infinite sequence that starts with "bebecafdead". In order to inform de decoding procedure to stop, we
must :
• either give the number of symbols to decode (typically at the beginning of the compressed file, or as
an integer part),
• either use a special type (like EOF) added at the end of the message to encode, and that is given the
weakest probability.
The decoding procedure is based on the fact that the bounds of the first type to decode contain the
number α. The decoding is also based on the use of two registers Inf and Sup updated according to the
following procedure :
→ Decoding : initialization
Input : α ∈ [0, 1[ % number to decode
Let’s consider the vector X(4) = [0 0 1 X X]. The X’s represent the bits appearing after the first
encountered "1". The probability of each X is 50% because each bit is equiprobable in a binary word.
However, the zeros before the first 1 represent a very redundant information.
The UVLC principle is that we won’t try to encode the less significant bits (the X’s), called LSB (Least
Significant Bits) : their probability is 50%, so there is no coding gain. On the opposite, the position of the
first "1", called MNZSB, is interessant.
The coding is carried out line after line. We count the number of "0" appearing before a MNZSB, and
we call this number the Run Length (RL). We will explain later how to code a RL, by using a parameter
mi depending on the index of the line being encoded. Once the position of the MNZSB is known, we send
the LSB of the column without coding. The column containing this LSB won’t be used anymore, so we can
delete it from the table. We do again the same procedure until we reach the next MNZSB, and then we go
to the next line.
The coding algorithm is here presented in pseudo-code :
we encode the larger index of the line containing the first MNZSB (n°4 in the example)
WHILE there is still a line to encode
is the line i encoded ? YES/NO (presence of MNZSBs)
RETURN parameter mi of the line
WHILE presence of MNZSB
START coding RL
Mi = 2mi
WHILE RL> Mi
RETURN 0
RL = RL - Mi
END WHILE
RETURN 1
RETURN RL (encoded on mi bits)
END coding RL
RETURN LSBs
END WHILE
END WHILE
Chapter 3
Quantization
The goal of Chapter 2 was to give tools that enable us to proceed to the entropy coding of discrete numerical
memoryless sources. The sequence of events X(n), assumed to be independant (memoryless source), is
encoded by a variable length code.
To each event X(n) = xk , we allocate a code Ck of length Lk that has to be as close as possible to
− log2 pk in order to get an efficiency as close as possible to 1.
The entropy coding of discrete numerical memoryless sources can be applied just as described if we want
to proceed to a reversible coding (lossless) of a source of that kind (see Paragraph 1.1). The rate associated
to the coding of such a source is variable, and depends on the probability of the events to encode. Only the
mean rate can be predicted :
D−1
X D−1
X
R= pk Lk ≥ − pk log2 pk = H
k=0 k=0
In the case of waveforms, it’s different : most often, it consists in ensuring a coded representation of the
signal, that will permit a reconstruction of the source as accurate as possible, for a given rate R.
We will assume that the waveforms sources have been sampled at a frequency that makes possible an
accurate reconstruction of the initial signal, and that they have been quantized with such a fineness that we
can consider to be in presence of a discrete source of samples X(n) taking real values.
Besides, the samples X(n) cannot be considered as a memoryless source anymore. Indeed, if we look at
very close points in an image or in an audio signal, these points look like each other. An image is characterized
by objects composed of similar points (the pixels, for "picture elements"). A sound is composed of a series
of samples having similar features too.
A waveforms coder will be composed of 3 elements :
• A decorrelator : transforms the sequence of samples X(n) in a sequence of decorrelated samples Y (n).
It’s a decorrelative transform. Later on, we will assume that decorrelated samples are independant
(this assumption is true if we consider only first order approximations);
• a quantifier : limits the information contained in Y (n). It’s an irreversible operation that introduces
errors. However, quantization permits the use of variable length code. Indeed, the quantized values
Y q (n) take their values in an alphabet of real values y−N/2 , . . . , y0 , . . . , yN/2−1 , and that alphabet has
a bounded (finite) size;
• an entropy coder : encodes the values Y q (n). The entropy of Y q (n) can be changed by the quantization.
So, the mean rate to reach will be obtained for a correct choice of the quantization parameters : the
relation entropy-quantization is one of the subjects of this Chapter. In order to keep the chosen rate
in a stable way, the output of the entropy coder is most often regulated by a buffer memory, that is
filled in with variable length code-words , and emptied at a constant rate. The filling up level of the
buffer acts as a feedback on the quantization in order to avoid an overflow : if we let the entropy
coder send too much information, the channel will overflow. The buffer memory avoids this problem
by returning a feedback on the quantization step.
In the case of a 2-D image, the relation between X and Y is described by the transform coefficients km,n (i, j),
as follows :
XX
ym,n = km,n (i, j)x(i, j)
i j
Let’s notice that in practice, we never apply a transform on a whole image, because a point from the top of
the image is most often totally different from a point from the bottom. Instead, we divide the image into
blocks 8 × 8 in such a way that the correlation between the points remains important. We can so chose a
quantization step q for each block.
Besides, the decorrelation corresponds to a frequential analysis of the signal : the goal of this is to
describe the signal by frequential values. The low frequencies have large values while high frequencies have
small values. Consequently, the employed transform concentrates the information in the coefficients of low
order. So, the quantization is rougher for the higher variations, because the human eye is less disturbed by
the errors occuring on high frequencies than errors on low frequencies.
The waveforms decoder is composed of elements realizing the inverse operations :
• the Y (n) are approach from the Y q (n) : a quantization error ε is here introduced;
∼
• a signal X is reconstructed from Y q (n). This signal is different from X(n) because of the quantization
errors. We can describe these errors in a mathematical way with the coefficients of the inverse transform
hm,n (i, j) (Let’s consider the case of a 2-D image) :
XX
x0 (i, j) = x(i, j) + ε(i, j) = q
hm,n (i, j)ym,n
m n
XX
= hm,n (i, j)(ym,n + εm,n )
m n
XX XX
= hm,n (i, j)ym,n + hm,n (i, j)εm,n
m n m n
| {z } | {z }
x(i,j) (inverse transform) ε(i,j) due to the quantization
∼
The decoding produces a signal X , more or less noisy according to the quantization level. Indeed, we
can observe in the equation that if we don’t quantize anything, we find x(i, j) without any error.
∼
The difference between X and X(n) is most often measured in terms of signal-noise ratio. For the
waveforms, we also use other scales for the measure of the distortion : scales that are closer to the perception
of the observer (image or sound quality).
The effect of coding is to compress the signal to a certain rate R, reached by quantizing more or less
in order to limit the information. If we want to have less errors, the rate R increases, but the entropy
H decreases. If we quantize in a rougher way, we get more errors, but R decreases. So, we must do a
compromise.
So, coding of waveforms will produce rate-distortion curves. We will study them mathematically in
Chapter 4. These curves look like :
The quantized value Y (n) is reconstructed with the approximation Y q (n) = kq. The quantization error
Y (n) − kq = ε(n) is a random variable included in [−q/2, q/2[.
Probability Density When the samples X(n) are very finely quantized, we can assume that their
distribution of possible values is continuous.
We define the probability density p(x) as :
Z X
P (X(n) ≤ X) = p(x) dx
−∞
The probability density can be evaluated by observing a large number of values for the sequence X(n)
• the uniform probability density of a variable taking its values in the interval [−D/2, D/2[
1
p(x) =
D
• the Laplace’s probability density, representing a variable that takes values around 0 with high proba-
bility (it’s typically the observation of the "variation" of a waveform, image or sound).
1 √
p(x) = √ exp − 2|x|/σx
2σx
Mean The mean is the expectancy of the variable X(n). Since the process is stationnary, the mean doesn’t
depend on n:
Z +∞
µx = E{X(n)} = x p(x) dx
−∞
Σ observed values of X(n)
'
Total number of observations
Autocovariance We generally evaluate the correlation between the samples of a process with the auto-
covariance function, that measures the expectancy of the product : (sample - mean)(neighbouring sample -
mean).
Because of the stationnarity, Γx (m) doesn’t depend on the distance between the two neighbours. We
can also see that Γx (0) = σx2 .
Noise A noise is a sequence that doesn’t contain any information, so there is no correlation (an image of
noise is a totally random alternation of points)
A noise has an autocovariance function that looks like
Γx (m) = σ2 si m = 0
=0 si m =
6 0
Markov Process of order 1 The waveforms (image and sound) are often modeled by a Markov process.
A Markov process of order 1 has an autocovariance function
The difference between a sample and the mean, and the difference between an other sample, at a distance m
from the first sample and the mean have most often the same sign. So, if m is small, Γx (m) takes significant
values. On the opposite, when m is very large, Γx (m) tends to 0.
The more ρ is close to 1, the more the process is correlated.
Remark on the expectancy operator The expectancy operator is a linear operator. If a and b are
constant, and X(n) and Y (n) are random variables, then
E{aX(n) + bY (n)} = aE{X(n)} + bE{Y (n)}
Besides, if µx = 0 :
E{(aX(n) + bX(n + m))(aX(n) + bX(n + m))}
= a2 E{(X(n))2 } + 2abE{X(n)X(n + m)} + b2 E{(X(n + m))2 }
= (a2 + b2 )σx2 + 2abΓx (m)
Γε (m) = σε2 si m = 0
=0 si m =
6 0
Its mean is most often considered to be null:
Z +∞ XZ (k+ 21 )q
µε = ε pε (ε) dε = (y − kq) py (y) dy
−∞ k (k− 12 )q
If the quantization step is small, we can assume that py (y) is constant in the interval 1 1
k− 2 q, k + 2 q
and that py (y) ' py (kq) = pky in that interval. So, we write,
(k+ 12 )q
y2
µε = pky − kqy
2 (k− 21 )q
2 2
" #
k + 21 − k − 12
k 2 1 1
= py q − kq k+ − k− q
2 2 2
= 0
We can compute σε2 in the same way
Z +∞ +∞ Z (k+ 1 )q
X 2
σε2 = 2
ε pε (ε) dε = (y − kq)2 py (y) dy
−∞ k=−∞ (k− 21 )q
but
Z +∞ +∞ Z
X (k+ 12 )q +∞
X
py (y) dy = 1 ' pky dy = pky · q
−∞ k=−∞ (k− 12 )q k=−∞
+∞
X 1
⇒ pky =
q
k−∞
and so
q2
σε2 =
12
This value is also a good approximation for the other distributions if the quantization step q is small, but it is
the exact value for a uniform distribution with a linear quantifier. Indeed, let’s consider such a distribution
over an interval D. Then we have :
+∞ Z (k+ 1 )q
X 2 1
σε2 = (y − kq)2 dy
(k− 12 )q D
k=−∞
but, for a uniform distribution, the error is null at the middle of each interval
and maximal at the extremities; we have the same error variance over the N intervals
Z q2
N N q3
= y 2 dy =
D − q2 D 12
q2 D
= because N = = number of quantization intervals
12 q
Once we have fixed the quantization step, we can compute the probabilities
Z (k+ 1 )q
1 1 2
pk = P k− q ≤ Y (n) ≤ k + q = py (y) dy ' pky q
2 2 (k− 2 )q
1
Then we can compute the mean rate, as output of the entropy coder, of the quantized values Y q (n),
represented by the integers k :
+∞
X
R'H= −pk log2 pk
k=−∞
For these distributions, we can compute the exact value of σe2 and H for different values of q. Let’s take
the case of the uniform distribution, and let’s calculate the probabilities :
Z q
2 1 q 1
p0 = dq = = = pk ∀ k
− q2 D D N
so X 1 X 1
H=− pk log2 pk = − log2 = log2 N
N N
k k
2
σy
So, we can establish the curve of signal-noise ratio σε2 as a function of the rate H. Again, for a uniform
distribution quantized with a linear quantifier,
D
D2 q2 N 2
Z
2 1 D
σy2 = x2 dx = = N=
−D
2
D 12 12 q
so
σy2 D2 /12 D2
= = = N2
σε2 q 2 /12 q2
thus s !
q
σy2 1 σy2
H (Y (n)) = log2 N = log2 = log2
σε2 2 σε2
There is no simpleanalytic expression for the Laplace distribution. However, the curve that matches
σy2
H (Y (n)) to 2 log σ2 can be seen as a straight line at 45◦ and slightly offset, so we can write H =
q 1
ε
σ2
1
2 log2 (φ σy2 ) with φ = 1, 25.
ε
σy2 22H σ2
2
= ⇔ ε2 = φ 2−2H
σε φ σy
Chapter 4
In order to compress more efficiently the samples X(n), it’s interessant to procede to a decorrelative
transform, transforming them in Y (n). We will show that in this case
2
q 1 σx
R = H(Y (n)) = log /GT
2 σε2
So we see that the sequence X(n) is decomposed in N sequences Yi (k) of transformed coefficients, each
containing N times less samples than X(n). The sequence X(n) is reconstructed by vectors of N samples,
and each vector is a weighted sum of N predefined vectors [h0,i , h1,i , . . . , hN −1,i ], weighted by the transformed
coefficients Yi (k).
N −1 N −1 2
1 X 1 X 1 σ Yn
H = H(Ynq (k)) = log2 φn [bits/sample]
N n=0 N n=0 2 σε2n
with φ = 1 if Yi ≡ uniform;
φ = 1.25 if Yi ≡ Laplace.
where σε2n is the variance of the quantization error introduced in the sequence σY2 i .
We will later on assume that all the transformed coefficients are quantized with the same step q, such
that σε2i = q 2 /12 for all i. So we have
N −1 2
N −1
!1/N
1 X 1 σ Yn 1 Y σY2 n
H = log2 φn = log2 φn 2
N n=0 2 σε2n 2 n=0
(q /12)
moyenne géométrique
zv }| {
uN −1
uY
N
t φn σY2 n v
uN −1
1 n=0 uY
= log2 s avec t
N
φn σY2 n = G−1 2
T φm σ x
2 −1
NQ
n=0
N
q 2 /12
n=0
σx2
1
= log2 G−1
T φm 2
2 q /12
where GT is the coding gain due to the transform.
To understand why we have this coding gain, we must examinate the terms appearing in the expression
of H. We assume that the variables are centered, so
So we can write
N −1
!1/N
1 Y σY2 n
H = log2 φn 2
2 n=0
(q /12)
"
N −1
#1/N "N −1 N −1 N −1 !#1/N
1 Y Y X X 1
= log2 φn kn,i kn,i0 Γx (i − i0 ) 2 /12
2 n=0 n=0 i=0 0
q
i =0
The reconstructed signal contains a noise because of the quantization of the coefficients Yi (k).
N
X −1
X̃(kN + j) = hj,i Yiq (k)
i=0
N
X −1
= hj,i [Yi (k) + εi (k)]
i=0
so
and so
and as a consequence
N −1 N −1 N −1 2
X 1 2 X X hj,i q 2
σε2 = σεj =
j=0
N j=0 i=0
N 12
so
N −1 N −1
q 2 X X h2j,i
= σε2 /
12 j=0 i=0
N
By expressing Γx (m) in its markovian form Γx (m) = σx2 rx (m), we finally get :
" #1/N "N −1 !#1/N N −1 N −1
−1 −1 N −1
NY N X X h2j,i σ 2
1 Y X X
rx (i − i0 ) x
H = log2
φn kn,i kn,i0
2
2 n=0 n=0 i=0 i0 =0 j=0 i=0
N σε
| {z }| {z }| {z }
→ φm → G−1
σ → G−1
ε
2
1 σx
H = log2 φm /(Gσ Gε )
2 σε2
So, the coding gain is composed of the product of two terms : Gσ is related to the transform kernel and is
equal to the inverse of the product of the variance of the transformed coefficients, and G is related to the
norm of the transform’s basis functions.
We can show that if the transform is normalized (G = 1), then the maximization of Gσ yields the
decorrelative transform, the one for which :
Example Let X(n) be a Markovian process of order 1. Let’s consider the normalized transform of order
2 defined by √ √
1/ √2 1/√2
= [km,n ]
−1/ 2 1/ 2
and so √ √
1/√2 −1/√ 2 −1
[hi,j ] = = [km,n ]
1/ 2 1/ 2
Graphically, you can see the transform in the following image :
The points represent pairs of points [X(2k), X(2k + 1)] that we can observe. The values X(n) are
uniformly distributed on the interval [0, D[ but, however, the pairs of points are all around the line X(2k) =
X(2k + 1) because of the correlation of the process.
( 2 )
2 X(2k) + X(2k + 1)
σ Y0 = E √
2
1 1
E (X(2k))2 + E {X(2k) X(2k + 1)} + E (X(2k + 1))2
=
2 2
1 2 1
= σ + Γx (1) + σx2
2 x 2
= σx2 + ρσx2 = (1 + ρ)σx2
( 2 )
2 −X(2k) + X(2k + 1)
σ Y1 = E √
2
= (1 − ρ)σx2
The coding by predicting is an alternative to the transform coding. The principle is to encode (quantiza-
tion and entropy coding) the difference between the sample X(n) to transmit and a predicted value X̂(n)
computed from the samples . . . , X̃(n − 3), X̃(n − 2), X̃(n − 1) that are available at the decoder.
The goal of the predictor in the above figure is to keep in memory a certain number of samples preceding
X(n), as they were received at the decoder (this is why we have a loop at the coder).
Because of the loop on the quantization error, the coding gain is more complicated to compute.
In general, we compute the coefficients of the predictor by minimizing the variance of the sequence Y (n).