You are on page 1of 40

Coding and Information Theory

Slide 5&6 Data Compression 2

CS5058701

.2410 Data Compression (2)

'

Kraft Inequality forDecodable Codes Uniquely Kraft Inequality for Uniquely Decodable Codes
m

Theorem 5.5.1. The codeword lengths of any uniquely decodable code must satisfy the Kraft inequality
li

q
i=1

1.

Conversely, given a set of codeword lengths that satisfy this inequality, it is possible to construct a uniquely decodable code with these codeword lengths. Uniquely decodable codes do not oer any further choices for the codeword lengths compared with prex codes.
2

&

'

Huffman Codes
Human Codes

An optimal (shortest expected length) prex code for a given distribution can be constructed by a simple algorithm discovered by Human. These codes are called Human codes. It turns out that any other code from the same alphabet cannot have a shorter expected length than the code constructed by the algorithm. Human codes are introduced with a few examples.

&

'

Example: Human Exam1 Huffman CodeCodes (1)

Example 1. Consider a random variable X taking values in the set X = {1, 2, 3, 4, 5} with probabilities 0.25, 0.25, 0.2, 0.15, 0.15, respectively. The optimal code for X is expected to have the longest codewords assigned to the symbols 4 and 5. Moreover, these lengths must be equal, since otherwise we can delete a bit from the longer codeword and still have a prex code with shorter length. In general, we can construct a code in which the two longest codewords dier only in the last bit. &
4

Huffman Code Exam2


Example 1. (cont.) For this code we can combine the symbols 4 and 5 into a single source symbol, with a probability assignment 0.3. We proceed in this way, combining the two least likely symbols into one symbol in each step, until we are left with only one symbol. This procedure and the codewords obtained are shown in the rst table on [Cov, p. 93]. The code has expected length 2.3 bits.

Example: Human Codes (2)

&
5

c Patric Oste

.2410 Data Compression (2)

Huffman Code Exam3


Example 2. Consider a ternary code for the same random variable as in the previous example (X = {1, 2, 3, 4, 5} with probabilities 0.25, 0.25, 0.2, 0.15, 0.15, respectively). The codewords obtained are shown in the second table on [Cov, p. 93]. The code has an expected length of 1.5 ternary digits.

Example: Human Codes (3)

&
6

c Patric Oste

'

Example: Human Exam4 Huffman CodeCodes (4)

Example 3. If q 3, we may not have a sucient number of symbols so that we can combine them q at a time. In such a case, we add dummy symbols to the end of the set of symbols. These dummy symbols have probability 0 and are inserted to ll the tree. Note: Since the number of symbols is reduced by q 1 in each step, we should add dummy symbols to get the total number to be of the form 1 + k(q 1) for some integer k. The use of dummy symbols is illustrated in the table on [Cov, p. 94]. The code has an expected length of 1.7 ternary digits. &
7

'

Example: Human Exam4 Huffman CodeCodes (4)

Example 3. If q 3, we may not have a sucient number of symbols so that we can combine them q at a time. In such a case, we add dummy symbols to the end of the set of symbols. These dummy symbols have probability 0 and are inserted to ll the tree. Note: Since the number of symbols is reduced by q 1 in each step, we should add dummy symbols to get the total number to be of the form 1 + k(q 1) for some integer k. The use of dummy symbols is illustrated in the table on [Cov, p. 94]. The code has an expected length of 1.7 ternary digits. &
7

'

Human Codes vs. Shannon Codes (1)

Huffman Codes vs. Shannon Codes

If Shannon codes (which are suboptimal) are used, the codeword length for some particular symbol may be much worse than with Human codes. Example. Consider X = {1, 2} with probabilities 0.9999 and 0.0001, respectively. Shannon coding then gives codewords of length 1 and 14 bits, respectively, whereas an optimal code has two words of length 1.

&

.2410 Data Compression (2)

'

Huffman Codes vs. Shannon Codes cont. Human Codes vs. Shannon Codes (2)

Occasionally, it is also possible that the Human codeword for a particular symbol is longer than the corresponding codeword of a Shannon code.
1 1 Example. For a random variable with distribution ( 3 , 1 , 1 , 12 ), the 3 4 Human coding procedure results in codeword lengths (2, 2, 2, 2) or (1, 2, 3, 3) (there are sometimes several optimal codes!), whereas the Shannon coding procedure leads to lengths (2, 2, 2, 4).

Note: The Human code is shorter on the average. &


9

Fano Codes
Fano proposed a suboptimal procedure for constructing a source code. In his method we first order the probabilities in decreasing order. Then we choose k such that p p is minimized. This point divides the source symbols into two sets of almost equal probability. Assign 0 for the first bit of the upper set and 1 for the lower set. Repeat this process for each subset. This scheme, although not optimal, achieves L(C) H (X) + 2.
k m i =1 i i = k +1 i

10

.2410 Data Compression (2)

'

Shannon-Fano-Elisa Coding cont. Codes (2) Shannon-Fano-Elias

The codeword for x consists of the l(x) rst binary decimals of F (x), where l(x) = log p(x) + 1 . The expected length of this code is less than H(X) + 2. Two examples of the construction of such codes are shown in the tables on [Cov, p. 103]. &

11

'

Shannon-Fano-Elisa Coding Shannon-Fano-Elias Codes (1)

Shannon-Fano-Elias coding is a simple constructive procedure to allot codewords. Let X = {1, 2, . . . , m} and assume that p(x) > 0 for all x. We dene F (x) =
ax

p(a), p(a),
a<x

F (x) = p(x)/2 +

where F (x) is known as the cumulative distribution function. &

12

'

Shannon-Fano-Elisa Coding Shannon-Fano-Elias Codes (1)

Shannon-Fano-Elias coding is a simple constructive procedure to allot codewords. Let X = {1, 2, . . . , m} and assume that p(x) > 0 for all x. We dene F (x) =
ax

p(a), p(a),
a<x

F (x) = p(x)/2 +

where F (x) is known as the cumulative distribution function. &

12

13

.2410 Data Compression (2)

'

Choice of Compression Choice of Compression Method Method

For small source alphabets, we must use long blocks of source symbols to get ecient coding. (For example, with a binary symbol, if each symbol is coded separately, we must always use 1 bit per symbol and no compression is achieved.) Human codes are optimal, but require the calculation of the probabilities of all source symbols and the construction of the corresponding complete code tree. A (good) suboptimal code with computationally ecient algorithms for encoding and decoding is often desired. & Arithmetic coding fullls these criteria.
14

.2410 Data Compression (2)

'

Universal Codes
Universal Codes

If we do not know the behavior of the source in advance or the behavior changes, a more sophisticated adaptive arithmetic coding algorithm can be used. Such a code is an example of a universal code. Universal codes are designed to work with an arbitrary source distribution. A particularly interesting universal code is the Lempel-Ziv code, which will be considered at a later stage of this course.
15

.2410 Data Compression (2)

'

Data Compression & Coin Data Compression and Coin Flips Flips

When a random source is compressed into a sequence of bits so that the average length is minimized, the encoded sequence is essentially incompressible, and therefore has an entropy rate close to 1 bit per symbol. The bits of the encoded sequence are essentially fair coin ips. Let us now go in the opposite direction: How many fair coin ips does it take to generate a random variable X drawn according to some specied probability mass function p. &
16

.2410 Data Compression (2)

'

Example: Generating Random Variable


Example: Generating Random Variable

Suppose we wish to generate a random variable X = {a, b, c} with 1 distribution ( 2 , 1 , 1 ). The answer is obvious. 4 4 If the rst bit (coin toss) is 0, let X = a. If the rst two bits are 10, let X = b. If the rst two bits are 11, let X = c. The average number of fair bits required for generating this random variable is 1 1 + 1 2 + 1 2 = 1.5 bits. This is also the entropy of 2 4 4 the distribution. &

17

.2410 Data Compression (2)

'

Algorithm for Generating Random Variables Algorithm for Generating Random Variables (1)

We map (possibly innite!) strings of bits Z1 , Z2 , . . . to possible outcomes X by a binary tree, where the leaves are marked by output symbols X and the path to the leaves is given by the sequence of bits produced by the fair coin. For example, the tree for the distribution in the previous example, ( 1 , 1 , 1 ), is shown in [Cov, Fig. 5.8]. 2 4 4 Theorem 5.12.1. For any algorithm generating X, the expected number of fair bits used is greater than the entropy, that is, ET H(X).
18

2410 Data Compression (2)

'

Algorithm for Generating Random Variables cont. (2) Algorithm for Generating Random Variables

If the distribution is not dyadic (that is, 2-adic), we rst write the probabilities as the (possibly innite) sum of dyadic probabilities, called atoms. (In fact, this means nding the binary expansions of the probabilities.) In constructing the tree, the same approach as in proving the Kraft inequality can be used. An atom of the form 2j is associated to a leaf at depth j. All the leaves of the atoms of the probability of an output symbol are marked with that symbol. &

19

.2410 Data Compression (2)

'

Example: Generating Random Variable Random Example: Generating acont. Variable

2 Let X = {a, b} with the distribution ( 3 , 1 ). The binary expansion of 3 the two probabilities are 0.101010 . . . and 0.010101 . . ., respectively. Hence the atoms are

2 3 1 3

= =

1 1 1 , , ,... , 2 8 32 1 1 1 , , ,... . 4 16 64

The corresponding binary tree is shown in [Cov, Fig. 5.9]. &


20

Bounding The Expected Depth of The Tree


Theorem 5.11.3 The expected Number of fair bits required by the optimal algorithm to generate a random variable X lies between H(X) and H(X) + 2: H(X) ET H(X) + 2.

21

.2410 Universal Coding

'

Background for Universal Source Coding


Background

Human coding compresses an i.i.d. source with a known distribution p(x) to its entropy limit H(X). However, if the code is designed for another distribution q(x), a penalty of D(p q) is incurred. Human coding is sensitive to the assumed distribution. What can be achieved if the true distribution p(x) is unknown? Is there a universal code with rate R that suces to describe every i.i.d. source with entropy H(X) < R? Yes! &

22

2.2410 Universal Coding

'

Fixed Rate Block Codes


Fixed Rate Block Codes

A xed rate block code of rate R for a source X1 , X2 , . . . , Xn which has an unknown distribution Q consists of two mappings, the encoder fn : X n {1, 2, . . . , 2nR }, and the decoder, n : {1, 2, . . . , 2
23

nR

}X .

.2410 Universal Coding

'

Universal Source Codes Universal Source Codes

The probability of error for the code with respect to the distribution Q is
(n) Pe = Qn (X1 , . . . , Xn : n (fn (X1 , . . . , Xn )) = (X1 , . . . , Xn )).

A rate R block code for a source is called universal if the functions (n) fn and n do not depend on the distribution Q and if Pe 0 as n when H(Q) < R. Theorem 12.3.1. There exists a sequence of (n, 2nR ) universal (n) source codes such that Pe 0 as n for every source Q such that H(Q) < R.
24

&

2410 Universal Coding

'

Universal Coding Schemes Universal Coding Schemes

One universal coding scheme is given in the proof of [Cov, Theorem 12.3.1]. That scheme, which is due to Csiszr and a Krner, is universal over the set of i.i.d. distributions. We shall in o detail look at another algorithm, the Lempel-Ziv algorithm which is a variable rate universal code. Q: If universal codes also reach the limit given by the entropy, why do we need Human and similar codes (which are specic to a probability distribution)? A: Universal codes need longer blocks length for the same performance and their decoders and encoders are more complex.
25

'

Lempel-Ziv Coding (1) Lempel-Ziv Coding

Most universal compression algorithm used in the real world are based on algorithms developed by Lempel and Ziv, and we therefore talk about Lempel-Ziv (LZ) coding. LZ algorithms are good at compressing data that cannot be modeled simply, such as (note that LZ also compresses other than i.i.d. sources) English text, and computer source code. Computer compression programs, such as compress, gzip, and WinZip, and the GIF format are based on LZ coding. &
26

72.2410 Universal Coding

'

Lempel-Ziv Coding cont.


Lempel-Ziv Coding (2)

The following are the main variants of Lempel-Ziv coding. LZ77 Also called sliding window Lempel-Ziv. LZ78 Also called dictionary Lempel-Ziv. Described in the textbook, in [Cov, Sect. 12.10]. LZW Another variant, not described here. With these algorithms, text with any alphabet size can be compressed. Common sizes are, for example, 2 (binary sequences) and 256 (computer les consisting of a sequence of bytes). &
27

2.2410 Universal Coding

'

LZ78 LZ78

In the description of the algorithm, we act on the string 1011010100010. The algorithm is as follows: 1. Parse the source into strings that have not appeared so far: 1,0,11,01,010,00,10. 2. Code a substring as (i, c), where i is the index of the substring (starting from 1, 0 = empty string) and c is the value of the additional character: (0,1),(0,0),(1,1),(2,1),(4,0),(2,0),(1,0). To express the locationin the example, an integer between 0 and 4 = c(n)we need log(c(n) + 1) bits.
28

.2410 Universal Coding

'

LZ77 LZ77

We again use the example string 1011010100010. In each step, we proceed as follows: 1. Find p, the relative position of the longest match (the length of which is denoted by l). 2. Output (p, l, c), where c is the rst character that does not match. 3. Advance l + 1 positions. 1011010100010 (0, 0, 1) 1011010100010 (0, 0, 0) 1011010100010 (2, 1, 1) 1011010100010 (3, 2, 0) 1011010100010 (2, 2, 0) 1011010100010 . . ..
29

'

Some Details of The Some Details of the Algorithms Algorithms

Several details have to be taken into account; we give one such example for each of the algorithms: LZ77 To avoid large values for the relative positions of strings: do not go too far back (sliding window!) LZ78 To avoid too large dictionaries: Reduce the size in one of several possible ways, for example, throw the dictionary away when it reaches a certain size (GIF does this). Traditionally, LZ77 was better but slower, but the gzip version is almost as fast as any LZ78. &
30

.2410 Universal Coding

'

Run-Length Coding
Run-Length Coding

A compression method that does not reach the entropy bound but is used in many applications, including fax machines, is run-length coding. In this method, the input sequence is compressed by identifying adjacent symbols of equal value and replacing them with a single symbol and a count. Example. 111111110100000 (1, 8), (0, 1), (1, 1), (0, 5). Clearly, in the binary case, it suces to give the rst bit and then only the lengths of the runs. &
31

.2410 Universal Coding

'

Optimality of Lempel-Ziv Coding Coding Optimality of Lempel-Ziv

Theorem 12.10.2. Let {Xi } be a stationary ergodic stochastic process. Let l(X1 , X2 , . . . , Xn ) be the length of the Lempel-Ziv codeword associated with X1 , X2 , . . . , Xn . Then 1 lim sup l(X1 , X2 , . . . , Xn ) H(X ) with probability 1, n n where H(X ) is the entropy rate of the process. From the examples given, it is obvious that these compression methods are ecient only for long input sequences. &
32

2.2410 Universal Coding

'

Lossy Compression
Lossy Compression

So far, only lossless compression has been considered. In lossy compression, loss of information is allowed: scalar quantization Take the set of possible messages S and reduce it to a smaller set S with a mapping f : S S . For example, least signicant bits are dropped. vector quantization Map a multidimensional space S into a smaller set S of messages. transform coding Transform the input into a dierent form that can be more easily compressed (in a lossy or lossless way). Lossy methods include JPEG (still images) and MPEG (video). 33 &

.2410 Universal Coding

'

Rate Distortion Theory Rate Distortion Theory

The eects of quantization are studied in rate distortion theory, the basic problem of which can be stated as follows: Q: Given a source distribution and a distortion measure, what is the minimum expected distortion achievable at a particular rate? Q: (Equivalent) What is the minimum rate description required to achieve a particular distortion? It turns out that, perhaps surprisingly, it is more ecient to describe two (even independent!) variables jointly than individually. Rate distortion theory can be applied to both discrete and continuous random variables. 34

Example: Quantization
Example: Quantization (1)
We look at the basic problem of representing a single continuous random variable by a nite number of bits. Denote the random variable by X and the representation of X by X(X). With R bits, the function X can take on 2R values. What is the optimum set of values for X and the regions associated with each value X?

&

35

c Patric Oste

.2410 Universal Coding

'

Example: Quantization cont. Example: Quantization (2)

With X N (0, 2 ) and a squared error distortion measure, we wish to nd a function X that takes on at most 2R values (these are called reproduction points) and minimizes E(X X(X))2 . With one bit, obviously the bits should distinguish whether X < 0 or not. To minimize squared error, each reproduced symbol should be at the conditional mean of its regionsee [Cov, Fig. 13.1]and we have
2 , 2 ,
36

X(X) =

if x < 0, if x 0.

.2410 Universal Coding

'

Example: Quantization cont. Example: Quantization (3)

With two or more bits to represent the sample, the situation gets far more complicated. The following facts state simple properties of optimal regions and reconstruction points. Given a set of reconstruction points, the distortion is minimized by mapping a source random variable X to the representation X that is closest to it. The set of regions of dened by this mapping is call a Voronoi or Dirichlet partition dened by the reconstruction points. The reconstruction points should minimize the conditional expected distortion over their respective assignment regions. &
37

.2410 Universal Coding

'

Example: Quantization cont. Example: Quantization (4)

The aforementioned properties enable algorithms to nd good quantizers called the Lloyd algorithm (for real-valued random variables) and the generalized Lloyd algorithm (for vector-valued random variables). Starting from any set of reconstruction points, repeat the following steps. 1. Find the optimal set of reconstruction regions. 2. Find the optimal reconstruction points for these regions. The expected distortion is decreased at each stage in this algorithm, so it converges to a local minimum of the distortion. &
38

You might also like