Professional Documents
Culture Documents
CS5058701
'
Kraft Inequality forDecodable Codes Uniquely Kraft Inequality for Uniquely Decodable Codes
m
Theorem 5.5.1. The codeword lengths of any uniquely decodable code must satisfy the Kraft inequality
li
q
i=1
1.
Conversely, given a set of codeword lengths that satisfy this inequality, it is possible to construct a uniquely decodable code with these codeword lengths. Uniquely decodable codes do not oer any further choices for the codeword lengths compared with prex codes.
2
&
'
Huffman Codes
Human Codes
An optimal (shortest expected length) prex code for a given distribution can be constructed by a simple algorithm discovered by Human. These codes are called Human codes. It turns out that any other code from the same alphabet cannot have a shorter expected length than the code constructed by the algorithm. Human codes are introduced with a few examples.
&
'
Example 1. Consider a random variable X taking values in the set X = {1, 2, 3, 4, 5} with probabilities 0.25, 0.25, 0.2, 0.15, 0.15, respectively. The optimal code for X is expected to have the longest codewords assigned to the symbols 4 and 5. Moreover, these lengths must be equal, since otherwise we can delete a bit from the longer codeword and still have a prex code with shorter length. In general, we can construct a code in which the two longest codewords dier only in the last bit. &
4
&
5
c Patric Oste
&
6
c Patric Oste
'
Example 3. If q 3, we may not have a sucient number of symbols so that we can combine them q at a time. In such a case, we add dummy symbols to the end of the set of symbols. These dummy symbols have probability 0 and are inserted to ll the tree. Note: Since the number of symbols is reduced by q 1 in each step, we should add dummy symbols to get the total number to be of the form 1 + k(q 1) for some integer k. The use of dummy symbols is illustrated in the table on [Cov, p. 94]. The code has an expected length of 1.7 ternary digits. &
7
'
Example 3. If q 3, we may not have a sucient number of symbols so that we can combine them q at a time. In such a case, we add dummy symbols to the end of the set of symbols. These dummy symbols have probability 0 and are inserted to ll the tree. Note: Since the number of symbols is reduced by q 1 in each step, we should add dummy symbols to get the total number to be of the form 1 + k(q 1) for some integer k. The use of dummy symbols is illustrated in the table on [Cov, p. 94]. The code has an expected length of 1.7 ternary digits. &
7
'
If Shannon codes (which are suboptimal) are used, the codeword length for some particular symbol may be much worse than with Human codes. Example. Consider X = {1, 2} with probabilities 0.9999 and 0.0001, respectively. Shannon coding then gives codewords of length 1 and 14 bits, respectively, whereas an optimal code has two words of length 1.
&
'
Huffman Codes vs. Shannon Codes cont. Human Codes vs. Shannon Codes (2)
Occasionally, it is also possible that the Human codeword for a particular symbol is longer than the corresponding codeword of a Shannon code.
1 1 Example. For a random variable with distribution ( 3 , 1 , 1 , 12 ), the 3 4 Human coding procedure results in codeword lengths (2, 2, 2, 2) or (1, 2, 3, 3) (there are sometimes several optimal codes!), whereas the Shannon coding procedure leads to lengths (2, 2, 2, 4).
Fano Codes
Fano proposed a suboptimal procedure for constructing a source code. In his method we first order the probabilities in decreasing order. Then we choose k such that p p is minimized. This point divides the source symbols into two sets of almost equal probability. Assign 0 for the first bit of the upper set and 1 for the lower set. Repeat this process for each subset. This scheme, although not optimal, achieves L(C) H (X) + 2.
k m i =1 i i = k +1 i
10
'
The codeword for x consists of the l(x) rst binary decimals of F (x), where l(x) = log p(x) + 1 . The expected length of this code is less than H(X) + 2. Two examples of the construction of such codes are shown in the tables on [Cov, p. 103]. &
11
'
Shannon-Fano-Elias coding is a simple constructive procedure to allot codewords. Let X = {1, 2, . . . , m} and assume that p(x) > 0 for all x. We dene F (x) =
ax
p(a), p(a),
a<x
F (x) = p(x)/2 +
12
'
Shannon-Fano-Elias coding is a simple constructive procedure to allot codewords. Let X = {1, 2, . . . , m} and assume that p(x) > 0 for all x. We dene F (x) =
ax
p(a), p(a),
a<x
F (x) = p(x)/2 +
12
13
'
For small source alphabets, we must use long blocks of source symbols to get ecient coding. (For example, with a binary symbol, if each symbol is coded separately, we must always use 1 bit per symbol and no compression is achieved.) Human codes are optimal, but require the calculation of the probabilities of all source symbols and the construction of the corresponding complete code tree. A (good) suboptimal code with computationally ecient algorithms for encoding and decoding is often desired. & Arithmetic coding fullls these criteria.
14
'
Universal Codes
Universal Codes
If we do not know the behavior of the source in advance or the behavior changes, a more sophisticated adaptive arithmetic coding algorithm can be used. Such a code is an example of a universal code. Universal codes are designed to work with an arbitrary source distribution. A particularly interesting universal code is the Lempel-Ziv code, which will be considered at a later stage of this course.
15
'
Data Compression & Coin Data Compression and Coin Flips Flips
When a random source is compressed into a sequence of bits so that the average length is minimized, the encoded sequence is essentially incompressible, and therefore has an entropy rate close to 1 bit per symbol. The bits of the encoded sequence are essentially fair coin ips. Let us now go in the opposite direction: How many fair coin ips does it take to generate a random variable X drawn according to some specied probability mass function p. &
16
'
Suppose we wish to generate a random variable X = {a, b, c} with 1 distribution ( 2 , 1 , 1 ). The answer is obvious. 4 4 If the rst bit (coin toss) is 0, let X = a. If the rst two bits are 10, let X = b. If the rst two bits are 11, let X = c. The average number of fair bits required for generating this random variable is 1 1 + 1 2 + 1 2 = 1.5 bits. This is also the entropy of 2 4 4 the distribution. &
17
'
Algorithm for Generating Random Variables Algorithm for Generating Random Variables (1)
We map (possibly innite!) strings of bits Z1 , Z2 , . . . to possible outcomes X by a binary tree, where the leaves are marked by output symbols X and the path to the leaves is given by the sequence of bits produced by the fair coin. For example, the tree for the distribution in the previous example, ( 1 , 1 , 1 ), is shown in [Cov, Fig. 5.8]. 2 4 4 Theorem 5.12.1. For any algorithm generating X, the expected number of fair bits used is greater than the entropy, that is, ET H(X).
18
'
Algorithm for Generating Random Variables cont. (2) Algorithm for Generating Random Variables
If the distribution is not dyadic (that is, 2-adic), we rst write the probabilities as the (possibly innite) sum of dyadic probabilities, called atoms. (In fact, this means nding the binary expansions of the probabilities.) In constructing the tree, the same approach as in proving the Kraft inequality can be used. An atom of the form 2j is associated to a leaf at depth j. All the leaves of the atoms of the probability of an output symbol are marked with that symbol. &
19
'
2 Let X = {a, b} with the distribution ( 3 , 1 ). The binary expansion of 3 the two probabilities are 0.101010 . . . and 0.010101 . . ., respectively. Hence the atoms are
2 3 1 3
= =
1 1 1 , , ,... , 2 8 32 1 1 1 , , ,... . 4 16 64
21
'
Human coding compresses an i.i.d. source with a known distribution p(x) to its entropy limit H(X). However, if the code is designed for another distribution q(x), a penalty of D(p q) is incurred. Human coding is sensitive to the assumed distribution. What can be achieved if the true distribution p(x) is unknown? Is there a universal code with rate R that suces to describe every i.i.d. source with entropy H(X) < R? Yes! &
22
'
A xed rate block code of rate R for a source X1 , X2 , . . . , Xn which has an unknown distribution Q consists of two mappings, the encoder fn : X n {1, 2, . . . , 2nR }, and the decoder, n : {1, 2, . . . , 2
23
nR
}X .
'
The probability of error for the code with respect to the distribution Q is
(n) Pe = Qn (X1 , . . . , Xn : n (fn (X1 , . . . , Xn )) = (X1 , . . . , Xn )).
A rate R block code for a source is called universal if the functions (n) fn and n do not depend on the distribution Q and if Pe 0 as n when H(Q) < R. Theorem 12.3.1. There exists a sequence of (n, 2nR ) universal (n) source codes such that Pe 0 as n for every source Q such that H(Q) < R.
24
&
'
One universal coding scheme is given in the proof of [Cov, Theorem 12.3.1]. That scheme, which is due to Csiszr and a Krner, is universal over the set of i.i.d. distributions. We shall in o detail look at another algorithm, the Lempel-Ziv algorithm which is a variable rate universal code. Q: If universal codes also reach the limit given by the entropy, why do we need Human and similar codes (which are specic to a probability distribution)? A: Universal codes need longer blocks length for the same performance and their decoders and encoders are more complex.
25
'
Most universal compression algorithm used in the real world are based on algorithms developed by Lempel and Ziv, and we therefore talk about Lempel-Ziv (LZ) coding. LZ algorithms are good at compressing data that cannot be modeled simply, such as (note that LZ also compresses other than i.i.d. sources) English text, and computer source code. Computer compression programs, such as compress, gzip, and WinZip, and the GIF format are based on LZ coding. &
26
'
The following are the main variants of Lempel-Ziv coding. LZ77 Also called sliding window Lempel-Ziv. LZ78 Also called dictionary Lempel-Ziv. Described in the textbook, in [Cov, Sect. 12.10]. LZW Another variant, not described here. With these algorithms, text with any alphabet size can be compressed. Common sizes are, for example, 2 (binary sequences) and 256 (computer les consisting of a sequence of bytes). &
27
'
LZ78 LZ78
In the description of the algorithm, we act on the string 1011010100010. The algorithm is as follows: 1. Parse the source into strings that have not appeared so far: 1,0,11,01,010,00,10. 2. Code a substring as (i, c), where i is the index of the substring (starting from 1, 0 = empty string) and c is the value of the additional character: (0,1),(0,0),(1,1),(2,1),(4,0),(2,0),(1,0). To express the locationin the example, an integer between 0 and 4 = c(n)we need log(c(n) + 1) bits.
28
'
LZ77 LZ77
We again use the example string 1011010100010. In each step, we proceed as follows: 1. Find p, the relative position of the longest match (the length of which is denoted by l). 2. Output (p, l, c), where c is the rst character that does not match. 3. Advance l + 1 positions. 1011010100010 (0, 0, 1) 1011010100010 (0, 0, 0) 1011010100010 (2, 1, 1) 1011010100010 (3, 2, 0) 1011010100010 (2, 2, 0) 1011010100010 . . ..
29
'
Several details have to be taken into account; we give one such example for each of the algorithms: LZ77 To avoid large values for the relative positions of strings: do not go too far back (sliding window!) LZ78 To avoid too large dictionaries: Reduce the size in one of several possible ways, for example, throw the dictionary away when it reaches a certain size (GIF does this). Traditionally, LZ77 was better but slower, but the gzip version is almost as fast as any LZ78. &
30
'
Run-Length Coding
Run-Length Coding
A compression method that does not reach the entropy bound but is used in many applications, including fax machines, is run-length coding. In this method, the input sequence is compressed by identifying adjacent symbols of equal value and replacing them with a single symbol and a count. Example. 111111110100000 (1, 8), (0, 1), (1, 1), (0, 5). Clearly, in the binary case, it suces to give the rst bit and then only the lengths of the runs. &
31
'
Theorem 12.10.2. Let {Xi } be a stationary ergodic stochastic process. Let l(X1 , X2 , . . . , Xn ) be the length of the Lempel-Ziv codeword associated with X1 , X2 , . . . , Xn . Then 1 lim sup l(X1 , X2 , . . . , Xn ) H(X ) with probability 1, n n where H(X ) is the entropy rate of the process. From the examples given, it is obvious that these compression methods are ecient only for long input sequences. &
32
'
Lossy Compression
Lossy Compression
So far, only lossless compression has been considered. In lossy compression, loss of information is allowed: scalar quantization Take the set of possible messages S and reduce it to a smaller set S with a mapping f : S S . For example, least signicant bits are dropped. vector quantization Map a multidimensional space S into a smaller set S of messages. transform coding Transform the input into a dierent form that can be more easily compressed (in a lossy or lossless way). Lossy methods include JPEG (still images) and MPEG (video). 33 &
'
The eects of quantization are studied in rate distortion theory, the basic problem of which can be stated as follows: Q: Given a source distribution and a distortion measure, what is the minimum expected distortion achievable at a particular rate? Q: (Equivalent) What is the minimum rate description required to achieve a particular distortion? It turns out that, perhaps surprisingly, it is more ecient to describe two (even independent!) variables jointly than individually. Rate distortion theory can be applied to both discrete and continuous random variables. 34
Example: Quantization
Example: Quantization (1)
We look at the basic problem of representing a single continuous random variable by a nite number of bits. Denote the random variable by X and the representation of X by X(X). With R bits, the function X can take on 2R values. What is the optimum set of values for X and the regions associated with each value X?
&
35
c Patric Oste
'
With X N (0, 2 ) and a squared error distortion measure, we wish to nd a function X that takes on at most 2R values (these are called reproduction points) and minimizes E(X X(X))2 . With one bit, obviously the bits should distinguish whether X < 0 or not. To minimize squared error, each reproduced symbol should be at the conditional mean of its regionsee [Cov, Fig. 13.1]and we have
2 , 2 ,
36
X(X) =
if x < 0, if x 0.
'
With two or more bits to represent the sample, the situation gets far more complicated. The following facts state simple properties of optimal regions and reconstruction points. Given a set of reconstruction points, the distortion is minimized by mapping a source random variable X to the representation X that is closest to it. The set of regions of dened by this mapping is call a Voronoi or Dirichlet partition dened by the reconstruction points. The reconstruction points should minimize the conditional expected distortion over their respective assignment regions. &
37
'
The aforementioned properties enable algorithms to nd good quantizers called the Lloyd algorithm (for real-valued random variables) and the generalized Lloyd algorithm (for vector-valued random variables). Starting from any set of reconstruction points, repeat the following steps. 1. Find the optimal set of reconstruction regions. 2. Find the optimal reconstruction points for these regions. The expected distortion is decreased at each stage in this algorithm, so it converges to a local minimum of the distortion. &
38