Professional Documents
Culture Documents
Data Compression
fixed-length codes
variable-length codes
an application
adaptive codes
References:
Algorithms 2nd edition, Chapter 22
http://www.cs.princeton.edu/algs4/65compression
Applications
Archivers: PKZIP.
Multimedia.
Encoder
C(M)
Decoder
M'
Sound: MP3.
Video: MPEG, DivX, HDTV.
Compression ratio. Bits in C(M) / bits in M.
Communication.
V.42bis modem.
Databases. Google.
this lecture
Number systems.
Natural languages.
Mathematical notation.
Braille.
Morse code.
Telephone system.
MP3.
MPEG.
Pf 2 (by contradiction).
fixed-length codes
variable-length codes
an application
adaptive codes
A. Quite a bit.
Fixed-length coding
Fixed-length coding
10
char
decimal
code
NUL
0000000
...
...
97
1100001
98
1100010
99
100
...
...
DEL
127
char
code
000
001
010
011
1100011
100
1100100
111
1111111
000
001
100
000
010
000
011
000
001
100
000
111
!
12 symbols 3 bits per symbol = 36 bits in code
1100001 1100010 1110010 1100001 1100011 1100001 1100100 1100001 1100010 1110010 1100001 1111111
12
4 different nucleotides.
2 bits suffice.
Amazing but true: initial databases in 1990s did not use such a code!
~ 2N bits to encode genome with N nucleotides
char
code
00
01
10
00
01
00
11
00
10
11
00
00
00
01
10
11
fixed-length codes
variable-length codes
an application
adaptive codes
14
Variable-length coding
Variable-length coding
char
SOS ?
IAMIE ?
EEWNI ?
V7O ?
b
1
r
1
a
0
c
1
111
1010
100
110
1101
d
1
code
prefix of V
prefix of I, S
prefix of S
code
a
1
char
a
0
b
1
r
1
a
0
!
1
16
A. Binary trie.
Encoding.
r
c
*a**d*c!*rb
12
0111110010100100011111001011
r
0
preorder traversal
# chars to decode
the message bits
(pack 8 to the byte)
char
code
char
code
111
111
1010
1010
100
100
110
110
1011
1011
Decoding.
r
0
18
r
0
}
19
20
Huffman coding
!
7
A. Huffman code.
3
1
David Huffman
1
d
r
1
12
a
1
3
1
3
1
2
b
a
d
21
2
r
22
internal node
marker
tabulate
frequencies
initialize PQ
Implementation.
merge tries
distinct
symbols
total
frequency
23
24
300 pixels/inch.
8.5-by-11 inches.
300*8.5*300*11 = 8.415 million bits.
fixed-length codes
variable-length codes
an application
adaptive codes
25
26
000000000000000000000000000011111111111111000000000
000000000000000000000000001111111111111111110000000
000000000000000000000001111111111111111111111110000
000000000000000000000011111111111111111111111111000
000000000000000000001111111111111111111111111111110
000000000000000000011111110000000000000000001111111
000000000000000000011111000000000000000000000011111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000001111000000000000000000000001110
000000000000000000000011100000000000000000000111000
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011000000000000000000000000000000000000000000000011
000000000000000000000000000011111111111111000000000
000000000000000000000000001111111111111111110000000
000000000000000000000001111111111111111111111110000
000000000000000000000011111111111111111111111111000
000000000000000000001111111111111111111111111111110
000000000000000000011111110000000000000000001111111
000000000000000000011111000000000000000000000011111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000001111000000000000000000000001110
000000000000000000000011100000000000000000000111000
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011000000000000000000000000000000000000000000000011
51
28 14 9
26 18 7
23 24 4
22 26 3
20 30 1
19 7 18 7
19 5 22 5
19 3 26 3
19 3 26 3
19 3 26 3
19 3 26 3
20 4 23 3 1
22 3 20 3 3
1 50
1 50
1 50
1 50
1 50
1 2 46 2
run-length encoding
28
Run-length encoding
run lengths
130
...
1
2
3
4
5
6
7
3W
1000
1B
010
2W
111
2B
11
130W
10010 0111
...
...
...
128 + 2
run
white
black
00110101
0000110111
000111
010
0111
11
1000
10
63
00110100
000001100111
64+
11011
0000001111
128+
10010
000011001000
...
1728+
010011011
0000001100101
29
30
Idea.
fixed-length codes
variable-length codes
an application
adaptive codes
Bottom line. Any extra information about file can yield dramatic gains.
31
32
Statistical methods
Lempel-Ziv-Welch encoding
LZW encoding.
Fast.
Not optimal: different texts have different statistical properties.
Ex: ASCII, Morse code.
previous substring.
Length of strings in ST grows, hence compression.
Adaptive model. Progressively learn and update model as you read text.
Ex: LZW.
33
34
ST
input
code
add to ST
key
value
key
value
input
code
add to ST
97
ab
NUL
ab
128
97
ab
98
br
...
...
br
129
98
br
114
ra
97
ra
130
114
ra
97
ac
98
ac
131
97
ac
99
ca
99
ca
132
99
ca
97
ad
100
ad
133
97
ad
100
da
101
da
134
100
da
102
abr
135
a
128
abr
130
rac
132
cad
a
b
128
abr
r
a
130
rac
132
cad
c
a
d
a
103
rac
136
...
...
cad
137
114
dab
138
...
...
bra
139
DEL
127
...
...
134
129
97
dab
bra
To send (encode) M.
Find longest string s in ST that is a prefix of unsent part of M
7-bit ASCII.
19 chars.
133 bits.
Encoding.
7-bit codewords.
14 codewords.
112 bits.
it_
134
dab
129
bra
97
it_
Message.
36
Lempel-Ziv-Welch decoding
ASCII
ST
key
value
key
value
NUL
ab
128
...
...
br
129
97
ra
130
98
ac
131
99
ca
132
100
ad
133
101
da
134
102
abr
135
103
rac
136
...
...
cad
137
114
dab
138
...
...
bra
139
DEL
127
...
...
Encode.
97
98
b
a
128
b
131
c
135
133 129
d
r
139
99
100
c
114
132
134
130
137
136
b
138
LZW decoding.
37
38
codeword
output
97
98
ab
114
br
97
ra
99
ac
97
97
ca
98
100
ad
128
a
b
130
134
129
key
value
key
value
128
ab
129
br
130
ra
131
ac
132
ca
133
ad
99
134
da
100
135
abr
136
rac
137
cad
138
dab
139
bra
0
...
...
...
da
114
r
a
132
add to ST
abr
...
c
a
rac
it_
cad
...
Use an array
to implement ST
...
127
255
dab
97
bra
255
STOP
39
40
LZ77.
LZ78.
LZW.
PNG: LZ77.
Winzip, gzip, jar: deflate.
Unix compress: LZW.
Pkzip: LZW + Shannon-Fano.
GIF, TIFF, V.42bis modem: LZW.
Google: zlib which is based on deflate.
never expands a file
41
43
year
scheme
bits / char
1967
ASCII
7.00
1950
Huffman
4.70
1977
LZ77
3.94
1984
LZMW
3.32
1987
LZH
3.30
1987
move-to-fromt
3.24
1987
LZB
3.18
1987
gzip
2.71
1988
PPMC
2.48
1994
SAKDC
2.47
1994
PPM
2.34
1995
Burrows-Wheeler
2.29
1997
BOA
1.99
1999
RK
1.89
next assignment
42