You are on page 1of 37

DATA COMPRESSION

The process of reducing the volume of


data by applying a compression
technique is called compression.
The resulting data is called compressed
data.
The reverse process of reproducing the
original data from compressed data is
called decompression.
The resulting data is called
decompressed data.
Reasons to Compress
Reduce File Size
Save disk space
Increase transfer speed at a
given data rate
Allow real-time transfer at a
given data rate
Basics of Compression
Basics of Compression

A good metric for compression is the compression


factor (or compression ratio) given by:

If we have a 100KB file that we compress to 40KB,


we have a compression factor of:

4
Basics of Compression

Compression is achieved by removing data redundancy


while preserving information content.
The information content of a group of bytes (a message)
is its entropy.
Data with low entropy permit a larger compression
ratio than data with high entropy.
How to determine the entropy
Find the probability p(x) of symbol x in the message
The entropy H(x) of the symbol x is:
H(x) = - p(x) log2p(x)
The average entropy over the entire message
is the sum of the entropy of all n symbols in
the message
5
Types of compression
techniques
Compression techniques can be
categorized based on following
consideration:

Lossless or lossy
Symmetrical or asymmetrical
Software or hardware
1. Lossless or lossy
If the decompressed data is
the same as the original data,
it is referred to as lossless
compression, otherwise the
compression is lossy
2. Symmetrical or asymmetrical
In symmetrical compression, the
time required to compress and to
decompress are nearly the same.
In asymmetrical compression, the
time taken for compression is
usually much longer than
decompression.
3. Software or hardware
A compression technique may
be implemented either in
hardware or software. As
compared to software codecs
(coder and decoder), hardware
codecs offer better quality and
performance.
Techniques for lossless compression

Techniques for lossless compression:


Based on length of word
Run -length encoding
Based on codewords
Huffman codes
Based on dictionaries
Lempel-Ziv
Run-length encoding
Simplest method of compression.
How : replace sequential repeating
occurrences of a symbol by 1 occurrence
of the symbol itself, then followed by the
number of occurrences.
Run-length encoding for two
symbols
We can encode one symbol which is
more frequent than the other.
This example only encode 0s
between 1s.

There is no 0 between
1s
Huffman Coding
Assign fewer bits to symbols that occur more
frequently and more bits to symbols appear less often.
Theres no unique Huffman code and every Huffman
code has the same average code length.
Algorithm:
Make a leaf node for each code symbol
Add the generation probability of each symbol to the leaf node
Take the two leaf nodes with the smallest probability and connect
them into a new node
Add 1 or 0 to each of the two branches
The probability of the new node is the sum of the probabilities of
the two connecting nodes
If there is only one node left, the code construction is completed. If
not, go back to (2)
Huffman coding
Take the two leaf nodes with the smallest
probability and connect them into a new node
Assign shorter codes to symbols that occur more frequently
and longer codes to those that occur less frequently.
Add (0) on Smaller branch and (1) one on the larger branch

Final tree and code


Huffman encoding
Huffman decoding
Huffman coding
The beauty of Huffman coding is that no
code in the prefix of another code.
There is no ambiguity in encoding.
The receiver can decode the received data
without ambiguity.
Huffman code is called instantaneous code
because the decoder can unambiguously
decode the bits instantaneously with the
minimum number of bits.
Lempel Ziv Encoding
It is dictionary-based encoding

Basic idea:
Create a dictionary(a table) of strings used
during communication.

If both sender and receiver have a copy of


the dictionary, then previously-encountered
strings can be substituted by their index in
the dictionary.
Dictionary-based encoding
example
Dictionary:
1. ASK Original text:
2. NOT ASK NOT WHAT YOUR COUNTRY CAN
3. WHAT DO FOR YOU ASK WHAT YOU CAN
4. YOUR DO FOR YOUR COUNTRY
5. COUNTRY
6. CAN Encoded based on
7. DO dictionary :
8. FOR 1 2 3 4 5 6 7 8 9 1 3 9 6 7 8
9. YOU 4 5
Lempel Ziv Compression
Have 2 phases:
Building an indexed dictionary
Compressing a string of symbols
Algorithm:
Extract the smallest substring that cannot be found in
the remaining uncompressed string.
Store that substring in the dictionary as a new entry
and assign it an index value
Substring is replaced with the index found in the
dictionary
Insert the index and the last character of the substring
into the compressed string
LZ-77 Principle Example
Input text:
IN_SPAIN_IT_RAINS_ON_THE_PLAIN

Coding:
IN_SPAIN_IT_RAINS_ON_THE_PLAIN

Coded output:
IN_SPA{3,6}IT_R{3,8}S_ON_THE_PL{3,
Lossy Compression Methods
Used for compressing images and video
files (our eyes cannot distinguish subtle
changes, so lossy data is acceptable).
These methods are cheaper, less time and
space.
Several methods:
JPEG: compress pictures and graphics
MPEG: compress video
MP3: compress audio
JPEG Encoding
Used to compress pictures and
graphics.
In JPEG, a grayscale picture is divided
into 8x8 pixel blocks to decrease the
number of calculations.
Basic idea:
Change the picture into a linear (vector) sets of
numbers that reveals the redundancies.
The redundancies is then removed by one of
lossless compression methods.
JPEG Encoding- DCT
DCT: Discrete Concise Transform
DCT transforms the 64 values in 8x8 pixel block in
a way that the relative relationships between
pixels are kept but the redundancies are revealed.
Example:
A gradient grayscale
Quantization & Compression
Quantization:
After T table is created, the values are quantized to reduce
the number of bits needed for encoding.
Quantization divides the number of bits by a constant, then
drops the fraction. This is done to optimize the number of
bits and the number of 0s for each particular application.

Compression:
Quantized values are read from the table and redundant 0s
are removed.
To cluster the 0s together, the table is read diagonally in an
zigzag fashion. The reason is if the table doesnt have fine
changes, the bottom right corner of the table is all 0s.
JPEG usually uses lossless run-length encoding at the
compression phase.
JPEG Encoding
100%
:50%
:10%
:5%
MPEG Encoding
Used to compress video.
Basic idea:
Each video is a rapid sequence of a set of
frames. Each frame is a spatial combination
of pixels, or a picture.
Compressing video =
spatially compressing each frame
+
temporally compressing a set of frames.
MPEG Encoding
Spatial Compression
Each frame is spatially compressed by JPEG.
Temporal Compression
Redundant frames are removed.
For example, in a static scene in which someone is
talking, most frames are the same except for the
segment around the speakers lips, which changes
from one frame to the next.
Audio Compression
Used for speech or music
Speech: compress a 64 kHz digitized signal
Music: compress a 1.411 MHz signal

Two categories of techniques:


Predictive encoding
Perceptual encoding
Audio Encoding
Predictive Encoding
Only the differences between samples are encoded,
not the whole sample values.
Several standards: GSM (13 kbps), G.729 (8 kbps), and
G.723.3 (6.4 or 5.3 kbps)

Perceptual Encoding: MP3


CD-quality audio needs at least 1.411 Mbps and
cannot be sent over the Internet without
compression.
MP3 (MPEG audio layer 3) uses perceptual encoding
technique to compress audio.
References
http://www.csie.kuas.edu.tw/course/cs/english/ch-
15.ppt

CS157B-Lecture 19 by Professor Lee


http://cs.sjsu.edu/~lee/cs157b/cs157b.html

The essentials of computer organization


and architecture by Linda Null and Julia Nobur.

Huffman Data Compression by Joseph


Lee ,May 23, 2007
TEEKKRLER.

20mart2017

You might also like