You are on page 1of 3

Name: Hayden Hodge

Log in:

hod0035

Report of Data Compression


Title: Data Compression
Task: Describe lossless compression methods
Introduction to problems:
Data compression falls into two branches, lossless and lossy. Lossless compression involves removing
only statistically redundant bits, therefore loosing no information. Lossy, on the other hand, reduces
bits by removing 'unnecessary' information. For instance, when compressing images or music, some
loss of detail is acceptable, possible even unnoticeable.
Elaboration:
The most important factors of compression are:
The Compression ratio, noted as CR, higher is better
CR = Uncompressed data size / compressed data size
The speed of compression and decompression, where faster is beter
There are also two major method groups, split into Statistical methods and Dictionary methods.
Statistical Methods:
Run Length Encoding:
This is an extremely simple method, but the data that is to be compressed must have several instances
of repeating data or the CR will not be much higher l. This is how it works:
AABBBCCDDDD 2A3B2C4D
Huffman Encoding::
This method starts by looking at a given data set and the probability of occurrence of each character,
and from there assigns a binary symbol to that character where the most frequently occurring characters
get the smallest bit symbols.
Ex. We have 5 characters, A, B, C, D, and E, with the respective probabilities of occurrence at 0.2,
0.25, 0.15, 0.3, and .10. We will list those out from greatest to least:
0.3 D
0.25 B
0.2 A
0.15 C
0.10 E
And we group the last 2 nodes together to create a new node, and repeat the process until we have
formed a tree.

0.3 D------------------------->------0.25 B---------------->------>


0.2 A----------->----->
0.15 C--->----->
0.10 E--->
Next, we assign a 1 or a 0 to each branch, putting a 0 on the left child and a 1 on the right. It looks like
this:
0.3 D----------------------------0-->------0.25 B------------------0-->---1-->
0.2 A------------0-->--1-->
0.15 C---0-->--1-->
0.10 E---1-->
Finally to assign our symbols to our characters, we combine the binary digits from the top of the tree
downwards. This gives up the following values:
D0
B 10
A 110
C 1110
E 1111
Now the data can be encoded this way, but for decompression the prefix must be know, for this reason
it is used mostly for offline applications.
Adaptive Huffman coding:
This is a variation of Huffman coding, where the probabilities are dynamically calculated, and the tree
of nodes is adjusted on the fly. It is rarely used in practice however, because updating the tree
constantly is not very efficient.

Dictionary Methods:
LZ77:
This compression method was invented by Lempel and Ziv in 1977, hence the name, LZ77. It's an
offline, dictionary lossless method that uses a sliding window that references previously coded data to
identify redundant bits. This gives us a high CR, but at the cost of a fairly high compression time.
LZ78:
This is a variation of the above method, where the repeated data is referenced through a dictionary that
is constructed as the data steams in.
LZW:
This method is used in data transmission. It is based off of the LZ78 method, but uses a per-initialized
dictionary with all possible characters. It creates symbols for the most commonly used strings, and uses
those symbols in the encoding.

References:
Ing. Nevlud Lectures
https://en.wikipedia.org/wiki/Data_compression
https://en.wikipedia.org/wiki/LZ77_and_LZ78

You might also like