You are on page 1of 6

Shortest Way Huffman Text Compression

Andysah Putera Utama Siahaan


Universitas Sumatra Utara
Jl. Dr. Mansur No. 9, Medan, Sumatra Utara, Indonesia
andysahputrautamasiahaan@yahoo.com

Abstract— Huffman is one of the compression algorithms. we can still change the sequence of character appearance. This
Huffman algorithm is the most famous algorithm to method requires two phases. The first phase to calculate the
compress text. There are three phases in the Huffman probability of occurrence of each symbol and determine the
algorithm to compress texts. The first is the phase map code, and the second phase to convert the message into a
formation of the Huffman tree, the second phase of the collection of code that will be trasmitted. Meanwhile, based
three phases of encoding and decoding. The principle used on symbols coding technique, Huffman uses symbolwise
by the Huffman algorithm is the character that often method. Symbolwise is a method that calculate the probability
appears on-encoding with a series of short bits and of occurrence of each symbol at a time, where the symbol is
characters that rarely appeared in bit-encoding with a more often occur will be given shorter code than the symbols
longer series. Huffman compression technique can provide that rarely appears.
savings of up to 30% memory usage. Huffman algorithm
has complexity O (n log n) for the set with n characters. A. Greedy Algorithm.

Keywords— Huffman, compression, algorithm, text, Greedy algorithms are algorithms which follow the
computer problem solving meta-heuristic of making the locally
optimum choice at each stage with the hope of finding the
I. INTRODUCTION global optimum. For instance, applying the greedy strategy to
the traveling salesman problem yields the following
Text is a collection of characters or strings into a single algorithm: "At each stage visit the nearest unvisited city to the
unit. Text contains many characters in it that always cause current city".[3]
problems in limited storage device and speed of data Greedy algorithms rarely find the globally optimal solution
transmission at particular time. Although storage can be consistently, since they usually don't operate exhaustively on
replace by another larger one, this in not a good solution if we all the data. Nevertheless they are useful because they are
can still find another solution. And this is making everyone try quick to think up and often give good approximations to the
to think to find a way that can be used to compress text. optimum. If a greedy algorithm can be proven to yield the
Compression is the process of changing the original data global optimum for a given problem class, it typically
into code form in order to save storage and time requirements becomes the method of choice. Examples of such greedy
for data transmission.[1] By using the Huffman algorithm, text algorithms are Kruskal's algorithm and Prim's algorithm. The
compression process is done by using the principle of the theory of matroids provides a whole class of such algorithms.
encoding, each character is encoded with a series of several
bits to produce a more optimal result. The purpose of writing B. The Relationship to Huffman Algorithm.
this paper is to investigate the effectiveness and the shortest
way of the Huffman algorithm in the compression of text and At first David Huffman encoded character by simply using
explain the ways of compressing text using Huffman ordinary binary tree, but after that, David Huffman found that
algorithm in programming. using greedy algorithm can establish the optimal prefix code.
The use of greedy algorithm Huffman algorithm is at the
II. HUFFMAN THEORY election of two trees with lowest frequency in a Huffman tree.
Greedy algorithm is used to minimize the total cost is
Huffman algorithm was created by an MIT student named required. Cost is used to merge two trees at the root of
David Huffman in 1952. This is one of the oldest methods and frequency equal to the number two fruit trees that are
most famous in text compression. [2] The Huffman code uses combined, therefore the total cost of the establishment of the
the principles similar to Morse code. Each character is Huffman tree is the sum total of the whole merger. Therefore,
encoded only with a few bits series, where characters that the Huffman algorithm is one example of compression
often appear with a coded series of short bits and characters algorithm that uses greedy algorithm. For example there is a
who rarely appears is encoded with a longer series of bits. text of 120 characters, each character has a cost. Our goal is to
Based on the type of map, the code is used to change the calculate the total cost incurred to establish the text.
initial message (the contents of the input data) into a set of
codewords. Huffman algorithm uses static methods. Static
method is a method that always uses the same codemap but
III. IMPLEMENTION

A. Phase One.

Imagine that we have a sentence “LIKA-LIKU LAKI-


LAKI TAK LAKU-LAKU”. In this phase we have the
original text converts to a single character set.

function Character_Set(text : string) : string;


var
temp : string;
result : string;

begin
temp := text;
result := '';
Fig. 1 – Unsorted Character & Freq Table
for i := 1 to length(temp) do
for j := i + 1 to length(temp) do The table must be sorted in ascending order and the primary
if temp[i] = temp[j] then temp[j] := '#'; key is the frequency.
j := 1;
procedure Tree_Sorting;
var
for i := 1 to length(temp) do
tASCII : char;
if temp[i] <> '#' then
tFreq : byte;
begin
result := concat(result, temp[i]);
inc(j); begin
for i := length(cs) - 1 downto 1 do
end;
begin
Current := Head;
for j := 1 to i do
Character_Set := result;
begin
end;
if Current^.Freq > Current^.Next^.Freq then
begin
The first loop which is done by “for” is to replace the tASCII := Current^.ASCII;
duplicate character into „#‟. The second loop removes the tFreq := Current^.Freq;
Current^.ASCII := Current^.Next^.ASCII;
original text that consists of „#‟. Current^.Freq := Current^.Next^.Freq;
Now we can see the illustration below. Current^.Next^.ASCII := tASCII;
Current^.Next^.Freq := tFreq;
Original Text : LIKA-LIKU LAKI-LAKI TAK LAKU-LAKU end;
Replaced Text : LIKA-###U ##########T############ Current := Current^.Next;
Character Set : LIKA-U T end;
end;
procedure Character_Freq(text : string); end;
var
temp : string; The procedure above is to sort the unsorted list by using
freq : byte; bubble sort algorithm. The first node will be compared to the
begin next node, and the larger value will be swapped and moved to
temp := Character_Set(text); the right. And finally we have the smallest value on the left.
for i := 1 to length(temp) do
begin
freq := 0;
for j := 1 to length(text) do
if temp[i] = text[j] then inc(freq);
AddNode(Head, Tail, temp[i], freq);
end;
end;

The above procedure calculate the frequency of the


character occurence. First, we have to run the character set
function to obtain a series of a single charater used, then we
have to compare from the first character until the last
character to the original text to obtain the frequency. Since we
use pointer to represent the value. The result will be sent to
the node after getting incremented.
Fig. 2 – Sorted Character & Freq Table
B. Phase Two. * *
After the table is fully sorted. Now it‟s time to face the 33 33
most difficult step, that is to make the Huffman tree. The * *
Greedy algorithm takes apart at this section. We have to 13 20
combine two nodes and make a new node and make it as a
parent of those earlier nodes. Let‟s see the illustration below:
T - U * I L A K

T - U I L A K 1 3 3 3 4 4 6 6 7

1 3 3 3 4 6 6 7 T - U * I * L A K
1 3 3 3 4 4 6 6 6 7
Draw the first two nodes, and release from the table. Make T - U * I * L A K *
a new node which will be their parent.
1 3 3 3 4 4 6 6 6 7 8

* U * I L A K T - U * I * L A K * *

4 3 3 4 4 6 6 7 1 3 3 3 4 4 6 6 6 7 8 12

T - T - U * I * L A K * * *

1 3 1 3 3 3 4 4 6 6 6 7 8 12 13
T - U * I * L A K * * * *
The parent node will be inserted to the table in ascending 1 3 3 3 4 4 6 6 6 7 8 12 13 20
order using insertion algorithm. The first node is now replaced
by the third node before. And we have to do the same way T - U * I * L A K * * * * *
again until the table consists of one node. 1 3 3 3 4 4 6 6 6 7 8 12 13 20 33

* * I * L A K Finally, the linked list consists of 15 nodes.


6 4 4 6 6 6 7
In this step, we have had two model of trees.
U - Double Linked List.
3 3 - Binary Tree Linked List.

* * L A K *
8 6 6 6 7 8
* I
4 4

* A K * *
12 6 7 8 12 Fig. 3 – Double & Binary Tree Linked List
* L
6 6

* * * *
13 8 12 13
A K
6 7

* * *
20 13 20
* *
8 12 Fig. 4 – Huffman Tree
The aim using two model of linked list is to avoid the Huffman Tree procedure is used to form the huffman tree
searching procedures. We can use breadth-first or depth-first by processing the earlier linear tree. We have to mark the
search to form the bit code, but it takes time and advanced nodes “who is on the left” and “who is on the right” by adding
programming technique. But here, we can use a list to replace a sign 0 or 1 to the node field.
the backtracking procedure. We have only to the who the
parent is. type
NodeP = ^Node;
Node = record
procedure Huffman_Tree;
ASCII : char;
var
Bit : byte;
tFreq : byte;
Code : string;
tASCII : char;
Dec : byte;
InsertN : NodeP;
freq : byte;
NewNode : NodeP;
Prev,
Next : NodeP;
begin
Parent,
tASCII := '*';
Left,
Current := Head;
Right : NodeP;
end;
while Current^.Next <> NIL do
begin
InsertN := Head; ASCII : where character is memorized.
tFreq := Current^.Freq + Current^.Next^.Freq; Bit : node sign. (left or right)
Current^.Bit := 0; Code : Huffman code.
Current^.Next^.Bit := 1;
Dec : decimal code.
while InsertN <> NIL do Freq : occurence of the character.
begin Next, Prev,
if tFreq <= InsertN^.Freq then Parent, Left, Right : represent the connected nodes.
begin
new(NewNode);
NewNode^.Freq := tFreq; The first two nodes must be combined. The first node will
NewNode^.ASCII := tASCII; be marked as „0‟ since it‟s on the left and the second node will
NewNode^.Next := InsertN; be marked as „1‟ since it‟s one the right. And both nodes have
NewNode^.Prev := InsertN^.Prev;
InsertN^.Prev^.Next := NewNode; the same parent and the parent has two child, the nodes. After
InsertN^.Prev := NewNode; the parent node is created, it must be inserted to the linked list
by combining the value of the node and the value of the linked
NewNode^.Left := Current;
list. The node must be inserted on the left of the larger or same
NewNode^.Right := Current^.Next;
Current^.Parent := NewNode; value. But if the parent node is larger than every node, it has
Current^.Next^.Parent := NewNode; to be inserted after the last node.
break;
end
else if tFreq > Tail^.Freq then
C. Phase Three.
begin
new(NewNode); In this step, the tree is already strutured. It‟s time to retrieve
NewNode^.Freq := tFreq; the node sign by doing a loop until the parent of the last node
NewNode^.ASCII := tASCII;
Tail^.Next := NewNode; is empty (NIL).
NewNode^.Prev := Tail;
Tail:=NewNode; procedure Write_Huffman;
Tail^.Next := NIL; var
result : string;
NewNode^.Left := Current; bit : string[1];
NewNode^.Right := Current^.Next;
Current^.Parent := NewNode; begin
Current^.Next^.Parent := NewNode; Current := Head;
break;
end; repeat
InsertN := InsertN^.Next; result := '';
Cursor := Current;
end; Current^.Dec := 0;
Biner := 1;
Current := Current^.Next;
Current := Current^.Next; if Cursor^.ASCII <> '*' then
end; begin
repeat
end; if (Cursor^.Bit = 0) or
(Cursor^.Bit = 1) then
begin
Current^.Dec := Current^.Dec +
(Cursor^.Bit * Biner);
Biner := Biner * 2;
str(Cursor^.Bit, Bit);
insert(Bit, result, 1);
end;
Cursor := Cursor^.Parent;
until Cursor^.Parent = NIL;
end;

Current^.Code := result;
Current := Current^.Next;
until Current^.Next = NIL;
end;ss
Fig. 5 – Running Program (Part. 1).

We just retrive the the nodes which are not parents. And the
node sign from the node will be inserted into a single string.

D. Phase Four.

Now each node has consisted of a code. It‟s time to form a


Huffman table.

Char Freq. Code Bit Len. Code Len.


T 1 1000 4 4
- 3 1001 4 12
U 3 1100 4 12 Fig. 6 – Running Program (Part. 2).

3 1101 4 12
I 4 101 3 12
L 6 111 3 18
A 6 00 2 12
K 7 01 2 14
96

Tab. 1 – Huffman Table.

Each characted has been represented by a few digit binary


code. It‟s time to combine all code by reading from the first Fig. 7– Running Program (Part. 3)
character of the original text and replace them with the code.
The original text before compression takes 33 characters
Original Text : LIKA-LIKU LAKI-LAKI TAK LAKU-LAKU
Original Code : length. Sooner after being compressed, the string only takes
12 characters. So we save 21 characters.
111 101 01 00 1001 111 101 01 1100 1101 111 00 01
101 1001 111 00 01 101 1101 1000 00 01 1101 111
00 01 1100 1001 111 00 01 1100
The illustration:

Bit Code : Original Text Length : 33


Coded Text Length : 12
11110101 00100111 11010111 00110111 10001101 Saving Rate : (33 – 12) / 33 * 100 %
10011110 00110111 01100000 01110111 10001110 63.63636 %
01001111 00011100

Decimal Code :

254, 39, 215, 55, 141, 158,


55, 96, 119, 142, 79, 28
V. CONCLUSION

This article should serve as a basic primer on how to


implementate the compression algorithm. Huffman algorithm
is combined from greedy algorithm which always find the
easiest ways from the nearest node. There are four phases in
the Huffman algorithm to compress a text, the first is the
phase to manage and get the frequency of the character. The
second is the formation of the Huffman tree, the third phase is
to form the code from the node sign. And the last phase is the
encoding process. But in this paper, we only the encoding
methode, the decoding is under a project.
.
REFERENCES

Howe, D., “Free On-line Dictionary of Computing”, http://


www.foldoc.org/, 1993, access: Saturday, January, 29th 2011.
Rinaldi Munir, 2005, Diktat Kuliah IF2251 Strategi Algoritmik, ITB.
Huffman Coding http://www.en.wikipedia.org/wiki/Huffman_coding
access time: Sunday, 18:00, January, 30th 2011
Practical Huffman Coding http://www.compressconsult.com/huffman/
Access time: Sunday, 18:30, January, 30th 2011

You might also like