You are on page 1of 7

TEXT COMPRESSION ALGORITHM: A NEW APPROACH S. E. Adewumi Department of Mathematics University Of Jos Nigeria adewumis@yahoo.com adewumisa@unijos.edu.ng E.J.D.

. Garba Department of Mathematics University Of Jos Nigeria Abstract This paper is a new version of our earlier algorithm in [1] and has the advantage of overcoming difficulties of factoring large number encountered in the earlier work. The algorithm is such that positions of occurrence of each alphabet for each character are expressed in binary forms having the same length as that of the last occurrence of a particular alphabet. This could be achieved by adding zeros in front of binary numbers whose length does not equal the length of the last occurrence of a particular alphabet. The binary expression of occurrences of each alphabet is then written as a continuous chain. This continuous chain is then converted to decimal and stored. This process constitutes compression. Decompression is achieved by converting each decimal number to their respective binary number and then using the length of the last occurrence of each previously stored length to partition the binary string into their original concatenated number and then back to their decimal number representing each position of occurrence of the alphabet. This lossless compression algorithm has the capacity of reducing a document of several pages to a maximum of two pages. Keywords Data Compression Algorithm, Decompression Algorithm, Lossless, Lossy, Redundant, Compactness. Introduction Data compression allows the conversion of data to a compact form by removing redundant elements from the input stream. The main purpose for compressing data is to improve the efficiency with which data is stored or transmitted [2]. It allows large files or text to be temporarily squeezed so that they take less space and network transmission time. For them to have meaning they must be decompressed. We achieve compression in the physical world by putting juices in concentrated form, while decompression is achieved by adding water to it. Data compression can be classified into two Lossless and Lossy. A lossless technique means that the restored data file is identical to the original compressed data; while a Lossy method only generates an approximate to the original compressed data. The lossless is necessary for many types of data that must revert back to its original text after decompression. The new algorithm being described below is a lossless data compression in which the compressed data reverts back to the original document [3].

The Compression Algorithm The new compression algorithm is described below: 1. 2. 3. 4. Take each letter l i of alphabets (characters) that makes up the text document; Find the positions where each of these letters occurs Covert each position to a binary number. The binary string length k representing the last position of occurrence is used as the standard length for each binary string. This means that if other positions are not k-length when converted, it has to be padded on the left to make it k-length compliant. Concatenate the binary strings for each alphabet (character); this is in turn converted to decimal number to complete the compression. Store the length k representing the binary string of the last occurrence of a particular alphabet (character) for use during the decompression. The binary values for each position Concatenated binary string of each l i decimal number equivalent of the concatenated binary string k the length representing the binary string of the last occurrence of a particular alphabet

5. 6.

li

Position of occurrence

Table 1: The new compression model. The Decompression Algorithm 1. Take each letter l i of alphabets(character); 2. Converted each decimal number to its binary number equivalent 3. Using the length k of the last occurrence for each l i to partition the binary string into their positional binary values 4. Convert the positional binary value to their decimal number equivalent 5. Write the alphabets in their respective positions In summary, if the positions of occurrence of each alphabet can be represented by n1, n2, n3,, nj-1, nj and if nj has k-length binary digits, then the binary string representing n1, n2, n3, , nj has a length of k x j. This is in turn converted to its decimal equivalent. Application of this scheme The example below demonstrates the use of this scheme to compression and decompress text document. If we wish to compress the sentence: When the bush is burning grasshoppers dont wait to bid farewell

Then the six stages in the compression algorithm are represented in five tables below, where the actual compression is represented by table 4.
1 2 3 e 4 0 o 4 n 4 1 n 4 2 t 5 6 T 4 3 7 h 4 4 w 8 e 4 5 a w h 3 8 3 9 d 9 1 0 b 4 4 6 7 I t 1 1 u 4 8 1 1 2 3 s h 4 5 9 0 t o 1 1 4 5 i 5 5 1 2 b 1 1 6 7 s 5 5 3 4 i d 1 1 8 9 b u 5 5 5 6 f 2 2 0 1 r n 5 5 7 8 a r 2 2 2 3 i n 2 2 4 5 G 2 6 g 6 3 l 2 2 7 8 r a 2 3 9 0 s s 3 3 1 2 h o 3 3 3 4 p p 3 3 5 6 e r 3 7 s

5 6 6 6 9 0 1 2 e W E l

Table 2: Position of each letter

li

Positions The binary values for positions of occurrence of occurrence

Concatenated binary string for each l i

decimal number equivalent of concatenate d binary string

a b d e f g h i l n o p r s t u

28, 45, 57 10, 18, 52 39, 54 3, 8, 35, 59, 61 56 24, 26 2, 7,13, 31 15, 22, 46, 53 62, 63 4, 21, 23, 41 32, 40, 50 33, 34 20, 27, 36, 58 12, 16, 29, 30, 37 6, 42, 47, 49 11, 19

11100,101101,111001 1010,10010,110100 100111,110110 11,1000,100011 111011,111101 111000 11000,1010 10,111,1101,11111 1111,10110,101110, 110101 111110,111111 100,10101,10111, 101001 100000,101000,110010 100001,100010 10100,11011,100100, 111010 1100,10000,11101,11110, 100101 110,101010,101111, 110001 1011,10011

011100101101111001 001010010010110100 100111110110 000011001000100011 111011111101 111000 1100011010 00010001110110111111 001111010110101110 110101 111110111111 000100010101010111 101001 100000101000110010 100001100010 010100011011100100 111010 001100010000011101 011110100101 000110101010101111 110001 0101110011

117625 42164 2550 3292925 56 794 73151 4025141 4031 1136105 133682 2146 10598714 205641637 1747953 371

k = the length representi ng the binary string of the last occurrenc e of a particular alphabet 6 6 6 6 6 5 5 6 6 6 6 6 6 6 6 5

Table 3: Analysis of each alphabet

li

decimal number equivalent of concatenated binary string

a b d e f g h i l n o p r s t u

117625 42164 2550 3292925 56 794 73151 4025141 4031 1136105 133682 2146 10598714 205641637 1747953 371

k = the length representing the binary string of the last occurrence of a particular alphabet 6 6 6 6 6 5 5 6 6 6 6 6 6 6 6 5

Table 4: The actual compressed table

To decompress, we simply take table 4, find the binary equivalent of the decimal values attached to each alphabet, break each binary digits to their k-length equivalent for each positions and then the original text is recovered. This is show in the table 5.

li

decimal Concatenated binary number string for each l i equival ent of binary string

The binary values for positions of occurrence

a b d e f g h i l n o p r s t u

117625 42164 2550 329292 5 56 794 73151 402514 1 4031 113610 5 133682 2146 105987 14 205641 637 174795 3 371

011100101101111001 001010010010110100 100111110110 000011001000100011 111011111101 111000 1100011010 0001000111011011111 1 001111010110101110 110101 111110111111 000100010101010111 101001 100000101000110010 100001100010 010100011011100100 111010 001100010000011101 011110100101 000110101010101111 110001 0101110011

11100,101101,111001 1010,10010,110100 100111,110110 11,1000,100011 111011,111101 111000 11000,1010 10,111,1101,11111 1111,10110,101110, 110101 111110,111111 100,10101,10111, 101001 100000,101000,110010 100001,100010 10100,11011,100100, 111010 1100,10000,11101,11110 , 100101 110,101010,101111, 110001 1011,10011

k = the length representing the binary string of the last occurrence of a particular alphabet 6 6 6 6 6 5 5 6 6 6 6 6 6 6 6 5

Positions of occurrence

28, 45, 57 10, 18, 52 39, 54 3, 8, 35, 59, 61 56 24, 26 2, 7,13, 31 15, 22, 46, 53 62, 63 4, 21, 23, 41 32, 40, 50 33, 34 20, 27, 36, 58 12, 16, 29, 30, 37 6, 42, 47, 49 11, 19

Table 5: Decompression table.

SUMMARY/CONCLUSION We have demonstrated compression and decompression using this scheme. This has an advantage over our earlier scheme, in that, factorization has been done away with. We believe that this scheme will provide better compression than any known text compression scheme and in a way, may open new grounds for multimedia compression techniques.

REFERENCE [1] Adewumi, S. E; Garba E. J. D (2006) New Text Compression Algorithm. Journal of Information and Communication Technology (ICT), EBSU Abakaliki. Vol. 2, No 1, May 2006. ISSN 0794-6910. Beekman, G. (1999) Computer Confluence. Addison-Wesley Longman Inc. California Lelewer D. and Hirschberg D. (2001) Data Compression http://www1.ics.uci.edu/~dan/pubs/DataCompression.html

[2]

[3]

You might also like