You are on page 1of 18

DATA COMPRESSION REPORT

For: Sandy Clark (Communications Lecturer) From: Graham Lyttle Date: 24/11/2003

TABLE OF CONTENTS
Objectives ............................................................................................. 3 Procedure .............................................................................................. 3 Findings ................................................................................................ 4 1. Background........................................................................................... 4 1.1 1.2 1.3 2. 3. Definitions................................................................................... 4 Methods....................................................................................... 4 Forms .......................................................................................... 4

Formats ................................................................................................. 5 Run Length Encoding........................................................................... 6 3.1 3.2 3.3 3.4 Overview ..................................................................................... 6 Method Example ......................................................................... 6 Further Reduction ....................................................................... 7 Problems Arising......................................................................... 7

4.

Shift Level Encoding ............................................................................ 8 4.1 4.2 4.3 Overview ..................................................................................... 8 Method Example ......................................................................... 9 Results....................................................................................... 10

5.

Huffman Encoding ............................................................................. 11 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Overview ................................................................................... 11 Method Example ....................................................................... 11 Anomalies ................................................................................. 14 Formalising the Sort Method .................................................... 14 Alternative Sort Method............................................................ 15 Code-Set Table.......................................................................... 16 Other Uses................................................................................. 16

6.

Hybrid Methods.................................................................................. 17 6.1 6.2 Overview ................................................................................... 17 Method Example ....................................................................... 17

Conclusions ........................................................................................ 18

OBJECTIVES
This report investigates data compression. It highlights the general techniques used and their applications. The report is not concerned with the subject of information theory as a whole and therefore does not contain terms used in this broader subject. The report attempts to de-mystify the world of data compression. Three methods of compression are investigated and their application clarified by example. The three methods chosen, Run Length, Shift Level and Huffman Encoding, were chosen for their simplicity and generality. Much detail has been included in the explanation of Huffman Encoding. This was required to relieve confusion and ambiguity often found in other material of this subject.

PROCEDURE
All content was acquired from personal knowledge of the subject and the methods used. Appropriate test data was generated and applied to the method examples for the purpose of demonstration.

FINDINGS
1.
1.1

BACKGROUND
Definition

Data compression is the process of reducing a volume of source data into a coded form of reduced volume. A decompression process for retrieval of the source data complements the process.

1.2

Methods

The methods used are varied and often depend on the nature of the source data. Regardless of exact method, data compression is achieved through removal of redundancy within the source data.

1.3

Forms

There are two main forms of data compression; these are lossy and non-lossy. Lossy compression allows for degradation of the source data while Non-Lossy does not.

2.

FORMATS

Compressed data of the lossy type proliferates the public in the form of MPEG Video as used in DVD and MP3 as near CD quality audio. Both these formats come in varying degrees of quality with the degree of compression increasing as quality is allowed to decrease. Non-lossy compression on the other hand tends to go on behind the scenes and is used extensively in telecommunications and data archiving.

3.
3.1

RUN LENGTH ENCODING


Overview

Run Length Encoding (RLE) is a method used mainly for pictorial compression. An uncompressed picture is stored as a bitmap image. The bitmap image is a series of scan-lines each representing successive horizontal strips of the image. The first scan-line is taken to be the top of the image and all scan lines run from left to right of the image.

3.2

Method Example

Figure 1 below represents a bitmap image of an arrow. It consists of 16 scan lines each containing 16 pixels. The picture is monochrome (contains only black or white pixels) and therefore each pixel requires one bit of data where a bit value of 0 equals black and a bit value of 1 equals white. In total the bitmap image requires 256 bits of data which equals 32 bytes.

Figure 1 A Monochrome Bitmap Image of an Arrow Examination of individual scan lines will show that pixels tend to remain a particular state (on or off) for a considerable length along a scan line. This is a pixels run-length and is utilised by the RLE coding process. Taking the second scan Line from the bottom as an example, the coding process will now be examined. The scan-line in question is stored in its non-compressed form as follows: 1111110000111111 RLE begins by indicating the state of the first pixel on the scan-line, in this case 1. The next stage is to count the running length of this state. For this bitmap, there can be between 1 and 16 bits in a single run. The first has already been accounted for so there remains between 0 and 15 to complete the run. In this particular scan-line, 5 pixels complete the first run. After this point the pixel toggles state to 0 but this does not need to be indicated since the previous state was already indicated. Only the duration of the next state need be indicated, in this case 4 minus the first pixel in the sequence which equals 3. The state toggles once again to 1 for the remaining 6 pixels. One again the value stored is 5 since the first bit is already assumed. The resulting RLE code is as follows: 1,0101,0011,0101 The commas have been added for clarity.

3.3

Further Reduction

The astute reader will notice that by the last run of bits there are less than half as many pixels left on the scan line. This means that the run length only requires 3 bits to be represented. The code can therefore further be reduced to the following: 1,0101,0011,101 Note that by the second run length there are only 10 pixels left but the value is contained in a 4 bit field. This shows that 6 values are redundant. This redundancy is undesirable in a compressed code and shows that, although the process has provided some compression has not been very efficient.

3.4

Problems Arising

RLE is not a very efficient compression method and indeed occasionally does not provide any compression. Figure 2 shows just such a bitmap that RLE fails to compress.

Figure 2 A Problem Bitmap Each scan line of this bitmap is stored in its non-compressed format as: 1010101010101010 The RLE process codes the scan lines as follows: 1,0000,0000,0000,0000,0000,0000,0000,000,000,000,000,00,00,0,0 The RLE coding process could overcome this by first rotating the image 900 prior to encoding. An indication of this pre-process would be required so that the decoder would reverse this process.

4.
4.1

SHIFT LEVEL ENCODING


Overview

4.1.1 More commonly, pictorial information contains sequential elements of only slightly varying values. This is because pictures tend to be made of areas of similar colour or shade. Shift level encoding stores only the difference (delta values) between two pixel values rather than the pixel values (absolute values) themselves. Compression arises from the practice of encoding delta values using fewer bits than are required to store absolute values. 4.1.2 The process is reliable except when a delta value exceeds the minimum or maximum values that can be represented by the reduction in bits. In such a situation the compression process sets the delta value to the closest possible value that can be represented in the available number of bits. For example, if the calculated delta value is 54 but the maximum storable delta value is 32 then the delta value is set to 32. The process will attempt to catch up with the lost difference in subsequent samples. 4.1.3 This loss of data integrity categorises Shift Level Encoding as lossy compression as opposed to nonlossy, which maintains data integrity. This form of compression is often used with end-user, portable, pictorial or audio data.

4.2

Method Example

Figure 3 is a simple bitmap containing shapes composed of varying shades of grey. The pixel values of the fifth scan line from the top of this image have been depicted in the graph of Figure 4. Pixels are numbered from 0 to 15 where pixel 0 is the leftmost and pixel 15 is the rightmost. An absolute pixel value within the original data is stored in 8-bits, which give 256 possible values ranging from 0 to 255 inclusively.

Figure 3 A Simple Greyscale Bitmap

Figure 4 Comparison of Source and Encoded Pixel Values

Figure 5 tables the process of calculating delta values for the source values given in Figure 4. In this example, delta values are given 6-bits each which allows representation of all whole numbers between 0 and 63 inclusively. To facilitate negative delta values the calculated delta values are weighted by the value 31. For example a calculated delta value of -12 (negative 12) is weighted by 31 (increased by the value 31) to produce the value 19. With 6-bits available, a value weighted by 31 allows representation of all values between -31 (minus 31) and +32 (positive 32) inclusively. Pixel Source Value 61 88 148 135 155 103 80 76 88 120 143 175 188 143 130 82 Actual Difference Delta Value Stored Value 61 (stored in 8 bits) 58 63 46 51 0 0 14 43 63 54 63 44 0 4 0 Resulting Value 61 88 120* 135 155 124* 93* 76 88 120 143 175 188 157* 130 99*

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

+27 +60 +15 +20 -52 -44 -17 +12 +32 +23 +32 +13 -45 -27 -48

+27 +32* +15 +20 -31* -31* -17 +12 +32 +23 +32 +13 -31* -27 -31*

Figure 5 Summary Table of Processed Delta Values The compression process begins by reading the first pixel value in a scan line and storing it in its full 8-bit format. This value is immediately copied to a resulting value, which is used and modified throughout the process. The second pixel value is then read and the difference between itself and the resulting value (currently the same value as the first pixel value) is calculated. For the test data given this difference is calculated to be +27 (positive 27) which is within the permitted range. The delta value is weighted and stored. The compression process then adds the stored delta value to the resulting value to give, in this example, the value 88 which is the correct value of pixel 1. The resulting value is updated with this new value. The process continues with the next pixel, which has a value of 148. When subtracting the resulting value a delta value of +60 is given. Because this is outside the permitted delta range the process estimates a delta value of +32, which is the closest achievable value. It is always the final delta value, in this case +32, and not the calculated delta value which is added to the resulting value. So in this instance the resulting value is updated to +120 (88+32). The next pixel has a value of 135. Subtraction of the current resulting value gives a difference of +15, which is within the permitted range. At this point the process has regained the difference lost, as can be seen by studying the graph of Figure 4. The process continues until the end of the scan line.

4.3

Results

The original data requires 16 bytes of storage space, which equals 128 bits. The encoded form requires only 98 bits made up from one 8-bit initial absolute value followed by fifteen 6-bit delta values. When used to compress an image the process may cause blurring of sharp edges while audio data will tend to lose definition at high audio frequencies of significant amplitude.

10

5.
5.1

HUFFMAN ENCODING
Overview

Huffman Encoding is a non-lossy compression process which, unlike Run Length Encoding, guarantees that the volume of coded data, excluding decoding tables, will not be greater than that of the original data. When examining a sample of text, it will be noted that some letters appear more often than others do. For example, the letter E occurs more frequently than does the letter Z. Huffman Encoding, also known as Variable Length Encoding, encodes frequently occurring data into fewer bits than less frequently occurring data. For example, the letter E may be coded in 3-bits while the letter Z may be coded in 5-bits. A lookup table (code-set table) is included in the compressed data to allow decoding of the resulting stream of bits.

5.2

Method Example

As a demonstration, the following text will be compressed using the Huffman Encoding method: THE BEST THINGS COME IN SMALL PACKAGES Uppercase has been chosen in this example for clarity. The source data representing the text is assumed to be 8-bits per character. There are 38 characters, including separating spaces, in the example text. A total of 304 bits (38 characters times 8 bits) are required to store this textual data. The coding process begins by counting the occurrences of data values and creating a frequency table as in Figure 6.

Figure 6 Frequency Table The table of Figure 6 is sorted in order of frequency, with lower frequencies on the right of the list. The top line lists the data while bellow each data item is the frequency of occurrence. The process begins pairing the lowest frequencies by summing them into a parent frequency. Each branch is labelled 0 and 1 respectfully as shown in Figure 7.

Figure 7 Pairing Frequencies The new parent frequency, together with its siblings, is sorted into the table with respect to the parent frequency as shown in Figure 8.

Figure 8 Resorting Parent Frequency

11

The process repeats again with the lowest frequencies being paired then sorted into the list. Figures 9 to 12 demonstrate the next two processes of pairing then sorting frequencies.

Figure 9 Pairing Frequencies

Figure 10 Resorting Parent Frequency

Figure 11 Pairing Frequencies

Figure 12 Resorting Parent Frequency

12

Eventually, the pairing then sorting process results in a single frequency, equal to the number of data items in the source data as shown in Figure 13.

Figure 13 Full Binary Tree From this full binary tree, coded data can be extracted to replace data in the source. This is achieved by travelling the route from the parent frequency to the extremities of the tree where the source data to be coded will be found. On route, the process collects, sequentially, a 0 or a 1 depending on which branch is followed. For example, Space is found by following branches 0, 0 then 1. The digits collected are streamed together to form a code word which will replace every occurrence of the Space character in the source data. Figure 14 tables all coded data derived from the full binary tree. To demonstrate the reduced bit count achieved by the process, the ASCII (American Standard Code for Information Interchange) has been shown for each source data item. The table has been sorted in order of ascending frequency. ASCII Code 01010000 01001111 01001011 01000010 01001110 01001101 01001100 01001001 01001000 01000111 01000011 01010100 01000001 01010011 01000101 00100000 Huffman Code 11101 11100 11111 11110 1001 1000 1011 1010 00001 00000 0101 0100 0001 110 011 001 Bits Each 5 5 5 5 4 4 4 4 5 5 4 4 4 3 3 3 Bits Sub total 5 5 5 5 8 8 8 8 10 10 8 12 12 12 12 18 146

Data P O K B N M L I H G C T A S E Space

Freq. 1 1 1 1 2 2 2 2 2 2 2 3 3 4 4 6

Total Bits Figure 14 Huffman Code Summary

13

5.3

Anomalies

From Figure 14 it can be seen that as frequency increases so does the bits per Huffman Code decrease. Closer study will reveal that an exception has occurred with the character data H and G. These characters have been coded with 5 bits while data of same frequency and some of greater frequencies have only been coded in 4 bits. These anomalies are artefacts of the coding process and are perfectly valid. Alternative coding sets can be achieved by altering the method of sorting newly formed parent frequencies. In the example shown, newly formed parent frequencies were sorting into the lower end of same frequency neighbours (Figure 8).

5.4

Formalising the Sort Rule

The rule of Huffman Encoding is that a newly formed parent frequency can be placed anywhere in the table so long as the table remains sorted. With this in mind, it is perfectly valid to sort the parent frequency of Figure 8 into the end of its same frequency neighbours as shown in Figure 15.

Figure 15 Alternative Sorting Method

14

5.5

Alternative Sort Method

Following this alternative method yields a different full binary tree and hence a different code set as shown in Figure 16 and Figure 17.

Figure 16 Alternative Full Binary Tree

Data P O K B N M L I H G C T A S E Space

ASCII Code 01010000 01001111 01001011 01000010 01001110 01001101 01001100 01001001 01001000 01000111 01000011 01010100 01000001 01010011 01000101 00100000

Huffman Code 01101 01100 01011 01010 1101 1100 1011 1010 1001 1000 0111 0100 0001 0000 111 001

Bits Each 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3

Freq. 1 1 1 1 2 2 2 2 2 2 2 3 3 4 4 6

Bits Sub total 5 5 5 5 8 8 8 8 8 8 8 12 12 16 12 18 146

Total Bits

Figure 17 Huffman Code Summary for Alternative Sorting Method Note that both methods yields the same total bits used in the coded form of the data. As long as the sorting rule is followed, the total number of bits will remain the same for a given block of source data.

15

5.6

Code-Set Table

When creating a Huffman Encoded file, a data block reflecting the structure of the full binary tree must precede the encoded data. This data block (a code-set table) is required for decompression of the encoded data. Unlike source data items, which occupy single 8-bit bytes and are randomly accessible, encoded data items vary in length and therefore must be treated as a bit-stream. Such an arrangement renders the data sequentially accessible only.

5.7

Other Uses

The textual data example was chosen to demonstrate the frequent tendency of redundant data items appearing in blocks of data. Huffman Encoding is a general-purpose compression method for dealing with just such an attribute of data. It is frequently applied in conjunction with other compression methods (See Hybrid Methods Section 6 page 17).

16

6.
6.1

HYBRID METHODS
Overview

A Hybrid method generally refers to a method derived from a combination of two or more simpler methods, with or without modification of one or more of these simpler methods. Hybrid methods are often used to gain better compression or compression quality than would be possible using a simple method alone, especially for specialised data such as pictorial or audio data.

6.2

Method Example

An example of a Hybrid method is that of an improved Shift Level Encoding (SLE) method. The SLE method is generally applied when the difference between data items is small for most if not all of the time. Statistical analysis of audio waveforms for example will show that this is mostly not the case. Analysis will show that while data values will congregate around certain values they do not lie alone in the low value range but rather within multiple pockets throughout the whole range. This clumping of values over the full range is better applied to a Huffman Encoding method than that of the Delta Shift method of SLE. The result is a non-lossy compression of the source data.

17

CONCLUSION
A. The methods described in the report highlight the diversity of methods available for compression. The choice of method depends much on the type of data to be compressed. B. Run Length encoding by itself is, in most cases, ineffective without pre processing of the source data. Shift Level encoding provides exact control over the size of the encoded data but is prone to data degradation. This makes it suitable for mainly pictorial and audio data. Huffman encoding is a much more general method which when applied to any data type is predictable in the sense that the encoded data will never be greater than the source data. C. Understanding the nature of the source data is the key to effective compression. Specialised data such as pictorial or audio data may be prepared by one method and finally compressed by the Huffman method. The incorporation of the Huffman method into the output stage of Shift Level Encoding effectively renders Shift Level Encoding by itself redundant. D. When clarified, data compression is simple enough to be undertaken by the modest programmer or the student in search of a vocational project or introduction to the field of data coding.

27/11/2003

18

You might also like