RM Thesis

Efficient Architectures for High Speed Lossless Data Compression and FIR Filter Implementation
Rizwana Mehboob 03-UET/PhD-CASE-CP-04
Supervisor Dr. Shoab. A. Khan

Department of Electrical and Computer Engineering, Center for Advanced Studies in Engineering, University of Engineering and Technology, Taxila, Pakistan. January 2010
A thesis subm itted in partial fulfillm ent of the requirement for the D egree of D octor of P hilosophy
Thesis Supervisor Dr. Shoab A. Khan
Department of Electrical and Computer Engineering, Center for Advanced Studies in Engineering, University of Engineering and Technology, Taxila, Pakistan. January 2010
A thesis subm itted in partial fulfillm ent of the requirem ents for the degree of D octor of Philosophy in Com puter E ngineering
by:
Rizwana Mehboob
03-UET/PhD-CASE-CP-04
Approved by: External Examiners
Dr. Mahmood Ashraf Khan

Ex-Principal, Institute of Communication Technologies (PTCL), Islamabad
Dr. Farooque Azam

College of E&ME, NUST Rawalpindi
Internal Examiner / Thesis Supervisor
Dr. Shoab A. Khan

CASE, Islamabad
Department of Electrical and Computer Engineering, Center for Advanced Studies in Engineering, University of Engineering and Technology, Taxila, Pakistan.
DECLARATION
The substance of this thesis is original work of the author and due reference and acknowledgement has been made, where necessary, to the work of other, no part of the thesis has already been accepted for any degree, and it is not being currently submitted in candidature of any degree.
D edicated To My D eceased Parents
ii
Acknowledgements
I thank the all praiseworthy, the most powerful and the Gracious Creator of the universe for giving me the necessary strength for completing this dissertation. Dr. Shoab A. Khans guidance as a supervisor had been instrumental in finalizing my research work; I thank him for all his technical support and innovative ideas, during the period of my research. His encouraging and positive attitude enables a scholar to move forward during difficult times. I thank Dr. Syed Ismail Shah, Dr. Saeed-ur-Rehman and Dr. Habibullah Jamal for serving as my research committee members. Their suggestions and support was important for me. I would also like to thank my friends and relatives who wished me success and always encouraged me. I would like to acknowledge the love, care, help and support of my elder sisters Sajida Naeem and Dr. Attiya for helping me in managing my affairs during my PhD. I would like to thank my sons Ibrahim and Abdullah who wished me success whenever they were happy and also for bearing their always preoccupied and over burdened mom. In the end, I thank my better half Zaheer for being the best of friends who always extends a helping hand when one wants. Zaheers support despite his hectic schedule kept me pushing forward in my studies and research.
iii
Summary
This thesis presents novel architectural optimization techniques for designing efficient applications. Though there are many standard transformations that are proposed in the literature but for many applications these techniques are not directly applicable or once applied does not generate optimal designs. The thesis evolves that in many applications the digital design techniques need to be application specific. These techniques vary from reworking the algorithm or finding optimal methods of implementing the algorithm from the perspective of effective hardware design. In many instances, the application is such that no modification in the algorithms specifications can be performed. In these applications, the architect needs to remain in the ambient of the defined algorithm. The architect can still explore the design space by folding or unfolding the defined algorithm for area and time efficiency subject to desired design objectives. Yet there is another interesting paradigm where algorithms can be reworked and modified, the designer must first make exploration in modifying the algorithm from hardware (HW) perspective. The thesis defines high speed, high throughput and small area as the design requirements and selects two distinct representative applications, where in one application the architecture is optimized for a defined algorithm and in the other first the algorithm is reworked and then architectures are designed. The first application implements lossless data compression algorithm for multigigabit rate network interfaces. As lossless data compression application requires strict compliance to algorithm, therefore the technique remains within the constraint of the defined algorithm since lossless nature of data is ensured and purely applies architectural optimization methods for defining efficient architectures. For designing parallel architecture the unrolling of algorithm is performed for multiple iterations
iv
and then the technique finds the data and partial computation reuse in designing massively parallel architecture that are scalable and modular. The second technique takes another important application of all parallel dedicated finite impulse response (FIR) filter. In this method, the algorithm is modified and data manipulation techniques are applied that results in the more efficient architecture. In this application, the modification of the filter design constraints is involved for producing optimal architecture. The thesis presents novel algorithm optimization technique for defining effective FIR digital filter implementation. One of the objectives of this research therefore is to propose/implement techniques to address the challenge of devising hardware architectures, which can support high throughputs with low latency while dissipating less power and consuming minimum area. But most of these goals are competing design objectives. Improving one will result in degradation of the others. The research presents scalable architectures for high data rate applications to compress thick pipes of data. The architecture is designed for high speed wide area network (WAN) routers. The research also involves the optimization of FIR digital filters that are widely applied in digital signal processing and communication applications. Several methodologies have already been proposed for optimizing the performance of the FIR filters. Similarly, numerous methods exist for realizing the FIR filters in hardware for high throughputs consuming smaller silicon area and improved power saving performance. The thesis proposes a novel methodology for efficient realization of the digital filter in hardware in terms of area and throughput.
Table of Contents
1 Introduction ......................................................................................................... 6 1.1 1.2 1.3 1.4 1.5 2 Data Compression ........................................................................................ 7 WAN Optimization for Enterprise Network Extension ............................... 8 Digital FIR Filters ........................................................................................ 9 Overview of the Dissertation ..................................................................... 10 References .................................................................................................. 10
Lossless Data Compression: An Overview....................................................... 13 2.1 2.2 2.3 Requirement of Data Compression ............................................................ 13 Lossless vs. Lossy Compression ................................................................ 14 Lossless Data Compression Methods......................................................... 14 Statistical Methods .............................................................................. 15
2.3.1
2.3.1.1 Static Modeling ................................................................................. 16 2.3.1.2 Semi-Adaptive Modeling .................................................................. 16 2.3.1.3 Adaptive Modeling ........................................................................... 16 2.3.1.4 Statistical Coding Methods ............................................................... 17 2.3.2 Dictionary Based Methods .................................................................. 17
2.3.2.1 Static Methods .................................................................................. 18 2.3.2.2 Adaptive Methods ............................................................................. 18 2.3.3 Types of Implementation .................................................................... 19
2.3.3.1 Software Solution.............................................................................. 19 2.3.3.2 Hardware Solutions ........................................................................... 20 2.4 2.5 3 Conclusions ................................................................................................ 21 References .................................................................................................. 22
High Speed Architectures for Lossless Data Compression Algorithms ........... 25 3.1 Proposed LZ77 Hardware Realization ....................................................... 26 LZ77 Compression Algorithm Processing.......................................... 27 Compression Example ........................................................................ 29 Architectural Blocks ........................................................................... 30
3.1.1 3.1.2 3.1.3 3.2 3.3
Unfolded Parallel Architecture................................................................... 34 Super-unfolded Architecture ...................................................................... 35 Design Methodology ........................................................................... 35 Comparison Matrix ............................................................................. 37
3.3.1 3.3.2
3.3.3 3.4
LZ77 High Speed Super-unfolded Architecture Details..................... 39
Pipelined Architecture ................................................................................ 40 Parallel Pipeline Interconnect ............................................................. 40 Pipelined Architecture Details ............................................................ 42
3.4.1 3.4.2 3.5 3.6 3.7 4
Results ........................................................................................................ 44 Conclusions ................................................................................................ 45 References .................................................................................................. 45
Multi-gig Lossless Data Compression Device for Enterprise Network............ 47 4.1 4.2 4.3 4.4 4.5 Data Compression in Enterprise Network Architecture ............................. 48 Compression Device .................................................................................. 53 High Throughput Compression Architecture ............................................. 55 Conclusions ................................................................................................ 57 Reference.................................................................................................... 58
Constant Coefficient Digital FIR Filters: Design and Implementation ............ 59 5.1 Overview of Digital FIR Filter ................................................................... 60 Types of Digital FIR Filters ................................................................ 62 Structures of Digital FIR Filters ......................................................... 62 Design Methods of Digital FIR Filters ............................................... 63
5.1.1 5.1.2 5.1.3
5.1.3.1 Window Design Method ................................................................... 63 5.1.3.2 Frequency Sampling Techniques ...................................................... 64 5.1.3.3 Optimal FIR Filter Design ................................................................ 64 5.2 Hardware Implementation Issues ............................................................... 64 Finite Word-length Effects ................................................................. 66
5.2.1 5.3 5.4 5.5 6
Digital FIR Filter Design Parameters ......................................................... 68 Conclusions ................................................................................................ 69 References .................................................................................................. 69
Hardware Efficient Filter Implementation ........................................................ 72 6.1 Effects of Coefficient Quantization............................................................ 73 Effect on Frequency Response of a FIR Filter.................................... 74
6.1.1 6.2 6.3 6.4 6.5 6.6
FIR Filters with varying Quantization Levels ............................................ 75 Proposed Design Methodology .................................................................. 77 Hardware Implementation and Results ...................................................... 82 Conclusions ................................................................................................ 85 References .................................................................................................. 86 2
Conclusions ....................................................................................................... 88
List of Figures
Figure 3-1: Buffers of LZ77 ..................................................................................... 28 Figure 3-2: LZ77 Encoding ...................................................................................... 29 Figure 3-3: Basic Compression Cell ......................................................................... 31 Figure 3-4: Best Match and Length Calculator (BMLC).......................................... 33 Figure 3-5: Compression Block ................................................................................ 34 Figure 3-6: Comparison Matrix ................................................................................ 38 Figure 3-7: High Speed Super-unfolded Architecture .............................................. 39 Figure 3-8: Pipelined Architecture............................................................................ 42 Figure 4-1: Mobile Branch Office Architecture ....................................................... 51 Figure 4-2: Optimized Enterprise Network Extension for Branch Offices .............. 52 Figure 4-3: Compression Device Top Level Architecture........................................ 53 Figure 4-4: High Speed Multiple Stream Pipelined Architecture............................. 56 Figure 4-5: High Speed Multiple Stream Unfolded Architecture............................. 57 Figure 5-1: A direct form FIR Filter ......................................................................... 60 Figure 5-2: Sampling and Quantization .................................................................... 66 Figure 5-3: Design Specifications of Digital Filter................................................... 68 Figure 6-1: Quantization (a) by rounding (b) by truncation ..................................... 73 Figure 6-2: Responses of Filters with different quantization levels ........................ 76 Figure 6-3: Filter Optimization for LP FIR Filter1 ................................................... 80 Figure 6-4: Filter Optimization for HP FIR Filter2 .................................................. 80 Figure 6-5: Filter Optimization for BP FIR Filter3 .................................................. 81 Figure 6-6: Filter Optimization for BS FIR Filter 4 ................................................. 81
List of Tables
Table 3-1: Parallel Comparisons for N=12, P=8, Q=4 ............................................. 32 Table 3-2: Parallel Comparisons for 6 iterations for N=12, P=8, Q=4 ..................... 36 Table 3-3: Parallel Comparisons for Pipelined Architecture .................................... 41 Table 3-4: Unfolded and Super-unfolded Throughput, Clock=100MHz ................. 44 Table 3-5: Pipelined Throughput, Clock=500Mhz ................................................... 45 Table 6-1: LP FIR FILTER 1 ORDER 19 VS ORDER 41 RESOURCES ............................. 83 Table 6-2: HP FIR Filter 2 order 20 vs. order 66 resources ..................................... 83 Table 6-3: BP FIR Filter 3 order 23 vs. order 63 resources..................................... 84 Table 6-4: Band Stop FIR Filter 4 order 26 vs. order 66 resources .......................... 85
CHAPTER 1
1 Introduction
The thesis focuses on the study of the techniques for efficient implementation of the digital signal processing and data communications algorithms. The thesis presents novel architectural optimization techniques for designing efficient applications; the work is broadly divided into three parts. 1) The architectures for implementing lossless data compression algorithm to compress multi-gigabit data pipes. We present three types of compression architectures by restricting the design methodology within the constraint of the defined algorithm and purely applying architectural optimization methods for defining efficient architectures. In our research work we defined basic building blocks for the compression architectures. Fully parallel unfolded and pipelined architectures are demonstrated for high data rate applications. Depending on the data rates the architecture can be folded and unfolded to any degree in order to save area and gain throughput respectively. 2) Multi-gig compression device for extending enterprise network to branch offices. The optimized network architecture to extend enterprise network to fixed and mobile branch offices is presented. A configurable compression device with LAN and WAN interfaces is presented to compress thick pipes of data. The device incorporates layered architecture to demonstrate the modularity and scalability of the presented compression architectures. The device incorporates multiple broadband technologies to provide fully mobile platform for mobile branch offices. 3) A novel method to design an FIR filters is proposed. The proposed optimization technique modifies the design constraints and quantizes the coefficients in a different perspective. The technique results in an efficient filter implementation for producing the optimized results in terms of silicon area and power efficient hardware realization.
1.1 Data Compression

With the advancement in both the wired or wireless communication technologies, more and more information in the form of text, audio or video data is being transported across different communication media. Similarly in this age of information explosion, enormous capacity of storage media is required for preserving the information that can be readily retrieved as well. Data compression finds its applications in communications as compressed data offloads the otherwise bloated communication channels by injecting compressed data of shorter lengths [1] - [6]. Similarly, the storage devices, be it hard disks or dynamic memories, the physical space of these devices is consumed little with compressed data [7] [8]. Therefore, data compression is an important technique for preserving the communication channel bandwidth and increases the capacity of the data storage devices. It is a method of removing the redundancies in the data by encoding the repeated longer strings of data into shorter code words. The data compression techniques have been practiced for achieving their functionality for over half a century. Shannon was pioneer in formulating the theory of data compression [9]. Since then, a wide variety of algorithms have been proposed and implemented; the broad classification is done on the basis of statistical or dictionary based adaptive methods [10]. The statistical methods like arithmetic coding[11], run-length coding[12], Huffman[13] and Adaptive Huffman[14] provide a good compression ratio but they are limited by their speed of execution as they encode one symbol at a time, therefore not suitable for hardware implementations. On the other hand dictionary based methods are good for a wide variety of applications where the input source is not known before hand. The LZ77 [15] and LZW [16] lossless data compression methods are therefore good candidates for data compression solutions for high data rates. The most popular dictionary based LZ77 and LZW are standard compression algorithms used in a number of compression utilities like compress, gzip, bzip etc. These techniques can be employed for high throughputs especially on high speed routers for WAN applications The dictionary based methods LZ77 or LZW can be used for hardware implementation for high data rates since they are adaptive and statistical computations on source symbols is not involved [10]. The LZ77 algorithm for
hardware implementation is converted from its sequential nature of execution to parallel comparisons. The parallel comparisons involved in finding the length of the longest matched string in adaptive dictionary based LZ77 technique plays a role in improving the throughputs since the dictionaries are updated adaptively. The reported throughputs of LZ77 are in the range of 2.5 Gbit/sec [17] on Application Specific Integrated Circuit (ASIC). In the literature a maximum of one code word per clock cycle is possible. The issue of maximizing the parallel comparisons at higher clock speeds is a challenge for greater dictionary sizes that are vital for higher compression ratios. In addition to this, no technique supports more than one code word per clock cycle. Therefore, proposing a multi-giga bit rate LZ77 lossless data compressor is a challenge having the flexibility of increased throughputs with improvement in VLSI technology.
1.2 WAN Optimization for Enterprise Network Extension

The Internet usage has accelerated significantly over the past few decades [18]. The companies reliant on Information Technology are mushrooming with remote site locations/branch offices interlinked by broadband communication links. The entrepreneurs are struggling hard to gain superiority from their compotators by providing 24/7 services to their customers. The provision of round the clock service requires remote branch offices to be established in different geographical areas with different time zones, far away from main corporate head office. The remote branch offices are categorized as fixed and mobile branch offices. The mobile branch offices are readily deployable by making use of long vehicles and are established keeping in view the changing business trends and requirements. The optimization of WAN connectivity to extend the enterprise network to fixed and mobile branch offices is a new research domain. The transfer of huge amount of data over limited WAN bandwidth results in network congestion and decreases the network performance. Data compression significantly increases the bandwidth without
having to upgrade line connections, yielding significant cost savings immediately. Optimizing the WAN bandwidth by data compression ensures better utilization of bandwidth and improved network performance. In our research work we propose a configurable compression device that can be deployed in the core network of corporate headquarter to compress the multi-gig rate
thick pipe of data. The device has multiple LAN and WAN interfaces and can be configured to operate for any data rates both for the fixed and mobile network. The device acts as a mobile hotspot to extend the enterprise network to mobile branch offices. The mobile hotspot is equipped with multiple wireless broadband technologies and acts as a gateway between local LAN devices and the enterprise network. The compression device incorporates layered architecture to compress multi-gig data pipes.
1.3 Digital FIR Filters

The other important area of research is digital FIR filters. The researchers have proposed a number of methods for FIR filters for digital filtering applications in broad band modems, video and audio signal processing. Many efficient methods exist for improving the performance of these filters and realizing them on targeted platforms either in software or implementing them in hardware. Some of these techniques included improving the structures for representing filters either in time domain or frequency domain. Another technique for area and power saving hardware realization focuses on reducing or eliminating multipliers by adders and shifter since multiplier requires more silicon area and power[19]-[24]. Many methods for reducing the number of arithmetic operations include canonic signed digits or minimum signed digit representation. The methods of common sub expression elimination on canonic signed digit translated coefficients further provided savings in terms of multipliers [25] [26]. Many researchers have also focused on statistical, analytical or heuristic based methods to offset the effects of quantization of the filter coefficients. The finite word length of the coefficients and the errors due to rounding and truncation of the arithmetic operations have been also reported in connection with efficiently implementing a digital filter. Area and power efficient hardware implementation of the filters has always been an active area of research. A methodology for enhancing the efficiency of the optimized implementation is also a major research challenge addressed in this research. The proposed research also focuses on efficient implementation technique for digital FIR filters hardware implementation.
1.4 Overview of the Dissertation

Chapter 1 gives an introduction of the dissertation. Chapter 2 explains the data compression methodologies and implementation methods in general that serves as a basis for the research carried out to device the novel lossless data compression architectures. The actual details of the novel lossless data compression architectures for LZ77 data compression algorithm and the building blocks are presented in Chapter 3. The design methodology, the basic building blocks and the scalable, reconfigurable architectures and the results are also discussed in Chapter3. Chapter 4 discusses the lossless data compression device for extending the enterprise networks to remote branch offices. Chapter 5 discusses the essential basics of digital FIR Filters including the types, structures, design methods and the problems encountered in the design. In chapter 6, the actual design methodology for realizing the efficient filter implementation is presented and Chapter 7 concludes the dissertation.
1.5 References
[1] Bongjin Jung; Burleson, W.P, Real-time VLSI compression for High-Speed Wireless Local Area Networks, Proceedings of Data Compression Conference, 1995 [2] Bongjing Jung, Wayne P. Burleson, Efficient VLSI for Lempel_Ziv Compression in Wireless Data Compression Networks, Proceedings of International Symposium on Circuits and Systems, London June 1994. [3] B. Jung, W. P. Burleson, A VLSI Systolic Array Architectures for Lempel_Ziv Based Data Compression, Proceedings of IEEE Symposium on Circuits and Systems, 1994. [4] Mark Milward, Jose Luis Nunez-Yanez and David Mulvaney, Lossless Paralle Compression Systems, Electronic Systems and Control Division Research 2003, Department of Electronic and Electrical Engineering, Loughborough University, LE11 3TU, UK [5] Konstantinos Papadopoulos, Ioannis Papaefstathiou, Titan-R: A
Reconfigurable hardware implementation of a high-speed compressor, 16th
10
International Symposium on Field-Programmable Custom Computing Machines, 2008 IEEE [6] Ioannis Papaefstathiou, An IP Comp Processor for 10-GPBS Networks, IEEE test and Design of Computers, November-December 2004 [7] Suzanne Rigler, William Bishop, Andrew Kennings, FPGA-Based Lossless Data Compression using Huffman and LZ77 Algorithms IEEE 2007. [8] R. B. Tremaine, P. A. Franaszek,J. T. Robinson,C. O. Schulz,T. B. Smith,M.,E. Wazlowski,P. M. Bland, IBM Memory Expansion
Technology, IBM J. RES. & DEV. Vol. 45 No. 2 March 2001 [9] C. E. Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal, Vol. 27, pp. 379423, 623656, July, October, 1948. [10] Dzung Tian Huang, Fast and Efficient Algorithms for Text and Video Compression, A PhD. Dissertation, Brown University, Rhode Island, 1997 [11] Howard, P. Vitter, J., Practical Implementation of Arithmetic Coding, Image and Text Compression, Kluwer Academic Publishers, 1992. [12] S. Colomb., Run Length Encoding, IEEE Transactions on Information. Theory, Vol. IT-12, pp 399-401, July 1966. [13] Park, H., Prassana V., Area Efficient Architectures for Huffman Coding IEEE Transactions on Circuits and Systems, 1993 [14] M. W. E. Jamro and K. Wiatr, FPGA Implementation of the Dynamic Huffman Encoder, Proceedings of IFAC Workshop on Programmable Devices and Embedded Systems, 2006. [15] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, vol. IT-23 No. 2, May 1977. [16] T. Welsh, A Technique for high-Performance Data Compression, IEEE Computer, vol. 17, pp 8-10, 1984. [17] [18] AHA Comtech AHA Corporation, www.aha.com Zhanping Yin and Victor C.M. Leung, A Proxy Architecture to Enhance the Performance of WAP 2.0 by Data Compression EURASIP Journal on Wireless Communications, 2005.
11
[19]
Daitx, F.F.; Rosa, V.S.; Costa, E.; Flores, P.; Bampi, S.; VHDL Generation of Optimized FIR Filetrs, 2nd International Conference
on Signals, Circuits and Systems,.7-9 Nov. 2008 Page(s):1 - 5 [20] A. Hosangadi, F. Fallah, and R. Kastner, Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two Term Common Subexpression, Proc. of ASP-DAC , 2005. [21] P. Flores, J. Monteiro, and E. Costa, An exact algorithm for the maximal sharing of partial terms in multiple constant multiplications, Proc. of ICCAD, pp. 13-16, 2005. [22] M. Yamada, and A. Nishihara, High-Speed FIR Digital Filter with CSD Coefficients Implemented on FPGA, in Proc. IEEE Design Automation Conference (ASP-DAC 2001), 2001, pp. 7-8. [23] M.A. Soderstrand, L.G. Johnson, H. Arichanthiran, M. Hoque, and R.Elangovan, Reducing Hardware Requirement in FIR Filter Design, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '00), 2000, Vol. 6, pp. 3275 - 3278. [24] E. Costa, P. Flores, J. Monteiro. Exploiting General Coefficient Representation for the Optimal Sharing of Partial Products in MCMs. In Symposium on Integrated Circuits and System Design, Ouro Preto (Minas Gerais), Brazil, 2006 [25] A. P. Vinod and Edmund M-K. Lai, On the Implementation of Efficient Channel Filters for Wideband Receivers by Optimizing Common Subexpression Elimination Methods, IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, Vol. 24, No. 2, February 2005. [26] Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner, FPGA Implementation of High Speed FIR Filters Using Add and Shift Method, International Conference on Computer Design, IEEE ICCD, 2006
12
CHAPTER 2
2 Lossless Data Compression: An Overview

This chapter serves as the basis or background study for proposing the efficient architectural optimization for dictionary based lossless data compression algorithms in general and LZ77 in particular. The choice of different lossless data compression algorithms, their important classification and necessary details and application domains, choice of hardware or software platform for realization and types of hardware implementations are also briefly discussed.
2.1 Requirement of Data Compression

The continual increase in information explosion has challenged tremendous improvement in data transmission and storage technologies. Data compression is a method of encoding data with fewer bits by removing the redundancies in the data before either storage or transmission over a communication channel and then reinserted on the associated decompression process. The available bandwidth of a wired or a wireless communication channel can be effectively increased by reducing the amount of data injected in the channel. Similarly, the storage capacity of a variety of on-line and offline storage systems is increased by data compression. The data redundancy introduced by channel coding for reliability is also offset by data compression techniques [1]. Therefore data compression is a cost effective tool for preserving the expensive resources like communication bandwidth and data storage space and particularly more applicable in wireless communications [2] [3]. The capacity of mass storage devices has increased from few mega bytes to tens of giga bytes and so are the application software requirements [4]. The demand for data storage and transmission continues to exceed the capacity despite the advancement in communication technology and high density memories, therefore, data compression technology would not become obsolete in the years to come [3][4][5]. The throughput of the data compression system should be in tune with the overall
13
communication system throughput [2] for preserving the transmission bandwidth both for wired and wireless systems. The data compression is also applied before encryption that becomes significantly effective for providing greater security [6][7].
2.2 Lossless vs. Lossy Compression

The digital information including data, audio, images and video is persistently being increased for transmission and storage/retrieval. The compression can be either lossy or lossless. For lossy or perceptual coding or compression, the information after decompression is an approximation of the transmitted information and usually applied to images transmission and storage where some finer details of the images can be ignored during compression and thus introducing some distortion in original information [8]. The JPEG image compression is a lossy compression technique. For the lossless data compression, the information at the source and destination is exactly identical. The systems managing data for databases like financial transactions, accounts, flights reservation or executable codes, even the loss of single bit cannot be tolerated, thus lossless data compression is imperative [5]. The data compression is essentially lossless in most of the cases. The ZIP utility is an example of lossless data compression. In this thesis, we target only the lossless data compression.
2.3 Lossless Data Compression Methods

The lossless data compression implies that data at the source and destination is exactly identical. A number of encoding methods exist for lossless data compression and sometimes a combination of these methods work. The encoding techniques for lossless data compression can be broadly classified as dictionary based and statistical. The dictionary based methods are variable to fixed length codes while the statistical methods are fixed to variable length codes that will be explained in the sections to follow. The division between statistical codes and dictionary based methods is sometimes not very distinct and some techniques cannot be pushed in
14
one category or the other and sometimes the hybrid of the two methods also exist. Since the data compression deals with compression in files or packets in computer memories or storage devices and the transmission across LAN or WAN, the set of alphabet symbols employed in these applications are 1 byte characters from 8-bit extended ASCII (American Standard for Information Interchange) character set. A total of 256 distinct symbols can be represented in-8-bit byte.
2.3.1 Statistical Methods

These techniques capture redundancies using probabilistic methods [4] and convert characters into variable length strings of bits based on the frequency of use. They perform encoding by taking one character at a time. The statistical methods like Huffman, Arithmetic, Shanon-Fano and Run-length Coding etc are all examples of classical blockto variable codes classification producing a variable number of output bits for each input symbol by assigning shorter codes to frequently occurring and longer codes to rarely occurring symbols. The statistical methods essentially comprise of two steps; source modeling and encoding. Practically, approximation is done for approximate mathematical model of the source data. There is some probability for each symbol that defines the amount of information each symbol carries. A code is actually a mapping from a source element or symbol to an output element [3] [4]. The probability model predicts the current character to be coded based on previously coded character. The prediction is then used in a process called entropy coding [4]; where measure of entropy provides information content of a message. There are different ways of constructing probability models and the accuracy of these models is one of the criteria for statistical data compression methods. Shanon described different probability models [9]. A zero order model in which each symbol is statistically independent and equally likely to occur, so same number of bits encode each symbol. Similarly in first, second, third or higher order models, the probability of current symbol being coded depends on the previous 1, 2, 3 or higher
15
number of symbols with lesser number bits encoding a symbol as the order of the model increases. Once a model of the source is chosen, next comes the choice of a coder. The entropy or information content provides the lower limit on the number of bits for a codeword [10]. In some literature they are termed as fixed, static and adaptive methods while in other places they are termed as static, semi-adaptive or adaptive. Similarly there are a variety of coding methods so the following sections also briefly discuss these statistical coding techniques.
2.3.1.1 Static Modeling

This is based on the type of source symbols; prior and extensive knowledge of source symbols is required. The model of a particular source is fixed or known for that source. Therefore, different models for each type of source is required which is a difficult practical preposition and alternatively single static model for a particular source may produce maximum compression but may perform very poorly for some other source.
2.3.1.2 Semi-Adaptive Modeling

This category falls between static and adaptive modeling. The source symbols are analyzed for gathering the statistics and knowing the probabilities of the source symbols over the input data. Therefore they are near to static modeling since statistics for different types of sources are different. Secondly, this model is then transmitted to the decoder as well for synchronization hence it is semi-adaptive too. The model needs two passes over input symbols one for gathering probabilities and second for encoding the data.
2.3.1.3 Adaptive Modeling

In this model, the encoder is continuously updating the statistics while encoding the data in a single pass. Initially, the encoder and decoder begin with the same model and as the compression begins, the same updating of statistics is done at the decoder
16
for synchronization. The performance of adaptive statistical modeling is better. The algorithms using this modeling are termed as dynamic algorithms.
2.3.1.4 Statistical Coding Methods

There are a wide variety of these coders producing variable length codes for a given symbol. Some are classified as entropy encoders in which the input symbols are coded such that the length of each code word is proportional to the negative logarithm of its probability. Examples are Huffman [11] [12] and arithmetic codes [13]. The Huffman codes require an integral number of bits while arithmetic codes may have fractional bits for its code word. The adaptive or dynamic Huffman [14] is also an entropy encoder but unlike the Huffman and arithmetic coders, it rather adapts to the source symbols as coding is done without knowing the source statistics beforehand. The run-length coding [15] counts the repeated runs of single symbol and forms code words according to the number of occurrences. This is only suited to sources with data having long sequential runs of repeated data. In context mixing, the next symbol predictions of two or more models are combined for yielding the prediction that is accurate than any of the individual [16]. The prediction by partial matching (PPM) [4][16] is based on statistical models in which a set of previous symbols in the uncompressed data stream are used to predict the next symbol in the stream. PPM is an adaptive technique based on context modeling. The other coders like ShannonFano, range coding and Burrows Wheeler Transform also employ statistical modeling. Through efficient modeling and coding, the statistical methods render state of the art compression ratios, but they are computationally intensive methods. Furthermore, the hardware realization of statistical models is quite complex for achieving high throughputs in real-time communication and data storage applications like routers.
2.3.2 Dictionary Based Methods

The dictionary based lossless data compression encoders are also termed as substitution coders [16]. These methods are adaptive in nature and in these
17
techniques a priori information of the type of source data is not required and are classified as variable to block coders. In these methods, a dictionary of certain length of input symbols is maintained and the repeated strings of characters of input data stream are replaced by the shorter code words. The dictionary based techniques are high speed encoders since they code several symbols at a time compared to statistical coders. These methods can also be static and adaptive or dynamic; methods. Both the encoder and decoder must have same dictionaries and greater the length of the dictionary, more is the compression ratio. The most popular are the Lempel Ziv (LZ1 or LZ77) [17] and LZ2 or LZ78 [18] class of dictionary based coders. There are many variants of the LZ coders namely LZSS, LZR, LZ7B, LZFG, LZT, LZW and LZJ etc. There is another class of dictionary coding algorithm that does not belong to LZ class of coders is X-Match Pro [19].
2.3.2.1 Static Methods

In the static dictionary methods, the full set of strings is determined prior to coding and it does not alter during the coding process. The method is used where message or set of messages is fixed e.g. the software that stores the contents of religious books in a limited storage space of a PDA and then builds dictionary by repeating phrases.
2.3.2.2 Adaptive Methods

In these types of methods, the dictionary begins in some predefined state and then the contents keep on changing and the encoding is done based on the data that has already been coded. The two famous LZ77 and LZW and its variants are all dynamic dictionary based coders. The algorithms dynamically adapts to different types of data. The prior knowledge of symbols is not required. It is also often termed as sliding window compression method or a substitution coder [16]. The simplicity and effective speed of compression and decompression have made it quite popular for both storage and communication applications and have been widely used an integral
18
part of many international standards [4]. A variety of its variants are implemented in very well known compression algorithms like zip, gzip and compress etc.
2.3.2.2.1 Lempel-Ziv 1 (LZ1) Coding

This Lempel-Ziv 77 or LZ77 is one of the straight forward methods and replaces the recurring patterns in the data with the shorter length code words. It is a fixed code length coding method and operates as non-probabilistic method for sequentially compressing the repeated stings of data. It is also termed as window or dictionary based methods. Same dictionary is maintained at the compressor and de-compressor.
2.3.2.2.2 LZ78 or LZW Coding

The LZ78 or its variant LZW maintains a dictionary at both the encoder and the decoder. The dictionary is continuously updated during encoding process. Initially dictionary contains 256 characters considering a symbol set of all possible 8-bit extended ASCII characters. The details are provided in [18]
2.3.3 Types of Implementation

The adaptive lossless data compression algorithms can be realized on general purpose computers in software or in FGPGAs or ASICs on hardware.
2.3.3.1 Software Solution

The lossless data compression algorithms can be implemented in software on general purpose computers. The advantage with this approach is that theres a flexibility of improvement and incorporation of newer and optimized methods for better compression ratios is also possible. The data compression algorithms are computationally intensive and require memory for maintaining the dictionaries. The more the size of the dictionary, the better is the compression ratio and similarly the more efficient or computationally intensive is the searching, the better is the compression. All this is done at the cost of CPU processing time and memory usage thus making the compression transparent to the end user but all the tasks cannot be 19
done simultaneously where high compression throughputs are required by the software methods [2]. The commonly used data structures for software implementations are binary search tree and hashing that require indirect addressing and multiple memory accesses that are bottlenecks in achieving high data rates [20]. The software solutions of data compression cannot fulfill the high speed and performance requirements of the present and the future systems.
2.3.3.2 Hardware Solutions

The exponentially expanding wired and wireless networks (LANs, WANs) traffic and digital data information generation and storage in the ranges beyond terabytes have necessitated the need for high speed data compression engines [21]. The most time consuming part of the dictionary based techniques is string searching for a maximum longest match. The hardware accelerator offloads a CPU by performing the compression and decompression besides providing an increase in compression speed. The hardware especially designed for a particular algorithm can achieve fast and reliable compression with higher throughputs [2]. The on-line real time transmission at network routers requires scalable and parallel architectures for hardware implementation compressors and de-compressors with very high throughputs of the order of gigabits/sec. The hardware implementations either on FPGAs or ASICs facilitate the real-time on the fly compression and decompression of data. A number of methodologies exist for mapping the dictionary based LZ77 or LZW algorithms in hardware as provided below.
2.3.3.2.1 CAM Based Implementations

Most of these techniques employ CAMS (Content Addressable Memories) that speed up the string matching time. The content addressable memories provide parallel lookups for repeating strings of input data in the dictionaries and fast comparisons lead to high speed compression. CAMs can be read or written to like ordinary memories and throughput using CAMS is reported as 100 Mbit/sec [22]. The CAM based approach combined with pipelining and partitioning of the design
20
has claimed to handle the data rates of 100Mbits/sec [23]. The CAMs are also employed for designing a chip that produced 2000 full duplex channels at 80 Mbits/sec throughputs. The CAMs are not cost effective methods since CAM are very expensive.
2.3.3.2.2 Systolic Array Implementations

The systolic array is a regular pattern of processing elements interconnected in a simple way; and each processing element is connected to its adjacent element The basic idea is to lay out an identical pattern of processing elements having simple interconnections and each capable of carrying out simple tasks [24]. This performs well for hardware implementations as it is well suited for parallel processing like parallel comparisons in LZ77 [20]. The order of computations O(n2) of a serial implementation can be decreased to O(n) where n is the length of the longest match by using a parallel architecture of n systolic processors [25]. The existing architectures based on systolic arrays have their own merits and demerits.
2.3.3.2.3 Other Approaches

The Reduced Instruction Set Computer RISC approach is sometimes employed as a compression engine but execution speed is the bottleneck, so not well suited to real time applications [26]. Another approach where the architectures are neither based on systolic arrays nor CAMs is the one in which locality of sub-string match length is exploited [27]. The speedup of match length is compared with the systolic array. It is found that the match length is obtained after n cycles compared to 3n cycles of systolic array approach [28]. The other approaches using hardware implementation of LZ77 class of algorithms[6] [29] [30] [31] are standalone hardware accelerators implemented in FPGAs or ASICs produce very high throughputs as compared to the other existing methods.
2.4 Conclusions
The chapter provides the necessary literature about lossless data compression methods and outlines the existing techniques for the hardware implementation of the LZ77 class of algorithms. The research formed the foundation on the basis of which 21
the architectural optimization methodology for LZ77 algorithms is devised that will be presented in the next chapter.
2.5 References
[1] Shih-Arn Hwang and Cheng-Wen Wu, Unified VLSI Systolic Array for LZ Data Compression, IEEE Transactions on Very Large Scale Integration, Vol. 9, No. 4, August 2001. [2] Bjong Jung, Wayne P. Burleson, Efficient VLSI for Lempel_Ziv Compression in Wireless Data Compression Networks, Proceedings of International Symposium on Circuits and Systems, London June 1994. [3] Kenneth C. Barr and Krste Asanovic 1995, Energy Aware lossless Data Compression, ACM Transactions on Computer Systems, Vol. 24, No. 3, August 2006, Pages 250291. [4] Dzung Tian Huang, Fast and Efficient Algorithms for Text and Video Compression, A PhD. Dissertation, Brown University, Rhode Island, 1997 [5] D J. Craft A fast hardware data compression algorithm and some algorithmic extensions, IBM Journal of Research and Development, Vol 42, Number 6, 1998 [6] Konstantinos Papadopoulos, Ioannis Papaefstathiou, Titan-R: A
Reconfigurable hardware implementation of a high-speed compressor, 16th International Symposium on Field-Programmable Custom Computing Machines, 2008 IEEE [7] Edward R. Fiala and Daniel H. Greene, Data Compression with Finite Windows, Communications of the ACM, April 1989, Volume 32, Number4 [8] [9] Data compression - Wikipedia, the free encyclopedia.mht C. E. Shannon, A Mathematical Theory of Communication, Bell System Technical Journal, Vol. 27, pp. 379423, 623656, July, October, 1948. [10] [11] Pasi 'Albert', Compression Basics http://www.cs.tut.fi/~albert/ D. Huffman, A Method for the Construction of Minimum Redundancy Codes, Proc. IRE, 1958, Vol. 40, pp. 1098-1101, Sep 1952. [12] T. Lee and J. Park, Design and Implementation of a Static Huffman Encoding Hardware using a Parallel Shifting Algorithm, IEEE Transactions on Nuclear Science, vol. 51, no. 5, pp. 20732080, October 2004.
22
[13]
G.G. Langdon Jr., An Introduction to Arithmetic Coding, IBM J. Res. Development, pp. 135-149, Mar 1984.
[14]
M. W. E. Jamro and K. Wiatr, FPGA Implementation of the Dynamic Huffman Encoder, Proceedings of IFAC Workshop on Programmable Devices and Embedded Systems, 2006.
[15]
S. Colomb., Run Length Encoding, IEEE Transactions on Inform. Theory, Vol. IT-12, pp 399-401, July 1966.
[16]
Lossless data compression: theory and algorithms www. maximum compression.com,
[17]
J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, vol. IT-23 No. 2, May 1977.
[18]
T. Welsh, A Technique for high-Performance Data Compression, IEEE Computer, vol. 17, pp 8-10, 1984.
[19]
J.L. Nunez and S. Jones, The X-Match PRO 100 Mbytes/Second FPGABased Lossless Data Compressor, Proc. Design, Automation and Test in Europe, DATE Conf., pp. 139-142, Mar. 2000.
[20]
Brian Ta Cheng Hou, A VLSI Architecture for Data Compression Engine, Masters Thesis, MIT 1989.
[21]
Mohamed A. Abd El Ghany, Aly E. Salama and Ahmed H. Khalil, Design and Implementation of FPGA- based Systolic Array for LZ Data Compression, IEEE 2007
[22]
S. Jones, 100 Mbit/s Adaptive Data Compressor Design using selectively Shiftable Content addressable Memory in Proc. Pt. G, vol.139, no. 8, August 1992.
[23]
C.Y. Lee and R.Y. Yang, High throughput Data Compressor Design using Content Addressable Memory, Proc. Pt. G., vol.142, Feb 1995.
[24]
James A. Storer, John H. Reif , Parallel Architecture for high Speed Data Compression, 3rd Symposium on the Frontiers of Massively Parallel Computation, 1990
[25]
D. Mark Royals, Tasso Markas, Nick Kanopoulos, John H. Reif, and James A. Storer, On the design and Implementation of Lossless Data Compression Chip, IEEE Journal of Solid State Circuits, vol.28, No. 9, 1993
23
[26]
Chang, H. J. Jih, and J. W. Liu, A lossless data compression Processor, in Proc. 4th VLSI Design/CAD Workshop, NanTou, Aug. 1994,
[27]
Y.J. Kim, K. S. Kim and K. Y. Choi Efficient VLSI Architecture for Lossless Data Compression, Electronic letters, June 1995, Vol 31, No 3
[28]
N. Ranganathan and S. Henriques, High Speed VLSI designs for LempelZiv Based Data Compression, IEEE Transactions on Circuits and SystemsII: Analog and Digital Signal Processing, vol. 40, February 1993
[29]
Suzanne Rigler, William Bishop, Andrew Kennings, FPGA-Based Lossless Data Compression using Huffman and LZ77 Algorithms IEEE 2007
[30]
Rizwana Mehboob, Shoab A. Khan, Zaheer Ahmed, High Speed Lossless Data Compression Architecture, 10th IEEE Multi-Topic Conference INMIC 2006, Islamabad.
[31]
Suzanne Rigler, FPGA-Based Lossless Data Compression using GNU Zip, A Masters Thesis, University of Waterloo, 2007
24
CHAPTER 3
3 High Speed Architectures for Lossless Data Compression Algorithms

The chapter focuses on the design of the basic building blocks to device different parallel architectures to incorporate LZ77 algorithm. As lossless data compression application requires strict compliance to algorithm, therefore all the basic building blocks are designed by keeping in view that technique remains within the constraint of the defined algorithm. We purely applied architectural optimization methods for defining efficient architectures. The chapter also presents the design of three parallel architectures by incorporating the presented basic building blocks. The first architecture is based on the unrolling of algorithm for parallel comparisons. The second architecture not only unrolls the algorithm for parallel comparisons rather uses multiple iterations of the algorithm and reuses the data and partial computation in designing massively parallel architecture. This architecture finds applications to compress thick pipes of data where data variation takes place in a limited set; thus requiring moderate length history buffers. The third architecture is a pipelined architecture to cater for high data rates and is applicable on backbones where long history buffers are required to achieve higher compression ratios on diverse data patterns. Both the architectures are highly scalable and modular, the modularity and scalability of the architecture is demonstrated in the next chapter, where the layered architecture is presented to compress multi-gig data pipes. The architectural optimization for achieving high throughputs of the order of few gigabits/sec not only involves conversion of serial processing to parallel comparisons rather introduces the concept of look ahead comparisons for further maximizing the throughputs in one of the proposed methods. The research deals with efficient hardware realization of a broader class of LZ dictionary algorithms in general and LZ77 in particular. The scalability and configurability of the
25
architectures can be incorporated to realize the data compression devices that can be incorporated in applications requiring bandwidth economy and enhancement in mass storage devices. The three architectures for LZ77 lossless data compression algorithm have been synthesized on the FPGAs.
3.1 Proposed LZ77 Hardware Realization

The flexibility of realizing data compression algorithm on a general purpose processor can be traded with the performance in terms of high throughput by means of hardware implementation using FPGAs or ASICs. This is because of the unending appetite of the communication network bandwidths that requires high throughput systems for information exchange. The massive growth in data traffic over wired or wireless communication links require very high speed data compression of the order of Gigabits per second throughputs. The, compression ratio, speed of compression or effective throughput, utilization of computing resources and memory requirements are parameters for gauging the efficacy of a data compression algorithm and its implementation methodology. In this thesis, the dictionary based standard LZ class of algorithms has been investigated in general and LZ77 in particular for inherent parallelism during the string matching process for replacing the repeated strings with shorter code words thus effectively eliminating redundancies. The LZ77 algorithm first proposed in 1977 [1] has been under constant research and widely employed as standalone software or hardware solution or in combination with other schemes such as Huffman or adaptive Huffman coding and applied in standard programs like gzip [2] [3]. In dictionary based lossless data compression, the repeated character strings in input data are encoded by shorter code words of fixed length. This is done by comparing input characters with the previously encoded data. Thus the redundancies in data are removed to a large extent to accomplish the required compression. The compression ratio (CR) as a measure of
26
the removal of redundancy or the measure of compression achieved can take either of the two definitions [4] [5] [6]. CR = [length of uncompressed input string]/[length of compressed output string] or CR= [length of compressed output string]/[length of uncompressed input data string] According to the first definition, the compressor is effective if and only if the CR is greater than 1 and in the second definition the CR must be less than 1. A few variants of LZ77 as mentioned in literature [7] [8] have proposed some modifications for greater compression ratios. The trend remains to minimize the length of the codeword and save the bits in forming these codeword for effectively improving the compression ratio. In our proposed architecture, these modifications can be easily incorporated, but for simplicity we stick to the simplified version of the LZ77 algorithm processing as mentioned in research papers [6] [9]. The proposed architecture can be tailored for alterations in LZ77 due to variation in codeword length.
3.1.1 LZ77 Compression Algorithm Processing

The principle of compression is based on the premise of the temporal locality of substring in which the data tends to repeat in near future and the input data strings are searched for repetitions and shorter codeword are transmitted for the repeated strings [9]. At each step of the encoding process, the input data is pumped in the window of length N characters. This is further subdivided in two buffers; a history buffer of length P and the look ahead or search buffer of length Q characters as shown in Figure 3-1. The input data of length Q characters (from y0 to y3) occupy the search buffer and these symbols are yet to be encoded while the input characters (x0 to x7) of length P in the history buffer have actually been pushed from the search buffer as the search buffer characters get encoded. This is termed as sliding dictionary or sliding window encoding method. At the beginning of the compression, initially the buffer is filled with one type of character that can be decided by the encoder and
27
decoder. This character can be zero, or some null character. A sequential search for finding a match of search buffers first symbol (y0) with already encoded history buffer is done sequentially and on finding the match, the next symbol of search buffer is concatenated with the first and again a look up is performed. This continues till the longest match length (len) of search buffer characters and history buffer characters is found. The match in history buffer may overlap some of the characters in search buffer but it must begin at any location in the history buffer.
Figure 3-1: Buffers of LZ77 Similarly, the starting position in the history buffer where the longest match of the search buffers symbols with history buffers symbols commenced is termed as the pointer (pntr). The last symbol (lsymb) is the character in the search buffer immediately followed by longest matched string of the search buffer. The concatenation of the longest match length, pointer and last symbol forms the code word as : < pntr len lsymb >. Once the code word is formed, on the next encoding step, the search buffer is filled with new input characters from one side and same number of characters are pushed into the history buffer by (len+1) symbols on the other side. Similarly, the characters of length (len+1) are pushed out from the other side of the history buffer to accommodate new encoded data form search buffer. The maximum number of bits required for encoding match length (len) is given by:
# of bits for encoding match length m : len = log 2 Q where x = ceiling ( x)

Similarly the number of bits required for encoding the pointer (pntr) is given by:
# of bits for encoding pointer pntr =
log
28
3.1.2 Compression Example

In this example we consider that history buffer is initially filled with the alphabet a from a symbol set of 256 extended ASCII character set.. Consider the input string S where S= aaababacbacbacbcacbacbcaa. The length of the input buffer N= 18, and lengths of history buffer P =9 and length of look ahead buffer Q = 9. The longest match of the search buffer string starting at first character of search buffer is found in the history buffer that is initially filled with the character a.
S= a a a b a b a c b a c b a c b c a c b a c b c a a N=18, P=9, Q=9

0 1 2 3 4 5 6 7 8
....(i/p str)
a a a a a a a a a a a a b a b a c b Drop aaaa
0 1 2 3 4 5 6 7 8
match
Code word = < 8 3 b >
a a a a a a a a b a b a c b a c b a Drop aaaa
match
Code word = < 7 3 c >
0 1 2 3 4 5 6 7 8
a a a a b a b a c b a c b a c b c a
match
Drop aaaababa
0 1 2 3 4 5 6 7 8
Code word = < 6 7 c >
c b a c b a c b c a c b a c b c a a
match
Drop cbacbacbc
Code word = < 2 8 a >
Figure 3-2: LZ77 Encoding As shown in Figure 3-2, the input string S is shown with portions of S that occupy the search buffer in each step of the encoding algorithm is marked by the dotted lines of different colours. In the first pass, where the history buffer is filled with the same character a, the three characters of search buffer a a a match with the history buffer. The match can start from any position (0-8) of the history buffer and
29
can extend to search buffer as well. The match is considered to start at location 8 of history buffer and overlaps the search buffer. The last symbol following the longest match is b in the first pass. The code word comprising of < pntr len lastsym > is <8 3 b>. Since 3 characters are matched, so after the first pass (3+1) characters are dropped out of the history buffer from left and 4 characters of search buffer are moved to history buffer from right. Similarly next 4 characters of input string fill the search buffer. The other passes of the encoding algorithm are clearly depicted in Figure 3-2. The following sections would explain the details for devising the basic building blocks of the algorithm.
3.1.3 Architectural Blocks

In this section, the important blocks that constitute the proposed architectures, for hardware implementation are discussed. Considering the algorithmic processing in the preceding sections, the thesis addresses the issue of sequential comparisons of the LZ77 encoding process. The sequential comparisons of the encoder are realized as the parallel comparisons for attaining very high speeds. It is important to mention that in our architecture, we consider the symbols set comprising of the ASCII or extended ASCII Code (American Standard Code for Information Exchange); there are a total of 256 distinct characters and therefore these 256 can be encoded as 8-bit byte. Each 8-bit character would constitute the input data string. Therefore, each of the characters of the search buffer is matched with the character in the history buffer. For comparing each of the two characters to decide whether they are same or not, the hardware comparator compares each constituent bit of the two characters. The basic compression cell (BCC) for comparing a character x0 with y0 is shown in Figure 3-3. The number of simultaneous BCCs in a particular architecture depends on P (length of history buffer), on length of the search buffer (Q) and on the type of parallel comparisons architecture that would be discussed in the sections to follow. The output of the BCC is a single bit.
30
Figure 3-3: Basic Compression Cell The basic compression cell compares two characters to produce a single bit output. The parallel comparisons of the characters of search buffer with those in the history buffer comprising of the already encoded characters imply that the entire search buffer is compared with the history buffer locations simultaneously. Consider a window of length N with search buffer of length Q = 4 and history buffer of length P = 8. The characters in history buffer are x0 x1 x2 x3 x4 x5 x6 x7 and the characters in the search buffer are y0 y1 y2 y3. Therefore, a total of 32 (8x4) simultaneous distinct two character comparisons are required as shown in Table 3-1, in the first pass of the encoding algorithm for a fully parallel approach for obtaining the code word in a single clock cycle.
31
Table 3-1: Parallel Comparisons for N=12, P=8, Q=4 x0 x1 x2 x3 y0 y1 y2 y3 x 1 x2 x3 x4 y0 y1 y2 y3 x2 x3 x4 x5 y0 y1 y2 y3 x3 x4 x5 x6 y0 y1 y2 y3 x4 x5 x6 x7 y0 y1 y2 y3 x5 x6 x7 y0 y0 y1 y2 y3 x6 x7 y0 y1 y0 y1 y2 y3 x7 y0 y1 y2 y0 y1 y2 y3
0 1 2 3 4 5 6 7
There are 32 basic compression cells employed in the first pass of the algorithm for generating the first codeword. There are 4 BCCs that takes distinct input combinations in each of the 8 rows. The maximum match length of the comparisons can therefore be 4, in any of the eight rows from 0 to 7. Therefore, the pointer to the maximum match which is the address of the history buffers character from where the maximum match commenced will be noted in the same row for forming the code word. In order to find the maximum match length and the pointer or the address of the history buffers character where the match commenced is realized by a very important entity designed and termed as best match and length calculator (BMLC) as shown in Figure 3-4. The BMLC is an architectural entity comprising of the interconnections of BCCs outputs, simple logic gates and priority encoders that automatically renders the maximum match length of the comparisons and the pointer to that particular match. The BMLC serves as one of the important blocks of the architectures.
32
x0y0 x1y1 & 0 0 x1y0 x2y1 & 1 x2y0 x3y1 & 2 x3y0 x4y1 & 3 x4y0 x5y1 & 4 x5y0 x6y1 & 5 x6y0 x7y1 & 6 x7y0 y0y1 & 7 >
1
x2y2 &
x3y3 &
x3y2 &
x4y3 &
x4y2 &
x5y3 &
x5y2 &
x6y3 &
x6y2 &
x7y3 &
x7y2 &
y0y3 &
y0y2 &
y1y3 &
y1y2 &
y2y3 &
>
>
>
0 1 Match Pointer 2 3 4 Match length
Figure 3-4: Best Match and Length Calculator (BMLC) The BMLC can be used for different lengths of the history and search buffers. The longer lengths of the history buffer ensure that the compression ratio increase considerably but the search buffer length of 8 to 32 characters produce optimal
33
results [10] [11][12]. The long lengths of the history buffer or the search buffer also increase the critical path of the BMLC. A basic parallel compression block (CB) is a combination of BCCs and a BMLC as shown in Figure 3-5. A group of BCCs is termed as parallel comparators and outputs of the parallel comparators are fed to a BMLC as shown in Figure 3-4 to find the match length and pointer simultaneously. It also caters for the proper data interconnects from the history and the search buffers. In Figure 3-5, a CB is shown for which N = 12, P = 8 and Q = 4.
Figure 3-5: Compression Block The single bit outputs of each of the 32 BCCs are fed in the BMLC. On the basis of the data interconnects of the CB, different architectures are realized as discussed in the sections to follow. Broadly three main categories of architectures are discussed and each of them can be used in different types of applications.
3.2 Unfolded Parallel Architecture

This architecture can be easily realized simply by employing the basic compression block, the two data streams of history and the search buffers, and data shifting logic. It comprises of the parallel comparisons of the entire search buffer with the history
34
buffer in one clock cycle. This can be implemented for a variety of search and history buffer lengths. The longer history buffers provide a good compression ratio but the parallel comparisons as depicted in Table 3-1 for large sized history buffers increase the size of the critical path and therefore the speed of execution slows down. The number of parallel comparisons in this type of architecture is equal to the product of P and Q where P is the history buffer size and Q is the size of the search buffer. The unfolded parallel architecture simply comprises of the parallel compression block CB as shown in Figure 3-5. Only the barrel shifters would be used to shift the search buffers contents to history buffer and shift the input data characters in the search buffer.
3.3 Super-unfolded Architecture

Based on the architectural blocks discussed in previous sections, fully parallel superunfolded architecture is proposed for the applications requiring high throughputs.
3.3.1 Design Methodology

In this architecture, the concept of super-unfolding of the algorithm is applied for accommodating the future iterations of the algorithms. In order to understand this, consider a history buffer of length P = 8 and search buffer of length Q = 4 making the length N=12. The different iterations or steps of the algorithm that can be implemented in parallel fashion are depicted in Table 3-2. The degree of unfolding depends upon the length of the history and the search buffers. Initially the first column is the simple unfolded architecture catering for the fully parallel comparisons. It is considered that if the entire search buffer is matched with a string within the history buffer in the very first column, then for the next pass or iteration of the algorithm, the parallel comparisons would be performed in the column that is Q+1 number ahead of the first. It is because the code word also consists of the last symbol following the match. Therefore, for a search buffer of length Q = 4, the first iteration of the algorithm is performed in first column and the
35
next iteration may land in the column after Q+1. Therefore, for a two degree unfolding we have included 6 columns of parallel matches in Table 3-2 for Q=4. Table 3-2: Parallel Comparisons for 6 iterations for N=12, P=8, Q=4 Col1 x0x1x2x3 y0y1y2y3 x1x2x3x4 y0y1y2y3 x2x3x4x5 y0y1y2y3 x3x4x5x6 y0y1y2y3 x4x5x6x7 y0y1y2y3 x5x6x7y0 y0y1y2y3 x6x7y0y1 y0y1y2y3 x7y0y1y2 y0y1y2y3 Col2 x1x2x3x4 y1y2y3y4 x2x3x4x5 y1y2y3y4 x3x4x5x6 y1y2y3y4 x4x5x6x7 y1y2y3y4 x5x6x7y0 y1y2y3y4 x6x7y0y1 y1y2y3y4 x7y0y1y2 y1y2y3y4 y0y1y2y3 y1y2y3y4 Col3 x2x3x4x5 y2y3y4y5 x3x4x5x6 y2y3y4y5 x4x5x6x7 y2y3y4y5 x5x6x7y0 y2y3y4y5 x6x7y0y1 y2y3y4y5 x7y0y1y2 y2y3y4y5 y0y1y2y3 y2y3y4y5 y1y2y3y4 y2y3y4y5 Col4 x3x4x5x6 y3y4y5y6 x4x5x6x7 y3y4y5y6 x5x6x7y0 y3y4y5y6 x6x7y0y1 y3y4y5y6 x7y0y1y2 y3y4y5y6 y0y1y2y3 y3y4y5y6 y1y2y3y4 y3y4y5y6 y2y3y4y5 y3y4y5y6 Col5 x4x5x6x7 y4y5y6y7 x5x6x7y0 y4y5y6y7 x6x7y0y1 y4y5y6y7 x7y0y1y2 y4y5y6y7 y0y1y2y3 y4y5y6y7 y1y2y3y4 y4y5y6y7 y2y3y4y5 y4y5y6y7 y3y4y5y6 y4y5y6y7 Col6 x5x6x7y0 y5y6y7y8 x6x7y0y1 y5y6y7y8 x7y0y1y2 y5y6y7y8 y0y1y2y3 y5y6y7y8 y1y2y3y4 y5y6y7y8 y2y3y4y5 y5y6y7y8 y3y4y5y6 y5y6y7y8 y4y5y6y7 y5y6y7y8
All the parallel comparisons of the history and the search buffer are accommodated in the first column of Table 3-2. When there is no match of the search buffer with history buffer in the first pass of the algorithm, the (last symb) y0 is shifted in the history buffer. The second pass of the parallel comparisons now commences as depicted in column 2 of Table 3-2. Similarly, if one symbol in the first pass is matched, two search buffers characters are shifted to the history buffer from one side and two characters form input string become part of the search buffer from the other side. Therefore, the comparisons that are performed either in second pass of the algorithm if one symbol in first column is matched or in the third pass of the algorithm when there were no matches in the previous two passes is shown in
36
column 3 of the Table 3-2. Similarly in situations, where 3 symbols of the search buffer match with the history buffer in the very first iteration of the algorithm , then 4 characters are encoded and the next pass of the algorithm is the one as shown in column 5 of Table 3-2. The above methodology is employed for future iterations also and more than one code word can be obtained in a single clock cycle. Therefore, by unfolding the algorithm, we can encode future strings based on the unfolding level. In each of the columns after the first column, one third of the comparisons are redundant and their results are already available from the previous entries. The redundant comparisons shown by the blue entries in Table 3-2 are avoided by recording the parallel comparisons of different passes or iterations of the algorithm in tabular form. There exists another simpler technique to arrive at the necessary comparisons in each pass and avoid the repeated or redundant comparisons by drawing the comparison matrix. The table above is said to be unfolded for a 2 degree of unfolding. For a 3 degree of unfolding in the above example values of history and search buffers, the table would comprise of 11 columns.
3.3.2 Comparison Matrix

The parallel comparisons in Table 3-2 shown in blue colour are redundant as their results are already available from the previous columns comparisons. The results already available are fed in the BMLC and the CB for each pass of the algorithm. Many comparisons are not only repeated twice rather appear more than twice. In order to find out only the necessary comparisons so that an economy in the number of BCCs is achieved is by the use of the comparison matrix. Instead of writing the parallel comparisons in tabular form, we employ a simple method of finding the necessary minimum number of comparisons. Consider the matrix as shown in as shown in Figure 3-6 where the 72 distinct black dots at the x and y crossings are the only necessary comparisons.
37
Figure 3-6: Comparison Matrix A total of 192 BCCs are required as shown in Table 3-2 for encoding two code words of maximum length in a single clock cycle by unfolding the six iterations of the algorithm. Out of total 192 comparisons, 120 comparisons are redundant and only 72 are valid, therefore for optimizing the architecture, the redundant comparisons are omitted. The redundant comparisons for a buffer length N=8, P=4 and Q=4 are mentioned in the associated research paper [13]. The parallel comparisons are required for high speeds but by adopting our methodology of future iterations; the redundancy is avoided by selecting only the required comparisons. The rest of the logic of parallel compression block including the BMLC is present and the economy in BCCs is achieved. The advantage gained in this architecture is that results of future iterations can be made available in one clock cycle. If there is stringent constraint of area, the degree of parallelism can be reduced with the trade off of lesser speed and throughput. This fully parallel superunfolded architecture also holds the next incoming symbols beyond the sliding window. This concurrency in matching is considered on an account of the fact when
38
the entire search buffer is matched to a string in history buffer, the next symbols of the match can also be incorporated by parallel block shifting of the history and search buffers.
3.3.3 LZ77 High Speed Super-unfolded Architecture Details

Based on the parallel comparisons, this architecture is depicted in Figure 3-7.
Y s y m b ol s
X symbols
Comp Col1
Comp Col2
Comp Col3
Comp Col4
Comp Col5
Comp Col6
x4y4 x5y4 x6y4 x7y4 y0y4 y1y4 y2y4 y3y4
x5 y5 x6 y5 x7 y5 y0 y5 y1 y5 y2 y5 y3 y5 y4 y5
x6y6 x7y6 y0y6 y1y6 y2y6 y3y6 y4y6 y5y6
x7 y7 y0 y7 y1 y7 y2 y7 y3 y7 y4 y7 y5 y7 y6 y7
y0y8 y1y8 y2y8 y3y8 y4y8 y5y8 y6y8 y7y8
BMLC1
BMLC2
BMLC3
BMLC4
BMLC5
BMLC6
Pntr, len
Pntr, len
Pntr, len
Pntr, len
Pntr, len
Pntr, len
Column Select Logic Pointers, lengths
Figure 3-7: High Speed Super-unfolded Architecture In Figure 3-7, all the parallel comparisons of each of the six iterations or passes of the algorithm are implemented. For understanding purposes, the same lengths N= 12, P=8 and Q=4 are considered as mentioned in the explanation of Table 3-2. The x and y symbols of the history buffer and the search buffer respectively for the first pass of the algorithm are fed to comparison column 1 as depicted in column 1 of Table 3-2 and by a block shown as Comp Col1 in Figure 3-7. In this block all 32
39
comparisons or 32 BCCs are used. The results of these parallel comparisons of column 1 are then fed to the BMLC1. For the subsequent 5 passes or iterations of the algorithm, the results of 24 out of 32 comparisons in each of columns 2 to 6 are already available as shown by the blue entries in Table 3-2. Only 8 out of 32 comparisons as shown by black entries in Table 3-2 are incorporated in the architecture and these are explicitly shown in Figure 3-7 as comparison col2 to col6. The required pre-computed results from the previous pass and the present pass or iteration is fed to the corresponding BMLC for each of the columns 2-6. The pointer and maximum match length of each column is selected independently. Finally, the results of all the six BMCs are fed to the column select logic and based on the values of pointer and lengths of matches, the code word with maximum length is selected.
3.4 Pipelined Architecture

The pipelined architecture is quite efficient for relatively larger history buffers. In this architecture, the pipeline is incorporated in order to reduce the critical path to achieve high data rates for longer history buffers. In this architecture, the comparisons are not serial rather parallel but the large number of parallel comparisons is divided into independent pipelines. The following section explains the rationales for arriving at the architecture.
3.4.1 Parallel Pipeline Interconnect

Consider the tabular form of comparisons as shown in Table 3-3. In this table, the history buffer is of length P=32 while the search buffers length Q = 8. For obtaining the code word in a single clock cycle, the parallel comparisons for a single iteration or one pass of the encoding algorithm requires 256 (32x8) simultaneous comparisons according to
40
TABLE 3-1. Similarly the dimensions of the BMLC for 256 parallel comparisons
also increase the critical path for calculation of the match length and pointer of the match, thus limiting the speed or throughput of the architecture. Therefore, we exploit the concept of .dividing the 256 parallel comparisons into chunks of either two pipelines of 128 comparisons each or four pipelines of 64 parallel comparisons each. The later choice of four streams or pipelines of parallel comparisons for 256 possible comparisons of a single pass of LZ77 algorithm is depicted in Table 3-3. Table 3-3: Parallel Comparisons for Pipelined Architecture Column 1
x0x1x2x3x4x5x6x7 y0y1y2y3y4y5y6y7 x1x2x3x4x5x6x7x8 y0y1y2y3y4y5y6y7 x2x3x4x5x6x7x8x9 y0y1y2y3y4y5y6y7 x3x4x5x6x7x8x9x10 y0y1y2y3y4y5y6 y7 x4x5x6x7x8x9x10y11 y0y1y2y3y4y5y6 y7 x5x6x7x8x9x10x11x12 y0y1y2y3 y4y5 y6 y7 x6x7x8x9x10x11x12x13 y0y1y2y3y4 y5 y6 y7 x7x8x9x10x11x12x13x14 y0y1y2 y3 y4 y5 y6 y7
Column 2
x8x9x10x11x12x13x14x15 y0y1 y2 y3 y4 y5 y6 y7 x9x10x11x12x13x14x15x16 y0 y1 y2 y3 y4 y5 y6 y7 x10x11x12x13x14x15x16x17 y0 y1 y2 y3 y4 y5 y6 y7 x11x12x13x14x15x16x17x18 y0 y1 y2 y3 y4 y5 y6 y7 x12x13x14x15x16x17x18x19 y0 y1 y2 y3 y4 y5 y6 y7 x13x14x15x16x17x18x19x20 y0 y1 y2 y3 y4 y5 y6 y7 x14x15x16x17x18x19x20x21 y0 y1 y2 y3 y4 y5 y6 y7 x15x16x17x18x19x20x21x22 y0 y1 y2 y3 y4 y5 y6 y7
Column 3
x16x17x18x19x20x21x22x23 y0 y1 y2 y3 y4 y5 y6 y7 x17x18x19x20x21x22x23x24 y0 y1 y2 y3 y4 y5 y6 y7 x18x19x20x21x22x23x24x25 y0 y1 y2 y3 y4 y5 y6 y7 x19x20x21x22x23x24x25x26 y0 y1 y2 y3 y4 y5 y6 y7 x20x21x22x23x24x25x26x27 y0 y1 y2 y3 y4 y5 y6 y7 x21x22x23x24x25x26x27x28 y0 y1 y2 y3 y4 y5 y6 y7 x22x23x24x25x26x27x28x29 y0 y1 y2 y3 y4 y5 y6 y7 x23x24x25x26x27x28x29x30 y0 y1 y2 y3 y4 y5 y6 y7
Column4
x24x25x26x27x28x29x30x31 y0 y1 y2 y3 y4 y5 y6 y7 x25x26x27x28x29x30x31y0 y0 y1 y2 y3 y4 y5 y6 y7 x26x27x28x29x30x31y0 y1 y0 y1 y2 y3 y4 y5 y6 y7 x27x28x29x30x31y0 y1 y2 y0 y1 y2 y3 y4 y5 y6 y7 x28x29x30x31y0 y1 y2 y3 y0 y1 y2 y3 y4 y5 y6 y7 x29x30x31y0 y1 y2 y3 y4 y0 y1 y2 y3 y4 y5 y6 y7 x30x31y0 y1 y2 y3 y4 y5 y0 y1 y2 y3 y4 y5 y6 y7 x31y0 y1 y2 y3 y4 y5 y6 y0 y1 y2 y3 y4 y5 y6 y7
The search buffer contains only 8 characters from y0 to y7. The history buffer holds 32 characters x0 to x31. As shown in Table 3-3, the comparisons for fully parallel single pass of algorithm between the history and the search buffers is divided in four columns of parallel comparisons instead of only one column. The search buffer in each of the 256 comparisons that are divided in four columns is essentially the same from y0 to y7. This search buffer moves or slides across the history buffer for searching the longest matched string between the two buffers. The search buffer
41
slides forward by one position from left of history buffer towards the right. There are no redundant comparisons in this table and each entry is independent. For dividing a long history buffer of length P into multiple pipelines of parallel comparisons, the beginning index of history buffer character for every new pipeline is obtained by the pipeline number PLK where k ranges from 0-3 in this example. Therefore, the history buffers beginning character index is given by: PLk = k * Q where Q is search buffer length and k is pipeline index.
3.4.2 Pipelined Architecture Details

The pipelined architecture is depicted in Figure 3-8. The basic building blocks for the pipeline architecture are primarily the same as that used for the unfolded and super unfolded fully parallel architectures.
X symbols
Comp Col1
Comp Col2
Comp Col3
Comp Col4
BMLC1
m_len1 pntr1
BMLC2
m_len2 pntr2
BMLC3
m_len3 pntr3
BMLC3
m_len4, pntr4
Register
Register
Register
Register
m_len1 pntr1
m_len2 pntr2
m_len3 pntr3
m_len4, pntr4
Column Select logic

Pointer, length
Figure 3-8: Pipelined Architecture Each independent stream or pipeline of parallel comparisons of Table 3-3 is shown by comparison columns i.e. CompCol1 to CompCol4 as shown in Figure 3-8. The 42
maximum match length and pointer to the best match is obtained from each of its BMLC for each column. The results of each columns BMLC are stored in the corresponding register for that column. The stored results from each of the four registers are then fed to the column select logic in the next clock cycle for determining the best match and associated pointer for finalizing the code word. Once a code word is formed, the column select logic determines the required shifts for x and y buffers and for inserting the required number of symbols from input data stream to y buffer for the next pass of algorithm processing. The proposed architecture can be extended to larger history buffers. For example, for a history buffer of 1K characters i.e. P=1024 and Q=32 characters, we can have 32 (1024/32) pipelines, each performing 1K (32x32) parallel comparisons. The codeword is available in every other clock cycle. Similarly, when there are stringent requirement of the area, the architecture is then folded to the required degree. For example in the same example of 1K characters long history buffer, and 32
characters long search buffer, instead of maintaining 32 pipelines simultaneously, only 16 pipelines each of 1K comparisons can be maintained. In this case, half of the processing is done in one clock cycle, while the remaining comparisons would be carried out in the next clock cycle. The values of pointers and lengths of each processing interval are then analyzed in the next clock cycle for forming the code word. In this way, the execution time is increased but the architecture is folded and reused for smaller areas. Therefore, depending upon the requirements, the architecture can be folded or unfolded to any degree according to the requirements of speed of execution or the silicon area. The folded architecture occupies smaller area and the critical path is reduced and therefore it can be synthesized at much higher clock rates. This architecture is also synthesized along with the preceding architectures and the results of all the three architectures are discussed in the next section.
43
3.5 Results
The throughput for different buffer sizes for the unfolded and super-unfolded architectures is compared in Table 3-4. The super-unfolded architectures with replication factor of two and four can produce two and four code words of maximum match length per clock cycle respectively. The throughput of super-unfolded architecture with replication factor of two is approximately double as compared to simple unfolded architecture. Similarly super-unfolded with replication factor of four is four time more efficient than unfolded architecture. Table 3-4: Unfolded and Super-unfolded Throughput, Clock=100MHz History buffer Length P 64 128 256 1024 64 128 256 1024 Search Buffer Length Q 16 16 16 16 32 32 32 32 Throughput (Gbps) Super-unfolded Super-unfolded replication = 2 replication = 4 0.53991 1.07981 0.67567 1.35134 0.80648 1.61295 1.05618 2.11236 0.57317 1.14635 0.72313 1.44627 0.87042 1.74085 1.17753 2.35506
Unfolded 0.26995 0.33784 0.40324 0.52809 0.28659 0.36157 0.43521 0.58876
The throughput of pipelined architecture is shown in Table 3-5. The pipeline architecture can be synthesized at much higher clocks and can be used for longer history buffers. As can be seen the throughput touches to 3Gbps for history buffers of 1K and greater. The architecture is modular and highly scalable to compress thick pipes of data by incorporating longer history buffers for better compression ratios.
44
Table 3-5: Pipelined Throughput, Clock=500Mhz History buffer Length P 64 128 256 1024 64 128 256 1024 Search Buffer Length Q 16 16 16 16 32 32 32 32 Throughput (Gbps) 1.34976 1.68918 2.0062 2.64044 1.43294 1.80784 2.17606 2.94382
3.6 Conclusions
Different area and time efficient parallel architectures were explored keeping in view the defined algorithm. In the presented architectures, the bottom up approach is used; the algorithm is unfolded and partitioned into basic building blocks. Different parallel architectures are envisioned by integrating these optimized basic building blocks. The automatic generation of the best match and pointer calculation used in conjunction with parallel comparators is used for devising the unfolded parallel, super-unfolded and pipeline architectures. The degree of unfolding and similarly the amount of pipelining are dependent upon the applications of that architecture.
3.7 References
[1] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, vol. IT-23 No. 2, May 1977. [2] Suzanne Rigler, William Bishop, Andrew Kennings, FPGA-Based Lossless Data Compression using Huffman and LZ77 Algorithms, IEEE 2007. [3] Suzanne Rigler, FPGA-Based Lossless Data Compression Using GNU Zip, A Masters Thesis, University of Waterloo, 2007. [4] D.A. Lelewer and D.A. Hirschberg, Data Compression, ACM Computing Surveys, Vol. 19, No.3, 1987
45
[5]
N. Ranganathan and S. Henriques, High Speed VLSI designs for LempelZiv Based Data Compression, IEEE Transactions on Circuits and SystemsII: Analog and Digital Signal Processing, vol. 40, February 1993.
[6]
Brian Ta Cheng Hou, A VLSI Architecture for Data Compression Engine, Masters Thesis, MIT 1989.
[7]
James A. Storer and Thomas G. Szymanski, Data Compression via Textual Substitution, Journal of the Association of Computing Machinery, Vol.29, No 4, October 1982, pp 928-951
[8]
R.N .Williams, An Extremely Fast Ziv-Lempel data Compression Algorithm, Proceedings of IEEE Data Compression Conference, IEEE Computer Society Press, April 1991
[9]
Timothy C. Bell, Better OPM/L Text Compression, IEEE Transactions on Communications, Vol-Com-32, No. 12, 1986.
[10]
Shih-Arn Hwang and Cheng-Wen Wu, Unified VLSI Systolic Array for LZ Data Compression , IEEE Transactions on Very Large Scale Integration, Vol. 9, No. 4, August 2001.
[11]
B.W.Y. Wei R. Tarver, J.-S. Kim, and K. Ng, A Single Chip Lempel-Ziv Data Compressor, in Proc. IEEE Int. Symposium. On Circuits and Systems (ISCAS), Chicago, May 1993.
[12]
R. Y. Yang and C. Y. Lee, High-Throughput Data Compressor Designs using Content Addressable Memory, in Proc. IEEE Int. Symp. Circuits and Systems, London May 1994.
[13]
Rizwana Mehboob, Shoab A. Khan, Zaheer Ahmed, High Speed Lossless Data Compression Architecture, 10th IEEE Multi-Topic Conference INMIC 2006, Islamabad.
46
CHAPTER 4
4 Multi-gig Lossless Data Compression Device for Enterprise Network

The rapid growth and advancements in communication technologies are giving birth to newer bandwidth hungry applications and services. The consumers are more comfortable as network providers are adding more and more value added services in their networks but this luxury requires thicker data pipes in order to remain at par with other competitors. The service providers are extending their services for 24/7 operation over the globe by extending their enterprise network to branch offices. Due to multitude of networking technologies and costly Wide Area Network (WAN) bandwidth, it has become imperative to device cost-effective solutions to extend enterprise network to branch offices. Moreover, there is a trend of exponential growth in the central database of the enterprise. An enterprise with several branch offices storage area network (SAN) handles requests from all the static and the mobile branch offices at the main data center. For optimizing the two competing objectives of minimizing the WAN bandwidth requirements and enhancing the storage area networks capability, high throughput lossless data compression is one of the major components of prime importance besides other techniques. In our research work, we present the design of a compressor that is scalable and is based on layers. The architecture supports very high data rates by instantiating multiple layers of the architecture. The architecture is specifically designed to cater for communication needs of large enterprises and can be installed at national gateways and can compress thick pipes of data to save costly WAN bandwidth. The modularity of the design makes it an integral component in our national network infrastructure. This chapter presents three types of compression devices, incorporated in the enterprise network. The enterprise can be envisioned as central office, remote fixed
47
branch offices and mobile remote branch offices. The extension of enterprise network to remote fixed and mobile offices requires WAN connectivity. We propose multi-layer compression device at the concentration point of central office, single layer compression devices at fixed remote offices and mobile compression devices with multiple broadband and LAN interfaces at mobile offices. The mobile compression device is termed as mobile hotspot that acts as gateway between LAN and WAN devices that extends enterprise network to mobile branch offices by integrating multiple communication technologies. We propose a unified compression device that can be configured as a multi-layer, single layer or a mobile compression device. Depending on the application, any number of compression layers can be enabled to support the required data rate. Moreover all the LAN and WAN interfaces can be enabled or disabled according to the application requirements. The presented device incorporates Gigabit Ethernet, STM1/STM4/STM16 and wireless interfaces for WAN connectivity and 10G Ethernet interface for enterprise network connectivity. Two fiber channel interfaces are provided to interface SAN with the compression device to compress and decompress the data stored and retrieved from the SAN respectively. The device implements a novel layered architecture that implements the LZ77 lossless data compression algorithm in hardware. The high throughput data compression architecture enables the interfacing of the diverse high-speed communication technologies besides preserving the channel bandwidth for accommodating multiple applications. The device finds applications to optimize WAN bandwidth for healthcare, media & broadcasting. Similarly the device can also be applied for maximizing the utilization of storage area networks.
4.1 Data Compression in Enterprise Network Architecture

The ever increasing demand for voice, data and video or internet communication is giving birth to multiple and diverse homogeneous and heterogeneous, wired and wireless, static and mobile connectivity preferably in handheld devices and laptops.
48
The wishful thinking for a blend of wide area network (WAN) and local area network (LAN) connectivity while on the move is turning into reality. In todays ultra-competitive market, advance data services are migrating from fixed to mobile infrastructure and WAN connectivity is becoming mandatory to extend the enterprise network to mobile infrastructure. The difficult and challenging market conditions have compelled mobile operators in making substantial investments to upgrade their networks for higher capacity technology. In any particular geographical area, besides satellite and point to point microwave links, different wireless broadband communication technologies are also available. These technologies include Wideband Code Division Multiple Access (W-CDMA), CDMA2000 and Worldwide Interoperability for Microwave Access (WiMAX). With the unprecedented and accelerated growth in IT applications, the demand for high speed WAN links increased manifolds thus necessitating the growth in broadband technologies. In the last couple of years, not only the notebooks and PDAs require WAN connectivity while on the go but many other consumers applications demand for mobile broadband communication. The consumers applications like advertisement, marketing, estate agencies, logistics, recruitment, travel, hospitality, healthcare, media, broadcasting, insurance, finance and temporary events require wireless broadband connectivity. Data compression technique maximizes bandwidth and increases WAN link throughput by reducing frame size and thereby allowing more data to be transmitted over a link. This feature enables network managers to increase application performance and service availability for end users without costly infrastructure upgrades. Data compression enables network service providers to maximize available bandwidth and deliver more services to customers over their existing infrastructure. With the widespread use of the branch offices that serve as the actual development centers, the need for compression devices to accelerate data across the WAN becomes more critical.
49
The latest trend of businesses is the establishment of many branch offices miles away from the main head office for better customer support that in turns provides a competitive edge to a company. The mobile branch offices are readily deployable by making use of long vehicles or trailers and established keeping in view the changing business trends and requirements. The information exchange by cell phones, PDAs, laptops or Wi-Fi enabled tools, is vital for providing valuable 24/7 customer support and services across the globe. Wired or wireless networks are formed within the branch offices for voice, video or data and mobile applications. The corporate Information Technology (IT) has to therefore address an increasing number of networks and mobile devices for a scalable and integrated solution for reliable, media rich services and for extending new technologies and corporate services to the branch offices employees at all locations and all responsibility levels. The wireless LAN (802.11) and 3G standards and wired networking technologies are incorporated in the architecture for branch offices that enable remote office personnel to avail the same services, security features and capabilities available at the corporate head office. The proposed mobile and fixed compression devices extends enterprise network to fixed and mobile branch offices. The mobile compression device enables computing devices communicate seamlessly while in motion. The IP centric devices in the mobile branch office, i.e. in a vehicle form local area network on Ethernet, Wi-Fi and Bluetooth networking technologies. Our proposed compression device acts as a gateway to connect local LAN/WLAN with the corporate headquarters through the available WAN link in that geographical area. The device integrates all the devices data, connected on LAN / WLAN and routes the compressed data on the available WAN link. The Figure 4-1 shows communication architecture of our proposed mobile branch office connected with its corporate headquarter through the mobile compression device. The mobile compression device mobile platform is based on our published work of mobile backbone nodes [1], which is acting as a mobile hotspot in our proposed architecture. The device connects desktop computers and a
50
printer through Ethernet interface by forming LAN while IP phone, tablet and a laptop form WLAN on 802.11 interfaces. A PDA communicates with the device via Bluetooth interface. The device is equipped with two broadband technologies cdma2000 and WiMAX.
WAN connectivity Wi-Fi WiMAX cdma2000
Bluetooth Mobile Compression device Ethernet
LAN
Figure 4-1: Mobile Branch Office Architecture The device acts as a mobile hotspot and provides complete mobile platform to integrate multiple LAN and WAN technologies for providing secure ubiquitous connectivity while on the move. The architecture for enterprise network extension to branch offices is shown in Figure 4-2. The fixed and mobile branch offices are connected with the corporate head quarter through WAN communication links. Each branch office LAN is connected with the WAN through the compression device. Similarly the corporate headquarter also incorporates compression device to extend the enterprise network to branch offices. An enterprise with several branch offices requires multi-gig compression device to handle requests from all the static and the mobile branch offices at the main data center.
51
Figure 4-2: Optimized Enterprise Network Extension for Branch Offices This thesis focuses on two broad categories of enterprise applications for the proposed device. The first category addresses the extension of enterprise network to fixed and mobile branch offices where a variety of computing devices form LAN or Wireless LAN (WLAN) within the mobile branch office. The device acts as a gateway to connect these devices with the enterprise network via broadband communication link. We have incorporated a compression layer and multiple LAN/WAN interfaces in our already published work [2] [3] [4]. The second category provides data compression and decompression layer to storage area network in the enterprise architecture. In this research, different lossless data compression architectures are proposed with varying data rates for branch offices and enterprise network concentration points. The compression ratio or factor by which the compression architecture reduces the size of the data stored or transmitted depends on the type of the data transmitted. For example, the ASCII text, which is
52
inefficient in its use of bits, is highly compressible, whereas Secure Sockets Layer (SSL) encrypted traffic, which is intentionally obfuscated to have less repeatable data patterns, is less compressible. Thus the performance of the same architecture for different types of data varies.
4.2 Compression Device

The top level architecture of the multi-gig compression device is shown in Figure 4-3. The device is equipped with multiple communication interfaces for LAN, WAN and SAN connectivity.
Figure 4-3: Compression Device Top Level Architecture The 10G Ethernet interface is provided to connect the device on the enterprise LAN. Two 100/1000 Ethernet interfaces are provided to extend the enterprise network to branch offices. Two Small Form-Factor Pluggable (SFP) transceivers are embedded in the device to connect STM1/STM4/STM16 lines either to connect the remote offices or to interface the SAN on fiber channel protocol. The broadband interfaces CDMA2000 and WiMAX are incorporated to use the device as the mobile
53
compression device in the mobile branch offices. The device can be configured as a mobile hotspot in a mobile branch office, depending on the available broadband link; the mobile compression device selects the optimum link to connect with the enterprise network. The Interconnect routing layer exchanges data between interfaces. The Interconnect Engine serves multiple layers for exchanging data between interfaces. The Interconnect Engine comprises of a scheduler, DMA, a task queue, decoder and a layer interface. The scheduler schedules the tasks for different layers of compression device in an Interconnect Engines task queue. The task fields include Layer ID, Task ID, type of Algorithm and destination LAN/WAN interface ID. The scheduler maintains the task in an array of structures called DMA structure or descriptor table. There are two such DMA structure tables for each layer of compression device, one for compression engine and the other for decompression engine. Due to limited size of the local memories and possibility of large sets of data packet, several structures are maintained for a task in order to process the data packet in chunks. The Interconnect Engine has two DMA channels for each layer; CH0 is dedicated for compression engine and the CH1 for decompression engine. Every layer of the compression device has two base address registers, containing the starting address of the compression engine and decompression engine memory areas. Each of the DMA channels maintain Channel Control Register 0 for the external memory address and Channel Control Register 1 for holding internal memory address, size of transfer and length of link list. The information required to fill the control registers resides in the structure table. The DMA performs read-in transaction from external to engines local memory buffer and write-back transaction from local to external memory. After processing the data, the engine interface requests DMA for write-back to the destination interface. The Interconnect Engine interrupts the scheduler for task completion. The especially designed packet queues are incorporated to minimize the delays. The configuration and management interface is employed to configure communication interfaces, packet queue sizes and compression parameters. Different architectures of compressor are presented to
54
handle multiple channels and varying data rates. The device can be configured to act as a multi-gig compression device to compress/de-compress the 10Gbps data for backup and recovery in the SAN. Our published work [5] [6] incorporates a security layer to encrypt and decrypt the data with Advanced Encryption Standard (AES). The security layer can be instantiated in the device to secure the communication channel as well as to cipher the data in the SAN.
4.3 High Throughput Compression Architecture

Our published research [2] is based on a single layer of compression. The use of multiple instances of processors is a widely adopted approach to perform computational intensive algorithms [7]. Our device architecture also uses this approach and provides scalable and modular hardware solution. The general computational requirements of a class of algorithms are well thought-out in order to make the architecture work for other compression algorithms besides LZ77. The data movement strategy is designed that works best for multilayer architecture of compression algorithms. The configurability and modularity of the design makes the compression device universal that can be configured for any number of layers and any type of LAN and WAN interfaces. The building blocks for finalizing the high throughput architectures have been discussed in the preceding chapter. The high speed multiple stream pipelined architecture is shown in Figure 4-4. This is an extension of high speed single stream pipelined architecture as discussed in previous chapter. Depending upon the requirements of the SAN or WAN connectivity multiple layers of the single layer architectures are employed. Each layer can cater for different types of data or different streams of data are assigned to the respective layers for compression and decompression.
55
X symbols
I P / P R O T O C O L D E M U L T I P L E X E R
Y s y m b ol s
Comp Col1
Comp Col2
Comp Col3
Comp Col4
BMLC1
m_len1 pntr1
BMLC2
m_len2 pntr2
BMLC3
m_len3 pntr3
BMLC3
m_len4, pntr4
Register
Register
Register
Register
m_len4, pntr4
m_len1 pntr1
m_len2 pntr2
m_len3, pntr3
Column Select logic

Layer 1 Layer 2 Layer 3
Multigig WAN Interface
Pointer, length Pointer, length Pointer, length
WAN Interface
Figure 4-4: High Speed Multiple Stream Pipelined Architecture. As already mentioned, this is primarily a pipelined architecture, by incorporating the registers for reducing the critical path in terms of massively parallel comparisons. The initial latency introduced due to the use of pipeline registers are easily offset by the high throughput multiple layer structure. This architecture is well suited for applications where high compression ratios are required as longer history buffer lengths can be accommodated in this architecture at the cost of initial latency. The IP protocol de-multiplexer separates different IP data streams and schedules these streams on different layers of the architecture in the device depending on the workload of a particular layer. The architecture shown in Figure 4-5 is best suited to very high throughput applications. The history buffer length is not very long in this
56
architecture and it caters for the future iterations of the algorithm in the same clock cycle. Therefore, more than one codeword can be produced in the single clock cycle. Different layers of the architecture handle different streams of data and this further helps in enhancing the throughput of the design.
X symbols
I P / P R O T O C O L D E M U L T I P L E X E R
Y s y m b ol s
Comp Col1
Comp Col2
x4y4 x5y4 x6y4 x7y4 y0y4 y1y4 y2y4 y3y4
Comp Col3
x5y5 x6y5 x7y5 y0y5 y1y5 y2y5 y3y5 y4y5
BMC3
Comp Col4
x6y6 x7y6 y0y6 y1y6 y2y6 y3y6 y4y6 y5y6
BMC4
Comp Col5
x7y7 y0y7 y1y7 y2y7 y3y7 y4y7 y5y7 y6y7
BMC5
Comp Col6
y0y8 y1y8 y2y8 y3y8 y4y8 y5y8 y6y8 y7y8
BMC6
BMC1
BMC2
Ptr, len
Ptr, len
Ptr, len
Ptr, len
Ptr, len
Ptr, len
Column Select Logic

Pointers, lengths Pointers, lengths Pointers, lengths
Multigig WAN Interface
WAN Interface
Figure 4-5: High Speed Multiple Stream Unfolded Architecture
4.4 Conclusions
The research presented different architectures of compression device. The presented architectures are layered based and are fully scalable. Depending on the data rates, multiple layers of compression engines can be instantiated in order to meet the required throughput. The device is equipped with multiple interfaces to handle data rates starting from STM1 to 10Gbps. The presented architecture finds applications to compress big pipes of data to conserve the bandwidth. The device can also be utilized in storage area network to compress and decompress the data in real time. 57
The presented architecture can also be employed to implement LZW and its other variants for lossless data compression.
4.5 Reference
[1] Zaheer Ahmed, Rizwana Mehboob, Shoab Khan, Habibullah Jamal, Decentralized GPS aware MANET For Mission Critical Applications conference proceedings M2USIC. [2] Rizwana Mehboob, Shoab A. Khan, Zaheer Ahmed, High Speed Lossless Data Compression Architecture, 10th IEEE Multi-Topic Conference INMIC 2006, Islamabad. [3] Zaheer Ahmed, Habibullah Jamal, Rizwana Mehboob, and Shaob A. Khan, A Navigation Device with MAC Supporting Multiple Physical Networks for Extended Coverage and Operations, IEEE Transactions on Consumer Electronics, Vol. 54, No. 3, August 2008. [4] Zaheer Ahmed, Habibullah Jamal, Shaob A. Khan, Rizwana Mehboob, and Asrar Ashraf, Cognitive Communication Device for Vehicular
Networking, IEEE Transactions on Consumer Electronics, May 2009. [5] Zaheer Ahmed, Sheikh M. Farhan, Rizwana Mehboob, Shoab Khan, Habibullah Jamal, Configurable Network for Mobile System Conference proceedings ICEEC 2004. [6] Zaheer Ahmed, Rizwana Mehboob, Shoab Khan, Habibullah Jamal, Dynamic Mission Critical Secure Network for Mobile System Conference Proceedings ICM 2004. [7] Intel Virtualization Technology in Embedded and Communication Infrastructure Applications Intel Technology Journal, Volume 10, Issue 3, 2006.
58
CHAPTER 5
5 Constant Coefficient Digital FIR Filters: Design and Implementation

The proposed research for evolving the parallel architectures where algorithmic modifications in the context of data manipulation can be exploited is focused around the digital FIR filters. A novel technique that exploits the manipulation of filter coefficients is a part of this sequel and would follow this chapter. This chapter primarily serves as a basis or background study necessary for devising the efficient filter design methodology. The constant coefficient digital FIR Filter is one of the most widely used signal processing algorithms and widely employed in diverse audio, video, hand held devices and many other data communication applications. The filter design comprises of selection of the filter coefficients such that the desired band of input signal frequencies is selected while the unwanted band of frequencies is attenuated to a predefined threshold level. Filters are rightly termed as signal conditioners and employed for separation of noise or interference from a signal or restoration of the distorted signals [1] [2]. The coefficients of a filter actually represent the impulse response of that filter. The digital systems like FIR or IIR filters are either realized in software on general purpose computers, digital signal processors or as standalone hardware realizations using FPGAs or ASICs. Obviously, the hardware implementation using FPGA platform provide enhanced performance in terms of speed, power or area yet offering the flexibility of being scalable and reconfigurable[2] [3]. All the relevant details of digital FIR filters will be discussed in this chapter for understanding of the issues and problems encountered in designing of a novel implementation methodology. The proposed methodology obviously takes into consideration the design basics and proposes the solution to
59
overcome the bottleneck for a high speed and area efficient architecture discussed in chapter 6
5.1 Overview of Digital FIR Filter

The digital FIR filtering involves the multiplication of the coefficient of a filter with the input samples and the addition of all these multiplication results for obtaining the desired output sample. The FIR filter is a linear time invariant (LTI) system and essentially a non-recursive feed forward structure. The output of a filter is invariably expressed as a convolution of the digital input with the filter coefficients (impulse response of the filter). A simple FIR structure is shown in Figure 5-1 [4] [5].
x[n] z-1 z-1 z-1
.. ..
b0 +
b1 +
b2
.. ..
+
bM-1 +
bM y[n]
Figure 5-1: A direct form FIR Filter The output y[n] can be calculated using the following equation:
y[n] = b0 x[n] + b1x[n 1] + ....bM 1x[n M + 1] + bM x[n M ] . 5-1

In the equation . 5-1, the output samples y[n] are the weighted sums of the present and the previous input samples; the weights being the coefficients of the FIR filter. The FIR filter is also termed as tapped delay line, delays represented by z-1. Writing the filter equation in closed form would yield a convolution sum.
M
y[n] =
b[k] x[n k] 5-2

k =0
From the FIR filter, represented by equation 5-2, we see that a simple linear time invariant causal digital FIR filter is written in the form of a non-recursive, constant
60
coefficient difference equation. The FIR digital filter whose output is y[n] given the input x[n] has the impulse response h[n] is actually a sequence of M+1 numbers in the interval 0 to M and represented by the weights or coefficients b[k] in equations 5-1 or equation 5-2. The impulse response h[n] is the time-domain representation of the filter and is given by equation 5-3
b h[ n ] = k 0
0nM otherwise
..5-3
The Discrete Fourier Transform of impulse response h[n] is termed as the frequency response of that filter and given by the equation 5-4
H ( e j ) =
h[ n ] e
n=0
j n
.5-4
Similarly the z-transform of h[n] is termed as the system function of the filter h[n] and given by the equation 5-5
H ( z ) = h[ n] z n .5-5
n =0
The FIR filters do not suffer from the stability problems and in the absence of recursions or feedback paths, they are therefore stable structures. The coefficients for the filter are usually symmetric or asymmetric imparting the important linear phase property which implies that all the frequencies are delayed by the same or equal amount in the output. The length of the filter is equal to the number of taps or number of coefficients. The order of the filter is one less than the filter length. If N is the filter length, then the order M of that filter is equal to M=N-1. The greater the length of the filter, the more sharp is the transition band or more steep roll off is obtained but with the tradeoff of increased computational complexity on an account of longer delays. The delay of a FIR filter as mentioned in [6] is proportional to the order or number of taps: delay =0.5 *(number of taps) / (sampling frequency)
61
One of the major issues in the implementation of this type of linear phase, stable filters especially as a standalone hardware structure is the computational complexity. The complexity arises on an account of the most computationally intensive multiplication process that is a bottleneck for efficient implementation of filters especially in hardware. Chapter 6 lists some of the methods that addresses the problem of computational complexity for realizing digital FIR filters and describes the novel methodology devised for efficient realization of FIR filter. The other details that are considered necessary for arriving at devising the design methodology are provided below.
5.1.1 Types of Digital FIR Filters

The highest allowable frequency in the digital domain is dictated by the Nyquist Theorem which is half of the sampling frequency fs. These filters are classified on the basis of their frequency response. There are primarily four types of FIR filters. They are the low-pass (LP) filters that allow the smaller frequencies to pass and attenuate the higher band of frequencies till fs/2. Similarly as opposed to LP filters, the high pass (HP) filters reject the frequencies from 0 to a required band and allow the higher frequencies till fs/2 to pass. Likewise, the band pass (BP) filters allow a selected band of the input frequencies to pass while rejects the remaining frequencies from the input signal. Finally the band stop or band reject (BS or BR) filters just stops a specified band of frequencies from the input data. All types of filters can be derived from the low pass version of the design [1].
5.1.2 Structures of Digital FIR Filters

The structures are the network of the delay elements, multipliers and adders used to define a filter. Alternatively, the structure is a graphical representation like signal flow graph or block diagram as depicted in Figure 5-1 that represents an algorithm (like FIR filter) for realizing a linear time invariant system especially in hardware. The different arrangements of the elements comprising the filters give rise to a different algorithms or different structures but maintain the same input-output
62
relationship. The digital FIR filters are classified as the direct form, transposed direct form, cascade and lattice structures in which the delays, multipliers and adders are arranged differently. The different structures are equivalent but the computational complexity in terms of mathematical operations for different structures varies considerably for their hardware implementation. Some structures offer economy in terms of multiplications and some in terms of delay or memory elements. The FIR filter as seen from the convolution equation of time-domain representation, it is apparent that an FIR is an all zero structure in the absence of recursion that induces poles. The all zero structure makes it a generalized linear phase structure and the impulse response is either symmetric or asymmetric as given by the equation 5-6 .
h [ n ] = h [ M n ] ..5-6
M is the order of the filter and M+1is the length of the filter. Therefore, using the linear phase property, the direct form realization can be represented with (M+1)/2 or (M/2+1) multipliers when M is odd and (M+1)/2 multipliers when M is even in equation 1. The reduction in the number of multiplications saves the area for hardware realization. Each type of structure and its mathematical formulation is described in detail in the text [4] [5].
5.1.3 Design Methods of Digital FIR Filters

Filters are also classified according to the type of method employed for their design. There are primarily three design methods for designing the FIR digital filter. 5.1.3.1 Window Design Method This is classified as an analytical and sub-optimal method where the desired frequency response of the filter is selected and expressed as Fourier series coefficients. The Inverse Discrete Fourier Transform applied to the required frequency response yields an infinitely long impulse response or the coefficients of the required filter. In order to obtain a finite length filter, the response is either
63
truncated to or multiplied to a time-limited smoothing window that minimizes the Gibbs phenomena at the corners. This multiplication in time domain corresponds to periodic convolution of the desired frequency response with windows frequency response resulting in the smearing of the frequency response obtained by window method [4] [7] [8]. Among the most common windows applied are the Kaiser Rectangular, Hamming, Hanning, Barlett or Blackman windows [4] [5].
5.1.3.2 Frequency Sampling Techniques

In this particular case, equally spaced samples of the desired frequency response are considered. The Inverse Fourier Transform, when applied to these equally space samples of the desired frequency response yield the impulse response h[n] of the desired filter [4] [9]. The frequency response is specified by fixing most of the frequency samples or discrete Fourier Transform coefficients in the pass band and the coefficients in the transition band are chosen by an optimization algorithm that minimizes weighted approximation error over that frequency range[7][9][10].
5.1.3.3 Optimal FIR Filter Design

One of the most commonly used methods is equiripple FIR filter design methods. This is also known as the optimal design and based on Parks McClellan method
that implements the Remez Algorithm based on linear programming [4][7] [11] [12]. This is a computationally intensive method where the weighted maximum error between filter response of the actual and the desired filters is made even across both the pass and the stop bands for reducing ripples in the respective frequency bands [12].
5.2 Hardware Implementation Issues

The digital FIR filter realization either on a digital signal processor or a general purpose microprocessor in software or in hardware on application specific integrated circuit or FPGAs depends upon the type of application. Obviously the later choice speeds up filtering processes and especially useful in handheld and portable devices
64
like cell phones and PDAs. In majority of the filtering applications, the real-time, high throughput parallel and area efficient hardware implementation require that the fixed point numbers be used instead of floating point numbers. Utilizing finite precision arithmetic is also cost effective and economically viable solution. The input signals and the coefficients of the digital filters are represented in finiteprecision binary numbers. This is because the multiplications in FIR filtering are very expensive and it is carried out in binary arithmetic and results are truncated to fixed number of bits. In finite precision arithmetic using fixed point numbers, the numbers are quantized to fixed number of bits either by rounding or by truncation. Different types of errors are introduced at different locations in FIR filters. There are primarily three types of errors introduced due to quantization [13] [14] [15] [16] as mentioned below: The digital filters essentially operate on digital input signal samples and suffer from quantization errors because of analog to digital conversion (ADC) process. The analog values are real values correctly expressed as floating point numbers but when they are represented in finite precision fixed point binary numbers, the errors are introduced in digital filters. Overflow/underflow addition/subtraction and or round-off multiplication errors occur as a result in of the
operations
respectively
intermediate calculations in filtering operations. The overflow/underflow quantization errors occur due to addition/subtraction of two similar signed numbers may results in values that cannot be expressed in the allowed number of bits. Similarly, the rounding or truncation of the product of two numbers to a fixed number of bits leads to the round off quantization errors. The errors due to coefficients quantization of real-valued floating point coefficients when these coefficients values are expressed as fixed point binary numbers. The coefficients as well as the inputs are expressed as fixed point numbers that implies a fixed number of bits required to represent each coefficients numerical
65
value. Similarly, for input sequences we know that analog signals are converted to a sequence of samples by the continuous to discrete time converter in process termed as sampling and resulting signals are discreet time samples of continuous amplitude. These discrete time samples can be expressed with infinite precision, but for processing of these samples on general purpose processors in software or in hardware on specific circuits, the amplitudes must be of finite precision. Thus the quantization process converts discrete time samples of infinite precision values to discrete values or it discretizes the amplitude [17]. Therefore, quantization process affects both the input signal as well as the coefficients of a FIR filter. The quantization is a non-linear process [4] and when applied to the digital input data it is depicted below:
Figure 5-2: Sampling and Quantization The continuous time analog input xa(t) is converted to the discrete time samples x[n] of theoretical infinite precision and then the quantizer stage converts it to the practical realizable samples with values from a finite set of numbers as shown in Figure 5-2. The finite word length implies that there are a fixed number of binary bits for representing a range of numbers. The quantizers can either be uniform or non-uniform but in practical cases we apply uniform quantizers utilizing same uniform step size for each binary value. Both x[n] and y[n] in equation 1 are processed as finite precision numbers and with fixed point representations in addition to h[n].
5.2.1 Finite Word-length Effects

Word length decision is extremely important in fixing the number of bits for input samples, coefficients, intermediate operations and final outputs as smaller word lengths deteriorates the signal quality and output and excessive word lengths incur 66
costs in terms of area and slower speeds. In order to understand the effects of error introduced in coefficients, a discussion regarding the number representation and requirements of bits defining the word lengths is essential. The number system employed in most of the cases for FIR digital filter implementation is twos complement numbers. A finite length data word has an integer part as well as a fractional part. The total number of bits for a fixed point number is word length (wL) that comprises of the fractional part or fraction word length (FwL) and (IwL) [18] and written by equation 5-7.
wL = IwL + FwL .5-7

The range X of data for a fixed point number of word length greater than zero can be expressed as:
0 X 2
IwL 1
IwL
for unsigned numbres

IwL 1
X 2
for signed numbers
The unwanted overflows and underflows are prevented by appropriate assignment of the number of bits according to the range of the signals. The IwL is determined by monitoring the minimum and maximum deviations in the signal of interest [18][19] and for a signal with the maximum value of integer part as X, it can be expressed as
IwL log 2 X
For FIR digital filters, the coefficients value vary from maximum negative value of 1 and positive values less than +1; therefore for such signed fractional numbers where the sign bit is the only integer value = -1, the integer word length IwL is 1. The quantization step () is the least significant bit of the codeword that represents the quantized value [4]. So the quantization step or the resolution of a number is expressed as
=2
FwL
Where FwL is the length of the fractional word length or number of bits
to represent the fractional part of a number. The number of quantization levels is dependent upon the number of bits for representation. For 200 levels of quantization, 8 bits are required; since 7 bits can encode a maximum of 128 (27) discrete values. The fractional word length is therefore given as:
67
FwL log 2 (number of quantization levels)

The error in the quantized output of the quantizer whose input is a discrete time signal x[n] as shown in Figure 5-2 is given by the equation 5-8.
e[n] = x[n] x q [n] .5-8

The error e[n] is considered as a random signal or additive white noise and by modeling it as a random variable with no correlations between the samples. There is a derivation for the effect of length of bits on signal to noise ratio and it increase by a factor of 6 dB for every bit added to the word length of the quantized sample [4].
5.3 Digital FIR Filter Design Parameters

The FIR filter design implies finding its impulse response h[n] or the frequency response H[ej ] in the interval from (- to ) as H[ej ] is a periodic function with a period 2 . Since the FIR filter is symmetric or anti symmetric, therefore the response in the interval (0 to ) is usually specified.
H e j
( )
1+ p
1p
Figure 5-3: Design Specifications of Digital Filter The design parameters are usually specified in the frequency domain [4][5]. A digital FIR filter with frequency response H[ej ] is written as H (e j ) = h[n] e jn
n =0 M
68
The frequency response is a complex quantity having a magnitude and phase but in designing the FIR digital filters only the magnitude of the filter response H (e j ) is stated in the specifications in the design as shown in Figure 5-3. A filter allows a band of frequencies to pass in the range from p to p known as the pass band frequencies and where ideally the magnitude response of a filter should be unity. If we are considering a linear phase system it is symmetric about the origin therefore the pass band range is 0 to p. The allowable pass band magnitude variation of the magnitude response is termed as delta pass p such that
1 p H (e j ) 1 + p .................. p
Similarly the response of filter is attenuated by an amount of s (delta stop) in the stop band that is from ((- to s) U (s to )) or from (s to ) for a symmetric filter such that
H (e j ) p
The terms p and s are real numbers typically much less than 1. The transition band is the band of frequencies from p to s; ideally this should be as small as possible and the smaller the transition band more is the filter order.
5.4 Conclusions
The chapter outlined the pertinent details of the constant coefficient FIR digital filters that are extensively employed in many consumer devices. The material serves as the basis for proposing the methodology for efficient realization of the FIR filters as presented in chapter 6. The issues encountered in the design methodology for devising efficient implementation are outlined.
5.5 References
[1] Steven W. Smith, The Scientist and Engineer's Guide to Digital Signal Processing, copyright 1997-2006 by California Technical Publishing, www.dspguide.com/copyrite.html
69
[2]
Vinger, K.A. Torresen, J., Implementing Evolution of FIR-Filters Efficiently in an FPGA, Proceedings of NASA/DoD Conference on Evolvable Hardware, July 2003
[3]
Shanthala S, S. Y. Kulkarni, High Speed and Low Power FPGA Implementation of FIR Filter for DSP Applications, European Journal of Scientific Research, ISSN 1450-216X Vol.31 No.1 (2009), pp. 19-28, Euro Journals Publishing, Inc. 2009, http://www.eurojournals.com/ejsr.html
[4]
A.V. Oppenheim, R.W. Schafer and John Buck, Discrete-Time Signal Processing, Second Edition, 1989 Prentice-Hall, Inc, copyright.
[5]
John
J.
Proakis,
Dimitris
G.
Manolakis,
DIGITAL
SIGNAL
PROCESSING, Principles, Algorithms and Applications, Third Edition, 1996 Prentice Hall, Inc. [6] FIR and IIR filter Design Guide, Frequency Devices, Inc., http:/ www.freqdev.com [7] Lawrence R. Rabiner, James H. McClaullen, Thomas W. Park, FIR Digital Filter Design Techniques Using Weighted Chebyshev Approximation, Proceedings of the IEEE, Vol. 63, No. 4, April 1975 [8] Gordon Hands ,Selecting FPGAS For FIR Filter Implementation, A Feature from www.latticesemi.com [9] L. Rabiner and R. Schafer, Recursive and Nonrecursive realization of Digital Filters designed by Frequency Sampling Techniques, IEEE Transactions on Audio Electroacoustics, vol. AU-20, March. 1972. [10] C. Sidney Burrus, FIR Filter Design by Frequency Sampling or Interpolation, Connexions Module m16891,
http://cnx.org/content/m16891/latest/ [11] L. Rabiner,, Linear Program Design of Finite Impulse Response (FIR) Digital Filters, IEEE Transactions on Audio Electroacoustics., vol. AU-20, Oct. 1972. [12] John R. Treichler, Notes of the Design of Optimal FIR Filters, Copyright2006 Applied Signal Technology, Inc. [13] Joseph Petrone , Adaptive Filter Architectures for FPGA implementation, Maters Tthesis, Florida State University, 2000 [14] J. B. Knowles and E. M. Olcayto, Coefficient Accuracy and Digital Filter Response, IEEE Transactions on Circuit Theory, Vol-CT15, No. 1, 1968 70
[15]
Bernard Gold Charles M. Rader, Effects of Quantization Noise in Digital Filters, Proceedings of the April 26-28, 1966, AFIPS Joint Computer Conferences Boston, Massachusetts, s 213-219 Year of Publication: 1966
[16]
Weinstein, Clifford J., Quantization Effects in Digital Filters, Technical Report Accession Number : AD0706862 , corporate author : Massachusetts Institute of Technology, Lexington Lincoln, Date : 21 Nov 1969
[17]
Brad
Hunting,
Finite
Word
Length
Effects
on
Digital
Filter
Implementations, www.embedded.com [18] K Han, Automating Transformation from Floating-point to Fixed-point for Implementing Digital Signal Processing Algorithms, A PhD. Dissertation University of Texas at Austin , 2006 [19] S. Kimand W.Sung, A floating-Point to Fixed-Point Assembly Program Translator for the TMS 320C25, IEEE Transactions on Circuits and Systems, vol.41, no.11, 1994.
71
CHAPTER 6
6 Hardware Efficient Filter Implementation

This chapter deals with a new paradigm for optimizing the architectures where the algorithmic modifications involving data perturbations are tolerated and therefore incorporated for the desired results. This target architecture is that of the constant coefficient FIR digital filter in which the coefficients are tempered to a certain degree in terms of their length and numerical values to obtain the optimized hardware implementations. This is the second application as claimed in the thesis summary and introduction where architectural optimization evolved does not follow the standard optimization techniques or mathematical number representation methods and many others techniques reported in literature. The FIR filters extensively used in wired, satellite/wireless communications, video, audio processing and hand held devices are preferred because of their stability and linear phase properties. The only drawback in the FIR digital filters implementation is the computational complexity due to the large number of multiplications that are major resource consuming operations in terms of execution speed, area and power consumption [1]-[4]. Many common hardware optimized implementations include coefficients
representation in canonic signed digit (CSD) or minimum sign digit (MSD) form that reduces the number of non-zero bits in coefficients; the multiplications can be simply converted to shifts and adds [1][2][5]-[9]. The shift operation is cheap in hardware so the minimization of additions is termed as reducing multiple constant multiplication (MCM) problem Similarly the common or repeated bit patterns in filters terms can be eliminated by another technique in vogue termed as common sub expression Elimination (CSE) [1][9]-[12]. Some other methods include the extrapolated impulse response [13][14], the frequency response masking approach [15][16], predictive coding of coefficient values[17], coefficient over-sampling [18] and coefficient thinning [19] to mention a 72
few. Having studying a variety of optimized methods, this research focused on devising some other methodology that must cater for efficient realization of FIR filters in FPGAs or ASICs. In this chapter, we propose a novel design methodology for area and consequently power efficient technique for implementing an optimized FIR filter in hardware. The technique for proposing the hardware optimized FIR digital filter implementation in terms of area and power savings is primarily devised by incorporating the quantization effects on the filter coefficients.
6.1 Effects of Coefficient Quantization

The coefficient quantization is one of the sources of error in the realization of the digital FIR filters in hardware. This implies that the error is introduced in filter realization when the real-valued constant coefficients are expressed as fixed point finite precision numbers. These fixed point binary numbers are represented in a finite number of bits depending upon the range of the numbers and the required precision. Twos complement representation of finite precision numbers is usually employed for FIR filters implementation and in our research also, we have used twos complement numbers.
2 -1 -0.875 -0.75 -0.625 -0.5 -0.375 -0.25 2 0 0.125 0.25 0.375 0.5 0.625 0.75 0.875
-0.125
1000 1001 1010
1011 1100 1101 1110
1111
0000 0001 (a)
0010 0011
0100
0101 0110
0111
-1
-0.875 -0.75 -0.625 -0.5 -0.375 -0.25 -0.125 1000 1001 1010
0.125 0.25
0.375
0.5
0.625
0.75
0.875
1011 1100
1101
1110 1111
0000 (b)
0001
0010
0011
0100 0101 0110
0111
Figure 6-1: Quantization (a) by rounding (b) by truncation The quantization error on the filter coefficients represented in twos complement numbers either by rounding or by truncation is depicted in Figure 6-1. The error in
73
conversion of a real value to a finite precision value varies from /2 to /2 where is the quantization step. Similarly, the quantization error in coefficient value due to truncation is always negative [3] and varies from - to 0. In our design, we resorted to the later approach of quantization by truncation of bits. In our approach the quantized coefficients are further truncated to smaller word lengths.
6.1.1 Effect on Frequency Response of a FIR Filter

In the previous chapter, the system response of a digital FIR filter is given. If the filter coefficients h[n] are quantized and represented by hq[n], the quantization would result in the frequency response with errors. If the error in the coefficient values due to quantization is taken into account, then the quantized coefficient can be written as hq[n] =h[n] + h[n] The quantized Hq[z] response can then be written as mentioned in [3]
H q ( z ) = hq [n] z = H [ z ] + H [ z ]
n=0
Where H ( z ) = h[n] z
n =0
The focus of the researchers is to minimize the hardware implementation costs of FIR filters in terms of area, speed or power consumption and computational complexity. The type of digital FIR filter structure selection besides being dependent on the above mentioned hardware implementation factors is also dependent on the sensitivity of a particular structure to quantization errors. The FIR filter is an all zero filter and if the structure of a filter is such that zeros are tightly packed then the errors due to quantization would be more pronounced [3]. The direct form structure is more robust to withstand these types of errors as zeros are uniformly spaced around the unit circle. The quantization process in inevitable for a cost effective and efficient realization of a digital FIR filter and it affects the frequency response of a
74
filter. In our research, we studied the effects of filter coefficients on the filters frequency response and discussed in the next section.
6.2 FIR Filters with varying Quantization Levels

In order to draw empirical results of the effects of quantization on the frequency response of different types of filters, we studied the effects of quantization by varying the number of quantization bits and analyzing the corresponding frequency responses. In this study, the filter design platform employed is Matlab Filter Design and Analysis (FDA) tool. A number of low pass, high pass, band pass and band stop filters are realized by different filter parameters and by using different design techniques like equiripple or a variety of window design methods. Each type of design is realized as floating point implementation and then converted to a quantized filter. The frequency responses of these filters are then plotted. The quantization levels are then reduced by reducing the number of bits from 32 to 31, from 31 to 30 and so on. On each reduction in the number of bits, the corresponding frequency responses are again plotted. The results are tabulated for different quantization levels, the plots and the actual values of the filters realized at different quantization bits for the coefficients served as an insight for a newer efficient design methodology. In this study, we experimented all types of FIR filter ie. Low pass, high pass, band pass and band stop implementations that also serve as a basis for designing the methodology for efficient filter implementation in hardware as described in next section. Similarly, each of these filters are realized by using a variety of design methods and furthermore, the specifications of pass and stop band values, stop band attenuation etc. is also varied in different filter realizations. For either filter type or design, the number of quantization levels is progressively reduced from 32 down to a value where the deterioration in frequency response can be tolerated. The results for low pass filters for different quantization levels are plotted as given below.
75
Magnitude Response (dB) 20 0 -20 -40 -60 -80 32 bits -100 -120 -140 -160 -180 20 bits 16 bits 14 bits 12 bits 09 bits
Magnitude (dB )
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Normalized Frequency ( rad/sample)
0.8
0.9
Figure 6-2: Responses of Filters with different quantization levels It is observed as shown in the Figure 6-2 that varying the lengths of the quantization bits of the same filter produce the frequency responses whose parameters vary. This is done by initially making use of a double precision floating point realization. The next step is to quantize the same filter. In the example shown in Figure 6-2, the filter design method employed is direct form I, Kaiser window low pass filter of order 37 with normalized p=0.25 and s =0.45 and the values of
p and s are set at 0.1
and 0.001 respectively. The frequency responses of the same filter with different quantization levels are plotted on the same plot as shown in Figure 6-2 As we progressed from 32 bits down to 9 bits, the exercise of finding the frequency response for next lower number of quantization levels was stopped. This is because, that the responses of the same filters remained practically same in the pass band when the number of quantization bits were reduced to half of the original i.e. 16 bits as compared to the initial 32 bits. Similarly, in the stop band also, the attenuation remained within -60 dbs, even when the bits were reduced to more than half of the
76
original number of 32 bits. The stop band attenuation started worsening after 13 bit and began worsening beyond 12 bits. Therefore, based on the quantization study we will proceed further for devising the efficient implementations for FIR digital filters in the section to follow.
6.3 Proposed Design Methodology

The basis of this technique can be attributed to the study of the effects of quantization on filter response as described in previous section. Our ultimate target is to map the design on the FPGAs and to achieve the minimization of hardware resources at high speed. Considerable research has already been published for minimizing the size of the filter implementation in hardware; here we claim that we propose a novel idea for minimizing the FIR filter area and hardware resources. The other techniques like CSD, MSD or CSEE can still be superimposed on our proposed method for further saving in area and increase in speed. In our approach, we initially design the filter with required specifications and then perform the manipulations for designing the same filter with better performance in terms of the hardware resources and area consumed thus providing power savings too. In the proposed scheme, a digital FIR filter with the required specifications is designed using the Matlab FDA tool. The notation for a filter in time domain is usually the letter h, so we designate the initially designed filter as horig. Therefore, as a first step a filter horig of required specifications is designed. There is a choice of four types of filters and that include low pass, high pass, band pass and band stop. Similarly either of the window based approach design method or equiripple design method can be selected. The FDA tool provides a wide range of choice for filter design. The resulting filter order with floating point values of coefficients and frequency response plot is obtained. The filter horig is termed as an originally specified filter design. The coefficients of horig are then converted to finite precision fixed point numbers by quantizing at 32 bits and represented as [32 31] where the number 31 depicts bits for fractional parts of a number.
77
The next step is to design another filter termed as filter over designed ( hover_des) with similar filtering requirements as horig but the design parameters specifications like stop band attenuation or width of transition band are more stringent for this filter. The resulting filter design hover_des is obviously a filter with over designed specifications than the filter horig. The number of coefficients of filter hover_des increases considerably thus increasing the order of the filter conforming to better frequency response. This filter hover_des has smaller transition band while providing greater attenuation in stop band than horig at the cost of increased filter order. The over designed filter hover_des is then initially quantized using 32 bits. In the proposed approach, we claim that if we reduce the number quantization bits of the over designed filter hover_des, the response of the over designed filter matches to or is better than the original filter with considerably lesser number of bits than that in the horig [20]. The hover_des filters coefficients are then quantized by successively reducing the number of bits. The responses of each of the quantized designs are plotted. Iterations are performed in which the reduction in hover_des coefficients quantization bits and checks for the overshoots in the stop band. The quantization of hover_des coefficients to lesser number of bits stops when the stop band attenuation as specified in horig increases beyond the specified limit. The responses of filters are also plotted for visual. In order to formulate the results, an iterative algorithm is developed that begins with the quantization of hover_des by 32 bits. In every iteration, the number of bits of hover_des is reduced by one and the resulting response is plotted. The transition bandwidth and stop band attenuation is analyzed. The algorithm therefore finds the number of bits where hover_des termed as optimized filter (hopt) frequency response matches to horig frequency response and further reduction in the quantization bits produce a deteriorated response than horig. This technique provides an optimized design for the filter horig as hardware resources are reduced for implementing a filter of desired frequency response. Therefore, a number of minimum quantization bits of
78
filter hover_des is obtained where its frequency response either matches to or is slightly better than the response of filter horig. The dissertations claim that hover_des coefficients quantized at appreciably smaller number bits has a frequency response that matches to or better than the horig frequency response quantized at 32 bits. This can easily be verified by the plots shown in Figure 6-3, Figure 6-4, Figure 6-5 and Figure 6-6 for low pass (LP), high pass (HP), Band pass (BP) and band stop (BS) FIR filters. The results reveal that even when the number of bits of the filter hover_des is reduced to less than half of the original 32 bits, the resulting hopt filter response is better than the response of the originally designed filter horig. Further reduction in the quantization bits still produce no appreciable worsening effect in pass band but stop band attenuation and transition band width of the filters frequency response begins to deteriorate. In each of the figures, there are three types of frequency responses plotted i.e. frequency responses of original filter horig quantized at 32 bits, the frequency response of over-designed filter hover_des quantized at 32 bits and the frequency the optimized design hopt quantized at much lesser number of bits. The number of bits for the optimized design is different for different types of filters since they correspond to reduced number of bits of over designed filter where its response matches to the original filter.
79
20
orig-design-32b
0
over-design-32b opt-design-11b
Magnitude (dB)
-20 -40 -60 -80 -100 -120 0
0.2
0.4
0.6
0.8
Normalized Frequency ( rad/sample)

Figure 6-3: Filter Optimization for LP FIR Filter1
20 0
Magnitude (dB)
-20 -40 -60 -80 -100 -120 -140 0 0.2 0.4 0.6 0.8
orig-desin-32b over-design-32b opt-design-10b

Figure 6-4: Filter Optimization for HP FIR Filter2
80
50
Magnitude (dB)
-50
-100
orig-design-32b over-design-32b opt-design-14b 0.2 0.4 0.6 0.8
-150 0

Figure 6-5: Filter Optimization for BP FIR Filter3
20 0
Magnitude (dB)
-20 -40 -60 -80
orig-design-32b
-100 -120 -140 0
over-design-32b opt-design-14b
0.2 0.4 0.6 0.8

Figure 6-6: Filter Optimization for BS FIR Filter 4
81
6.4 Hardware Implementation and Results

In this section, we present the results of our approach for minimizing the hardware resources that in turns results in reduction of the area and also increases the speed of execution. The filter designed with original specifications horig quantized at 32 bits, the filter with tighter design constraints hover_des quantized at 32 bits and filter with optimum number of quantized bits (hopt) is converted to Verilog for hardware implementation. The RTL code is synthesized on Virtex-5 LT110T FPGA. The hardware resources for the originally designed filter horig, for the over designed filters hover_des and for the over designed filter with reduced quantization bits hopt are compared by synthesizing their respective designs. The Table 6-1, Table 6-2, Table 6-3 and Table 6-4 show the synthesis results or hardware resources of filters originally designed with certain parameters, over designed and finally the optimized filters quantized at smaller number of bits for low pass, high pass band pass and band stop FIR filters whose frequency response plots are shown in Figure 6-3, Figure 6-4, Figure 6-5 and Figure 6-6. As seen from the entries of Table 6-1, Table 6-2, Table 6-3 and Table 6-4 and the hardware resources for the optimized filter design hopt are significantly lesser than the originally designed filter horig while the frequency response of the former is same or better than the later. Obviously the resources for hover_des filters quantized at 32 bits are much greater than horig quantized at 32 bits. The reduction in hardware resources is obviously due to the decrease in the number of bits of the optimized filter. The enhanced performance of the optimized filter is due to the increased number of coefficients but this increase is compensated by the decrease in number of quantization bits. Therefore the tabulated results validate our claim. The synthesis results show that the proposed methodology results in the minimization of hardware resources and yet an increase in speed is achieved.
82
Table 6-1: LP FIR FILTER 1 ORDER 19 VS ORDER 41 RESOURCES Parameters & HW Resources Delta pass Delta stop wpass wstop order No.of bits No.of slices No.of FFPs No.of LUTs No.of IOBs No.of MULTs Max.Freq MHz Original Filter [32 31] 0.01 0.01 0.3 0.5 19 32 1975 704 3186 99 80 14 Over designed Filter [32 31] 0.008 0.008 0.3 0.4 41 32 7737 1409 13553 99 96 7 Optimized Filter[11 10] 0.008 0.008 0.3 0.4 41 11 703 484 877 36 38 60
TABLE 6-2: HP FIR FILTER 2 ORDER 20 VS. ORDER 66 RESOURCES Parameters & HW Resources Deltapass Deltastop wstop wpass order No.of bits No.of slices No.of FFPs No.of LUTs No.of IOBs No.of MULTs Max.Freq MHz Original Filter [32 31] 0.01 0.01 0.3 0.5 20 32 2093 736 3378 99 82 14 Over designed Filter [32 31] 0.001 0.001 0.4 0.5 66 32 13126 2209 22957 100 96 5 Optimized Filter[10 9] 0.001 0.001 0.4 0.5 66 10 942 661 1162 34 31 55
In the above tables, for low pass and high pass filters, the number of bits of the optimized filters is less than half of the originally designed or approximately one third of the originally designed filter. The optimized filters are synthesized at much higher clock rates or approximately 3 times to that of the originally designed filters.
83
TABLE 6-3: BP FIR FILTER 3 Parameters & HW Resources Deltastop1 Deltastop2 Deltapass Wstop1 Wpass1 Wpass2 Wstop2 Order No.of bits No.of slices No.of FFPs No.of LUTs No.of IOBs No.of MULTs Max.Freq MHz Original Filter [32 31] 0.001 0.001 0.1 0.3 0.45 0.7 0.85 23 32 2393 832 3871 99 96 12
ORDER 23 VS. ORDER 63 RESOURCES
Over designed Filter [32 31] 0.0001 0.0001 0.01 0.35 0.45 0.7 0.8 63 32 13358 2113 23613 99 96 5
Optimized Filter[14 13] 0.0001 0.0001 0.01 0.35 0.45 0.7 0.8 63 14 1234 858 1516 42 54 55
The Table 6-3 depicts the hardware resources of band pass filters. The original filter order is 23 and the over designed filter is obtained by narrowing the transition bands on either side. The overdesigned filters order thus turns out to be 63. Obviously
the hardware resources of this filter increase compared to the original filter. When the bits of the overdesigned filter are reduced from 32 to a value where its pass band and stop band responses are better or equal to the original filter, the optimized filter is obtained and resulting bits number is 32. The hardware resources as well as the clock speed for the three different types of filter as shown in Table 6-3 validate our claim. Similar results are obtained for band stop filter of order 26 as shown in Table 6-4. In the over designed filter not only the transition band width is reduced but the stop band attenuation is increased. Thus the resulting design is a filter of order 66. The resources of these two filters quantized at 32 bits and the optimized design quantized at 14 bits results again verify our proposed design methodology for efficient hardware implementation.
84
TABLE 6-4: BAND STOP FIR FILTER 4 ORDER 26 VS. ORDER 66 RESOURCES Parameters & HW Resources Deltpass1 Deltapass2 Deltastop wpass1 wstop1 wstop2 wpass2 Order No.of bits No.of slices No.of FFPs No.of LUTs No.of IOBs No.of MULTs Max.Freq MHz Original Filter [32 31] 0.1 0.1 0.001 0..25 0.4 0.7 0.85 26 32 3480 928 5873 99 96 11 Over designed Filter [32 31] 0.01 0.01 0.0001 0.3 0.4 0.7 0.8 66 32 14328 2209 25319 100 96 5 Optimized Filter[14 13] 0.01 0.01 0.0001 0.3 0.4 0.7 0.8 66 14 1494 967 1865 46 65 55
6.5 Conclusions
The results of each type of low pass, high pass, band pass and band stop digital FIR filters reveal that the technique proposed in the dissertation is an optimized methodology for hardware resources minimization. Therefore, this results in economy in terms of area and power consumption. The advantage in using this heuristic approach can be used in conjunction with CSD or MSD or CSE techniques for optimizing the filter implementation. This would then result in further reduction in hardware resources. In this technique, we verify our claim that in some applications, the algorithm or the processing tolerates the manipulation of specifications during processing for the optimized output. In this case, architectural optimizations are not employed rather processing or algorithmic optimization is applied.
85
6.6 References
[1] A. Hosangadi, F. Fallah, and R. Kastner, Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two Term Common Sub expression, Proc. of ASP-DAC , 2005. [2] Flores, J. Monteiro, and E. Costa, An Exact Algorithm for the Maximal Sharing of Partial Terms in Multiple Constant Multiplications, Proceedings of ICCAD, pp. 13-16, 2005. [3] A.V. Oppenheim, R.W. Schafer and John Buck, Discrete-Time Signal Processing, Second Edition, 1989 Prentice-Hall, Inc, copyright. [4] John J. Proakis, Dimitris G. Manolakis, DIGITAL SIGNAL
PROCESSING, Principles, Algorithms and Applications, Third Edition, 1996 Prentice Hall, Inc. [5] C. Park and H. -J. Kang, Digital filter synthesis based on minimal signed digit representation, Proc. of DAC, pp. 468-473, 2001. [6] L-P. Flores, J. Monteiro, and E. Costa, An exact algorithm for the maximal sharing of partial terms in multiple constant multiplications, Proc. of ICCAD, pp. 13-16, 2005. [7] A. Dempster and M. MacLeod, Use of minimum-adder multiplier blocks in FIR digital filters, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 42, no. 9, pp.569-577, 1995. [8] D. R. Bull and D. H. Horrocks, Primitive operator digital filters, IEE Proceedings G, vol. 138, no. 3, pp. 401-412, 1991. [9] R. Pasko, P. Schaumont, V. Derudder, S. Vernalde, and D. Durackova, A new algorithm for elimination of common subexpressions, IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, vol. 18, no. 1, pp. 58-68, 1999 [10] R. I. Hartley Subexpression Sharing in Filters using Canonic Signed Digit Multipliers, IEEE Transactions on Circuits and Systems II, 43(10), 1996 [11] M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, : Multiple Constant Multiplications: Efficient and Versatile Framework and
Algorithms for Exploring Common Subexpression Elimination, IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systmes, 15(2):151-165, Feb 1996.
86
[12]
Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner, FPGA Implementation of High Speed FIR Filters Using Add and Shift Method International Conference on Computer Design, IEEE 2006
[13]
Y.C. Lim, Extrapolated Impulse Response FIR Filters, IEEE Transactions on Circuits, Systems, vol 37, Dec. 1990
[14]
Ya Jun Yu; Guohui Zhao, Kok Lay Teo and Yong Ching Lim, Optimization and Implementation of Extrapolated Impulse Response Filters, IEEE Internation Conference on Neural Networks and Signal Processing, Nanjing, China, December 14-17, 2003
[15]
Ya Jun Yu; Guohui Zhao, Kok Lay Teo and Yong Ching Lim, Frequency Response Masking Approach of Sharp Linear Phase Digital Filters, IEEE Transactions on Circuits, Systems, vol. CAS-33, pp 357-364, April 1986
[16]
T. Saram aki and H. Johansson, Optimization of FIR filters using Frequency Response Masking Technique, Proceedings of IEEE International Conference on Circuits, Systems Vol II, May 2001.
[17]
Y. C. Lim, Predictive coding for FIR Filter Wordlength Reduction, IEEE Transactions on Circuits Syst., vol. CAS-32, pp. 365-372, April 1985..
[18]
M. R. Y. Bateman and B. Liu, An approach to programmable CTD filters using coefficients 0, +1, and -1, IEEE Transactions on. Circuits Syst., vol. CAS-27, pp.451-456, June 1980.
[19]
G. F. Boudreaux and T.W. Parks, Thinning digital filters: A piecewiseexponential approximation approach, IEEE Trans. Acoustics., Speech, Signal Processing, vol. ASSP-31, pp. 105-113, Feb. 1983
[20]
Rizwana Mehboob, Shoab Khan, Rabbiya Qamar, FIR Filter Design Methodology for Hardware Optimized Implementation, IEEE Transactions on Consumer Electronics, Vol 55, Issue 3, August 2009
87
CHAPTER 7
7 Conclusions
The thesis presents innovative approaches for high throughput and high speed architecture designs. The thesis argues that for optimal digital design the techniques are application specific. In many design problems, the constraint of algorithm is such that no exploration in redefining the algorithm can be performed. The architect needs to remain in the ambient of the algorithm specifications and apply innovations in the laying out of the algorithm. Even in this design paradigm, the standard techniques may not find optimality. The thesis considers data compression algorithms, and design effective architecture to handle gigabits of rates. The design is scalable and modular. The design unrolls the iterations like any optimization technique but then finds interesting pattern of data manipulation that result in effective design. The patterns reuse partial results and minimize data fetches from memory. The design is used to develop a compression device to compress thick pipes of multi-gigabit data rates. The modularity and scalability is shown to work effectively in the network. The proposed design can reduce the requirement of interconnect by many folds thus reduces the cost of bandwidth. The thesis then takes another example and stresses that in many design instances, looking at the algorithm differently may result in better designs. A reworking or modification of design methodology should be explored. The thesis presents the analysis of the effects of quantization of FIR filter coefficients on the frequency response of the filter by successively reducing the number of quantization bits. This analysis reveals that reduction in quantization bits of FIR filter coefficients by half of the original number does not adversely affect the frequency response of a filter. This leads to a novel methodology of filter design that is proposed for design of a FIR filter for hardware optimized implementation. The approach designs a filter of certain specifications and then over designs the same filter with tighter constrains resulting in a higher order filter. The filter with over designed parameters is then 88
quantized by successively reducing the bits to a limit where its frequency response matches with the actual initially designed filter. This innovation reduces the area of a filter hence providing an optimized hardware design implementation. The filter design methodology proposed in the thesis is a new paradigm for efficient hardware implementation of a FIR filter. This methodology can be applied for different types of filters and the Canonic Signed Digit and Common Sub expression elimination methods can be superimposed on our proposed technique for still better results and hardware efficient implementation.
89

RM Thesis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RM Thesis

Uploaded by

Copyright:

Available Formats

Efficient Architectures for High Speed Lossless Data Compression and FIR Filter Implementation

Rizwana Mehboob 03-UET/PhD-CASE-CP-04

Supervisor Dr. Shoab. A. Khan

Thesis Supervisor Dr. Shoab A. Khan

Dr. Mahmood Ashraf Khan

Dr. Farooque Azam

Internal Examiner / Thesis Supervisor

Dr. Shoab A. Khan

Rizwana Mehboob 03-UET/PhD-CASE-CP-04

D edicated To My D eceased Parents

3.1.1 3.1.2 3.1.3 3.2 3.3

LZ77 High Speed Super-unfolded Architecture Details..................... 39

3.4.1 3.4.2 3.5 3.6 3.7 4

5.1.1 5.1.2 5.1.3

5.2.1 5.3 5.4 5.5 6

6.1.1 6.2 6.3 6.4 6.5 6.6

1.1 Data Compression

1.2 WAN Optimization for Enterprise Network Extension

1.3 Digital FIR Filters

1.4 Overview of the Dissertation

Reconfigurable hardware implementation of a high-speed compressor, 16th

2 Lossless Data Compression: An Overview

2.1 Requirement of Data Compression

2.2 Lossless vs. Lossy Compression

2.3 Lossless Data Compression Methods

2.3.1 Statistical Methods

2.3.1.1 Static Modeling

2.3.1.2 Semi-Adaptive Modeling

2.3.1.3 Adaptive Modeling

2.3.1.4 Statistical Coding Methods

2.3.2 Dictionary Based Methods

2.3.2.1 Static Methods

2.3.2.2 Adaptive Methods

2.3.2.2.1 Lempel-Ziv 1 (LZ1) Coding

2.3.2.2.2 LZ78 or LZW Coding

2.3.3 Types of Implementation

2.3.3.1 Software Solution

2.3.3.2 Hardware Solutions

2.3.3.2.1 CAM Based Implementations

2.3.3.2.2 Systolic Array Implementations

2.3.3.2.3 Other Approaches

Lossless data compression: theory and algorithms www. maximum compression.com,

3 High Speed Architectures for Lossless Data Compression Algorithms

3.1 Proposed LZ77 Hardware Realization

3.1.1 LZ77 Compression Algorithm Processing

# of bits for encoding match length m : len = log 2 Q where x = ceiling ( x)

3.1.2 Compression Example

S= a a a b a b a c b a c b a c b c a c b a c b c a a N=18, P=9, Q=9

Code word = < 8 3 b >

Code word = < 7 3 c >

Code word = < 6 7 c >

Code word = < 2 8 a >

3.1.3 Architectural Blocks

Table 3-1: Parallel Comparisons for N=12, P=8, Q=4 x0 x1 x2 x3 y0 y1 y2 y3 x 1 x2 x3 x4 y0 y1 y2 y3 x2 x3 x4 x5 y0 y1 y2 y3 x3 x4 x5 x6 y0 y1 y2 y3 x4 x5 x6 x7 y0 y1 y2 y3 x5 x6 x7 y0 y0 y1 y2 y3 x6 x7 y0 y1 y0 y1 y2 y3 x7 y0 y1 y2 y0 y1 y2 y3

0 1 Match Pointer 2 3 4 Match length

3.2 Unfolded Parallel Architecture

3.3 Super-unfolded Architecture

3.3.1 Design Methodology

3.3.2 Comparison Matrix

3.3.3 LZ77 High Speed Super-unfolded Architecture Details

x4y4 x5y4 x6y4 x7y4 y0y4 y1y4 y2y4 y3y4

x6y6 x7y6 y0y6 y1y6 y2y6 y3y6 y4y6 y5y6

y0y8 y1y8 y2y8 y3y8 y4y8 y5y8 y6y8 y7y8