You are on page 1of 9

IEEE TRANSACTIONS ON SIGNAL PROCESSING

FPGA Realization of FIR Filters by Efcient and Flexible Systolization Using Distributed Arithmetic
Pramod Kumar Meher, Senior Member, IEEE, Shrutisagar Chandrasekaran, Student Member, IEEE and Abbes Amira, Senior Member, IEEE

Abstract In this paper, we present the design optimization of one- and two-dimensional fully-pipelined computing structures for area-delay-power-efcient implementation of nite impulse response (FIR) lter by systolic decomposition of distributed arithmetic (DA)-based inner-product computation. The systolic decomposition scheme is found to offer a exible choice of the address length of the look-up-tables (LUT) for DA-based computation to decide on suitable area-time trade-off. It is observed that by using smaller address-lengths for DA-based computing units, it is possible to reduce the memory-size but on the other hand that leads to increase of adder complexity and the latency. For efcient DA-based realization of FIR lters of different orders, the exible linear systolic design is implemented on a Xilinx Virtex-E XCV2000E FPGA using a hybrid combination of Handel-C and parameterizable VHDL cores. Various key performance metrics such as number of slices, maximum usable frequency, dynamic power consumption, energy density and energy throughput are estimated for different lter orders and address-lengths. Analysis of the results obtained indicate that performance metrics of the proposed implementation is broadly in line with theoretical expectations. It is found that the choice of address-length M = 4 yields the best of area-delaypower-efcient realizations of the FIR lter for various lter orders. Moreover, the proposed FPGA implementation is found to involve signicantly less area-delay complexity compared with the existing DA-based implementations of FIR lter. Index Terms Finite impulse response (FIR) lter, linear convolution, systolic array, eld programmable gate arrays (FPGA), distributed arithmetic.

I. I NTRODUCTION Finite impulse response (FIR) digital lters are extensively used due to their key role in various digital signal processing (DSP) applications [1], [2]. Along with the advancement in very large scale integration (VLSI) technology as the DSP has become increasingly popular over the years, the highspeed realization of FIR lters with less power consumption has become much more demanding. Since the complexity of implementation grows with the lter order and the precision of computation, real-time realization of these lters with desired level of accuracy is a challenging task. Several attempts have, therefore, been made to develop dedicated and recongurable architectures for realization of FIR lters
Manuscript submitted January 16, 2007, Revised August 21, 2007. P. K. Meher is with the School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, (email: aspkmeher@ntu.edu.sg), URL: http://www.ntu.edu.sg/home/aspkmeher/. Shrutisagar Chandrasekaran has recently completed his Ph.D. from Electronic and Computer Engineering School of Engineering and Design, Brunel University, West London, UK, (email: sc@shrutisagar.com). A. Amira is with the Electronic and Computer Engineering School of Engineering and Design, Brunel University, West London, UK, (email: abbes.amira@brunel.ac.uk).

in application specic integrated circuits (ASIC) and eld programmable gate arrays (FPGA) platforms. Systolic designs represent an attractive architectural paradigm for efcient hardware implementation of computation-intensive DSP applications, being supported by the features like simplicity, regularity and modularity of structure. Additionally, they also possess signicant potential to yield high-throughput rate by exploiting high-level of concurrency using pipelining or parallel processing or both [3]. To utilize the advantages of systolic processing, several algorithms and architectures have been suggested for systolization of FIR lters [4][7]. However, the multipliers in these structures require a large portion of the chip-area, and consequently enforce limitation on the maximum possible number of processing elements (PEs) that can be accommodated and the highest order of the lter that can be realized. Multiplierless distributed arithmetic (DA)-based technique, has gained substantial popularity, in recent years, for their high-throughput processing capability, and increased regularity which results in cost-effective and area-time efcient computing structures. The main operations required for DA-based computation of inner-product are a sequence of look-up-table (LUT)-accesses followed by shiftaccumulation operations of the LUT output. DA-based computation is well-suited for FPGA realization, because the LUT as well as the shift-add operations can be efciently mapped to the LUT-based FPGA logic structures. In FIR ltering, one of the convolving sequences is derived from the input samples while the other sequence is derived from the xed impulse response coefcients of the lter. This behavior of FIR lter makes it possible to use DA-based technique for memory-based realization. It yields faster output compared with the multiplier-accumulator-based designs because it stores the pre-computed partial results in the memory elements [8], which can be read out and accumulated to obtain the desired result. The memory requirement of DA-based implementation for FIR lters, however, increases exponentially with the lter order. DA was rst introduced by Croisier et al [9]; and further developed by Peled and Lui [10] for efcient implementation of digital lters. Attempts are made to use offset-binary coding [11] to reduce the ROM size by a factor of 2. An LUT-less adder-based DA approach has been suggested by Yoo and Anderson, where memory-space is reduced at the cost of additional adders [12]. Memory-partitioning and multiple memory-bank approach along with exible multi-bit data-access mechanisms are suggested for FIR ltering and inner-product computation in order to reduce the memorysize of DA-based implementation [13][17]. Allred et al have suggested an efcient DA-based implementation of least mean

IEEE TRANSACTIONS ON SIGNAL PROCESSING

square (LMS) adaptive lter using a decomposition of DAbased FIR computation and subsequent memory decomposition [18]. All these structures, however, are not suitable for implementation of the FIR lters in systolic hardware since the partial products available from the partitioned memory modules are summed together by a network of output adders. A new tool for the automatic generation of highly parallelized FIR lters based on PARO design methodology is presented in [19], where the authors have performed hierarchical partitioning in order to balance the amount of local memory with external communication, and they have achieved higher throughput and smaller latencies by partial localization. A systolic decomposition technique is suggested in a recent paper for memory-efcient DA-based implementation of linear and circular convolutions [20]. In this paper we have extended further the work of [20] to obtain an area-delay-power-efcient implementation of FIR lter in FPGA platform. The rest of the paper is organized as follows: The formulation of the algorithms for exible DA-based realization of FIR lter is described in the next Section; and the systolic structures are derived from the dependence graphs of the algorithms in Section III. The FPGA implementation methodology is described in Section IV, and simulation results pertaining to the FPGA implementation are presented and discussed in Section V. Conclusion along with the scope for future work is presented in Section VI. II. F ORMULATION OF THE A LGORITHM We briey outline here the conventional distributed arithmetic approach for inner-product computation, and thereafter derive a decomposition scheme for exible DA-based systolization of FIR lters. A. Conventional DA Approach for Inner-Product Computation Let us consider the inner-product of two N -point vectors A and B given by C=
N 1 k=0

over the indices k and l in the second term of (3) can be interchanged to have: C=
N 1 k=0

Ak .bk0 +

L1 l=1

N 1 2l . Ak .bkl
k=0

(4)

Without loss of generality, for simplicity of discussion, we may assume the signal samples to be unsigned words of size L, although the DA decomposition algorithm can be used for 2s complement coding and offset binary coding also. The inner-product given by (4) then can be expressed in a simpler form:
L1 l=0 N 1 k=0

C= where Cl =

2l .Cl

(5a)

Ak .bkl .

(5b)

Since vector A is assumed to be constant, and each element of the N -point bit-sequence {bkl for 0 k N 1} can either be zero or one, any of the partial sum Cl for l = 0, 1, .., L 1, can have 2N possible values. All the 2N possible values of Cl can, therefore, be pre-computed and stored in a ROM, such that while computing the inner-product the partial sums Cl can be read out from the ROM using the bit-sequence {bkl for 0 k N 1} as address-bits. The inner-product can, therefore, be calculated according to (5), by L cycles of shift-accumulation followed by ROM-read operations corresponding L number of bit-sequences {bkl } for 0 l L 1. B. Decomposition Scheme for DA-based Implementation of FIR Filter The output of an FIR lter of order N can be computed as an inner-product of the impulse response vector {h(k), for k = 0, 1, ..., N 1} and an input vector {sn (k), for k = 0, 1, ..., N 1}, given by y(n) =
N 1 k=0

Ak .Bk

(1)

where A is constant vector, while B may change from time to time. Assuming L to be the word-length, each component of B may be expressed in twos complement representation: Bk = bk0 +
L1 l=1

h(k).sn (k)

(6)

bkl .2

(2)

where bkl denotes the l-th bit of Bk . Substituting (2) on (1), the inner-product can be expressed in an expanded form: C=
N 1 k=0

Ak .bk0 +

N 1 k=0

L1 Ak . bkl .2l
l=1

where sn (k) = x(n k), and x(n) is the current input sample. {h(k)} is a xed sequence, while the input sequence {sn (k)} changes in every sampling instant. {sn (k)} is derived from serially-shifted input samples using a window of size N , such that it receives a fresh input sample and leaves its oldest sample. Comparing (6) with (1), the lter output can be computed according to (5) as
L1 l=0

(3) y(n) = where 2l .Cl (7a)

To convert the conventional sum-of-products form of innerproduct of (1) into a distributed form, the order of summations

MEHER ET AL: FPGA REALIZATION OF FIR FILTERS BY EFFICIENT AND FLEXIBLE SYSTOLIZATION USING DISTRIBUTED ARITHMETIC N 1 k=0

Cl =

sn (k) l for l =0, 1, ..., L 1, being the l-th bit of sn (k).

h(k). sn (k) l .

(7b)
j
0

i
(bn)(L-1),0 (bn)(L-1),1
l=L-1

(bn)(L-1),(P-1) 0 A B

Yin Xin
B

Equation (7) can be directly used for straight-forward DAbased implementation of FIR lter using a ROM containing of 2N possible values of Cl . For large values of N , however, the ROM size becomes too large; and so also the ROM access time consequently becomes large. The straight-forward DA implementation is, therefore, not suitable for large lter orders. When N is a composite number given by N = P M , (P and M may be any two positive integers) one can map the index k into (m+pM ) for m = 0, 1, ..., M 1 and p = 0, 1, ..., P 1, in order to express (7) in the form: y(n) = where (Sn )l,p =
M 1 i=0 L1 l=0

(bn)1,0
l=1 0

(bn)1,1 A (bn)0,1 A

(bn)1,(P-1) A (bn)0,(P-1) A B
OUTPUT

A (bn)0,0

Yout
Yout Xin + 2.Yin

l=0 0

(a)
Yin Xin A Xout

(c)

P 1 2l . (Sn )l,p ,
p=0

(8a)

Xout Xin + Memory _ Read (Yin)

(b)
(8b)
Fig. 1. The DG for DA-based implementation of FIR lter. (a) The DG. (b) Function of node A. (c) Function of node B.

for l = 0, 1, ..., L 1 and p = 0, 1, ..., P 1.

h(m + pM ). sn (m + pM ) l

For any given sequence of impulse response {h(k)}, the 2M possible values of (Sn )l,p corresponding to the 2M permutations of M -point bit-sequence sn (m + pM ) l , for m = 0, 1, ..., M 1 for l = 0, 1, ..., L 1 may be stored in an LUT of 2M words. These values of (Sn )l,p can be read out when the bit-sequence is fed to the ROM as address. Equation (8) may, thus, be written in term of memory-read operation as y(n) =
L1 l=0

2l

where F((bn )l,p = (Sn )l,p and (bn )l,p = sn (pM ) l sn (1 + pM ) l ... sn (M 1 + pM ) l ,

P 1
p=0

F(bn )l,p ,

(9)

of the input sequence as given in (9)] is fed to the node-A on (l + 1)-th row and (p + 1)-th column. The node uses the sequence of M input bits of the input bit-vector as address for an LUT, and reads the content stored at the location specied by the address. The value read from the LUT is then added with the input available from its left, and the sum is passed to the node on its right. Node-B performs a shift-add operation such that it makes a left-shift of the bits of the input available from the top, then adds the input available from the left to the left-shifted value, and passes the result down to its adjacent node. The DG can be projected vertically along the projection direction [0 1]T with default schedule [4] to derive a linear array consisting of P number of PEs and an output-cell as shown in Fig.2. The input sequence {x(n)} is fed to a serial-in parallel-out input-register, where content of the register is serially-rightshifted by one position and transferred in parallel to the bitserial word-parallel converter in every L cycles. The bits of vector (bn )l,p , are derived from the bit-serial word-parallel converter and fed to the (p + 1)-th PE [for p = 0, 1, .., P 1] in most signicant bits (MSBs) to least signicant bits (LSBs) order in each cycle period (time-step) such that (L 1)-th bits of input values are fed to the PE at rst, and the zeroth bits are fed at the end. Besides, input to each PE is staggered by one cycle-period with respect to the preceding PE to meet the causality requirement. The function of the PEs is described in Fig.2(b). Each PE consists of a ROM of 2M words. During a cycle-period each PE reads the content on its ROM at the location specied by the input bit-vector. The value read from the ROM is then added to the input available to the PE from its left. During every-cycle period, the sum is then transferred as output to its right. Function of the output-cell is shown in Fig.2(c). Each output-cell contains a shift-register and an adder. During a cycle period it shifts the content of its register

for 0 l L 1 and 0 p P 1.

The bit-vector (bn )l,p is used as address word for the lookup-table and F is the memory-read operation. III. D ERIVATION OF THE S TRUCTURES Following the approach suggested in [20], we derive here the DA-based 1- dimensional (1-D) and 2-dimensional (2-D) systolic arrays for FIR lters from dependance graph (DG) representation of DA-based computation. A. 1-D Systolic Array for FIR Filters The DG for computation of FIR lter output according to (9) is shown in Fig.1. It consists of L rows, where each row consists of P number of node-A and one boundary node-B. The functions of node-A and node-B are depicted in Figs. 1(b) and 1(c), respectively. A bit-vector (bn )l,p consisting of a sequence of M bits [derived from the l-th bit of the elements

IEEE TRANSACTIONS ON SIGNAL PROCESSING

INPUT SHIFTREGISTER
BIT-PARALLEL WORD-SERIAL CONVERTER

SERIAL-IN PARALLEL-OUT SHIFT-REGISTER

x(n)

x(n-1)

x(nN+2)

x(nN+1)

M M PE

(P-1) M PE SA

BITSERIAL WORDPARALLEL CONVERTER M M M

PE

(bn)0,0 (bn)1,0 (bn)(L-1),0 0 PE

(bn)0,1 (bn)1,1 (bn)(L-1),1

(bn)0,(P-1) (bn)1,(P-1) (bn)(L-1),(P-1)


(P-1)

SERIAL-IN PARALLEL-OUT SHIFT-REGISTER

M M PE

(P-1) M PE (L-2) SA

PE

PE

OUTPUT CELL OUTPUT

PE

(a)
Xin
OUTPUT CELL

SERIAL-IN PARALLEL-OUT SHIFT-REGISTER

M M PE

(P-1) M PE (L-1) SA OUTPUT

Xout

INPUT

PE

Yin

Xin

PE

Xout

Xout Xin + ROM _ Read (Yin).

Initialize : S 0; Count 0; End Initializa tion. For 0 Count L : S 2S + Xin; Count Count + 1. If Count = L then Xout S ; S 0; Count 0; Endif .

(a)
Yin Xin PE Xout

Yout Xin + ROM_Read (Yin).

(b)

(c)

(b)
Yin Xin SA Yout

Fig. 2. The 1-D array for DA-based implementation of FIR lter. (a) The linear systolic array. (b) Function of PE. (c) Function of output cell. stands for a unit delay.

Yout Xin + Left_Shift (Yin).

left by one position and then adds the available input to the recently shifted content in its register. After L cycles it delivers a desired lter output. The structure will yield its rst lter output (L + P ) cycles after the rst input is fed to the rst PE, while the successive output becomes available in every L cycles. For high throughput applications one may, however, have a structure with N number of 1-D arrays which would yield N convolved output in every L cycles duration. B. 2-D Systolic Structure for FIR Filters For high-throughput implementation of FIR lters, each node of the DG of Fig.1 can be assigned to a PE exclusively to obtain a 2-D systolic array of L rows and (P + 1) columns as shown in Fig.3. Each row of the structure consists of P number of PEs and a shift-add cell (SA). The computation of all the subsequent values of lter output may also be given by similar DGs, and the computation of corresponding nodes of all such DGs may be folded to the same structure. The input samples are fed to a bit-parallel word-serial converter which receives a new input sample in every cycle period, and generates L number of bits of the input sample and feeds one bit each to L number of bit-level serial-in parallel-out shift register (SIPOSR) associated with each row of PEs, as shown in Fig.3(a). Each SIPOSR contains a bit-stream of the corresponding bits of all the input words, such that the SIPOSR on upper-rows contain the more signicant bits

(c)
Fig. 3. The 2-D array for FIR lter. (a) The 2-D systolic array. (b) Function of PE. (c) Function of SA cell. stands for unit delay.

compared to that of the lower rows. Each of the SIPOSRs of the structure shifts its content to right by one bit-location and receives a new bit in every cycle, open arrival of a fresh sample to the bit-serial word-parallel converter. The bit-vector (bn )l,p consisting of M number of bits from the (l + 1)-th SIPOSR is loaded to the (p + 1)-th PE of the (l + 1)-th row (for 0 l L 1 and 0 p P 1). Each PE [shown in Fig. 3(b)] uses the bit vector (bn )l,p as address for its LUT to read a partial result. The PE then adds the input available from the left with its recently read partial result, and passes that out to its right. Each row of the structure is terminated with an SA cell. The function of SA is depicted in Fig. 3(c). Each SA during a cycle period makes a left-shift of its input available from the top and adds that input to its input available from the left. The sum is then passed downward to its adjacent SA. To meet the data-dependence requirement, the SA cell on every (l)-th row is staggered by one cycle period with respect to the SA cell on the (l + 1)-th row. In the single-array structure of Fig.2, the processing of different bit-steams are time-multiplexed to the same PE, while in the 2-D structure of Fig.3 each bit-stream is processed

MEHER ET AL: FPGA REALIZATION OF FIR FILTERS BY EFFICIENT AND FLEXIBLE SYSTOLIZATION USING DISTRIBUTED ARITHMETIC

by a separate row of PEs. We can also derive a structure with q number of such linear arrays (for L = qu, where q and u are positive integers) by projecting the nodes of u number of rows of the DG to a single array structure instead of projecting the nodes of all the rows to a single linear array. One may, therefore, opt to derive a structure with multiple linear arrays, and similarly may also opt for a suitable value of P (P = number of PEs on one row of the array) for exible implementation to meet the hardware and time specication of constraint-driven systems. IV. FPGA I MPLEMENTATION M ETHODOLOGY This section is concerned with the description of the proposed system and methodology for FPGA implementation of the FIR lter based on systolic decomposition of DAbased computation discussed in the previous Section. Using a systematic design ow, a power-aware area-time-efcient optimized FPGA realization has been obtained here for the systolized lter. A. Design Environment The proposed computing system is designed by a hybrid combination of Handel-C and parameterizable VHDL cores [21]. A key advantage of Handel-C over the other hardware description languages (HDLs) is its rapid prototyping capabilities. Several works [22][24] have shown that HandelC shortens design time by a factor of 3-4 times with approximately the same operating speed compared to traditional HDLs. The VHDL based cores are generated using Xilinx Coregen [25], and are used for small frequently required blocks such as shifters and accumulators. Handel-C is used at the top level for architecture description and integration of the cores. Synthesis of electronic design interchange format (EDIF) netlist from the top level Handel-C code is performed using Celoxica design kit DK4 [21]. Appropriate pin assignments and input-output synchronization for the neighbouring blocks along with suitable routing of critical blocks makes it possible to realize highly optimized area-delay-efcient design with reduced energy consumption metric. This is particularly important, because non optimal place and route tends to use long nets that consume more power than short nets. Finally, XPower [25] is used to obtain the power estimates, from which various energy consumption measures can be calculated. B. Prototyping Platform Details In order to estimate the performance of the exible DAbased systolized FIR lter, the design has been prototyped on the Celoxica RC1000 board containing the Xilinx XCV2000E FPGA [21], [25]. The available on chip logic resources include: 19200 Slices, 80 120 CLB Array, 655,360 bits of Block RAM and 614,400 bits of Distributed RAM. The RC1000 also has four memory banks which communicate with the host by DMA data transfer mechanism. The proposed prototyping platform is pictorially presented in Fig. 4. The host application obtains the input vector of arbitrary length and transfers the data to SRAM Bank-0 by means of

Host Programs

Bank 0 Bank 1 Bank 2 Bank 3

XCV2000E FPGA

Systolized DA-based FIR Filter

Fig. 4. RC1000 prototyping platform containing the Xilinx Virtex-E XCV2000E FPGA.

DMA transfer after taking control of the FPGA memory banks. The control is then released and a pre-selected bitstream le is downloaded to the FPGA for conguration. The FPGA takes control of the memory banks thereafter, performs the FIR ltering and writes the result to SRAM Bank-1. The control of the memory banks is nally passed back to the host application which reads the lter output from Bank-1. We would like to note here that we have used DRAMs in stead of BRAMs for implementation of the architectures, because, BRAMs are specialized RAMS which are dependant on the architecture of the FPGA used; and consequently are not suitable to have a fully parameterisable and scalable architecture for implementing DA. We have consciously avoided the use of BRAMs, in order to ensure portability of the design to ensure easy retargeting to other platforms, particularly to non-Xilinx FPGAs and ASICs with minimum changes. V. R ESULTS AND D ISCUSSIONS The results of FPGA implementation of the exible 1D systolic design (Fig.2), in terms of area and maximum usable frequency metrics with respect to the lter order N and address-length M are presented in Table I. A number of interesting observations can be made from the data presented in Table I. It can be seen that for a given lter order N , the case for M = 4 yields the most area-time efcient architecture when compared to the case for M = 2 and 8. This can be explained by the fact that the increase in control logic and number of delay elements outweighs the gains made by reduction of LUT size for M = 2, while for M = 8, the memory requirement of LUTs is too high. Also, it is worth mentioning that four input LUTs are the basic building blocks of the Virtex-E congurable logic block (CLB) structure, and this accounts for the most efcient mapping of the DA-LUT to the available hardware resources for the case of M = 4. For a given platform, the maximum usable frequency of a design depends on a number of factors, e.g.:

The logic depth of the design, which depends on the complexity of the algorithm to be implemented; The architectural choices, that demand specic FPGA resources (embedded multipliers, BRAMs etc);

IEEE TRANSACTIONS ON SIGNAL PROCESSING

TABLE I K EY P ERFORMANCE M ETRICS OF THE P ROPOSED FPGA I MPLEMENTATION OF THE DA-BASED FIR F ILTER ( FOR W ORD -L ENGTH L = 8) order (N) 8 address size (M) 2 4 8 2 16 4 8 2 32 4 8 2 64 4 8 area (slices) 144 133 149 287 260 286 555 524 553 1057 1061 1094 frequency (MHz) 71.788 74.025 62.181 65.807 67.222 60.114 62.771 63.131 55.313 61.244 64.049 54.750

TABLE II C OMPARISON OF P ERFORMANCE OF THE P ROPOSED I MPLEMENTATION AND THE E XISTING I MPLEMENTATION OF DA-BASED FIR F ILTER [12]. lter order 8 16 32 64 proposed area (slices) 133 260 524 1061 MUF (MHz) 74.025 67.222 63.131 64.049 gate count 2512 4998 11128 23878 Yoo et al [12] area (slices) 146 283 547 1076 MUF (MHz) 70.552 62.775 61.166 57.192 gate count 3365 6337 12235 24801

The address-length M is taken to be four for the proposed implementation.

The device characteristics of the platform (e.g., speed grade and circuit technology); For a xed platform, and a given parameterizable IP core, the factors that inuence the maximum usable frequency are the parameters of the core; which in this case are lter order N and the decomposition factor P ; and Larger the area of spread of the clock-tree lower is the maximum usable frequency because of the clock skew resulted by the propagation delay across the computing circuits. The above points clearly account for the fact that the overall trend for maximum usable frequency shows a decreasing trend as the address-length increases. Apart from that, the depth of systolic logic is generally shallow, and the criticalpath is proportional to the ROM size. When the addresslength increases, not only does the size of the ROM increases exponentially but also the number of address lines increase linearly. The size of the multiplexer logic for de-referencing the ROM locations depends on the above mentioned factors, and contributes to the critical path. It must be highlighted that for the case M = 16 and above, it was not possible to synthesize or place and route the design due to exponential increase in ROM size to 65536 words.

necessary to homogenize the FPGA platform and parameters of the DA-based FIR lter. Hence, we have faithfully reimplemented the architecture presented in [12] on the same platform that has been used in this paper for word-length L = 8. It may be noted that the same design tools and identical P&R settings have been used; and the same amount of optimization has been applied for Yoos design, as has been for the case of the proposed implementation. It is clear that the proposed implementation signicantly outperforms the existing implementations in terms of three important key metrics, namely the area occupied, maximum usable frequency and gate-count, for all the values of N . The superior performance of the proposed one is due to the fact that the number of adders increases linearly with lter order N for the most optimum implementation as opposed to the case in [12] where the architecture presented uses a tree of adders to calculate the nal values before shift-accumulation operation. Additionally, it is worth mentioning that 16 word ROMs of the proposed implementation are more efciently mapped to the 4-input LUT structure (common to both Xilinx and Altera FPGAs) than the architecture presented in [12]. Apart from that, the complexity of control logic is also minimal in the proposed implementation. Moreover, in our design, only a single shift-accumulate operation is needed, irrespective of lter order N as a result of systolization. All these factors yield the most efcient implementation in all key performance metrics in our proposed implementation. B. Chip Level Details Careful manual place and route of critical nets and manual pin assignment for the designs has been performed using Xilinx PACE and Floorplanner [25]. This process yields compact and optimized design with short nets, and serves two important purposes. Firstly, short nets have lesser propagation delay, and up to 25% gains in maximum frequency have been achieved. Second, short nets have lesser parasitic capacitance and DC load, and therefore dissipate lesser power than long nets. Manual pin assignment also enables us to locate the I/O pads close to the design area, further aiding the above two criteria. The chip diagram for one of the implementations is shown in Fig. 5.

A. Comparison of Architectures Details of the performance of the 1-D systolic array of Section III in terms of the basic design metrics are tabulated alongside with those of other comparable existing architectures in Table II. It must be highlighted that the architecture proposed in [12] has been implemented on an Altera Stratix FPGA device. Signicant architectural differences in the FPGA fabric between Xilinx and Altera devices precludes the possibility of an objective and direct comparison between the design metrics of our architecture with those reported in [12]. Additionally, FPGA implementation details have been provided only for the case Bc = 18 (where Bc has been dened in [12] as wordlength of the original LUT). To make a fair comparison; it is

MEHER ET AL: FPGA REALIZATION OF FIR FILTERS BY EFFICIENT AND FLEXIBLE SYSTOLIZATION USING DISTRIBUTED ARITHMETIC

TABLE III P OWER D ISSIPATION OF THE P ROPOSED FPGA I MPLEMENTATION OF FIR F ILTER FOR D IFFERENT F ILTER O RDERS AND ADDRESS -L ENGTHS . power dissipation (mW) N M 2 8 4 8 2 16 4 8 Fig. 5. FPGA chip diagram for FIR lter realization for N = 64, M = 8 and L = 8. 2 32 4 8 2 64 4 8 clock 16.85 13.90 16.33 25.65 28.23 21.78 42.60 43.93 39.33 86.38 71.32 68.14 input 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 logic 44.52 42.16 56.78 83.75 75.13 101.08 169.86 144.90 195.71 345.87 296.81 389.70 output 58.06 58.06 58.06 61.12 61.12 61.12 64.18 64.18 64.18 67.23 67.23 67.23 signal 29.95 33.63 43.89 64.37 65.42 83.43 108.66 108.03 137.39 195.55 179.11 243.56

C. Power Consumption Power consumption depends on the design of the architecture, and is inuenced by a number of factors such as clock frequency of implementation, number of interconnects, switching activity rates, number of logic blocks and interconnect structure of the specic FPGA, power supply voltage level and input-output data transfers. XPower estimates of power dissipation are presented in Table III. It can be seen that I/O power remains constant for all architectures. This is explained by the fact that the architectures are fully parameterizable and pipelined, and consequently process one I/O value per operational cycle (or L clock cycles). Hence, for a xed wordlength L for a given clock frequency, the input power remains constant. Output power increases linearly with the number of output pins, which is also a function of lter order N . However, for a xed value of N , output power also remains constant across all values of address-length M . The total dynamic on-chip power is graphically presented for all cases in Fig. 6.
800

Power estimation has been carried out for 50 MHz frequency.

dissipation of the circuit as a whole. However, in case of high-throughput DSP circuits, energy is a more appropriate measure to quantify the efciency of an operation. A suitable estimate of energy consumption will enable to decide on the design choice that can meet the throughput requirement while minimizing power consumption as well. We have analyzed three parameters of energy estimates, e.g., Energy per OPeration (EOP ), Energy Throughput (ET ) and Energy Density (ED) of the proposed FIR lter implementation. 1) Energy per Operation: EOP is used to measure the average amount of energy required to complete one operation. It is useful for comparing energy efciency of two or more circuits that employ different architectural approaches to perform the same operation. Also, it is useful for comparing circuits that require different number of clock cycles to complete one operation. EOP is given by: 1 P (t) dt (10) T
T

Dynamic On-Chip Power (mW)

700 600 500 400 300 200 100 0 M=2 M=4 M=8

where P (t) is the instantaneous power dissipation and T is the number of clock cycles needed to complete one operation. Assuming constant power consumption, EOP can be estimated by the expression:
3 4 5 6

Filter Order, N (log2 scale)

EOP =

Pav l f

(11)

Fig. 6.

Plot of variation of dynamic on-chip power with lter order.

It can be seen from Fig. 7 that EOP steadily increases as N increases, and is proportional to the increase in circuit size and complexity. E. Energy Throughput Energy throughput is dened as the amount of energy dissipated per bit of output data given by

D. Energy Analysis Average power consumption is an important performance metric of FPGA-based systems, and most of the studies of the designs implemented on FPGAs focus on the power

8
120

IEEE TRANSACTIONS ON SIGNAL PROCESSING


0.13

Energy per Operation (nJ/operation)

Energy Density (nJ/slice)

100 M=2 M=4 M=8

0.12 0.11 0.1 0.09 0.08

M=2 M=4 M=8

80

60

40

20

Filter Order, N (log2 scale)


3 4 5 6

Filter Order, N (log2 scale)

Fig. 9. Plot of energy density of FPGA implementations for different values of N and M .

Fig. 7. Energy per operation of FPGA implementations FIR lter for different values of N and M .

ET = EOP/(L N )

(12)

It combines energy dissipation and the actual volume of data processed per cycle into a single metric, and enables us to make a fair comparison between different architectures that perform the same operation at different mathematical scales. The ET data is graphically presented in Fig. 8. It can be seen that the most energy efcient architecture is obtained for the case M = 4, in line with the power metrics obtained. F. Energy Density ED allows the designer to estimate the tradeoff between energy consumption and area occupied for different architectural strategies. ED is calculated by normalizing EOP with respect to the number of FPGA slices occupied (A), given by ED = EOP/A (13)

The plot of ED data for different lter orders is presented in Fig. 9. It can be observed from Fig. 9 that for N = 64 and M = 2 the EOP increases at a faster rate (about 5% faster) than the area-complexity of systolic elements in the architecture.
0.35

Energy Throughput (nJ/bit)

0.3

M=2 M=4 M=8

memory-based realization of FIR lters can be obtained by suitable choice of address-length of the LUTs used for the DA-based computation of partial-results of the inner-product. The 1-D systolic structure is implemented on a Xilinx Virtex-E XCV2000E FPGA by a hybrid combination of Handel-C and parameterizable VHDL cores. The key performance metrics, e.g., number of slices, maximum usable frequency, dynamic power consumption, energy density and energy throughput are estimated for different lter orders and address-lengths, and it is shown further that the FPGA prototyping of the systolic design yields the performance estimates broadly in line with theoretical expectations. It is found that the 1D structure with address-length M = 4 yields the best of area-delay-power-efcient implementations for various lter orders. Moreover, it is found that the proposed implementation involves signicantly less area-delay complexity compared with the existing DA-based implementation of FIR lters. For high-speed applications, a 2-D systolic array consisting of L number of linear systolic arrays (where L is the word-length) could be used to process the individual bit-stream of input signal separately in different systolic arrays. The 2-D structure would provide L times mores throughput at the cost of nearly L times more area-complexity over the 1-D structure. The proposed FPGA realization is fully parameterizable, modular and scalable, so that it can be readily used as an IP core in a number of environments. Further work may be carried out to develop an efcient DA-based adaptive lter; and the twofactor DA-decomposition may be extended further for threefactor and four-factor decomposition of DA-based computation for FIR ltering. R EFERENCES

0.25

0.2

Filter Order, N (log2 scale)

Fig. 8. Energy throughput of FPGA implementations for different values of N and M .

VI. C ONCLUSION Flexible designs of 1-D and 2-D systolic computing structures are derived for area-delay-power-efcient implementation of FIR lter by address decomposition of DA-based inner-product computation. It is shown that hardware-efcient

[1] J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Principles, Algorithms and Applications. Upper Saddle River, NJ: PrenticeHall, 1996. [2] A. Antoniou, Digital lters : analysis, design, and applications. New York: McGraw-Hill, 1993. [3] H. T. Kung, Why systolic architectures? IEEE Computer, vol. 15, pp. 3745, Jan. 1982. [4] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design and Implementation. New York: John Wiley & Sons, Inc, 1999. [5] R. Wyrzykowski and S. Ovramenko, Flexible systolic architecture for VLSI FIR lters, IEE Proceedings-Computers and Digital Techniques, vol. 139, no. 2, pp. 170172, Mar. 1992. [6] B. K. Mohanty and P. K. Meher, Cost-effective novel exible celllevel systolic architecture for high throughput implementation of 2-D FIR lters, IEE Proceedings-Computers and Digital Techniques, vol. 143, no. 5, pp. 436439, Nov. 1996.

MEHER ET AL: FPGA REALIZATION OF FIR FILTERS BY EFFICIENT AND FLEXIBLE SYSTOLIZATION USING DISTRIBUTED ARITHMETIC

[7] , Novel exible systolic mesh architecture for parallel VLSI implementation of nite digital convolution, IETE Journal of Research, vol. 44, no. 6, pp. 261266, Nov. 1988. [8] S. A. White, Applications of the distributed arithmetic to digital signal processing: A tutorial review, IEEE ASSP Magazine, vol. 6, no. 3, pp. 519, July 1989. [9] A. Croisier, D. J. Esteban, M. E. Levilion, and V. Rizo, Digital lter for pcm encoded signals, U.S. Patent 3 777 130, Apr., 1973. [10] A. Peled and B. Lie, A new hardware realization of digital lters, IEEE Transactions on Acoust. speech and signal procesing, vol. 22, p. 456462, Dec. 1974. [11] J. P. Choi, S.-C. Shin, and J.-G. Chung, Efcient ROM size reduction for distributed arithmetic, in Proc. IEEE International Symp on Circuits and Syst., 2000. ISCAS, vol. 2, May 2000, pp. 6164. [12] H. Yoo and D. V. Anderson, Hardware-efcient distributed arithmetic architecture for high-order digital lters, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP 05), vol. 5, Mar. 2005, pp. v/125v/128. [13] C.-F. Chen, Implementing FIR lters with distributed arithmetic, IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 33, no. 5, pp. 13181321, Oct. 1985. [14] H.-R. Lee, C.-W. Jen, and C.-M. Liu, On the design automation of the memory-based VLSI architectures for FIR lters, IEEE Trans. Consumer Electronics, vol. 39, no. 3, pp. 619629, Aug. 1993. [15] K. Nourji and N. Demassieux, Optimal VLSI architecture for distributed arithmetic-based algorithms, in 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994. ICASSP-94, vol. 2, Apr. 1994, pp. II/509II/512. [16] M. Mehendale, S. D. Sherlekar, and G. Venkatesh, Area-delay tradeoff in distributed arithmetic based implementation of FIR lters, in Proc. Tenth International Conference on VLSI Design, Jan. 1997, pp. 124129. [17] S.-S. Jeng, H.-C. Lin, and S.-M. Chang, FPGA implementation of FIR lter using M-bit parallel distributed arithmetic, in Proc. 2006 IEEE International Symposium on Circuits and Systems, ISCAS 2006, May 2006, p. 4. [18] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson, LMS adaptive lters using distributed arithmetic for high throughput, IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 52, no. 7, pp. 13271337, July 2005. [19] H. Ruckdeschel, H. Dutta, F. Hannig, and J. Teich, Automatic FIR lter generation for FPGAs, in Proc. 5th International Workshop on Systems, Architectures, Modeling, and Simulation, SAMOS 2005, T. D. H. et al., Ed., vol. LNCS 3553, July 2005, pp. 5161. [20] P. K. Meher, Hardware-efcient systolization of DA-based calculation of nite digital convolution, IEEE Trans. Circuits Syst. II: Express Briefs, vol. 53, no. 8, pp. 707711, Aug. 2006. [21] Celoxica Ltd, 66 Milton Park, Abingdon, Oxfordshire, United Kingdom. [Online]. Available: www.celoxica.com [22] S. M. Loo, B. E. Wells, N. Freije, and J. Kulick, Handel-C for Rapid Prototyping of VLSI Coprocessors for Real Time Systems, in Proc. Thirty Fourth Southeastern Symposium on System Theory, vol. 46, no. 1, 2003, pp. 610. [23] P. Voles, L. Holasek, and M. Vasilko, ANSI C and Handel-C based rapid prototyping framework for real-time image processing algorithms, in Proc. International Conference on Engineering of Recongurable Systems and Algorithms, (CSREA Press) 2002, pp. 153159. [24] Handel-C for Hardware Design, White Paper, Celoxica Ltd, 66 Milton Park, Abingdon, Oxfordshire, United Kingdom. [Online]. Available: www.celoxica.com [25] Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124-3400. [Online]. Available: www.xilinx.com

Pramod Kumar Meher (SM03) received the rst class degrees of B.Sc. (Honours) in Physics and M.Sc. in Physics (with electronics specials), and the Ph.D. degree, all from Sambalpur University, Sambalpur, India in 1976, 1978, and 1996, respectively. He has a wide scientic and technical background covering Physics, Electronics and Computer Engineering. Currently, he is a Senior Fellow in the School of Computer Engineering, Nanyang Technological University, Singapore. He was a Professor at Utkal University, Bhubaneswar, India since 19972002, a Reader in Electronics at Berhampur University, Berhampur, India during 1993-1997, and a Lecturer in Physics in various Government Colleges (in India) during 1981-1993. His research interest includes design of dedicated and recongurable architectures for computation-intensive algorithms pertaining to signal processing, image processing, secured communication, articial neural networks and bioinformatics. He has published nearly 90 technical papers. Dr. Meher was conferred with the Samanta Chandrasekhar Award for excellence in research in Engineering & Technology for the year 1999. He is a Chartered Engineer of the Engineering Council of UK, a Senior Member of IEEE, a Fellow of The Institution of Electronics and Telecommunication Engineers of India, and a Fellow of the Institution of Engineering and Technology, (formerly known as the Institution of Electrical Engineers), UK.

Shrutisagar Chandrasekaran (S00) recently completed his PhD in the parallel and recongurable computing for computer vision research group at Brunel University, West London within the division of Electronics and Computer Engineering in the School of Engineering and Design. Previously, he was pursuing his PhD in the School of Computer Science at the Institute of Electronics, Communications and Information Technologies (ECIT) at Queens University, Belfast (QUB). He received his Bachelors Degree in Electronics and Communications with distinction from Madurai Kamaraj University, India in 2004. Dr. Chandrasekaran is a recipient of the IEEE R10 RAB Larry K Wilson Award and he is a member of IEEE. His research interests include custom computing using FPGAs, hardware-software co-design, power and energy aware design techniques, modeling of power and performance for FPGA based designs. Dr. Chandrasekaran is currently working in the nancial sector in the UK as a quantitative analyst.

Abbes Amira (M99, SM06) is a senior lecturer at Brunel University, West London, UK within the division of Electronic and Computer Engineering in the School of Engineering and Design. Before he joined Brunel University in May 2006 he has held a lectureship in Computer Science at Queens University, Belfast (QUB) since November 2001. He received his Ph.D in Computer Science from Queens University Belfast in 2001. He has been awarded a number of grants from government and industry, has published over 100 publications and supervised 5 PhD students during his career to date. Dr. Amira has been invited to give talks at universities in UK, Europe, USA and North Africa, at international conferences, workshops and exhibitions and being chair, program committee for a number of well known conferences. He is a senior member of IEEE, member of ACM and Fellow of the Higher Education Academy of UK. His research interest includes: recongurable computing, image and vision systems, System on Chip, Custom computing using FPGAs, medical image analysis, multi-resolution analysis, biometrics technologies and information retrieval.

You might also like