You are on page 1of 5

2006 IEEE International

Symposium on Signal Processing


and Information Technology

Area-Efficient FIR Filter Design on FPGAs using Distributed Arithmetic


Patrick Longa and Ali Miri
School of Information Technology and Engineering
University of Ottawa
{plong034, samiri}@site.uottawa.ca
Abstract - In this paper, a highly area-efficient multiplier-less FIR
filter is presented. Distributed Arithmetic (DA) has been used to
implement a bit-serial scheme of a general asymmetric version of an
FIR filter, taking optimal advantage of the 4-input LUT-based
structure of FPGAs. Furthermore, we have introduced a
modification in the accumulator stage to achieve further savings.
The proposed filter has been designed and synthesized with Altera
Quartus II, and implemented on a Stratix FPGA device. Our results
show reduced area requirements in comparison to previous LUTless DA architectures.
Keywords - FIR filter, FPGA, DSP, Distributed Arithmetic, Lookup
table (LUT).

1. INTRODUCTION
In the last few years, there has been a growing trend to
implement DSP functions in Field Programmable Gate
Arrays (FPGAs), which offer a balanced solution in
comparison with traditional devices. Although ASICs and
DSP chips have been the traditional solution for highperformance applications, now the technology and the market
are imposing new rules. On one hand, high development
costs and time-to-market factors associated with ASICs can
be prohibitive for certain applications and, on the other hand,
programmable DSP processors can be unable to reach a
desired performance due to their sequential-execution
architecture. In this context, FPGAs offer a very attractive
solution that balance high flexibility, time-to-market, cost and
performance.
In that sense, the research community has put great effort
in designing efficient architectures for DSP functions such as
FIR filters, which are extensively used in multiple
applications in telecommunications, wireless/satellite
communications, video and audio processing, biomedical
signal processing and many others.
Traditionally, the design methods were mainly focused in
multiplier-based architectures to implement the Multiply-andAccumulate (MAC) blocks that constitute the central piece in
FIR filters and several DSP functions:
K 1

y[ n] = ak x[ n k ]

(1)

k =0

But careful analysis shows that multiplier-based filter


implementations may become highly expensive. A direct
implementation of equation (1) requires K MAC blocks,
which is expensive in terms of area and is especially critical

0-7803-9754-1/06/$20.002006 IEEE

for high-order filters. This issue has been partially solved


with the new generation of low-cost FPGAs that have
embedded DSP blocks. However, if the final product will
reside on an ASIC for instance, the problem is still present.
To resolve this issue, several multiplier-less schemes were
proposed over the years. Basically, these methods can be
classified in two categories according to how they manipulate
the filter coefficients for the multiply operation. The first type
of multiplier-less technique is the conversion-based approach,
in which the coefficients are transformed to other numeric
representations whose hardware implementation or
manipulation is more efficient than the traditional binary
representation. Example of such techniques are the Canonic
Sign Digit (CSD) method, in which coefficients are
represented by a combination of powers of two in such a way
that multiplication can be simply implemented with
adder/subtractors and shifters [4], and the Dempster-Mcleod
(DM) method, which similarly involves the representation of
filter coefficients with powers of two but in this case
arranging partial results in cascade to introduce further
savings in the usage of adders [5].
The second type of multiplier-less method involves the use
of memories (RAMs, ROMs) or Look-Up Tables (LUTs) to
store pre-computed values of coefficient operations. These
are called memory-based methods, and examples of them are
found in the Constant Coefficient Multiplier (KCM) method
and the very-well known Distributed Arithmetic method [2].
Distributed Arithmetic (DA) appeared as a very efficient
solution especially suited for LUT-based FPGA architectures.
This technique, first proposed by Croisier et al. [11], is a
multiplier-less architecture that is based on an efficient
partition of the function in partial terms using 2s
complement binary representation of data. The partial terms
can be pre-computed and stored in LUTs. The flexibility of
this algorithm on FPGAs permits everything from bit-serial
implementations to pipelined or full-parallel versions of the
scheme, which can greatly improve the design performance.
The main problem with DA is that the requirement of
memory/LUT capacity increases exponentially with the order
of the filter, given that DA implementations need 2K - words
(K being the number of taps of the filter). That constitutes a
first obstacle for FIR filters of high order. Yoo et al. [1]
proposed a flexible architecture that gradually replaces LUT
requirements with multiplexer/adder pairs. The authors
exploit the fact that block values corresponding to K
coefficients in a LUT-DA structure are identical to block
values corresponding to K+1 coefficients if one discards the
last added coefficient. Yoo et al. implemented asymmetric

248

FIR filters of several orders (4-1024 taps) using the LUT-less


DA architecture on an Altera Stratix device. The results
showed drastic improvement in terms of area and memory
usage in comparison with the traditional LUT-based bit-serial
DA architecture and previous LUT-less DA schemes.
In this paper, we present an asymmetric FIR filter
architecture using the bit-serial LUT-based DA technique.
For this implementation, we use a scheme that takes
advantage of the 4-input LUTs in FPGAs, and rearranges the
input sequence to implement a modified version of the
shifter/accumulator stage. We show that our modified version
is superior in terms of area to previous LUT-less DA
architectures.
2. DISTRIBUTED ARITHMETIC (DA)
Distributed Arithmetic is one of the most well-known
methods of implementing FIR filters on FPGAs,
characterized by its high flexibility that permits from serial to
full-parallel arrangements. The right balance among versions
is tied to specifications for a given application, and basically
depends on requirements in terms of hardware cost and
throughput. In each case, the designer has to trade bandwidth
for area.
Equation (2) describes an FIR filter of length K:

In equation (4), we observe that the terms in parenthesis


may take one of 2K possible values, given that b {0,1} ,
and that those values correspond to all possible sum
combinations of filter coefficients. These values can be precomputed and stored in LUTs or memories, and addressed by
bk ,l . This way, the MAC algorithm of FIR filters is reduced
to LUT accesses and summations.
3. CIRCUIT DESCRIPTION
As previously stated, the main problem with LUT-based
implementations is that the LUT requirement increases
exponentially with the order of the filter. To alleviate this
problem, the main strategy is to take advantage of the 4-input
LUT-based architecture of FPGAs. Partitioning the input into
4-bit units reduces the Look-Up Table from 2K words to
(K/4 x 24) - words at a cost of approximately 4-bit (K-4)/4
adders.
Applying this approach in equation (4), the basic LUT-DA
scheme on an FPGA would consist of three main components
(Figure 1): the input registers, the 4-input LUT unit and the
shifter/accumulator unit. Additionally, it would require a
control unit to manipulate the filter operation, and an addertree unit to perform addition on partial filter results.
1. Input registers:

K 1

y[n] = ak x[ n k ]

(2)

k =0

Where:
x and y are two vectors of size K that represent the input and
transformed data, respectively.
ak is the set of constant coefficients of the filter.
K is the number of taps of the FIR filter.

To decrease LE (Logic Element) consumption, we mostly


used RAM resources to implement the shift registers. To
illustrate the savings in Logic Elements that this technique
introduces, for instance to implement four 20-bit shift
registers, one would need 80 LEs. In contrast, using RAM
resources, one needs just 8 LEs and 72 bits of memory.

In a bit-serial DA scheme, assuming that the input x to the


filter is represented in L-bit 2s complement binary numbers
with the sign bit to the left of the radix point, we have:
L 1

x[ n k ] = bk ,0 + bk ,l 2l

(3)

l =1

Replacing this result in equation (2), we obtain:


K 1

L 1

k =0

l =1

Fig. 1. Proposed LUT-based bit-serial DA implementation


of a 4-tap FIR filter.

y[n] = ak (bk ,0 + bk ,l 2l )
K 1

y[n] = ( ak bk ,0 ) + ( ak bk ,l )2 l
k =0

2. LUT unit:

L 1 K 1
l =1

k =0

(4)

To implement a 4-input LUT unit, we followed the LUT


table presented in Figure 1, which represent all the possible
sum combinations of filter coefficients. The implementation
was done using VHDL.

249

3. Shifter/adder unit:

5. Adder-tree structure:

This stage consists of an accumulator and a shifter. We


introduced a modification to the traditional scheme to obtain
better results in terms of area usage. The traditional scheme
gets a partial term beginning with the LSB of the input and
then shifts this to the right to add it to the next partial result.
This involves extra logic to take the LSB of each partial
result out and to prepare the result for the summation with the
next partial term. The steps are shown in Figure 2.

It consists of a tree of adders that perform addition on


partial results of the 4-input filter units.

1001
10011
:
:
+
:
--------------- 111001111

(accumulated value)
(current partial term)
(coming partial terms)
(final result)

Fig. 2. Traditional shift/accumulate process


beginning with the LSB of the input.

The new scheme first begins with the MSB of the input
and shifts each partial result to the left, avoiding the logic
necessary to manipulate the LSB at each iteration. The
number of bits in the addition is increased in 1 bit to maintain
whole precision of the output. This is shown in Figure 3.
We have to note that, given equation (4), the first partial
result must be subtracted in the shifter/adder structure. For
this reason, the main component of this stage is actually an
adder/subtractor unit.
An element taken into account was the usage of a sign
extension unit. Given that we are dealing with signed digits
and that the partial terms have evidently fewer bits than the
accumulated result, one needs to fill each partial term with
the necessary 0s or 1s (depending on if this partial term is
negative or positive) to correctly add both numbers.
:
:
:

10011
+ 11010
---------------- 111001111

(coming partial terms)


(current partial term)
(accumulated value)
(final result)

Fig. 3. Proposed shift/accumulate process


beginning with the MSB of the input.

4. Control unit:
This unit controls the other circuit components and the
whole circuit behaviour. It is a counter whose upper limit
depends basically on the input precision and defines the
circuit throughput. In contrast to other methods, an advantage
of Distributed Arithmetic is that the throughput in DA-based
architectures is independent of the order of the filter.

4. IMPLEMENTATION
To evaluate the performance of the proposed scheme, 4,
16 and 64-tap asymmetric low-pass FIR filters were
implemented and synthesized using Altera software Quartus
II on a Stratix device, and the results were compared to
implementations presented by Yoo et al. [1]. The precision
for inputs and coefficients were 18 and 16 bits, respectively.
In addition, to compare performance, the scheme
presented in [1] was implemented for the case of 4, 16 and 64
taps, introducing modifications in the input registers and
replacing the traditional accumulator by the one presented in
this work.
Firstly, the filter design was done using the Remez method
and Matlab 7r14. The stopband was defined at 0.28 rad/sec.
approximately, and the error minimization in the passband
was fixed to 10 times greater than in the stopband.
The coefficients were truncated to 4 decimals of precision
(5 decimals in 64-tap case) and scaled to signed integer
numbers with 16 bits of precision. These coefficients were
used to implement the FIR filters in Quartus II. The
frequency responses of the designed filters are shown in
Figure 4.
To validate the correct functionality of the implemented
circuits, each implementation was simulated with the
simulation tool provided by Quartus II. Tests were done using
random inputs and the results validated with a Matlab code.
The logic flow of the proposed LUT-based circuit can be
summarized as follows:
For the 4-tap FIR filter, the control unit clears buffers and
then the input collected by the input registers during the
previous 20-clock cycles is serially injected to the circuit.
These bits address a value in the 4-input LUT structure, and
this partial result is accumulated and shifted as previously
explained by the shifter/adder unit, taking into account that
the very first value must be subtracted.
Given that the precision of inputs is 18 bits in our case, the
partial result is accumulated and shifted 18 times with new
values addressed in the 4-input LUT unit. Finally, in the 19th
clock cycle, a signal from the control unit indicates to the
latch structure to output the final result, which is shown every
20 clock cycles.
In the case of 16-tap filters, the logic flow is similar,
excepting that now we get 4 partial results each from one of
the four basic 4-input filter cells. The adder-tree structure
adds these partial results and sends the values to the
shifter/adder unit, which again accumulates and shifts the
values. Similarly, in the 19th clock cycle, a signal from the
control unit indicates to the latch structure to output the final
result, which is shown every 20 clock cycles.

250

The authors in [1] claimed that the proposed LUT-less DA


scheme achieves a reduction of 22%, 33% and 45% in terms
of logic elements (LE) in comparison with the traditional DA
method for a 4, 16 and 64-tap FIR filter, respectively. In
terms of memory requirements, the savings are about 33% in
each implementation.

Frequency response: 4-tap low-pass FIR filter


1.4

1.2

Magnitude

0.8

0.6

0.4

0.2

0.1

0.2

0.3

0.4
0.5
0.6
Frequency (rad/sec)

0.7

0.8

0.9

LUT-based
DA [1]
LUT-less DA
[1]

Frequency response: 16-tap low-pass FIR filter


1.4

1.2

Magnitude

0.8

In the present work, the results obtained using the Quartus


II compilation tool are shown in Table 2. The device selected
for all the implementations was EP1S10F484C5 FPGA of the
Altera Stratix family.

0.6

0.4

0.2

0.1

0.2

0.3

0.4
0.5
0.6
Frequency (rad/sec)

0.7

0.8

0.9

Frequency response: 64-tap low-pass FIR filter


1.4

Proposed
LUT-based DA

1.2

Magnitude

Filter length (K-1)


16
64
551
1639
1376
5504
367
887
224
896

Table 1. Area requirements of LUT-based and LUT-less DA schemes


presented in [1] for different filter length implementations
on an Altera FPGA device.

LE
memory
LE
memory

4
272
344
210
56

LUT-less DA

0.8

LE
memory
LE
memory

4
139
72
128
72

Filter length (K-1)


16
64
233
586
288
1152
255
630
288
1152

Table 2. Area requirements of the proposed LUT-based DA and


the modified LUT-less DA for different filter length implementations
on an Altera FPGA device.

0.6

0.4

0.2

0.1

0.2

0.3

0.4
0.5
0.6
Frequency (rad/sec)

0.7

0.8

0.9

Fig. 4. Frequency responses of 4, 16 and 64-tap low-pass FIR filters


designed with the Remez method.

Finally, for the 64-tap filter, each four 4-input filter cells is
grouped in 16-tap filters, resulting in four of these 16-tap
blocks. The four partial results provided by these blocks are
added together by the adder-tree structure. This partial result
is sent to the shifter/adder unit, which again accumulates and
shifts the values 18 times. Similarly, in the 19th clock cycle, a
signal from the control unit indicates to the latch structure to
output the final result, which appears in the output every 20
clock cycles.
5. RESULTS
Yoo et al. [1] implemented the LUT-less DA version to
compare area performance with typical LUT-based DA
schemes. The authors in [1] used an Altera Stratix
EP1S80F1508C6
FPGA
device
and
tested
the
implementation with different FIR filter lengths (from L=4 to
1024). The results for 4, 16 and 64-tap filters are shown in
Table 1.

As we can see, these results show that the proposed LUTbased bit-serial scheme is superior in area performance to the
LUT-less scheme in most of the cases (only in the case of a
4-tap filter, the LUT-less scheme obtains higher efficiency).
However, we have determined that our proposed approach
is in all cases more efficient. This is basically because a 24word LUT with 16-bit coefficients consumes at most 18 LEs,
while the equivalent in a LUT-less implementation (four 16bit MUX2x1, two 16-bit adders with 17-bit output, and one
17-bit adder with 18-bit output) may consume about 21 LEs
or even more, depending on coefficient values.
The achievements of LUT-less DA in Table 2 correspond
to implementations which truncate coefficient operations to
16 bits which means a LUT-less cell with four 16-bit
MUX2x1 and three 16-bit adders with 16-bit output.
Furthermore, the whole filter precision is affected and
reduced from 37-bit to just 33-bit output, which introduces
further savings. This is possible because for the case of 16-bit
coefficients a LUT would need at least 19 bits in the output
(worst case a0+a1+a2+a3, in Figure 1), while a LUT-less
scheme without the restriction of a LUT could be
implemented with the minimum possible precision that does
not produce an overflow in the data (16-bit output precision
with the used coefficients). However, this argument is not
valid for comparing both architectures given that if it is

251

possible to limit the precision in the adders without producing


an overflow in the output, then it is possible to reduce the
precision in the LUT unit, too. Say, if 16-bit coefficients in
the aforementioned worst case (a0+a1+a2+a3) do not produce
an overflow using an output of 16 bits in a LUT-less scheme,
then one can reduce the precision of the LUT to 16 bits
following the same criteria.
Under this analysis, we consider that a fair comparison
between both techniques would involve the use of a LUT-less
cell which permits maximum precision, i.e. four 16-bit
MUX2x1, two 16-bit adders with 17-bit output and one 17-bit
adder with 18-bit output, and final filter precision of 37, 39
and 41 bits for 4, 16 and 64-tap filters, respectively. Results
of this comparison are presented in Table 3.

Proposed
LUT-based DA
LUT-less DA

LE
memory
LE
memory

4
139
72
139
72

Filter length (K-1)


16
64
233
586
288
1152
291
788
288
1152

REFERENCES
[1] H. Yoo, and D. Anderson, Hardware-Efficient Distributed Arithmetic
Architecture for High-Order Digital Filters, in Proc. IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP '05),
2005, Vol. 5, pp. 125 128.

Table 3. Area requirements for FIR filter implementations of


the proposed LUT-based DA and the modified LUT-less DA
(full precision for both schemes).

With regard to bandwidth performance, the results


obtained from the compilation tool and corroborated with
simulations are shown in Table 4.
4
Proposed
LUT-based
DA
LUT-less
DA

Filter length (K-1)


16

111MHz

55.9MHz

39.9MHz

Throughput
(Msamples/sec)

5.6

2.8

Max. frequency

111MHz

46.7MHz

34.6MHz

Throughput
(Msamples/sec)

5.6

2.3

1.7

[2] Martinez-Peiro, J. Valls, T. Sansaloni, A.P. Pascual, and E.I. Boemo, A


Comparison between Lattice, Cascade and Direct Form FIR Filter Structures
by using a FPGA Bit-Serial DA Implementation, in Proc. IEEE
International Conference on Electronics, Circuits and Systems, 1999, Vol. 1,
pp. 241 244.
[3] K-H. Tan, W.F. Leong, S. Kadam, M.A. Soderstrand, and L.G. Johnson,
Public-Domain Matlab Program to Generate Highly Optimized VHDL for
FPGA Implementation, in IEEE International Symposium on Circuits and
Systems, 2001, Vol. 4, pp. 514 517.

64

Max. frequency

applications, has also been introduced. We have developed 4,


16 and 64-tap LUT-based FIR filters that require only 139,
233 and 586 LEs, respectively. On the other hand, we have
shown that the LUT-based scheme is still more efficient than
the proposed LUT-less architecture in [1], for all the
implemented FIR filter orders. In this sense, it has been
determined that the argument to save resources in a LUT-less
scheme can be applied to the LUT-based structure too if one
follows the same criteria of reducing the word-length in the
LUT unit depending on the used filter coefficients.
Finally, the throughput performance of each
implementation has been determined and compared. In this
case, the performance is slightly superior in the LUT-based
approach. Further improvements can be reached using
pipelining techniques which would improve the bandwidth
performance at the cost of some extra hardware.

[4] M. Yamada, and A. Nishihara, High-Speed FIR Digital Filter with CSD
Coefficients Implemented on FPGA, in Proc. IEEE Design Automation
Conference (ASP-DAC 2001), 2001, pp. 7-8.
[5] M.A. Soderstrand, L.G. Johnson, H. Arichanthiran, M. Hoque, and R.
Elangovan, Reducing Hardware Requirement in FIR Filter Design, in
Proc. IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP '00), 2000, Vol. 6, pp. 3275 - 3278.

Table 4. Frequency / throughput performance of the proposed LUT-based


DA and the modified LUT-less DA (full precision for both schemes)

As we can see in Tables 3 and 4, for low-order FIR filters


(4-tap) the LUT-based and the LUT-less schemes perform
equivalently in terms of both area and bandwidth, but at
higher orders the performance grows in favour of the
presented LUT-based DA scheme, mainly in area
consumption.
6. CONCLUSIONS
We have successfully implemented high-efficient 4, 16
and 64-tap bit-serial DA FIR filters, using both a bit-serial
LUT-based DA scheme and a modified LUT-less scheme.
Furthermore, both schemes have been further improved in
area performance with the efficient usage of RAM resources
mainly in the input registers unit. A modified version of the
accumulator, which is intended for low-precision

[6] R. Grover, W. Shang, and Q. Li, A Faster Distributed Arithmetic


Architecture for FPGAs, in Proc. ACM/SIGDA 10th International
Symposium on Field-Programmable Gate Arrays, 2002, pp. 31-39.
[7] V. Pasham, A. Miller, and K. Chapman, Transposed Form FIR Filters,
XILINX Application Note, 2001.
[8] Implementing FIR Filters in FLEX Devices, ALTERA Application
Note, 1998.
[9] A Guide to Using Field Programmable Gate Arrays (FPGAs) for
Application-Specific Digital Signal Processing Performance, G. R. Goslin,
XILINX, 1995.
[10] D.J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. Anderson, A
Novel High Performance Distributed Arithmetic Adaptive Filter
Implementation on an FPGA, in Proc. IEEE Int. Conference on Acoustics,
Speech, and Signal Processing (ICASSP04), 2004, Vol. 5, pp. 161-164.
[11] A. Croisier, D. J. Esteban, M. E. Levilion, and V. Rizo, Digital Filter
for PCM Encoded Signals, U.S. Patent No. 3,777,130, issued April, 1973.

252

You might also like