Professional Documents
Culture Documents
1. INTRODUCTION
In the last few years, there has been a growing trend to
implement DSP functions in Field Programmable Gate
Arrays (FPGAs), which offer a balanced solution in
comparison with traditional devices. Although ASICs and
DSP chips have been the traditional solution for highperformance applications, now the technology and the market
are imposing new rules. On one hand, high development
costs and time-to-market factors associated with ASICs can
be prohibitive for certain applications and, on the other hand,
programmable DSP processors can be unable to reach a
desired performance due to their sequential-execution
architecture. In this context, FPGAs offer a very attractive
solution that balance high flexibility, time-to-market, cost and
performance.
In that sense, the research community has put great effort
in designing efficient architectures for DSP functions such as
FIR filters, which are extensively used in multiple
applications in telecommunications, wireless/satellite
communications, video and audio processing, biomedical
signal processing and many others.
Traditionally, the design methods were mainly focused in
multiplier-based architectures to implement the Multiply-andAccumulate (MAC) blocks that constitute the central piece in
FIR filters and several DSP functions:
K 1
y[ n] = ak x[ n k ]
(1)
k =0
0-7803-9754-1/06/$20.002006 IEEE
248
K 1
y[n] = ak x[ n k ]
(2)
k =0
Where:
x and y are two vectors of size K that represent the input and
transformed data, respectively.
ak is the set of constant coefficients of the filter.
K is the number of taps of the FIR filter.
x[ n k ] = bk ,0 + bk ,l 2l
(3)
l =1
L 1
k =0
l =1
y[n] = ak (bk ,0 + bk ,l 2l )
K 1
y[n] = ( ak bk ,0 ) + ( ak bk ,l )2 l
k =0
2. LUT unit:
L 1 K 1
l =1
k =0
(4)
249
3. Shifter/adder unit:
5. Adder-tree structure:
1001
10011
:
:
+
:
--------------- 111001111
(accumulated value)
(current partial term)
(coming partial terms)
(final result)
The new scheme first begins with the MSB of the input
and shifts each partial result to the left, avoiding the logic
necessary to manipulate the LSB at each iteration. The
number of bits in the addition is increased in 1 bit to maintain
whole precision of the output. This is shown in Figure 3.
We have to note that, given equation (4), the first partial
result must be subtracted in the shifter/adder structure. For
this reason, the main component of this stage is actually an
adder/subtractor unit.
An element taken into account was the usage of a sign
extension unit. Given that we are dealing with signed digits
and that the partial terms have evidently fewer bits than the
accumulated result, one needs to fill each partial term with
the necessary 0s or 1s (depending on if this partial term is
negative or positive) to correctly add both numbers.
:
:
:
10011
+ 11010
---------------- 111001111
4. Control unit:
This unit controls the other circuit components and the
whole circuit behaviour. It is a counter whose upper limit
depends basically on the input precision and defines the
circuit throughput. In contrast to other methods, an advantage
of Distributed Arithmetic is that the throughput in DA-based
architectures is independent of the order of the filter.
4. IMPLEMENTATION
To evaluate the performance of the proposed scheme, 4,
16 and 64-tap asymmetric low-pass FIR filters were
implemented and synthesized using Altera software Quartus
II on a Stratix device, and the results were compared to
implementations presented by Yoo et al. [1]. The precision
for inputs and coefficients were 18 and 16 bits, respectively.
In addition, to compare performance, the scheme
presented in [1] was implemented for the case of 4, 16 and 64
taps, introducing modifications in the input registers and
replacing the traditional accumulator by the one presented in
this work.
Firstly, the filter design was done using the Remez method
and Matlab 7r14. The stopband was defined at 0.28 rad/sec.
approximately, and the error minimization in the passband
was fixed to 10 times greater than in the stopband.
The coefficients were truncated to 4 decimals of precision
(5 decimals in 64-tap case) and scaled to signed integer
numbers with 16 bits of precision. These coefficients were
used to implement the FIR filters in Quartus II. The
frequency responses of the designed filters are shown in
Figure 4.
To validate the correct functionality of the implemented
circuits, each implementation was simulated with the
simulation tool provided by Quartus II. Tests were done using
random inputs and the results validated with a Matlab code.
The logic flow of the proposed LUT-based circuit can be
summarized as follows:
For the 4-tap FIR filter, the control unit clears buffers and
then the input collected by the input registers during the
previous 20-clock cycles is serially injected to the circuit.
These bits address a value in the 4-input LUT structure, and
this partial result is accumulated and shifted as previously
explained by the shifter/adder unit, taking into account that
the very first value must be subtracted.
Given that the precision of inputs is 18 bits in our case, the
partial result is accumulated and shifted 18 times with new
values addressed in the 4-input LUT unit. Finally, in the 19th
clock cycle, a signal from the control unit indicates to the
latch structure to output the final result, which is shown every
20 clock cycles.
In the case of 16-tap filters, the logic flow is similar,
excepting that now we get 4 partial results each from one of
the four basic 4-input filter cells. The adder-tree structure
adds these partial results and sends the values to the
shifter/adder unit, which again accumulates and shifts the
values. Similarly, in the 19th clock cycle, a signal from the
control unit indicates to the latch structure to output the final
result, which is shown every 20 clock cycles.
250
1.2
Magnitude
0.8
0.6
0.4
0.2
0.1
0.2
0.3
0.4
0.5
0.6
Frequency (rad/sec)
0.7
0.8
0.9
LUT-based
DA [1]
LUT-less DA
[1]
1.2
Magnitude
0.8
0.6
0.4
0.2
0.1
0.2
0.3
0.4
0.5
0.6
Frequency (rad/sec)
0.7
0.8
0.9
Proposed
LUT-based DA
1.2
Magnitude
LE
memory
LE
memory
4
272
344
210
56
LUT-less DA
0.8
LE
memory
LE
memory
4
139
72
128
72
0.6
0.4
0.2
0.1
0.2
0.3
0.4
0.5
0.6
Frequency (rad/sec)
0.7
0.8
0.9
Finally, for the 64-tap filter, each four 4-input filter cells is
grouped in 16-tap filters, resulting in four of these 16-tap
blocks. The four partial results provided by these blocks are
added together by the adder-tree structure. This partial result
is sent to the shifter/adder unit, which again accumulates and
shifts the values 18 times. Similarly, in the 19th clock cycle, a
signal from the control unit indicates to the latch structure to
output the final result, which appears in the output every 20
clock cycles.
5. RESULTS
Yoo et al. [1] implemented the LUT-less DA version to
compare area performance with typical LUT-based DA
schemes. The authors in [1] used an Altera Stratix
EP1S80F1508C6
FPGA
device
and
tested
the
implementation with different FIR filter lengths (from L=4 to
1024). The results for 4, 16 and 64-tap filters are shown in
Table 1.
As we can see, these results show that the proposed LUTbased bit-serial scheme is superior in area performance to the
LUT-less scheme in most of the cases (only in the case of a
4-tap filter, the LUT-less scheme obtains higher efficiency).
However, we have determined that our proposed approach
is in all cases more efficient. This is basically because a 24word LUT with 16-bit coefficients consumes at most 18 LEs,
while the equivalent in a LUT-less implementation (four 16bit MUX2x1, two 16-bit adders with 17-bit output, and one
17-bit adder with 18-bit output) may consume about 21 LEs
or even more, depending on coefficient values.
The achievements of LUT-less DA in Table 2 correspond
to implementations which truncate coefficient operations to
16 bits which means a LUT-less cell with four 16-bit
MUX2x1 and three 16-bit adders with 16-bit output.
Furthermore, the whole filter precision is affected and
reduced from 37-bit to just 33-bit output, which introduces
further savings. This is possible because for the case of 16-bit
coefficients a LUT would need at least 19 bits in the output
(worst case a0+a1+a2+a3, in Figure 1), while a LUT-less
scheme without the restriction of a LUT could be
implemented with the minimum possible precision that does
not produce an overflow in the data (16-bit output precision
with the used coefficients). However, this argument is not
valid for comparing both architectures given that if it is
251
Proposed
LUT-based DA
LUT-less DA
LE
memory
LE
memory
4
139
72
139
72
REFERENCES
[1] H. Yoo, and D. Anderson, Hardware-Efficient Distributed Arithmetic
Architecture for High-Order Digital Filters, in Proc. IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP '05),
2005, Vol. 5, pp. 125 128.
111MHz
55.9MHz
39.9MHz
Throughput
(Msamples/sec)
5.6
2.8
Max. frequency
111MHz
46.7MHz
34.6MHz
Throughput
(Msamples/sec)
5.6
2.3
1.7
64
Max. frequency
[4] M. Yamada, and A. Nishihara, High-Speed FIR Digital Filter with CSD
Coefficients Implemented on FPGA, in Proc. IEEE Design Automation
Conference (ASP-DAC 2001), 2001, pp. 7-8.
[5] M.A. Soderstrand, L.G. Johnson, H. Arichanthiran, M. Hoque, and R.
Elangovan, Reducing Hardware Requirement in FIR Filter Design, in
Proc. IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP '00), 2000, Vol. 6, pp. 3275 - 3278.
252