You are on page 1of 5

A Modified Radix-2 4 SDF Pipelined OFDM Module

for FPGA based MB-OFDM UWB Systems


M.Santhi, S.Arun Kumar, G.S.Praveen Kalish, K.Murali, S.Siddharth, G.Lakshminarayanan
Department of ECE, National Institute of Technology, Thiruchirapalli.
santhiphd@gmaiI.com laksh@nitt.edu

Abstract - The OFDM module in the MB-oFDM UWB system, and the execution time of the 128-point FFT/IFFT in
transmitter is necessarily operated at 528 MHz. This is really a UWB system is only 312.5 ns. The power consumption and
challenging task because the OFDM in the UWB module has to hardware cost can be saved in our processor by using the
calculate 128-point IFFT. Earlier papers used radix-2 4 SDF higher radix FFT algorithm and less memory and complex
algorithm with parallel processing architectures of block size two
multipliers.
to achieve the required speed and implemented the module on
ASIC. In this paper a novel scheme "modified radix-24 SDF This paper is organized as follows. Section II describes the
algorithm" is proposed to achieve the calculation of 128-point design issues of MB-OFDM UWB communication systems.
IFFT. In the proposed scheme, the order of the twiddle factor Section III describes the proposed 128-point radix-2 4
sequence is different compared to the earlier radix-24 SDF FFT/IFFT algorithm. Section IV describes the proposed 128-
algorithm. The change in twiddle factor sequence achieves easier point radix-24 FFT/IFFT architecture. In Section V, the
implementation of the CSD multiplier used for IFFT calculation. implementation and performance of the proposed FFT/IFFT
It is also proposed that the required speed can be achieved on architecture are discussed. Conclusions and further work are
FPGA itself without using paraDel processing architectures. This presented in Sections VI and VII respectively.
can be done by pipelining the OFDM module as well as using
LPMs. This leads to reduction in area compared to the earlier
II. DESIGN ISSUES OF THE FFf PROCESSOR
approach of using parallel processing architectures of block size
two. For improving the accuracy, in the proposed scheme the
internal wordlength is maintained at 13bits which is 7 bits more A block diagram of the proposed physical layer of OFDM-
than the input, to account for the overflows at each of the 7 stages based UWB system is shown in Fig. 1[4]. In the UWB system,
of the OFDM module. The proposed scheme with increased the data rate is from 53.3 Mb/s to 480 Mb/s with code rates of
complexity for better accuracy is tested on ALTERA Stratix III 113, 11/32, 112, 5/8, and 3/4. The bandwidth of the transmitted
EP3SL50F484C2 device. From the implementation, it is verified signal is 528 MHz and the OFDM symbol duration is 312.5
that the OFDM module achieves a maximum clock speed of 528 ns, including 60.61 ns for cyclic prefix duration and 9.47 ns
MSamplesls. In general ASICs are three times faster than FPGA,
for guard interval duration [2][3]. Thus, an FFT/IFFT has to
operating the ASIC based OFDM module in 528 MHz with the
proposed modified radix-24 SDF pipelined algorithm is very compute one OFDM symbol within 312.5 ns and the
much easier. throughput rate of this specification in 128-point FFf/IFFT is
up to 409.6 MSamples/s.
Keywords - MB-OFDM, SDF, FFT, FPGA.
Various FFT architectures, such as single-memory
architecture, dual-memory architecture, pipelined architecture,
array architecture, and cached-memory architecture, have been
I. INTRODUCTION proposed in the last three decades. In our view, the pipelined
architecture should be the best choice for UWB systems since
Ultra wideband (UWB) communication systems, which it can provide high throughput rate with acceptable hardware
enable the delivery of data from a rate of 110 Mb/s at a cost.
distance of 10m to a rate of 480 Mb/s at a distance of 2 m, are
ideally suited to application in short range wireless The pipelined FFT architecture typically falls into one of the
communications because they can share a frequency band with two following categories: multipath delay commutator (MDC)
existing narrowband systems and offer a higher data rate than and single-path delay feedback (SDF)[5]. In general, the Moe
802.11 or Bluetooth [1]. One of the communication methods scheme can achieve a higher throughput rate, while the SOF
for IEEE 802.15.3a standard is Multiband Orthogonal scheme needs less memory and hardware cost. In addition, the
Frequency Division Multiplexing (MB-OFDM), which offers higher radix FFT algorithm is difficult to be implemented in
528 MHz bandwidth [2][3]. MB-OFDM-based UWB not only the traditional MOC architecture. Table 1 compares the
has reliably high-data-rate transmission in time-dispersive or hardware requirements for various architectures. The proposed
frequency-selective channels without having complex time- architecture based on radix 24 SOF architecture was selected
domain channel equalizers but also can provide high-spectral for implementation owing to the low hardware cost and
efficiency. greater area efficiency and can also provide an available
The FFT/IFFT processor is one of the modules having high throughput rate to meet the UWB specifications.
computational complexity in the physical layer of the UWB

Proceedings of the 2008 International Conference on Computing, Communication and Networking (ICCCN 2008)
978-1-4244-3595-1/08/$25.00 <02008 IEEE

Authorized licensed use limited to: VELLORE INSTITUTE OF TECHNOLOGY. Downloaded on August 3, 2009 at 08:01 from IEEE Xplore. Restrictions apply.
+ +i
Where H (n) denotes the second butterfly unit
H(1'I):: H(ra.k t ·Ic:>:: B(tLkJ (-j)(.i4...:ft:JB(ft .kl)
Where B (n,kl) denotes the first butterfly unit as follows.
B(n.k1 ) =x(n) +(-l)tt'x(n+~)
~

The algorithm can take complex constant multiplier instead of


Fig. 1. Block diagram ofthe MB-OFDM UWB receiver system
programmable complex multiplier. The Canonic Signed Digit
TABLE 1 COMPARISON OF HARDWARE REQUIREMENTS FOR N-LENGTH FFT
WITH DIFFERENT ARCHITECTURES
(CSD) constant multiplier contains the fewest number of non-
Complex Complex Memory Control zero bits, so it can be used to reduce the area and power
Architecture
Multiplier # Adder # size circuit consumption [7]. Fig. 2 shows the signal flow graph (SFG) of
R2SDF log2(N)-2 log2(N) N-l simple the 128-point radix-~4 SDF FFT alg~rithm.
R2MDC lo~(N)-2 4Iog4(N) 3N/2-2 simple
R4SDF lo~(N)-1 log4(N) N-l medium
R4MDC 3(lo~(N)-1 ) 8Iog4(N) 5N/2-4 simple
R2 2 SDF lo~(N)-1 4Iog4(N) N-l simple
R2 3 SDF logg(N)-1 4Iog4(N) N-l simple
R2 4 SDF log16(N)-1 41og4(N) N-l simple

III. PROPOSED RADIX 24 SDF ALGORITHM

A Discrete Fourier transform (DFf) of length of N (=128)


is defined as

=L
N-f. Fig 2. Signal flow graph of the proposed R2 4 SDP algorithm
x(k) x(n)Wlk .k:: O.l..... N -1 (1)
'tat IV. PROPOSED FFT ARCHITECTURE FOR THE MB-OFDM
Where WN, the so called "twiddle factor", denotes the N-th UWB SYSTEM
primitive root of unity, with its exponent evaluated modulo N.
A block diagram of the proposed single data-path 128-point
The k is the frequency index, and the n is the time index. In
R2 4SDF FFT/IFFT processor is shown in Fig. 3. The
order to derive the radix-2 4 algorithm, consider the first 4
proposed architecture consists of a memory block, butterfly
steps of decomposition [6]. Applying a 5-dimensional linear
units (BFl, BF2), programmable complex multipliers, CSD
index map, wherein the 5th dimension in itself is decomposed
complex constant multipliers, register files, and some
into a 2 bit and 1 bit index, we have,
multiplexers. The FFT processor can be transformed to an
n=
11
<"2
N .V .V .'i
ftt +"4J1.: +8"; + 16 114 + 64 + ftt > n, IFFT block by performing the operation as shown in the Fig
4. The output results of butterfly units are complex addition
k = < k1 + 2k: +4k; + SkI. + 32k, .... 16k, > (2) and complex subtraction of two input data x[n] and x[N/2+n],
The common factor algorithm (CFA) takes the form of
where N=l28.
XOct + 2k: + 4k; + 8k~ + 32ks + 16k~) Due to the spatial regularity of Radix-2 4 algorithm, the
=L
~ f L.
£,
~ ~ ~ ~ (i~V .v :N N N ~\
l.. i.. ~ x >2"~ +."-- +it':J -+- ii~'" +ii,r.t +"a' synchronization control of the processor is very simple. A
t ....r . . . ., "'J~ ':-0 .. ~-o
(log2N)-bit binary counter serves two purposes:

:: L L [G(JII•. 1'1,. k t' ":. Iei' k4 }it:l".... Jlo)(k,....:k•• ~.j;(J]


synchronization controller and address counter for twiddle
factor reading in each stage. For first N/2 cycles, the 2-to-l

.'."';:'lI.....J(;:cI... ~
n.-Ot'la-O multiplexers in the butterfly module I (as shown in Fig.5.i)
(3) switch to position "0", and the butterfly is idle. The input data
if
from left is directed to the shift registers until they are filled.
On next N/2 cycles, the multiplexers tum to position "1 ", the
butterfly computes a 2-point DFT with incoming data and the
data stored in the shift registers.
ZI(n) = x(n) + x(n+N/2), 0 ~ n < N/2 (6)
ZI(n + N/2) = x(n) - x(n+N/2)
The butterfly output ZI(n) is sent to apply the twiddle
factor, and ZI(n + N/2) is sent back to the shift registers to be
"multiplied" in still next N/2 cycles when the first half of the
next frame of time sequence is loaded in. The operation of the
second butterfly is similar to that of the first one, except the
(5)

Authorized licensed use limited to: VELLORE INSTITUTE OF TECHNOLOGY. Downloaded on August 3, 2009 at 08:01 from IEEE Xplore. Restrictions apply.
Output

X(D~2.)..X(4).. -.J((eo)..X(52)C(1)..)C(3)r··,J((61)C(l53)r A(O)"A(n)"A(Uii)r-.A(62).,A(64)..··-,.Il(94},A(t26),


X(64)rX(Mi)r··,)((:l24)..x<U6).X(IIi5))C(fi7).. ....)({ 127) A(J.)"A(")"A(J.7)r·.A(~~·..,.A(9S),A(127)

:> Tlf11e>
TllTle c8> PragriImmabIe Ca~ Muitipler
o CSD Complex Multiplier
_____Data path

Fig 3. Block diagram ofFFT/IFFT processor

"distance" of butterfly input sequence are just N/4 and the


trivial twiddle factor multiplication has been implemented by
real-imaginary swapping with a commutator and controlled
add/subtract operations, as in Fig. 5-ii, which requires two bit
control signal from the synchronizing counter. The data then
goes through a full complex multiplier, working at 75%

Fig 5.i Structure ofBFl


IFFT 11M
R R

11N

Fig 4 Block diagram of the proposed 128-point R2 4SDF FFT/IFFT processor

utility, accomplishes the result of first level of radix-4 OFT


word by word. Further processing repeats this pattern with the
Fig 5.ii Structure ofBF2
distance of the input data decreases by half at each
consecutive butterfly stages. After N-l clock cycles, the
complete OFT transform result streams out to the right, in bit- After the transform of the Eq.7, the complex multiplication
reversed order. The next frame of transform can be computed
only needs 3 real multiplications, 1 addition and 2 subtraction
without pausing due to the pipelined processing of each stage.
when the sum and the difference between the real and the
Radix-2 4 FFT algorithm based single-data-path architectures imaginary parts are precomputed and stored in the ROM .This
has fewer multipliers than those of lower radix FFT algorithm is used for the programmable complex multiplier to
algorithms. For example, radix-2 4 algorithm has the same reduce the hardware complexity and to increase the speed.
number of multipliers as the radix-22 algorithm but can reduce
an amount of multiplicative complexity by means of replacing
a half of full complex multipliers with trivial constant CSD Multiplier
multipliers [8].In the CSD complex constant multiplier, the
multiplication of the twiddle factors is processed according to Since the twiddle factors in the FFT processor are known in
their scheduling in the signal flow graph. The output data advance, we propose the use of a multiplier-less architecture
generated by the BF in the sixth stage are multiplied by a to perform the multiplication with the twiddle factors using
trivial twiddle factor, -j, W(16) or W(48) before they are fed shift-and-add operations. The canonical sign digit (CSD)
to the last stage. algorithm has been applied to this architecture to further
reduce the number of shift and-add operations required. In this
The Simplification ofthe Complex Multiplication architecture trivial multiplications are implemented without
Complex multiplication is the main design key in the FFT any multipliers by either passing the data, swapping the real
algorithm. Consider the complex multiplication, the two and imaginary parts of the complex data or a sign change. The
inputs should be the xr + i xi and the coefficient W = design presented in the paper takes advantage of the
exp(j21t1N) = cosa + i sin a, and the result can be expressed by symmetries of the twiddle factors in the complex plane.
Y = yr + i yi , where, When the real and imaginary values of twiddle factors are
yr= xr cos a - xi sin a = xi(cos a + sina) + (xi - xr) cos a same, two CSO constant multipliers and two adder
yi = xi cos a+ xr sin a= xr(cos a - sin a)-(xi - xr) cos a (7) /subtractors are used to generate the output. When the real and
imaginary values are not same, three CSO constant multipliers

Authorized licensed use limited to: VELLORE INSTITUTE OF TECHNOLOGY. Downloaded on August 3, 2009 at 08:01 from IEEE Xplore. Restrictions apply.
are used. If inputs don't need to multiply with twiddle factor architecture agreed with the output data of MATLAB and the
the output results are generated from the input directly. FFT/IFFT in our UWB platform, which was designed on a
EXCEL worksheet which clearly depicts the outputs with the
Pipelining
signal flow graph.
The radix 24 architecture was thoroughly analyzed to find The implementation of the proposed FFT/IFFT processor
possible areas to be pipelined based on the design and the was carried out on a Stratix II EP2S60FI020C4 device and
critical path delays between various implemented blocks. The simulated for ALTERA Stratix III EP3SL50F484C2. The
processor was extensively pipelined to achieve the high input data is given through a dual port RAM and a PLL unit is
working frequency to meet the UWB specification. used to give the required clock frequency. The output is
Shimming registers are also needed for control signals to checked using a dual port RAM and the in-system memory
comply with thus revised timing. content editor. Table 2 shows the performance and resource
usage of the implemented processor. This shows the processor
v. IMPLEMENTATION AND PERFORMANCE is area efficient and so the entire MB-OFDM receiver
Itransmitter with the other modules can be accommodated in a
The word length of the proposed FFTIIFFT is 6-bit external single chip. It has a significantly reduced number of complex
FFT data [9] for both the real and imaginary parts. The 2's multiplication and complex addition. The critical path delay
complement representation of numbers is used in the occurs between the input RAM and first butterfly unit and so
processor. Due to overflow in each adder of the butterfly unit, the processor is capable of running at UWB speeds if
13-bit internal FFT precision has been maintained. The implemented within a larger system.
determined word length not only keeps the quantization noise All the previous implementations were on ASIC [9] and
to the least but also can minimize the hardware complexity. so comparison with them is not meaningful. Table 3 shows the
After the appropriate word length of the proposed FFT/IFFT comparisons of performance of the different FFT processors
processor is chosen, the architecture of the processor was implemented on FPGA. The validity and efficiency of the
modeled in Verilog in an ALTERA Stratix III FPGA. Some of proposed architecture has been verified by extensive
the modules were generated from the ALTERA Megawizard simulation and implementation. Fig 6 shows the
Plug-in Manager and others were written at the RTL level, implementation results of the proposed FFTIIFFT processor.
including the top level wrapper file. It contains all the
instantiated modules and the connectivity information in RTL
(VerilogHDL). The Timequest timing analyzer and Chip
TABLE 3 COMPARISIONS OF THE Performance of DIFFERENT PROCESSORS
planner (Floorplan and Chip editor) of QUARTUS II 8.0 were
applied to analyze timing, hardware expenditure and so on.
Family Frequency max
Vector waveforms associated with the RTL description were
Altera FFT Megacore function on
created and the stimulus provided in an external file. Using the Stratix III [10]
456 MHz
vector waveform file, simulations were carried out for the Proposed processor on ALTERA
350 MHz
design to validate the behavioral description. The results were Stratix II EP2s60FI020C4
obtained incrementally, first for a sub block comprising of one Proposed processor on ALTERA
528 MHz
Stratix III EP3SLSOF484C2
module of the FFT. Finally the results were obtained for the
whole design comprising of seven such sub blocks, global
clock and dual port RAMs. The output of the Verilog coded
TABLE 2 IMPLEMENTATION RESULTS OF THE PROPOSED PROCESSOR VI. CONCLUSION
ALTERA Stratix An OFDM module implemented as 128-point FFT/IFFT
Family ALTERA Stratix II
III
processor for a FPGA-based MB-OFDM UWB system using
Device EP3SL50F484C2 EP2s60FI020C4
the proposed modified radix-24 SDF pipelined algorithm has
ALUTs 7972/38000 (3%) 7822/48352 (16%)
been successfully implemented on ALTERA STRATIX III
ALMs 3986/19000(3%) 4375/19000(3%)
DSP block
and STRATIX II FPGAs without using parallel processing
6/216 «3%) 6/288 (2%) architectures. The high speed is achieved by using extensive
elements
3328/1880064«1
8192/2544192«1 %)
pipelining on Altera's LPM. The hardware costs of memory
Total memory bits
%) and complex multiplier is saved by adopting delay feedback
1:6 bits 1:6 bits
Word length and data scheduling approaches. In addition, the number of
Q:6 bits Q:6 bits
Number of complex multiplications is reduced effectively by using a
7580/38000(20%0 7697/38000(20%)
reldsters higher radix algorithm and using CSD complex multipliers.
Programmable Also for improving the accuracy in the proposed scheme, the
complex 1 1 internal wordlength is maintained at 13bits which is 7 bits
multipliers #
more than the input, to account for the overflows at each of the
Constant complex
2 2 7 stages of the OFDM module. The implementation results
multipliers #
Number of show that the throughput rate is 350 MSamples/s at 350 MHz
28 28
complex adders on ALTERA STRATIX II and 528 MSamples/s at 528 MHz
Clock rate 528 MHz 350 MHz on ALTERA STRATIX III device. The high throughput rate
Throughput rate 528 Msamples/s 350 Msamples/s of the OFDM module with increased internal wordlength of 13
Critical path delay 1.87 ns 2.87 ns

Authorized licensed use limited to: VELLORE INSTITUTE OF TECHNOLOGY. Downloaded on August 3, 2009 at 08:01 from IEEE Xplore. Restrictions apply.
bits from 6bits to improve accuracy is very well meeting the
MB-OFDM UWB system's specifications.

Fig 6. Results of the implemented processor

VII. REFERENCES
[1] Time Domain, "UWB Applications, Demonstration & Regulatory
Update," Sept 2001 workshop, March 20,2001.
[2] A. Batra et aI., "Multi-band OFDM Physical Layer Proposal for IEEE
802.15 Task Group 3a," IEEE P802.15-Q3/268r3, March 2004.
[3] A. Batra, J. Balakrishnan, G. R. Aiello, J. R. Foerster, A. Dabak, Design
of Multiband OFDM System for Realistic UWB Channel Environment,"
IEEE Trans. On Microwave Theory and Techniques, vol. 52, no. 9, pp.
2123-2138, Sept. 2004.
[4] Y-W. Lin, H-Y. Liu, and C-Y. Lee, "A I-GS/s FFT/IFFT processor for
UWB applications," IEEE Journal of Solid-State Circuits, vol. 40, no. 8,
pp. 1726-1735, August 2005.
[5] S. He and M. Torkelson, iODesigning pipeline FFT processor for
OFDM(de)modulation,i± in Proc. DRSI Int. Symp. Signals, Systems,
and Electronics, vol. 29, Oct. 1998, pp. 257.262.
[6] J. Lee, H. Lee, S-I. Cho, S-S. Choi, "A High-Speed, Low-Complexity
Radix-2 4 FFT Processor for MB-OFDM UWB Systems," IEEE Inter.
Symp. on Circuits and Systems, pp. 4719-4722,
[7] S-M. Kim, J-G. Chung, and K. K. Parhi, "Low Error Fixed-width CSD
Multiplier with Efficient Sign Extension," IEEE Transactions on
Circuits and Systems-II, vol. 50, no. 12, Dec. 2003.
4
[8] H.Lee, M.Shin "A High-Speed Low-Complexity Two-Parallel Radix-2
FFT/IFFT Processor for UWB Applications, " IEEE Asian Solid-State
Circuits Conference, November 2007
[9] R. S. Sherratt, S. Makino,"Numerical Precision Requirements on the
Multiband Ultra-Wideband System for Practical Consumer Electronic
Devices" IEEE Transactions on Consumer Electronics, Vol. 51, No.2,
MAY 2005.
[10] FFT MegaCore Function User Guide MegaCore Version 7.2
www.altera.com

Authorized licensed use limited to: VELLORE INSTITUTE OF TECHNOLOGY. Downloaded on August 3, 2009 at 08:01 from IEEE Xplore. Restrictions apply.

You might also like