Design of An Efficient FFT Processor For OFDM Systems: Haining Jiang, Hanwen Luo, Jifeng Tian and Wentao Song

H. Jiang et al.
: Design of an Efficient FFT Processor for OFDM Systems
1099
Design of an Efficient FFT Processor for OFDM Systems

Haining Jiang, Hanwen Luo, Jifeng Tian and Wentao Song
Abstract Orthogonal Frequency Division Multiplexing
(OFDM) system is famous for its robustness against frequency
selective fading channel and the FFT processor is the critical
block in all OFDM systems. In this article, an efficient FFT
processor architecture suitable for OFDM systems is proposed. In order to meet the requirements of high-speed data
transmission and low-area consumption in OFDM systems,
two novel butterfly algorithms--parallel butterfly algorithm
and dual butterfly algorithm are developed in the design of
butterfly unit, which is the kernel in FFT processor. The FFT
processor with these butterfly algorithms has high throughput
and requires relatively small areas. Performance evaluation
demonstrates that the proposed FFT architecture can meet the
requirement of Wireless LAN (IEEE 802.11a) standard.1.
Index Terms OFDM, FFT, FPGA, Butterfly.
I.
INTRODUCTION
Orthogonal frequency division multiplexing (OFDM) [1],

the spectrum efficient multi-carrier modulation technique,
transforms a highly-selective wide-band channel into a large
number of non-selective narrow-band slices which are frequency multiplexed. These years, OFDM techniques have received great attention in high-speed data communication systems and have been selected for wireless local area network
(WLAN) -- IEEE 802.11a and Hiperlan-2, digital audio broadcasting (DAB), digital video broadcasting (DVB) [2], very
high-speed digital subscriber line (VDSL), and Beyond 3G
research.
In OFDM systems, Fast Fourier Transform (FFT) is used to
realize multi-carrier modulation, which reduces the complexity
of OFDM systems greatly. As the data transmission rate of
OFDM systems increases, generating OFDM symbols with
high data rate requires high-speed FFT processor. Moreover,
An FFT processor with low-area and low-power consumption
is needed by portable feature of OFDM systems.
In this article, an efficient FFT processor for OFDM systems is proposed. After the introduction of conventional FFT
architectures in section II, an efficient FFT architecture is described in section III. Based on the proposed FFT architecture,
two novel butterfly algorithms, which are called parallel but-
terfly algorithm and dual butterfly algorithm, are developed

in section IV and section V, respectively. Section VI gives the
application of the proposed FFT architecture in OFDM systems based on IEEE 802.11a standard. Section VII is the conclusion of the paper.
II. CONVENTIONAL FFT ARCHITECTURES

Fig. 1 shows conventional FFT architectures. The processing (Proc) element in Fig. 1 performs butterfly operation. Fig.
1(a) shows the single-memory architecture. It has one processing element and one memory element. Butterfly outputs are
stored in the same memory location used by butterfly inputs
[3]. Fig. 1(b) shows the dual-memory architecture. It has two
memories: one is used to store butterfly inputs and the other is
used for butterfly outputs. These two architectures require
small areas. However, they have low throughput and require
high clock frequency. For high throughput applications, two
other architectures have been developed in some literatures.
Fig. 1(c) shows pipeline architecture [4], which is characterized by non-stopping processing on a clock frequency of the
input data sampling. Fig. 1(d) shows parallel architecture,
which increases processing elements in parallel to develop the
throughput of FFT. With the pipeline or parallel architecture
[5], a high-speed FFT processor can be implemented. However,
it requires more hardware resources (especially more complex
multipliers), which is not suitable for the portable application
of OFDM systems. So, in the design of FFT processors for
OFDM systems, we should not only enhance the speed by introducing more parallelization and pipelines, but also reduce
the hardware resource consumption as possible as we can.
Based on the rule, an efficient FFT architecture is proposed in
section III.
Proc
Proc
Memory
Proc
(a) Single-memory architecture
Memory
Memory
Proc
Memory
Proc
(b) Dual-memory architecture
(c) Parallel architecture
This work was supported in part by the National Natural Science Foundation of China under Grant No.60272079 and the National Hi-Tech Research & Development Program of China under Grant No. 2003AA123310.
Haining Jiang is with Shanghai Jiao Tong University, Shanghai, 200030,
China (e-mail: jhn2046@hotmail.com).
Hanwen Luo is with Shanghai Jiao Tong University, Shanghai, 200030,
China (e-mail: luo_hanwen@hotmail.com).
Jifeng Tian is with Shanghai Jiao Tong University, Shanghai, 200030,
China (e-mail: jeffhrb@hotmail.com).
Wentao Song is with Shanghai Jiao Tong University, Shanghai, 200030,
China (e-mail: radio_sjtu@hotmail.com).
Contributed Paper
Manuscript received July 17, 2005
Memory
.
.
.
Memory
0098 3063/05/$20.00 2005 IEEE
Proc
Buff
Proc
...
Proc
(d) Pipeline architecture
Fig.1. Conventional FFT architectures
Memory
1100
IEEE Transactions on Consumer Electronics, Vol. 51, No. 4, NOVEMBER 2005
Datain_I
Datain_Q
ROM
RAM1
RAM2
Data
Address
Generation
Data Switch
Dataout_I
Dataout_Q
Block
Floatpoint
Unit
Butterfly
Twiddle
Address
Generation
RAM1
RAM2
Timing Control
Fig.2. Block diagram of an efficient FFT processor
III. EFFICIENT FFT PROCESSOR

Fig. 2 shows the block diagram of an efficient FFT processor for OFDM systems, which follows the rule described in
section II. For simplicity of design, we divide the FFT processor into several functional units: butterfly unit, data address
generation, twiddle address generation, timing control, data
switch, and block float-point unit. Some memory units are also
required: RAM1, RAM2, RAM3, RAM4 and ROM. The function description of each unit is listed below:
(1) Butterfly unit: the kernel of FFT processor, conduct the 4point DFT and multiplications with twiddles;
(2) Data address generation: generate the address for reading
data and writing data;
(3) Twiddle address generation: generate the address of twiddle factor coefficients (for simplicity of description, we call it
twiddle at the rest of the paper) for radix-4 butterfly operation;
(4) Data switch: conduct the switch between RAMs;
(5) Block float-point unit: Conduct block float-point operation:
collect the bit length information of the data out of butterfly
and truncate it to the length required;
(6) Timing control: Control the timing of all the other units;
(7) RAM1 and RAM2: store the input data and internal data;
(8) RAM3 and RAM4: store the output data and internal data.
Butterfly unit area is 80% of FFT processor, and its speed
decides the speed of FFT processor. So, it is very important to
develop the processing speed of butterfly and reduce its areas
simultaneously. In the proposed architecture, radix-4 butterfly
unit for decimation-in-frequency (DIF) FFT is considered. It is
the core of the FFT processor, and performs radix-4 butterfly

operation. In order to meet the requirement of high-speed data
transmission and low-area consumption in OFDM systems,
two novel butterfly algorithms -- parallel butterfly algorithm
and dual butterfly algorithm are proposed in the article,
which will be described in the next two sections.
IV. PARALLEL BUTTERFLY ALGORITHM
In this section, a parallel butterfly algorithm is proposed,
which can increase the processing speed of FFT processor
greatly with little rise of areas.
A. Parallel Architecture
FFT processor performs Discrete Fourier Transform (DFT).
N-point DFT can be defined in (1).
N 1
X(k ) = x(n)WNnk , k = 0,1, , N 1

n=0
WN = e
j 2 / N
(1)
Using radix-4 DIF algorithm, radix-4 butterfly operation

can be expressed in (2),
A = ( A + C ) + ( B + D )
B = ( A C ) j ( B D ) WNp
H. Jiang et al.: Design of an Efficient FFT Processor for OFDM Systems
1101
C = ( A + C ) ( B + D ) WN2 p
D = ( A C ) + j ( B D ) WN3 p
Fig.4.
(2)
where A , B , C , D are butterfly inputs and A , B , C ,

D are outputs.
From (2), a butterfly operation can be decomposed into two
steps. The first step is 4-point DFT, which is represented in (3),
and the second one is to multiply with twiddles, as shown in
(4).
At = ( A + C ) + ( B + D )
Bt = ( A C ) j ( B D)
Ct = ( A + C ) ( B + D)
Dt = ( A C ) + j ( B D )
(3)
and
A = At , B = Bt WNp ,
C = Ct WN2 p , D = Dt WN3 p
(4)
It is easy to find that there are no multiplications in the

computation of A . So, we can parallelize the computation of
A and that of the other 3 points B , C , D by adding only a
4-point DFT unit. The parallel architecture enhances the processing speed of butterfly efficiently. Furthermore, the 4-point
DFT only contains some addition operations as shown in (3).
So the increase in hardware resources consumption of butterfly
unit is so small that it can be ignored. The parallel butterfly
architecture is shown in Fig. 3.
4-Point DFT
Twiddles
Multiplication
4-Point DFT
Fig.4. Pipeline diagram for parallel butterfly algorithm
Stage 1: To read data from memory (MEM). 4 complex data

are read in 3 clock cycles.
Stage 2: 4-point DFT. State machine (3 states: S1, S2 and S3)
is used to control the generations of At, Bt, Ct and Dt. At and Bt
are got in state S1; Ct is got in state S2; Dt is got in state S3. 4point DFT contains 3 addition units which are denoted as
ADD1, ADD2 and ADD3, respectively.
Stage 3: To multiply with twiddles. A complex multiplier is
divided into 3 real multipliers and 5 real adders. The 3 real
multipliers (MULT) perform in parallel, with 3 parallel adders
(ADD) in front and 2 parallel adders (ADD) at back. So completing this stage needs 3 clock cycles.
Stage 4: To write data into memory (MEM). Similar to
stage 1, it writes 4 complex data into memory in 3 clock cycles.
The total length of the pipeline is 12 clock cycles. With the
pipeline structure, the speed of butterfly unit is enhanced.
V.
DUAL BUTTERFLY ALGORITHM
Based on parallel butterfly algorithm and the analysis of

generation of twiddles as follows, dual butterfly algorithm is
put forward in this section. Dual butterfly algorithm is more
efficient than parallel butterfly algorithm.
A. Generation of Twiddles
From (2), we can see that 4 twiddles are needed in every
butterfly operation. They are WN0 , WNp , WN2 p and WN3 p , where
WN=exp(-j2/N). The value of exponent p has a connection
with the FFT stage m of radix-4 butterfly, as shown in (5).
The result of A
p = 4 m 1 l , m = 1, 2. , log 4 N ;
l=
l0 l0 l0

, l0 = 0,1, , N 4 m 1
(5)
repeat 4m 1 times
The results of B , C , D
Fig.3. Parallel butterfly architecture
B. Pipeline Architecture
To utilize the hardware resources more effectively, pipeline
architecture is introduced in the parallel butterfly algorithm.
The butterfly data-path has 4 pipeline stages, as shown in
From the equation above, we can see that the item p=0 comes
out regularly, and the bigger the value of m becomes, the more
item p=0 comes out. When m = log 4 N , all the values of p are
0, and the corresponding four twiddles equal to 1. That is,
there are no multiplications in the butterfly operation when
p=0. So, we can put this kind of butterfly operation, which
need no multiplications, parallel with other butterfly operations, which need multiplications. This kind of parallel architecture only introduces some simple addition operation, which
1102
IEEE Transactions on Consumer Electronics, Vol. 51, No. 4, NOVEMBER 2005
Fig.5. Block diagram for dual butterfly algorithm
has little effect on the hardware resources consumption. Under

consideration of the generation rule of twiddles described
above, dual butterfly architecture is proposed.
B. Description of the Algorithm
For simplicity of description, we define the four butterfly
inputs, which are connected with a certain p, as one butterfly
data group. The butterfly data group which is connected with
item p=0 is defined as zero data group; Otherwise, nonzero
data group is defined. With the definitions above, we can divide the inputs of all the butterflies into two kinds of data
groups. The butterfly data group allocation of 64-point FFT
processor is shown as an example in Table I.
TABLE I
BUTTERFLY DATA GROUP ALLOCATION OF 64-POINT FFT
Stage 1
Stage 2
Stage 3
Num. of nonzero data group
Num. of zero data group
15
1
12
4
0
16
Different butterfly operation is adopted for different data

group. Parallel butterfly algorithm (called parallel butterfly
process in this section) is used for butterfly operation of nonzero data groups, which needs multiplications. The butterfly
operation of zero data groups is called simple butterfly process, in which 4-point DFT is adopted and no multiplication
exists. The new architecture, which parallel butterfly process
and simple butterfly process run in parallel, is called dual butterfly architecture, as shown in Fig. 5.
In Fig. 5, every butterfly of parallel butterfly process reads 4
data in 3 clock cycles. That is, one port of dual-port RAM1
reads 3 data in the 3 clock cycles, while the other port reads
only 1 data in the first clock cycle and keeps free in the other 2
clock cycles. So, the two free clock cycles can be used to
transmit zero data groups. Because the number of zero data
groups is less than that of nonzero data groups in the FFT
stage except of the last one, so all the zero data groups can be
transmitted during the transmission of nonzero data. In the last
FFT stage, all the input data belong to zero data groups and a
simple method can be adopted -- all the data are transmitted
into simple butterfly process unit through both the two ports of
RAM1 simultaneously. At this time, the total length of the

pipeline is 7 clock cycles.
Furthermore, when parallel butterfly process and simple butterfly process run together, a certain pipeline delay should be
introduced in simple butterfly process to keep synchronization
with parallel butterfly process. This pipeline includes reading
2 data from memory in 6 clock cycles, simple butterfly process
in 3 clock cycles and writing data into memory in 6 clock cycles.
VI. APPLICATION IN WIRELESS LAN
In this section, an example of designing 64-point FFT processor for wireless LAN (IEEE 802.11a) is taken. In standard
IEEE 802.11a [6], the FFT time parameter is specified as
tFFT=3.2s. If parallel butterfly algorithm is adopted, the whole
64-point FFT computation needs nFFT = none_stage 3 =
(64/43+12) 3 = 603 = 180 clock cycles. If the system
clock is 60MHz, the time to complete the 64-point FFT computation is 180/60MHz=3s<3.2s, which has met the requirement in the standard of IEEE 802.11a. If dual butterfly
algorithm is used, it is easy to get the number of clock cycles
needed in every FFT stage from Table I. They are 57, 48 and
39 respectively. So the whole 64-point FFT computation needs
144 clock cycles. Under the system clock of 60MHz, 64-point
FFT operation costs 144/60MHz = 2.4s, which is much less
than the FFT time parameter required in IEEE 802.11a. From
the evaluation above, it is apparent that parallel butterfly algorithm and dual butterfly algorithm can satisfy the requirement
of FFT processor in the standard IEEE 802.11a greatly. Moreover, the parallelization introduced in the two novel butterfly
algorithms only increase the number of addition units. So the
high-speed FFT processor based on the two butterfly algorithms has little hardware resources consumption.
VII. CONCLUSION
In this article, two butterfly algorithms -- parallel and dual
butterfly algorithms are proposed. The main idea of these two
algorithms is to make the operation without multiplications
(mainly contains addition operations) and the one with multiplications run in parallel. Because the area that addition units
occupy is very small, the FFT processor based on the two but-
H. Jiang et al.: Design of an Efficient FFT Processor for OFDM Systems
terfly algorithms requires very small areas and has high processing speed. Performance evaluation and practical implementation proved that the FFT processor with these two novel algorithms is suitable for wireless LAN applications. Moreover,
it can also be used in other OFDM applications like digital
video broadcasting (DVB) and wireless MAN (802.16).
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
J. A. C. Bingham, Multicarrier modulation for data transmission: an

idea whose time has come, IEEE Commun. Mag., vol. 28, no. 5, pp. 514, May 1990.
R. Makowitz, A. Buttar, et al., DVB-T decoder ICs, IEEE Trans.
Consumer Electron., vol. 43, no. 3, pp. 438-442, Aug. 1997.
B. S. Son, B. G. Jo, M. H. Sunwoo, and Y. S. Kim, A high-speed FFT
processor for OFDM systems, ISCAS2002, vol. 3, pp. 26-29, May
2002.
S. He and M. Torkelson, Designing pipeline FFT processor for OFDM
(de)modulation, ISSSE98, pp. 257-262, Sept. 1998.
E. H. Wold and A. M. Despain, Pipeline and parallel pipeline FFT
processor for VLSI implementation, IEEE Trans. Computers, vol. 33,
no. 5, pp. 414-426, 1984.
Supplement to IEEE standard for information technology telecommunications and information exchange between systemslocal and metropolitan area networksspecific requirements. Part 11: wireless LAN
medium access control and physical layer, IEEE 802.11a, 1999.
BIOGRAPHIES
Haining Jiang received her B.S. and M.S. degrees in
electronic engineering from Harbin Engineering University in 1999 and 2002, respectively. She is currently
working toward the Ph.D. degree in electronic engineering at Shanghai Jiao Tong University, Shanghai, China.
Her research interests include B3G mobile communication systems and OFDM technique.
1103
Hanwen Luo was born in 1950. He received B.S. degree
from Shanghai Jiao Tong University in 1977, M.S. degree from Xidian University in 1992. He is currently a
professor of Department of Electronic Engineering,
Shanghai Jiao Tong University, China. His main research
interests are the 3G and 4G mobile communication systems and their key techniques for wireless transmission.
Jifeng Tian received his B.S. and M.S. degrees in electronic engineering from Harbin Engineering University
in 1999 and 2001, respectively. He is currently working
toward the Ph.D. degree in electronic engineering at
Shanghai Jiao Tong University, Shanghai, China. His
research interests include FPGA design for wireless
communications, B3G mobile communication systems
and OFDM technique.
Wentao Song was born in 1936. He received B.S. degree
from Shanghai Jiao Tong University in 1957. He is current the honorary chairman of Institute of Wireless Communication in Shanghai Jiao Tong University, the honorary director of Shanghai Institute of Electronics and fellow of China Institute of Communication. His research
areas include mobile communication and satellite communication.

Design of An Efficient FFT Processor For OFDM Systems: Haining Jiang, Hanwen Luo, Jifeng Tian and Wentao Song

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Design of An Efficient FFT Processor For OFDM Systems: Haining Jiang, Hanwen Luo, Jifeng Tian and Wentao Song

Uploaded by

Copyright:

Available Formats

H. Jiang et al.

: Design of an Efficient FFT Processor for OFDM Systems

Design of an Efficient FFT Processor for OFDM Systems

Orthogonal frequency division multiplexing (OFDM) [1],

terfly algorithm and dual butterfly algorithm, are developed

II. CONVENTIONAL FFT ARCHITECTURES

(a) Single-memory architecture

(b) Dual-memory architecture

(c) Parallel architecture

0098 3063/05/$20.00 2005 IEEE

(d) Pipeline architecture

Fig.1. Conventional FFT architectures

IEEE Transactions on Consumer Electronics, Vol. 51, No. 4, NOVEMBER 2005

III. EFFICIENT FFT PROCESSOR

the core of the FFT processor, and performs radix-4 butterfly

X(k ) = x(n)WNnk , k = 0,1, , N 1

Using radix-4 DIF algorithm, radix-4 butterfly operation

H. Jiang et al.: Design of an Efficient FFT Processor for OFDM Systems

where A , B , C , D are butterfly inputs and A , B , C ,

It is easy to find that there are no multiplications in the

Fig.4. Pipeline diagram for parallel butterfly algorithm

Stage 1: To read data from memory (MEM). 4 complex data

DUAL BUTTERFLY ALGORITHM

Based on parallel butterfly algorithm and the analysis of

IEEE Transactions on Consumer Electronics, Vol. 51, No. 4, NOVEMBER 2005

Fig.5. Block diagram for dual butterfly algorithm

has little effect on the hardware resources consumption. Under

Different butterfly operation is adopted for different data

RAM1 simultaneously. At this time, the total length of the

H. Jiang et al.: Design of an Efficient FFT Processor for OFDM Systems

J. A. C. Bingham, Multicarrier modulation for data transmission: an

You might also like

X(k ) = x(n)WNnk , k = 0,1, , N 1