You are on page 1of 7

HW-EFFICIENT REDUCED-LATENCY ARCHITECTURE

FOR CONFIGURABLE MIXED-RADIX FFT PROCESSORS

SANDEEP DAGER, VIK AS T YAGI AND VISHAL SINHA, MENTOR GRAPHICS

W H I T E P A P E R

F U N C T I O N A L V E R I F I C A T I O N

w w w . m e n t o r . c o m
HW-Efficient Reduced-Latency Architecture for Configurable Mixed-Radix FFT Processors

I. INTRODUCTION
Fast Fourier transform (FFT) is widely used algorithm in communication and digital signal processing for example 3GPP-
LTE, IEEE 802.11a/g/n, terrestrial digital video broadcasting (DVB-T). Variety of hardware architectures have been proposed in
past few decades with memory based and pipelined FFT being two of the more widely used architectures. Memory based FFT
architectures are preferred for larger sized FFT applications.
Memory based FFT architectures are area efficient as they use single or few processing units to perform FFT
computations. Saving the intermediate results in static random access memory (SRAM) banks provides big area savings compared
to register based implementation but maximizing performance becomes more difficult due to potential memory bottlenecks. The
key to maximizing performance is to reorganize the memory accesses to avoid address conflicts.
This address reordering approach has been done in prior architectures. In [1], a conflict free addressing scheme is proposed
but its implementation requires 2N memories and takes at least N-computational cycles. In [2] an in-place strategy is proposed
for mix radix-2/4, but doesn’t support higher radix for achieving fewer computation cycles. An area efficient approach is used
in [4] but requires much longer computation cycles. Lastly a shared memory approach has been discussed in [6] but this is
limited to radix-2 only.
With the advancement in new technologies and evolution of new standards like DOCSIS 3.1 (10 GBps), LTE, 5G, optical
coherent processor (400 GBps), massive parallel processing is needed to achieve this high bit rate. For example to achieve 400
GBps rate in optical coherent processor, the FFT may need to support higher a radix as high as 8 or 16. This derives the need for a
generic FFT processor architecture capable of supporting any radix while achieving minimal computation cycles by efficiently
using hardware resources.
The proposed architecture supports configurable power-of-two mixed-radix using a switching butterfly (SB) architecture. A
new memory bank filling algorithm is developed to realize configurable mixed radix support. The conflict free addressing
scheme is simplified to use minimal number of registers and gates. The architecture supports simplifies logic to choose bit
reversed or natural output order.

II. FIXED RADIX FAST FOURIER TRANSFORM


The basic definition of a discrete Fourier transform (DFT) is given by (1). According to Cooley-Tukey, an efficient way to
compute the DFT of any composite size N = N1N2 is to use smaller DFTs of size N1 and N2 along with O(N) multiplications by
twiddle factors, as shown in (2). The algorithm can divide the transform into r pieces of size N/r at each step and therefore limit
the size of the DFT to a positive integer power of r. These are called Radix-r FFTs. The best know use is for power of 2, also
known as radix-2 and mixed radix cases.

௡௞
ܺሺ݇ሻ ൌ  σேିଵ
௡ୀ଴ ‫ݔ‬ሺ݊ሻܹே ‫ ݇݁ݎ݄݁ݓ‬ൌ Ͳǡͳǡ ǥ ǡ ܰ െ ͳ ሺͳሻ

మഏ
ܹே௡௞ ൌ  ݁ ି௝ ಿ ௡௞  ሺʹሻ

ܹே௡௞ is the twiddle factor sampled from the unit amplitude complex exponential.
III. ARCHITECTURE
The proposed architecture computes an N point FFT using radix-r where ܰ ൌ ʹ௤ and ‫ ݎ‬ൌ ʹ௦ respectively. The N point FFT can
be broken into ʹሺ௦ି௧ሻ ሺʹ௦ ሻ௣ stages where ʹ௦ି௧ is for mix-radix stage. The rest of the ‫ ݌‬stages are for radix-r. For continuous flow mix
radix (CFMR) FFT architecture the following equation must be satisfied


ሺŽ‘‰ ௥ ܰሻܶிி் ൑ ܰܶ௦௔௠௣௟௘ ሺ͵ሻ

Where ܶிி் is the clock period of the FFT processor and ܶௌ௔௠௣௟௘ is the sample period (inverse of sampling frequency). The
proposed architecture takes fewer computational cycles and satisfies the condition given in (3). Two memories, each of size N, are
accessed in ping-pong fashion as suggested in [2], to convert the architecture into CFMR FFT processor.

1
HW-Efficient Reduced-Latency Architecture for Configurable Mixed-Radix FFT Processors

Figure. 1. FFT Processor with In-place architecture

The memory of size N has been split into r dual port memory banks each of size N/r. Here the notation for banks is used as B0,
B1, B2, …Bm…, B(r-1) where B0 denotes bank #1 and B(r-1) denotes bank #r respectively. The input data stream fills a bank say Bm
from location Bm[0] to Bm[N/r-1] and thereafter switch to the next bank. The proposed addressing scheme for filling MBs supports
mix-radix for maximum hardware utilization using switched butterfly architecture 2/22/23 as discussed below.
A. Switched Butterfly
The switched butterfly (SB) is a radix-2s butterfly which can be switched to 2t radix-2(s-t) parallel butterflies for efficient use of
hardware. One example of radix-2/22 is given in [2]. The architecture uses a de-multiplexer (DMUX) for bypassing the input from
primary stages of the flow graph. The SB architectures for 22/23 and 2/23 are shown in Figure 2 and Figure 3 respectively. S is the
control input to the switch butterfly to mix radix in first stage. For example, to calculate a 1024 point FFT using radix- 2/23, S = 0
during computation of stage-1 and for rest of the stages S = 1.

Figure. 2. Switched Butterfly for radix 22/23 and Switched Butterfly for radix 2/23

B. Memory Bank Filling Algorithm


The input data stream to the FFT processor is stored in MBs in such a fashion so that it would work for all stages including the
first mixed-radix stage (with SB) and would support conflict free addressing as suggested in [3] with the same hardware resources.
In [2], CFMR FFT MB swapping for radix 2/22 is suggested at the output stage. The proposed scheme swaps MBs at the input stage
and provides a generic scheme for higher radix-mixing.
The input data stream for an N point FFT would come at the rate of, one complex word per cycle. The input counter (ic) requires
Ž‘‰ ଶ ܰ bits. In binary format, ic can be represented as

ሺ݅ܿሾ݈‫݃݋‬ଶ ܰ െ ͳሿǡ Ǥ Ǥ Ǥ Ǥ ǡ ݅ܿሾ݉ሿǡ Ǥ Ǥ Ǥ Ǥ ǡ ݅ܿሾͳሿǡ ݅ܿሾͲሿሻ where ݅ܿሾ݉ሿ is (m+1)th bit of the counter.
For radix ‫ ݎ‬ൌ ʹ௦ ǡif ܰ ൌ ʹ௤ where q is integer, then N can be broken into
ܰ ൌ ʹሺ௦ି௧ሻሺʹ௦ሻ௣‫ Ͳ݁ݎ݄݁ݓ‬൑ ‫ ݐ‬൏ ܵ ሺͶሻ

2
HW-Efficient Reduced-Latency Architecture for Configurable Mixed-Radix FFT Processors

For example: for N=1024 and for radix–8, ‫ ݎ‬ൌ ʹଷ using in (4) Nൌ ʹሺଷିଶሻሺʹଷሻଷ݅Ǥ ݁Ǥ‫ ݐ‬ൌ ʹܽ݊݀‫ ݌‬ൌ ͵. This configuration needs
a SB of 2/23 and for N=512, ‫ ݎ‬ൌ ʹଷǡ ܰ ൌ ʹଷି଴ሺʹଷሻଶ݅Ǥ ݁Ǥ‫ ݐ‬ൌ Ͳܽ݊݀‫ ݌‬ൌ ʹ

Figure. 3. Address generator for input stream to store in memory

The address generator for the input stream is shown in figure 4. For ‫ ݎ‬ൌ Ͷ ൌ ʹଶ ƒ†ܰ ൌ ͵ʹ ൌ ʹଶିଵ ሺʹଶ ሻଶ implies t = 1, s = 2 and p
= 2. Elements from 0 to 31 in input data stream represented by bits ሺ݅ܿሾͶሿ݅ܿሾ͵ሿ݅ܿሾʹሿ݅ܿሾͳሿ݅ܿሾͲሿሻ, would be filled in 4 MBs. The
MB is decoded by circular shifting of the 2 MSBs by 1 (t = 1 and s = 2) in this case i.e. The MB to write is ‫ܤ‬ሺ݅ܿሾ͵ሿ݅ܿሾͶሿሻǡ ܽ݀݀‫ ݎ‬ൌ
ሺ݅ܿሾʹሿ݅ܿሾͳሿ݅ܿሾͲሿሻ as shown in detail in table1. First stage will be calculated using SB as S=0 and for rest of stages S=1.

TABLE I. INPUT PLACED IN BANKS FOR N=32 AND RADIX=4


Addr B0 B1 B2 B3
0 0 16 8 24
1 1 17 9 25
2 2 18 10 26
3 3 19 11 27
4 4 20 12 28
5 5 21 13 29
6 6 22 14 30
7 7 23 15 31

C. FFT Butterfly Computation


The FFT is computed by calculating ሺ‫ ݌‬൅ ͳሻ stages. The first stage will be mixed using SB otherwise a normal radix-2s butterfly
is used. Each stage computes ܰȀ‫ ݎ‬butterfly operations. For one butterfly computation, r complex inputs are simultaneously fetched
from memory banks, then, after twiddle multiplication, r complex outputs are computed, outputs are reshuffled in a non-conflicting
manner and finally written back into MBs for the next FFT stage. Conflict free reshuffling is performed at both inputs and outputs of
the butterfly.

The butterfly counter ሺܾܿሻ has Ž‘‰ ଶ ܰȀ‫ ݎ‬bits denoted asሺܾܿሾŽ‘‰ ଶ െ ͳሿǡ ǥ ܾܿሾͳሿǡ ܾܿሾͲሿሻ. Addresses for each butterfly operation

are generated by XORing address coefficient (‫ )ݎ݀݀ܽݔ‬with specific bits of ܾܿ as represented by (6). ‫ ݎ݀݀ܽݔ‬is different for every
stage and bank but, is the same for all address within a bank, as shown in (5). The memory bank selector ݉ below consists of s bits
represented in binary asሺ݉ሾ‫ ݏ‬െ ͳሿǡ ǥ ݉ሾͳሿǡ ݉ሾͲሿሻ. For Bn memory bank, ݉ ൌ ݊.

௦௧௔௚௘ିଵ
‫ݎ݀݀ܽݔ‬ሺ‫݁݃ܽݐݏ‬ǡ ݉ሻ ൌ ݉ σ௜ୀ଴ ʹ൫௦ሺ௣ି௜ିଵሻ൯  (5)

‫ ݏ݁ݎ݄݁ݓ‬ൌ  Ž‘‰ ଶ ‫ݎ‬



For computing address, all bits above ቀŽ‘‰ ଶ ቁ ‫݄ݐ‬bit are ignored as shown in Table II. The above equation is implemented using left

shift ‘<<’ and ‘OR’ gates in HW.

TABLE II. VALUE OF XADDR IN BINARY FOR N = 32 AND R=4

Stage B0 B1 B2 B3
1 0[000] 0[000] 0[000] 0[000]
2 0[000] 0[100] 1[000] 1[100]
3 0[000] 0[101] 1[010] 1[111]

‫ݔ‬௕ ሾ݉Ԣሿ ൌ ‫݉ܤ‬ሾሺ‫ݎ݀݀ܽݔ‬ሺ‫݁݃ܽݐݏ‬ǡ ݉ሻ۩ܾܿሻሿ ሺ͸ሻ

3
HW-Efficient Reduced-Latency Architecture for Configurable Mixed-Radix FFT Processors

For one butterfly operation r data are required from r banks (݉ ൌ Ͳǡͳǡʹ ǥ ǥ ǥ ǡ ‫ ݎ‬െ ͳ). Data read from the banks is reshuffled
before the butterfly operation. The coefficient read from ݉௧௛ bank is fed to ݉Ԣ input of the butterfly as represented in (7) and (8).
Corresponding HW implementation is shown in Figure 4.

ܾܿሾሺ‫ ݌‬െ ‫ ݁݃ܽݐݏ‬൅ ʹሻ‫ ݏ‬െ ͳሿǡ ǥ


݉ᇱ ൌ ݉۩ ൬ ൰ (7)
ǥ ǡ ܾܿሾሺ‫ ݌‬െ ‫ ݁݃ܽݐݏ‬൅ ͳሻ‫ݏ‬ሿ

ܾܿሾ݊ሿ ൌ Ͳ݂݅݊ ൒  Ž‘‰ ଶ  ሺͺሻ


For stage 1, ݉ᇱ ൌ ݉Ǣ ܾ݁ܿܽ‫ ݏ݌݁ݏݑ‬൒  Ž‘‰ ଶ  which means input reshuffling is not needed for first stage. The butterfly computes

output ܺሾ݉ᇱ ሿ from input ‫ݔ‬ሾ݉ᇱ ሿ as shown in Figure 2.

Figure. 4. Butterfly input reshuffling for mth Bank using butterfly Counter

D. Twiddle Multiplication
As per the FFT decimation in frequency (DIF) flow graph, the twiddle will be multiplied after the butterfly computation. Since
output X[0] is multiplied by (ͳ ൅ ‫)݆݋‬, only ‫ ݎ‬െ ͳ butterfly outputs need complex-multipliers.
ܾ݂ܿ݅‫ ݁݃ܽݐݏ‬ൌ ͳ
݊ൌ൝ ே (9)
ܾܿ ‫ʹ כ‬ቀ൫ሺ௦௧௔௚௘ିଵሻ௦ି௧൯ቁ ‫ܦܰܣ‬ሺ െ ͳሻ݂݅‫ͳ ് ݁݃ܽݐݏ‬

ܶ‫ݓ‬ሺ݇ሻ ൌ ܹே௞   ሺͳͲሻ

݇ ൌ ݊ ‫݀݁ݏݎ݁ݒ݁ݎݐܾ݅ כ‬ሺ݉ሻ ሺͳͳሻ

TABLE III. STATE OF MEMORY BANKS AFTER COMPUTATION OF FFT STAGES

Before Stage1 After Stage1 After Stage2 After Stage3


Addr B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3
0 0 16 8 24 0 16 8 24 0 20 8 28 0 21 10 31
1 1 17 9 25 1 17 9 25 5 17 13 25 4 17 14 27
2 2 18 10 26 2 18 10 26 10 30 2 22 8 29 2 23
3 3 19 11 27 3 19 11 27 15 27 7 19 12 25 6 19
4 4 20 12 28 20 4 28 12 16 4 24 12 16 5 26 15
5 5 21 13 29 21 5 29 13 21 1 29 9 20 1 30 11
6 6 22 14 30 22 6 30 14 26 14 18 6 24 13 18 7
7 7 23 15 31 23 7 31 15 31 11 23 3 28 9 22 3
8 0 16 8 24 0 16 8 24 0 20 8 28 0 21 10 31

Where ‘‫ہ‬Ǥ ‫ ’ۂ‬is greatest integer operator and k represents the index of twiddle ROM. Value of k FOR ALL STAGES (N=32
AND Radix =2/4)

4
HW-Efficient Reduced-Latency Architecture for Configurable Mixed-Radix FFT Processors

k Stage1 Stage2 Stage3


b m= 1 2 3 0 1 2 3 0 1 2 3
c 0
0 0 0 0 8 0 0 0 0 0 0 0 0
1 0 1 0 9 0 4 2 6 0 0 0 0
2 0 2 0 10 0 8 4 12 0 0 0 0
3 0 3 0 11 0 12 6 18 0 0 0 0
4 0 4 0 12 0 0 0 0 0 0 0 0
5 0 5 0 13 0 4 2 6 0 0 0 0
6 0 6 0 14 0 8 4 12 0 0 0 0
7 0 7 0 15 0 12 6 18 0 0 0 0

The twiddle index computation k is different for the first stage and is shown by the dotted line in Figure 5. After the first butterfly
computation, r parallel outputs are produced as shown by (12).
ܺ‫ݐ‬ሾ݉ሿ ൌ ܶ‫ݓ‬ሾ݉ሿ ‫ܺ כ‬ሾ݉ሿ (12)

Table IV is showing value of k for all stages for N=32 and r=4. Twiddle ROM is save only 1/8 part of complex exponential and actual
twiddle values are generated based on ROM values explain that is not in scope of this paper.
After reshuffling as per (13) intermediate coefficients are saved in MBs for next stage computation.

ܾܿሾሺ‫ ݌‬െ ‫ ݁݃ܽݐݏ‬൅ ͵ሻ‫ ݏ‬െ ͳሿǡ


݉ᇱ ൌ ݉۩ ൬ ൰ (13)
ǥ ǥ ǡ ܾܿሾሺ‫ ݌‬െ ‫ ݁݃ܽݐݏ‬൅ ʹሻ‫ݏ‬ሿ

‫ܾܿ݁ݎ݄݁ݓ‬ሾ݊ሿ ൌ Ͳ݂݅݊ ൑ Ͳ

‫݉ܤ‬ᇱ ሾሺ‫ݎ݀݀ܽݔ‬ሺ‫݁݃ܽݐݏ‬ǡ ݉ᇱ ሻ۩ܾܿሻሿ ൌ ܺ‫ݐ‬ሾ݉ሿሺͳͶሻ

Figure. 5. Twiddle generator for bank Bm. t>0 only for first stage of mixed-radix FFT

E. Output Order
The FFT processor data output order can be
Natural (NAT) or Bit-reversed (BREV). The
proposed scheme simplifies the address generator
for both NAT and BREV output order by
implementing a Ž‘‰ଶ ܰ bit-wide output counter
(oc). Counter (oc) is used to generate bank select
m and memory address for output as shown in
Figure 6.

Intermediate states of MBs for all stages of FFT


(N = 32 and Radix = 4) is shown in detail in
Table III Figure. 6. NAT and BREV mem-address and bank selector generator using output-counter (oc).

5
HW-Efficient Reduced-Latency Architecture for Configurable Mixed-Radix FFT Processors

IV. COMPARISION
Proposed architecture is compared with previous work mentioned in [1], [2] and [5]. Comparative study was conducted for cycle
taken for FFT calculation for a given radix. Work in [1] is unable to work as CF-FFT for 8192 points until 8 parallel paths are provided.
Work in 2 is for fixed Radix 2/4 while the proposed architecture can work for any Radix in 2s where s is positive integer. The results show
that proposed architecture always takes significantly less computation cycles for different radix implementation. Detailed comparison is
given in Table V.

TABLE IV. COMPARISON OF COMPUTATION CYCLES EXCLUDING I/O

[1] [2] [5] Proposed

r C r C r C r C
1024 2/22/23 1024 4 1280 2/8 1024 2/8 512
2048 2/22/23 2048 2/4 3072 4/8 2048 4/8 1024
4096 2/22/23 4096 4 6144 8 4096 8 2048
8192 2/22/23 ------ 2/4 14336 2/8 10240 2/8 5120

V. CONCLUSION
This paper proposed a hardware efficient architecture for a configurable mixed-radix FFT processor. Generalized
addressing scheme for generic mixed radix operations in power of 2 and flexibility to configure for variable size allows parallel
processing to increase throughput. The architecture can support 2N memories for CF-FFT implementation resulting in a lower latency
of 3N/2 cycles. The proposed architecture was validated for latency, throughput and HW resource utilization in different
technologies at various operating frequencies using Catapult High Level Synthesis. Discussion on HLS results achieved is beyond the
scope of this paper. The proposed architecture can be used in high bit rate to low bit rate applications such as optical coherent
processor, DOCSIS 3.1, 5G, LTE etc. targeting advance technology nodes.

REFERENCES
[1] Pei-Yun Tsai and Chung-Yi Lin, “A generalized conflict-free memory addressing scheme for continuous-flow parallel-processing FFT processors with
rescheduling,” IEEE Trans. VLSI systems, vol. 19, no. 12 pp. 2290-2302, Dec 2011.
[2] B. G. Jo and M. H. sunwoo, “New continuous-flow mixed-radix (CFMR) FFT processor using novel in-place strategy,” IEEE Trans. Circuits Systems I, Reg.
Papers, vol. 52, no. 5, pp. 911–919, May 2005.
[3] L. G. Johnson, “Conflict free memory addressing for dedicated FFT hardware,” IEEE Trans. Circuits Syst. II., Analog Digit. Signal Process., vol. 39, no. 5, pp. 312–
316, May 1992.
[4] J. A. Hidalgo, J. Lopez, F. Arguello, and E. L. Zapata, “Area-efficient architecture for fast Fourier transform,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal
Process., vol. 46, no. 2, pp. 187–193, Feb. 1999.
[5] C. F. Hsiao, Y. Chen, and C. Y. Lee, “A generalized mixed-radix algorithm for memory-based FFT processors,” IEEE Trans. Circuits Systems II, Exp. Briefs,
vol. 57, no. 1, pp. 26–30, Jan. 2010.
[6] J. Y. Yu and Y. Li, “An efficient conflict-free parallel memory access scheme for dual-butterfly constant geometry radix-2 FFT processor,” in Proc. Int. Conf. Signal
Processing, Oct. 2008, pp. 458–461.

This paper was ori ginally presented at DVCon 2016, Bangalore, India.

You might also like