You are on page 1of 0

Journal of VLSI Signal Processing 28, 115128, 2001

c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.


Implementation of a Communications Channelizer using FPGAs
and RNS Arithmetic

UWE MEYER-B

ASE
Department of Electrical and Computer Engineering, FAMU-FSU College of Engineering, Tallahasser,
FL 32310-6046
ANTONIO GARC

IA
Dpto. Ingeniera Inform atica, Universidad Aut onoma de Madrid
FRED TAYLOR
High Speed Digital Architecture Laboratory, University of Florida, Gainesville, FL 32611-6130
Received July 1999; Revised December 1999
Abstract. Field-programmable logic (FPL), often grouped under the popular name eld-programmable gate arrays
(FPGA), are on the verge of revolutionizing sectors of digital signal processing (DSP) industry as programmable DSP
microprocessor did nearly two decades ago. Historically, FPGAs were considered to be only a rapid prototyping and
low-volume production technology. FPGAs are nowattempting to move into the mainstreamDSPas their density and
performance envelope steadily improve. While evidence nowsupports the claimthat FPGAs can accelerate selected
low-end DSP applications (e.g., FIR lter), the technology remains limited in its ability to realize high-end DSP
solutions. This is due primarily to systemic weaknesses in FPGA-facilitated arithmetic processing. It will be shown
that in such cases, the residue number system (RNS) can become an enabling technology for realizing embedded
high-end FPGA-centric DSPsolutions. This thesis is developed in the context of a demonstrated RNS/FPGAsynergy
and the application of the new technology to communication signal processing.
Keywords: eld-programmable logic (FPL), eld programmable gate array (FPGA), complex programmable
logic devices (CPLD), digital signal processing (DSP), residue number system (RNS), channelizer, zero-IF lter
1. Introduction
Experts generally agree that future signal processing
systems will contain deeply embedded DSP elements
having a performance envelop at least 10greater than
that possessed by the existing DSP p art. These de-
signs will normally manifest themselves as an applica-
tion specic integrated circuit (ASIC). Market forces
require that ASIC solutions be rapidly developed in
order to insure early market entry. This market reality

Portions of the results presented here have been in presented at the


IEEE ICASSP 97/98 and SPIE 99 conferences.
makes FPGAs a potentially attractive facilitating tech-
nology in cases where the high non-reoccurring engi-
neering costs (NRE) of cell-based (standard cell), or
custom VLSI solutions cannot be justied. This claim,
of course, is predicated on the FPGA solution meet-
ing all other stated performance requirements. It will
be shown in the next section that FPGA are actually
multiply-accumulate (MAC) decient. Since MACs
are a fundamental to virtually every DSP operation, it
will be argued that the path to successful FPGA assim-
ilation into the greater DSP arena is predicated on im-
proving FPGA-enabled MAC performance. One such
candidate is the residue number system or RNS [1].
116 Meyer-B ase, Garca and Taylor
Table 1. Adder and multiplier speed and complexity
versus wordwidth.
ADD MUL
Bits 8 16 26 32 9 9 12 12
MSPS 137 73 51 45 71 69
# LE 8 16 26 32 217 328
The RNS MAC has been shown, in numerous aca-
demic and industrial studies, to dene a compact, high-
bandwidth, high-precision DSP solution in a variety of
DSP instances. The RNS is just now beginning to ap-
pear as a commercial off-the-shelf (COTS) cell-based
technology for use in high-end DSP applications. The
potential impact of the RNS on FPGA-centric system
design is, at this time, unquantied.
FPGAs are being aggressively touted as a viable
DSP ASICtechnology by FPGAvendors. Vendors and
IP providers point to FPGA-enabled designs whose
performance far surpasses those of systems designed
using DSP ps in implementing some baseline DSP
operations, such as nite impulse response (FIR) dig-
ital lters. Regardless of the marketing hyperbole, it
should be appreciated that FPGAs possess an intrinsi-
cally weak general-purpose arithmetic unit that limit
their DSP applicability, especially in high-end appli-
cations. To illustrate, consider the design of 2s com-
plement ripple carry adder with fast carry chains. The
design of DSP arithmetic primitives, reported in
Table 1, are based on a Flex10K device where each
logic element (LE) is a 2
3
1 table. It can be seen
that the adder speed reduction is highly correlated to
precision. While 8-bits may be marginally acceptable
for some 100+MIPS applications, most DSP solutions
require signicantly higher precision. Achieving both
high-precision and high-bandwidths with an FPGA is
therefore problematic.
ReportedFPGA-enabledDSPsolutions have, infact,
not been based upon faster arithmetic units, but rather
on nding ways of masking arithmetic deciencies.
Table 2. Comparison of common FPGA resources.
FPGA Logic unit Routing hierarchy Memory Largest part
Xilinx 4000 Congurable logic blocks 5-level routing (adjacent, 32-bits per CLB 40125XV; 64 64 CLB
(CLB), two 4-bit LUT, double, quad, long, array, 448 user I/O
one 3-bit LUT, F/F and global)
Altera Flex 10K Logic element (LE), logic 3-level (local, fast, Embedded array (EAB) EPF10K250, 1520 LABs,
block of 8 LEs, 4-bit global) of 2K bits, 3-12 EABs 470 user I/O
LUT, F/F per chip
It will be developed in the next section that the two
most commonly encountered masking methods are dis-
tributed arithmetic (DA) [2] and reduced adder graphs
(RAG) [3]. It will be discovered that these techniques
applytothe implementationof low-order low-precision
xed-point DSP algorithms having xed coefcients
(e.g., FIRs, IIRs, and DFT). This denes a very
narrow window of DSP/FPGA synergy. Implement-
ing run-time, fully programmable high-bandwidth,
high-precision, real and complex arithmetic DSP ob-
jects with FPGAs remains today, a challenging prob-
lem. Unless this arithmetic barrier can be overcome,
FPGAs will continue to be primarily prototyping tech-
nology and never become a viable DSP facilitating, let
alone, enabling technology. What is needed is a means
of overcoming the arithmetic limitations of FPGAs.
The paper presents a promising approach to this chal-
lenge based on recent development of cell-based RNS
arithmetic units. It will be shown that in the RNS arith-
metic operations are implemented as a set of concurrent
primitive operations within non-communicating small
wordlength channels. As a result, the RNS is highly
synergistic with existing FPGA architectures that
naturally partition a device into small wordlength
independent channels. To illustrate this thesis, the
implementation of an advanced communication re-
ceiver (channelizer) will be presented. The design will
demonstrate that the RNS can enable high-end solu-
tions that existing FPGA design methodology cannot
achieve.
2. FPGA Overview
Field programmable logic (FPL) devices are marketed
in a variety of names, including eld programmable
gate arrays (FPGA) and complex programmable logic
devices (CPLD). The two technology suppliers, Xilinx
and Altera, are reported in Table 2. FPGAs are con-
sidered ne grain devices, consisting of small logic
cells (LC) (e.g., Xilinx XC4000) and various routing
Channelizer using FPGAs and RNS 117
canals (short, local, and long-lines). CPLDs have
comparatively larger logic blocks with fast busses con-
necting array blocks (e.g., Altera FLEX [4]). The
historical advantage of FPLs has been their in circuit
programmability and ability to support rapid pro-
totyping. FPLs have been promoted in custom com-
puting machine (CCMs) applications where they have
been reported to achieve speed-up-factors ranging from
101000 compared to conventional workstations [5,
Table 1]. FPLs can support accelerated arithmetic using
fast carry chains (Xilinx XC4000, Altera FLEX) that
can be used to implement high-bandwidth MACs. This
motivates the claimby FPGAdevotees that the technol-
ogy is not just for prototyping anymore, but rather is a
viable DSPfacilitating technology. They note that com-
plexity levels have reached 1M gates, and to the fact
that the technology, if properly interpreted and cong-
ured, can:
exploit algorithm parallelism: implement multiple
MAC calls
maximize gate efciency: remove zero product-
terms
exploit pipelining: each logic cell contains a regis-
ter and therefore requires no additional pipelining
resources
FPGA advocates point to a demonstrated 50 speed
up in pixel averaging, 30 in pattern recognition, and
100 in edge detection over a TMS320C30 solution
[6]. While comparing a xed-point ASIC to a oating
point DSP p is not a true comparison, such examples
do demonstrate that when properly employed, FPGAs
can be a competitive DSP technology. The FPGA DSP
claim of superiority to date are actually attributable
to the application selection rather than intrinsic tech-
nological advantages. Using DA or RAG techniques,
engineers have been able to accelerate selected linear
time-invariant (LTI) digital lters and transforms (i.e.,
constant coefcient). A simple audit of the DSP ven-
dor and IP support libraries of Xilinx and Altera illus-
trates this point. What is important to note is DA and
RAG FPGA-enabled DSP objects are intrinsically of
low-precision and low-order. Absent from this list are
programmable high-end objects such as run-time pro-
grammable precise lters, adaptive lters, neural nets,
and so forth.
3. Residue Number System (RNS)
The silicon area (complexity) associated with a
constant-speed xed-point MAC unit is generally
considered to geometrically increase with word-length.
The antithesis is the RNS which establishes a lin-
ear relationship between MAC speed and silicon area
[1]. The RNS, therefore, provides an opportunity to
overcome the precision barrier in high-performance
FPL applications. RNS integer arithmetic is performed
concurrently within parallel non-communicating small
word length channels. An RNS system is dened in
terms of a basis set {m
1
, m
2
, . . . , m
L
} of relatively
prime positive integers called moduli. The dynamic
range of the resulting system is dened by the product
of the moduli andis givenby M =

L
i =1
m
i
. RNSarith-
metic is dened with respect to the ring isomorphism:
Z
M

= Z
m
1
Z
m
2
Z
m
L
(1)
Specically, Z
M
=Z/M corresponds to the ring
of integers modulo M. The mapping of an inte-
ger X into the RNS is dened to be the L-tuple
X =(x
1
, x
2
, . . . , x
L
) where x
i
= X mod m
i
, for
i 1, 2, . . . , L. This is generally assumed to be a
straightforward process that can be directly imple-
mented in hardware using small lookup tables.
Dening tobe either the algebraic operations +,
or , it follows that if 0 Z < M, then:
Z = X Y mod M (2)
is isomorphic to Z = (z
1
, z
2
, . . . , z
L
) where:
z
i
= (x
i
y
i
) mod m
i
i = 1, 2, . . . , L (3)
It should be self-evident that the RNS arithmetic is
performed in parallel within small non-communicating
(i.e., carry-free) wordlength channels whose word
width is bounded by n
i
= log
2
(m
i
), where n
i

8-bits (typically). In practice, most RNS arithmetic
systems use small RAM or ROM tables to implement
the modular mappings z
i
=(x
i
y
i
) mod m
i
as LUT
calls. Using direct LUT operations can, however, cre-
ate a technological problem. If the address of the LUT
is formed by concatenating the arguments (x
i
y
i
)
then a 2
(2n
i
)
n
i
-bit table would be required. A 7-bit
moduli, for example, would require a 114K bit table
which is beyond the current capabilities of a modern
FGPA. Specically, consider again an n
i
-bit moduli
and two residues, say x
i
and y
i
used to create a prod-
uct z
i
= (x
i
y
i
) mod m
i
, which is n
i
-bits wide. If the
desired moduli size is on the order of 6 to 8-bits an un-
reasonable 12 to 16-bit TLUaddress space results. The
118 Meyer-B ase, Garca and Taylor
address space requirement can, however, be reduced by
nearly half by using the quarter square algorithm:
z
i
= (x
i
y
i
) mod m
i
(4)
=

(x
i
+ y
i
)
2
4

(x
i
y
i
)
2
4

mod m
i
(5)
= ((x
i
, y
i
) (x
i
, y
i
)) mod m
i
(6)
where (x
i
, y
i
) and (x
i
, y
i
) are obtained from LUT
calls using the sum and difference of residues as an
(n
i
+1)-bit wide address. Compared to a direct imple-
mentation of a standard RNS multiplier, the table re-
quirement are reduced from2
2n
i
n
i
-bits to 22
n
i
+1

n
i
= 2
n
i
+2
n
i
-bits. The savings for a 7-bit moduli is
a factor of 32. What is more important is that the mul-
tiplication LUTs can now be contained within an 8-bit
FPGA channel. As a result, 7-bit moduli could be used
in conjunction with 8-bit FPGA tables to implement a
standard RNS multiplier.
Conversion from the RNS to integers is performed
using either the Chinese Remainder Theorem (CRT) or
mixed-radix conversion (MRC) algorithm. The direct
implementation of either form can be awkward but ef-
cient forms of these algorithms can be found in the
literature.
Demonstration RNS systems have been built as cus-
tom VLSI [7] (see Fig. 1), GaAs, and LSI systems [1].
These studies have demonstrated the speed-area ad-
vantage of the RNS in implementing MAC-intensive
Figure 1. RNS systolic array chip [7].
algorithms. The 0.8 system shown in Fig. 1 contains
twenty-four 32-bit MACs. Running at the speed of a
TMS320C5x MAC, the RNSMACs footprint is 1/14th
the C5xs. For a small wordlengths RNS can provide
a signicant speed-ups [8] using the 2
4
2 bit tables
found in a Xilinx XC4000 FPGAs. For larger moduli,
the 2
8
8 bit tables belonging to the Altera FLEX
CPLDs are benecial in designing RNS arithmetic and
RNS-to-integer converters. With the ability to support
larger moduli, the design of high-precision FPL sys-
tems becomes a practical reality.
There are several variations of the RNS theme which
apply to DSP. One of the popular variants is based on
the use of index arithmetic [9]. It is similar, in some
respects, to the form taken by the logarithmic number
system (LNS). Computation in the index domain is
based on the fact that that if all the moduli are chosen
to be primes p
i
, then it is known from number theory
that there exists a primitive element (i.e., generator )
such that:

modp (7)
The element generates all elements in the eld
Z
p
, excluding zero (denoted Z
p
/{0}). There is, in fact,
a one-to-one correspondence between the integers in
Z
p
/{0} and the exponents which are dened in Z
p1
.
As a point of terminology, the index with respect to
the generator and integer , is denoted =ind

().
For notational purposes the element =0 is denoted
g

=0. The structure of this system suggests that


arithmetic requires that the exponent be manipulated.
This is referred to as index algebra.
Multiplication of RNS numbers can be preformed in
the index-domain using the following procedure:
1. transform X andY inthe indexdomain(i.e., X =

and Y =

2. add the index values modulo p 1 (i.e, = ( +


) mod ( p 1)
3. transform the sum back to the original domain (i.e.,
P =

)
If the data beingprocessedis inindexform, thenmul-
tiplication can performed using only exponent addition
mod( p 1). The advantage gained by index process-
ing is found in the fact that the multiplicative table
size, when compared to the standard RNS of compara-
ble moduli size, is reduced from2
2n
n to 2
n
n based
on n-bit moduli. If the modulo adder in step two is re-
place by a binary adder, then the multiplier correction
Channelizer using FPGAs and RNS 119
table is 2
(n+1)
n, or twice as large as that requiring
modulo adders. In either case, this can be benecial in
FPGA designs where only small tables are generally
available.
The advantage gained in index multiplication, how-
ever, is somewhat mitigated when index addition is
encountered. Addition can technically be performed
by converting index-coded RNS data back into the
RNS domain where the summands can be added. Once
the sum is formed, the result can be mapped back
into the index-domain. Another approach is based
on Zech-logarithms [10], where a Zech-logarithm is
dened as:
Z(k) = ind

(1 +
k
)
Z(k)
= 1 +
k
(8)
The sum of index-coded numbers, say X and Y, is
expressed as:
z = x + y =

z
=

x
+

y
=

1 +

1 +

(9)
or, in terms of a Zech-loragrithm:

z
=

y
+
Z(
x

y
)

z
=
y
+ Z(
x

y
).
(10)
Adding numbers in the index domain requires one
addition, one subtraction, and a Zech-LUT. The spe-
cial case a + b 0 corresponds to the case where
[11]:
x y mod p

x
+( p1)/2

mod p.
That is, the sum is zero if, in the index domain, =
+( p 1)/2 mod ( p 1).
Therefore implementing a basic DSP object, with
Zech logarithm, will reduce the number of necessary
LUTs for FPLs to the minimum of one per MAC
cell.
Another RNS variant applies to case where com-
plex arithmetic is required (e.g., DFT) and, commu-
nications applications. Traditional logic states that the
roots to the quadratic equation x
2
= 1 are dened
over the complex eld. That is, in the complex RNS
(CRNS) the roots of x
2
= 1 are complex and dened
in terms of the imaginary operator =

1. As a
consequence, complex RNS numbers are dened by
the two-tuple Z = X j Y and complex addition re-
quires two real adds, and complex multiplication is de-
ned by four real products, an addition and subtraction
(albeit short wordlength). This condition is radically
altered in the quadratic RNS, or QRNS. The QRNS is
based on known properties of Gaussian primes of the
form p = 4k + 1, where k is a positive integer. The
importance of this choice of moduli is found in the fac-
torization of the polynomial x
2
+1 given by Gauss. For
Gaussian primes, the roots of x
2
= 1 are no longer
imaginary by rather two real roots, denoted and ,.
Specically and are real integers belonging to the
residue class Z
p
. Converting a RNS complex number
a + j b into the QRNS is accomplished by applying the
transform f : Z
2
p
Z
2
p
as follows:
f (a + j b) = (a + b mod p, a b mod p)
= (A, B) (11)
In the QRNS, addition and multiplication is
component-wise, and is dened to be:
(a + j b) +(c + j d) (A +C, B + D) mod p.
(12)
(a + j b)(c + j d) (AC, BD) mod p (13)
In the QRNS domain, complex multiplication re-
quires only two real multiplications, while twos com-
plement multiplication requires four real multiplier, a
real add, and real subtraction to complete. Finally the
conversion of a QRNS digit, back into the RNS, is de-
ned by:
f
1
(A, B) =

2
1
(A + B) + j (2 )
1
(A B)

mod p, (14)
Figure 2 graphical interprets the mappings between
the CRNS and QRNS.
Figure 2. CRNS QRNS conversion.
120 Meyer-B ase, Garca and Taylor
4. FPL RNS Implementation
In order to facilitate efcient RNS-centric FPGA de-
signs, a collection of RNS macros were developed for
a target Altera technology. For an Altera CPLDdesign,
the VHDL description was used because it provided a
exibility design environment that could be precisely
controlled and optimized. For the VHDL approach,
structural (i.e. component instantiation) and behavioral
descriptions yielded similar results. The structural de-
signs, however, produced synthesized results that were
easier to post-optimized.
For both standard and index RNSarithmetic, the core
element is a modular adder. Several modular adder de-
signs are shown Fig. 3 [12]. Using only LCs, the design
of Fig. 3(a) is realized. The Altera FLEX CPLD con-
tains a number of 2K bit ROMs and/or RAMs (EABs)
which can be congured as 2
8
8, 2
9
4, 2
10
2, or
2
11
1 tables and used for modulo m
i
correction, as
shown in Fig. 3(b). Table 3 summarizes the re-designed
6, 7, and 8-bit modulo adder [13].
Although the ROMs shown in Fig. 3 support high-
speed LUTs, the ROM itself produces a four cycle
pipeline delay. Furthermore, the number of on-chip
ROMs are limited. ROMs, however, are required for
Figure 3. Modular addition with CPLD. (a) MPX-Add and MPX-
Add-Pipe. (b) ROM-Pipe.
Table 3. Modulo adder complexity: FLEX10K 3ns device.
Bits
Pipeline
stages
6 7 8
MPX 0 41.3 MSPS 46.5 MSPS 33.7 MSPS
27 LC 31 LC 35 LC
MPX 2 76.3 MSPS 62.5 MSPS 60.9 MSPS
16 LC 18 LC 20 LC
MPX 3 151.5 MSPS 138.9 MSPS 123.5 MSPS
27 LC 31 LC 35 LC
ROM 86.2 MSPS 86.2 MSPS 86.2 MSPS
2 7 LC 8 LC 9 LC
1 EAB 1 EAB 2 EAB
the scaling schemes. Compared to the pipelined design
shown in Fig. 3 (b), the multiplexed-adder (MPX-Add)
shown in Fig. 3 (a) runs at a reduced speed even if a
carry chain is added to each column. The pipelined
design requires the same number of LCs as the un-
pipelined version, but is expected to runs about twice
as fast. Maximum throughput occurs when the adders
are implemented in two blocks (where each block has
eight LCs for Altera FLEX 10K devices) within 6-bit
pipelined channels.
Several other RNS basic building blocks are re-
quired to support RNS designs. The list includes mod-
ulo adder for the index domain (i.e. modulo multiplier),
Zech MAC cells, code converters (BINRNS and an
RNSBIN) based on an CRT algorithm. Altera
VHDL software does not allowgeneric clauses. There-
fore gawk and C programs have been developed for an
automatic generation of the basic building blocks by
specifying the desired blocks. With these library ele-
ments, standard and index RNS arithmetic systems can
be designed.
5. FPGA Channelizer Implementation
A typical modern communication receiver is shown in
Fig. 4. The received analog signal is mixed with locally
generated signal, and bandpass ltered. In the process
the received wideband signal is split into quadrature
channels (I and Q) that is digitized. The digital section
of the receiver is called a channelizer or zero-IFdemod-
ulator The channelizer maps RF(or near RF) directly to
baseband. The commercial imperative is to reduce the
complexity of digital portion of the receiver to ideally
a single chip. For mobile applications, a premium is
also placed on power dissipation (active and standby).
Channelizer using FPGAs and RNS 121
Figure 4. IF incoherent receiver with sin/cos mixer.
Figure 5. Harris HSP43320 Hogenauer decimating lter.
The interface between the analog and channelizers is
therefore based on maximum data conversion rate and
power/complexity decision.
Converting signal from, or near RF-rates to base-
band is a non-trivial problem. For many typical wire-
less communications problems, signal decimation rates
on the order of 10
3
or higher needs to be achieved. The
preferred design methodology is called a Hogenauer
architecture [4]. An example of a Hogenauer-enabled
channelizer is shown in Fig. 5 as the lowpass lters
(LFP). The advantage of the Hogenauer architecture
is:
the preprocessor (called a Hogenauer lter) is a
MAC-free multirate lowpass lter and, being MAC-
free is capable running at a high real-time rate
the postprocessor is a basic FIRhousekeeping l-
ter running at a low data rate.
The theoretical foundations of a Hogenauer chan-
nelizer are well understood but represent a signicant
FPGA design challenge. Arithmetic in a Hogenauer
lter section must be exact and can often exceed 50-
bits wordwidths. Large arithmetic wordwidths imme-
diately create a barrier to FPGA implementation. The
RNS, however, provides a mechanism of achieving ex-
act high-precision MACoperations within independent
small wordlength channels. To appreciate the need for
the RNS in this case, the mechanics of a Hogenauer
channelizer will be briey reviewed.
6. Hogenauer Filter
A Hogenauer lter, or as it is sometimes called, a cas-
cade integrator comb (CIC) lter, has been proven to
be capable of performing high decimation-rate chan-
nelizationat highinput data rates. Figure 6(a) illustrates
a three stage CIC lter consisting of a three stage in-
tegrator, a sampling rate reduction by R (decimation),
and a three stage three comb lter. Notice that the only
logic elements in the design are registers and adders
(i.e., MAC-free).
122 Meyer-B ase, Garca and Taylor
(a)
(b)
Figure 6. CIC lter. (a) Each stage 26 Bit. (b) Detail design with
base removal scaling (BRS).
The transfer function of a S stage CIC system is
given by:
H(z) =

1 z
RD
1 z
1

S
(15)
The S poles of the CIC lter are located at z = 1
(i.e., DC) and the zeros are distributed along the pe-
riphery of the unit circle, appearing with multiplicity S
on /(RD) centers. The S zeros at z = 1 are annihi-
lated by the S poles residing at the same location. The
result is that the transfer-function behaves as a classic
S stage moving average lter. The CIC lter maxi-
mum gain occurs at DC (i.e., z = 1) and has a value
of B
grow
= (RD)
S
, or b = log
2
B
grow
in bits. This value
can be substantial as evidenced by the need for a 56-bit
dynamic range in Harris HSP43220 [15] channelizer
shown in Fig. 5. Furthermore, it is fundamentally im-
portant the CIC arithmetic be performed Exactly since
the integrator section, during run-time, will constantly
be incurring modulo (N) overows (N is the CIC dy-
namic range). The comb lter section must compensate
for the integrators modulo (N) overows by unwrap-
ping the result modulo (N) an equal number of times.
Any rounding or approximation in this process would
be fatal. The Harris HSP43220, for example, uses an
exact 2s-complement 56-bit code to satisfy this re-
quirement. To illustrate, assume that the input word-
width to the 3 stage RNS CIC lter, shown in Fig. 6(a),
is 8-bits. For D = 2, R = 32, or DR=2 32 = 64, an
Figure 7. CICtransfer function ( f
s
is sampling frequency at input).
internal wordwidth of W = 8 + 3 log
2
(64) = 26 bits
is needed to insure that no run-time overow will oc-
cur. The output wordwidth would normally be a value
signicantly less then W, say 10-bits. Hogenauer [14]
noted it is possible to design each stage of the CIC
section to have just enough dynamic range to insure
an arithmetically correct outcome. Figure 7 shows a
pruning architecture as suggested by Hogenauer. If
the ratio of signal bandwidth to sampling frequency
is, for instance 1/32, then the aliasing suppression is
89.6 dBand the maximumpassband attenuation is 0.17
[14, Tables 1 and 2]. These facts, along with a high-
bandwidth requirement, motivate the use of RNS to
implement CIC lters with FPGAs.
Apipelined FPGAintegrator section needs the same
number of LCs as an un-pipelined version, and would
run about twice as fast. Maximum throughput oc-
curs when the adders are implemented in two blocks
(where each block contains 8 LCs for Altera FLEX
10K devices), within six-bit pipelined channels. One
additional pipeline delay, for the modulo adder, corre-
sponds to a non-recursive transfer function A(z) = z
2
which introduces no signicant processing problem.
The accumulator, however, is recursive and an addi-
tional delay is introduces a second pole at one half
the sampling frequency (i.e., [16, Fig. 1]). Because
the transfer function of the pipelined accumulator sat-
ises F(z) =z
2
/(1 z
2
), the pole at can be
compensated for by a (modulo m
i
) comb lter with
a delay of one (i.e., G(z) =(1 z
1
)z
2
). The inte-
grator section, with pole compensation, then becomes
F(z) G(z) = z
4
/(1 z
1
) as desired. In a high
decimation CIC application, it can be assumed that
an anti-aliasing lter provides sufcient suppression of
Channelizer using FPGAs and RNS 123
Figure 8. BRS and -CRT conversion steps.
signal components near . A second passband located
at is introduced by the recursive pipelined accumu-
lator but introduces no additional aliasing. The six-bit
wide pipelined accumulators can then developed with-
out pole compensation.
As a design example, consider a three stage CIC
lter having 8-bit input, 10-bit output, D=2, and
R =32. The required maximal dynamic range is 26-
bits. For the RNS implementation, a 4 modulus sys-
tem is chosen consisting of the relatively prime mod-
uli (256, 63, 61, 59) (i.e., one 8-bit twos complement
(TC) and three 6-bit moduli). The output scaling of the
RNS system is implemented using the -CRT at a cost
of 8 tables and 3 TC adders [17, Fig. 1], or (as shown
in Fig. 8) with a base removal scaling (BRS) algorithm
based on two 6-bit moduli (which occur in the same
fashion in the mixed radix conversion scheme [18])
and a -CRTfor the remaining 2 moduli. This approach
uses a total of 5 modulo adder and 9 ROM tables, or
7 tables if the multiplicative inverse ROM and the -
CRT are combined. The following table shows speed
in MSPS and used LCs and EABs for the three scaling
schemes.
BRS--CRT BRS--CRT
(Speed data for combined
Type -CRT BRS m
4
only) ROM
MSPS 58.8 70.4 58.8
#LC 34 87 87
#Table (EAB) 8 9 7
The decrease in speed to 58.8 MSPS for scaling
schemes #1 and #3 are caused by the fact that a 10-
bit -CRT table address must be placed in different
FPGA rows (each row has only one EAB). This, how-
ever, introduces no system speed decrease because the
scaling is applied at the lower (output) sampling rate.
For the BRS--CRT, it is assumed that only the BRS
m
4
part (see Fig. 8) must run at the input sampling rate,
while BRS m
3
and -CRT runs at the output sampling
rate. Some additional resources can be saved based on
the architecture presented in Fig. 6(b). Here the BRS-
-CRT is used to reduce the bit-width found in earlier
lter sections. The early use of ROMs decreases the
possible throughput from 76.3 to 70.4 MSPS which is
the maximum speed of the BRS with m
4
. At the out-
put, the efcient -CRT scheme was employed. The
following table concludes the three implemented lter
realization without including the scaling data.
TC RNS Detailed bit-width
Type 26 Bit 8, 6, 6, 6 bit RNS design
MSPS 49.3 76.3 70.4
#LC 343 559 355
6.1. Modulation and Postprocessing
Referring to Fig. 4, it can be seen that digital modula-
tors exist to the left of the channelizer. A high-speed
ADC unit, operating at or near RF frequencies, resides
at the analog-digital-domain boundary. For sampling
rates 100 MHz, precision is practically limited to
12-bits or less. The output of the ADC can be either
binary, or directly mapped into standard or index RNS
L-tuples. The product modulators can be implemented
using a standard or indexed RNS multiplier. The differ-
ence would be that the standard RNS multiplier would
124 Meyer-B ase, Garca and Taylor
require a multiplicative LUT and the index multiplier
is simply a modulo p
i
adder. All these options can
be implemented with an FPGA to varying degrees of
acceptability. Afast 2s complement 1212-bit multi-
plier can be built using 9 EABs or 328 LCs, and would
run at 69 MHz. A comparable index RNS multiplier,
based on three 7-bit moduli, would come in at 2 EABs
or 260 LCs per moduli for a total of 6 EABs or 780
LCs, and would run at 86 MHz rate. This points to an
important observation supported by numerous design
studies which states that for low resolution cases, the
RNS benet is marginal. A RNS advantage is, rapidly
gained for high-end high-precision applications (e.g.,
CIC lter). In the case under study, due to the assumed
short wordlength of the digitized data, it may be pre-
ferred to use a traditional 2s complement digital mixer
and then map the output into the standard RNS for CIC
processing. If index RNS is used, data would need to
be converted to standard RNS before being CIC pro-
cessed in the manner developed in the previous section.
Astandard RNS mixer design is also possible using the
quarter-square algorithm.
The channelizer output is a baseband (low sample
rate) signal sampled at a highly decimated rate. The
channelizer output can be taken directly from the CIC
section or from a post-processing FIR. The magnitude
frequency response of the CIC section is that of a S-
stage moving average lter (i.e., sin(x)/x). A low data
rate FIR can be used to shape the CIC baseband which
resides between DC and the rst null of the Hogenauer
lter (i.e., f
sample
/RD). The implementation of a FIR
in the RNS is well understood. If implemented using
the standard RNS, data can be accepted directly form
the CIC section, ltered, and presented to a back-end
communications processor. The implementation of an
index RNSFIRare discussed, for instance, in [19]. This
model assumes that the CIC section is implemented
using the index RNS. The advantage of a standard, or
Figure 9. Cascading of frequency sampling lter to save a factor of R delays for multirate signal processing [20, Sec. 3.4].
indexed RNS arithmetic, over a 2s complement im-
plementation of an FIR is well established. Again, this
advantage geometrically increases with arithmetic pre-
cision for comparable real-time bandwidths.
Finally, it is notedthat the entire systemcanbe imple-
mented using the QRNS as developed in Section 4. The
QRNSimplements complexarithmetic usinga minimal
amount of real arithmetic. The channelizer presented in
Fig. 4 divides the received signal into I and Qchannels,
using separate sine and cosine modulators. This oper-
ation can be replace with a complex exponential that
can, in turn, be directly implemented in a minimally
complex QRNS system, with the individual modular
operations dened as standard or index RNS calls. The
channelizer following the I and Q modulators can also
be implement in the QRNS resulting in an end-to-end
QRNS solution.
7. Frequency Sampling Filter
The CIC lters discussed in the last section belongs
to a larger class of systems called frequency sampling
lters. These lter can be used with channelizers to de-
compose the information spectruminto discrete bands.
This is essential in many multi-user communication
system applications. A classical frequency sampling
lter (FSF) consists of a comb lter cascaded with a
bank of frequency selective resonators [20, 21]. The
resonators independently produce a collection of poles
that annihilate the zeros produced by the comb pre-
lter. Gain adjustments are applied to the output of the
resonators so as to approximately prole the magni-
tude frequency response of a desired lter. An FSF can
also be created by cascading all-pole lter sections with
all-zero lter (comb) sections as suggested in Fig. 9.
The delay of the comb-section 1 z
D
is chosen so
that its zeros cancel the poles of the all-pole prelter
as shown in Fig. 10. It can be observed that wherever
Channelizer using FPGAs and RNS 125
Figure 10. Example of pole/zero-compensation for a pole-angle of 60

and Comb-delay D = 6.
there is a complex pole, there also exists an annihi-
lating complex zero which results in an all-zero FIR,
with the usual linear phase and constant group delay
properties.
Frequency sampling lters are of interest to design-
ers of multi-rate lter banks due, in part, to their intrin-
sic low complexity and linear phase behavior. FSF de-
signs rely on exact pole-zero annihilation and are often
found in embedded applications. Exact FSF pole-zero
annihilation can be guaranteed by using polynomial l-
ters dened over an integer ring in the residue number
system (RNS).
The poles of the FSF lter developed in this pa-
per reside on the periphery of the unit circle. This is
in contrast with the customary practice of forcing the
poles and zeros to reside at interior locations to guard
against possible inexact pole-zero cancellation. It will
be shown that stability is not an issue if the FSF is im-
plemented using RNS. In addition, by allowing the FSF
poles and zeros to reside on the unit circle, a multiplier-
less FSF can be realized with an attendant reduction in
complexity and an increase in data bandwidth.
To motivate this discussion, consider the lter shown
in Fig. 9. It can be argued that rst-order lter sections
produce poles at angles 0

and 180

. Second-order
sections with integer coefcients can produce poles
at angles 60

, 90

, 120

according to the relationship


2 cos(2K/D) =1, 0, and 1. For sections of higher
order, lter frequency selectivity options are shown in
Table 4. Here the angular frequencies resulting from a
complete search are reported for all polynomials up to
order six having integer coefcients with roots on the
unit circle. It will be shown that the building blocks
listed in Table 4 can be used to efciently design and
implement FSF lters with integer coefcients having
poles residing on the periphery of the unit circle.
As a design example a RNS single modulus l-
ter bank was developed covers a frequency range
126 Meyer-B ase, Garca and Taylor
Table 4. Filters with integer coefcients producing unique angular pole locations up to order six. Shown
are the lter coefcients and non-redundant angular locations of the roots on the unit circle.
C
k
(z) Order a
0
a
1
a
2
a
3
a
4
a
5
a
6

1

2

3
C
1
(z) 1 1 1 0

C
2
(z) 1 1 1 180

C
6
(z) 2 1 1 1 60

C
4
(z) 2 1 0 1 90

C
3
(z) 2 1 1 1 120

C
12
(z) 4 1 0 1 0 1 30

150

C
10
(z) 4 1 1 1 1 1 36

108

C
8
(z) 4 1 0 0 0 1 45

135

C
5
(z) 4 1 1 1 1 1 72

144

C
16
(z) 6 1 0 0 1 0 0 1 20.00

100.00

140.00

C
14
(z) 6 1 1 1 1 1 1 1 25.71

77.14

128.57

C
7
(z) 6 1 1 1 1 1 1 1 51.42

102.86

154.29

C
9
(z) 6 1 0 0 1 0 0 1 40.00

80.00

160.00

Table 5. Number of used CLBs of Xilinx XC4000 FPGAs (Notation: F20D90 means lter pole-angle 20.00

delay Comb D = 90).


Total: Actual 1572 CLBs, nonrecursive FIR: 11292 CLBs.
F20D90 F25D70 F36D60 F51D49 F72D40 F90D40 F120D33 F180D14 HB6 III D4 D5
Theory 122 184 128 164 124 65 86 35 122 31 24 24
Practice 160 271 190 240 190 93 120 53 153 36 33 33
Nonre. FIR 2256 1836 1924 1140 1039 1287 1260 550
from 9008000 Hz [22, 23] using 16 kHz sampling
frequency. The lter bank can for instance be used to
implement adaptive multi-tone receiver.
An integer coefcient half-band lter HB6 [24] anti-
aliasing lter and third order multiplier-free CIC-lter
(a.k.a. Hogenauer lter [14]) was added to the design
to suppress unwanted frequency components as shown
in Fig. 11. The bandwidth of each resonator can be
independentlytunedbythe number of stages anddelays
in the comb-section, where the number of stages and
Figure 11. Design of an lterbank consisting of a half-band and
CIC prelter and FSF comb-resonator sections.
delays are optimized to meet the desired bandwidth
requirements. All frequency selective lters have two
stages and delays.
The lterbank was prototyped using a Xilinx
XC4000 FPGA with the complexity reported in Ta-
ble 5. Using high-level design tools, the number of
used CLBs was typically 20% more than the theoreti-
cal prediction obtained by counting adders, ip-ops,
ROMs and RAMs.
FSF can be adapted to the signal property by chang-
ing the comb delay, channel amplitude, and/or the
number of sections. For instance the adaptation of
the comb delay can easily be achieved, because the
CLBs are used as 32 1 memory cell and a counter
realize specic comb delays with the CLB memory
cell.
8. Conclusion
The RNS is shown to be an enabling technology for
high-end DSP applications implemented with FPGA
Channelizer using FPGAs and RNS 127
devices. The RNS was shown to have several forms
that distribute arithmetic across a number of indepen-
dent non-communicating small wordlength channels.
As a result, the RNS is completely synergistic with a
typical FPGA architecture. Using collections of logic
andsmall tables, RNSprimitives were addedtoFPGAs.
These capabilities were usedtoimplement a Hogenauer
CIC lter that was able to achieve both the speed
and dynamic range and requirements required of this
high-endcommunicationsystem. Specically, the RNS
brought to the design speed, compactness, and exact-
ness. All are required of a modern embedded com-
munication system. Compared to a twos complement
design, the RNS enabled CIC was 54% faster. Using
a BRS--CRT scaling scheme, a CIC was also devel-
oped which produced a xed-point output which 43%
faster than the twos complement design. The outcome
is a new opportunity to develop embedded high-end
communication ASIC systems using FPGAs.
Acknowledgments
The authors would like to thanks Altera and Xilinx for
their support under the university programs. A. Garca
was supported by the Direcci on General de Ense nanza
Superior (Spain) under project PB96-1397. The au-
thors would also like to thank all the students who con-
tributed to this project. Special thanks to O. Six [22],
S. Dworak [23], J. Buros [25], M. R osch [26] and
W. Trautmann [27].
References
1. M. Soderstrand, W. Jenkins, G. Jullien, and F. Taylor, Residue
Number System Arithmetic: Modern Applications in Digital
Signal Processing, IEEE Press Reprint Series, IEEE Press,
1986.
2. S. White, Applications of Distributed Arithmetic to Digi-
tal Signal Processing: A Tutorial Review, IEEE Transactions
on Acoustics, Speech and Signal Processing Magazine, 1989,
pp. 419.
3. A. Dempster and M. Macleod, Use of Minimum-Adder Multi-
plier Blocks in FIR Digital Filters, IEEE Transactions on Cir-
cuits and Systems II, vol. 42, 1995, pp. 569577.
4. Altera Corporation, Data sheet. FLEX 10K CPLD Family,
1996.
5. R. Hartenstein, J. Becker, and R. Kress, Costum Computing
Machines vs. Hardware/Software Co-Design: From a Global-
ized Point of View, in Lecture Notes in Computer Science,
vol. 1142, 1996, pp. 1142:6576.
6. J. Rosenberg, DSP Acceleration Using Recongurable Copro-
cessor FPGA, Atmel Application Note #0724A, 1997.
7. U. Meyer-B ase, A. Meyer-B ase, J. Mellott, and F. Taylor, A
Fast Modied CORDICImplementation of Radial Basis Neu-
ral Networks, Journal of VLSI Signal Processing, vol. 20, 1998,
pp. 211218.
8. V. Hamann and M. Sprachmann, Fast Residual Arithmetics
with FPGAs, in Proceedings of the Workshop on Design
Methodologies for Microelectronics, 1995, pp. 253255.
9. N. Szabo and R. Tanaka, Residue Arithmetic and its Applications
to Computer Technology, McGrawHill, 1967.
10. J. Conway, in Computers in Mathematical Research, A
Tabulationn of Some Information Concerning Finite Fields,
R. Churchhouse and J. Herz (Eds.) North-Holland, 1968.
11. G. Zelniker and F. Taylor, A Reduced-Complexity Finite Field
ALU, IEEE Transactions on Circuits and Systems, vol. 38,
no. 12, 1991, pp. 15711573.
12. M. Bayoumi, G. Jullien, and W. Miller, AVLSI Implementation
of Residue Adders, IEEETransactions on Circuits and Systems,
vol. 34, no. 3, 1987, pp. 284288.
13. A. Garca, U. Meyer-B ase, and F. Taylor, Pipelined Hogenauer
CICFilters using Field-Programmable Logic and Residue Num-
ber System, in IEEE International Conference on Acoustics,
Speech, and Signal Processing, vol. 5, 1998, pp. 30853088.
14. E.B. Hogenauer, An Economical Class of Digital Filters for
Decimation and Interpolation, IEEETransactions on Acoustics,
Speech and Signal Processing, vol. 29, no. 2, 1981, pp. 155
162.
15. Harris Semiconductor, Data sheet, HSP43220 Decimating
Digital Filter, 1992.
16. U. Meyer-B ase, J. Mellott, and F. Taylor, Design of RNS Fre-
quency Sampling Filter Banks, in IEEE International Confer-
ence on Acoustics, Speech, and Signal Processing, vol. 3, 1997,
pp. 20612064.
17. M. Grifn, M. Sousa, and F. Taylor, Efcient Scaling in the
Residue Number System, in IEEE International Conference on
Acoustics, Speech, and Signal Processing, 1989, pp. 10751078.
18. G. Jullien, Residue Number Scaling and Other Operations
Using ROM Arrays, IEEE Transactions on Communications,
vol. 27, 1978, pp. 325336.
19. U. Meyer-B ase and F. Taylor, High-speed Wavelet Implementa-
tion with Field-Programmable Logic, in Aerosense 99 *SPIE*,
Orlando. 1999, pp. 250261.
20. U. Meyer-B ase, The Use of Complex Algorithm in the Realiza-
tion of Universal Sampling Receiver using FPGAs, VDI press,
Serie 10, no. 404, 1995, (in German).
21. F. Taylor, Digital Filter DesignHandbook, Marcel Dekker, 1983.
22. O. Six, Design and Implementation of a Xilinx universal XC-
4000 FPGAs board, Masters Thesis, Institute for Data Tech-
nics, Darmstadt University of Technology, 1996.
23. S. Dworak, Design and Realization of a new Class of Fre-
quency Sampling Filters for Speech Processing using FPGAs,
Masters Thesis, Institute for Data Technics, Darmstadt Univer-
sity of Technology, 1996.
24. D.J. Goodman and M.J. Carey, Nine Digital Filters for Decima-
tion and Interpolation, IEEETransactions on Acoustics, Speech
and Signal Processing, vol. ASSP-25, no. 2, 1977, pp. 121126.
25. J. Buros, Conception and Design of Wavelet Processor in
VHDL-FPL technic. Masters Thesis, University of Florida,
Gainesville, 1998.
26. M. R osch, Fast Methods for FIR Filtering, Masters Thesis,
University of Florida, Gainesville, 1998.
128 Meyer-B ase, Garca and Taylor
27. W. Trautmann, RNSWavelet Processor Built inFPGATechnol-
ogy, Masters Thesis, University of Florida, Gainesville, 1998.
Uwe Meyer-B ase received his BSEE, MSEE, and Ph.D. Summa
cum Laude from the Darmstadt University of Technology in 1987,
1989, and 1995, respectively. In 1994 and 95 he hold a post-doc po-
sition in the Inst. of Brain Research in Magdeburg. In 1996 and
1997 he was a Visiting Professor at the University of Gainesville,
FL. From 1998 to 2000 Dr. Meyer-Baese was a research scientist
for ASIC Technologies for the Athena Group, Inc., where he was
responsible for development of high performance architectures for
digital signal processing. He is now a Professor at the FAMU-FSU
College of Engineering in Tallahassee, Florida. During his gradu-
ate studies he worked part time for TEMIC, Siemens, Bosch, and
Blaupunkt. He holds 3 patents, has supervised more than 60 master
thesis projects in the DSP/FPGA area, and gave four lectures at the
University of Darmstadt in the DSP/FPGAarea. He received in 1997
the Max-Kade Award in Neuroengineering. Dr. Meyer-Baese is a
IEEE, BME, SP and C&S society member.
uwe.meyer-baese@ieee.org
Antonio Garca received the M.A.Sc. degree in Electronic Engi-
neering (obtaining the Nation Best Record Award) in 1995, the M.Sc.
degree in Physics (majoring in Electronics) in 1997 and the Ph.D.
degree in Electronic Engineering in 1999, all from the University of
Granada (Spain). From 1999 to 2000 he was an Associate Professor
in the Department of Electronics and Computer Technology at the
University of Granada. He is now an Associate Professor with the
Department of Computer Engineering at the Universidad Aut onoma
de Madrid. His research interests include Residue Number System
arithmetic, the application of RNS to high-performance digital signal
processing and VLSI and FPL implementation of RNS-based sys-
tems. He is a member of IEEE.
agarcia@ieee.org
Fred J. Taylor received his Ph.D. from the University of
Colorado in 1969. Since then he has held professional positions at
Texas Instruments and the University of Texas at El Paso, Cincinnati,
and Florida where he is currently a Professor of Electrical and Com-
puter Engineering and Computer and Information Science, along
with being president of the Athena Group, Inc. He has authored
over 100 archived papers, nine books, contributed chapters to four
monographs and encyclopedias, and holds four U.S. patents. His
professional interests include digital design and architecture, digital
signal processing, and engineering education.
fjt@hsdal.u.edu

You might also like