Distributed Arithmetic (Da)

CHAPTER 4
DISTRIBUTED ARITHMETIC (DA)
4.1 Introduction
Dispersed Arithmetic (DA) is so named in light of the fact that the number
juggling operations that show up in sign preparing (e.g., expansion, duplication) are
not “lumped” as a solid useful element but rather are conveyed in a regularly
unrecognizable manner. The frequently experienced type of calculation in advanced
sign preparing is a total of items (or in vector investigation speech or internal item
era). This is additionally the calculation that is executed most proficiently by DA. The
inspiration for utilizing DA is its compelling computational productivity. The points
of interest are best misused in circuit plan, however off-the-rack equipment frequently
can be designed adequately to perform DA. Via cautious outline, one may decrease
the aggregate door tally in a sign preparing number-crunching unit by a number
sometimes littler than 50 percent and regularly as extensive as 80 percent.
DA is fundamentally (however not so much) somewhat serial computational

operation that structures an inward (spot) result of a couple of vectors in a solitary
direct stride. The benefit of DA is its productivity of motorization. Conveyed
Arithmetic is regularly utilized, where figuring the internal result of two vectors
involves the majority of the computational workload.
This kind of figuring profile depicts a huge segment of sign handling calculations;
thus the potential utilization of Distributed Arithmetic is gigantic. The inward point is
regularly registered utilizing multipliers and adders. At the point when registered
successively, the duplication of two B-bit numbers requires B/2 to B increases, and is
time concentrated. Then again, the augmentation can be registered in parallel utilizing
B/2 to B adders, however is territory serious (K. Hwang 1979, D. L. Jones 1993:
1077–1086). Whether a K-tap channel is processed serially or in parallel, it requires in
any event B/2 increases for every duplication in addition to K – 1 expansion for
summing the items together. In the most ideal situation, K.(B +2)/(2 – 1) increases are
required for a K-tap channel utilizing multipliers and adders. A aggressive contrasting
option for utilizing a multiplier is Distributed Arithmetic. It packs the calculation of a
K-tap channel from K augmentations and K - 1 expansion into a memory table and
42
creates result in B-bit time, utilizing B - 1 expansion. DA fundamentally decreases the
quantity of increases required for separating (A. Peled, B. Liu 1974: 456 –462, S. A.
White 1989: 4–19). This decrease is especially recognizable for channels with high
piece of accuracy. This diminishment in the computational workload is an after effect
of putting away the pre-figured fractional aggregates of the channel coefficients in the
memory table (D. L. Jones 1993: 1077–1086). At the point when contrasted and
different choices, Distributed Arithmetic requires less number-crunching figuring
assets and no multipliers. This part of Distributed Arithmetic is ideal.
At the point when computational assets are constrained, particularly multipliers,

Distributed Arithmetic are utilized as a part of lieu of the common multiplier-based
sifting structures.
Appropriated Arithmetic is absolutely a standout amongst the most capable
devices for the calculation of the result of two vector items, one of which is steady,
i.e. it comprises of consistent qualities. DA abuses the way of LUT-based calculation
given by the Field-Programmable Gate Arrays (FPGA's), by putting away in a LUT,
all the conceivable results for an arrangement of variable mixes. Calculation at run-
time just comprises of recovering the outcomes from the LUT, where they were
beforehand put away. The components of the variable vector are utilized to address
the Look up Table and recover fractional entireties in somewhat serial way.
One of the figuring situations with constrained computational assets, particularly

multipliers, can be found on more seasoned, low-end, ease FPGA's. By utilizing
Distributed Arithmetic, these sorts of gadgets can be utilized for low dormancy,
territory obliged, high-arrange channels. Actualizing such a channel utilizing a
multiplier based methodology would be troublesome.
One of the new approach for the equipment execution of number-crunching

computerized channels (A. Peled, B. Liu 1974: 456 –462), requires the putting away
of the limited number of conceivable results of a middle of the road number-
crunching operation and utilizing them to acquire the following yield test through
rehashed expansion and moving operations without any augmentations required.
43
The equipment usage is profoundly secluded and utilizes just standard accessible
IC's. It demonstrates that fell and parallel acknowledgment offers noteworthy funds
and new higher request channel outlines can be acknowledged for the same rate of
operation as existing acknowledge, regarding equipment unpredictability and force
utilization. The benefits of the FPGA way to deal with computerized channel
execution incorporate higher examining rates than that accessible from customary
DSP chips, lower costs than an ASIC for moderate volume applications, and more
adaptable than the substitute methodologies.
Advantages of DA include computational efficiency, multiplier less architecture,

requires less arithmetic computing resources and designs can be implemented with the
standard available IC’s. Application of DA to the design of a filter results in
challenges such as increase in number of LUT’s/ROM with each added input bits,
there by increasing the area, reduction in speed of computation as it is bit serial in
nature which are the active research areas.
Distributed Arithmetic specifically targets sum of product term computation

required for many DSP filtering functions. DA efficiently replaces multiplications by
ROM Look up Table (LUT) which is an efficient technique to be implemented on
FPGA. Area saving is from 50% to 80% by the usage of DA in Digital Signal
Processing (Wayner Burleson, Louis, Scharf 1989: 158-161).
Conveyed Arithmetic calculation shows up as an extremely productive

arrangement particularly suited for LUT-based FPGA models. DA based calculation
is most appropriate for FPGA execution in light of the fact that LUT alongside
movement, include operations that can be effectively mapped to LUT based FPGA
rationale structures. In FIR channel, one of the convolving successions is from
information tests and the other arrangement is from preset motivation reaction
coefficients of the channel. This operation of FIR channel makes it promising to apply
DA based procedure for memory based acknowledgments (Pramod Kumar Meher,
Shrutisagar Chandrasekaran Abbes Amira 2008).
44
4.2 DA Algorithm
The Principle of DA Algorithm is as follows (Mrs.Bhagyalakshmi N, Dr.Rekha K
R, Dr.Nataraj K R 2013: 114-118).
The output of linear time-invariant system is as shown in Eqn. (4.1).
Y= Xm (4.1)
Where Am is a fixed factor (co-efficient of FIR filter), Xm is the input data (X<1). Xm
can be expressed as in Eqn. (4.2) using the binary complement.
-n
Xm=-Xm0+ (4.2)
Where Xmn is 0 or 1, Xm0 is sign bit, Xm, N-1 is the least significant bit.
Then Y can be expressed as in Eqn. (4.3).
-n
Y= ( -Xm0)=
2-n+ (4.3)
In Eq. (4.3), as the value of Xmn is 0 or 1, there are 2M kinds of different results of
m Xm.
Basic Block diagram for the DA implementation of a FIR filter is as shown in Fig
4.1(Mrs.Bhagyalakshmi N, Dr.Rekha K R, Dr.Nataraj K R 2013: 114-118).
The bits of N input data samples where each data is of size B are stored in the bit
shift register. The LSB is at the rightmost position and MSB at the left most position.
Data bits are shifted one bit at a time and fed as input to the LUT. The outputs of the
shift register act as address value that points to each location of the LUT. The contents
of the LUT are the precomputed coefficient values of the filter.
Arithmetic
Scaling Accumulator
Bit Shift Register Table
XB-1[0] ….. X1[0] X0[0]
Y
XB-1[1] ….. X1[1] X0[1] LUT +/- Register
XB-1[N-1] ….. X1[N-1] X0[N-1]
Fig 4.1 Distributed Arithmetic Block Diagram.

45
The value accessed from the LUT are either added or subtracted depending on
whether the coefficients are positive or negative values. The mathematically
processed values are prestored in the register. The process of addition/subtraction and
accumulation of values in the register continues until all the coefficients are exhausted
for processing, there by computing the final output of the filter. In the above block
diagram, it is very clear that multiplication operations is replaced with shift and
add/sub operations.
Since DA is an LUT based method, size of LUT increases with increase in number
of coefficients to be handled, there by increasing the hardware resources resulting in
reduction of speed of operation, as it is bit serial natured algorithm. Improvisation of
the existing DA algorithm is indispensable for the effective utilisation of the
algorithm.
4.3 FIR Filter Implementation using DA

In the figure below, the input values (X0(n)……X7(n)) represent the 8 bit address
of the Look up Table to access the coefficients of the filter. The filter coefficients are
prestored in the LUT. The value thus accessed by the incoming address shall be added
with the previous values to find the product of input and the coefficient values. The
scaling accumulator is used for creating the delay in the product generation. The
added values of the successive inputs are stored in D-flip-flop before the final product
is obtained at Y(n) (J. G. Proakis ,D. G. Manolakis 1996).
Scaling Accumulator
X0(n)
X1(n)
<<
X2(n)
X3(n)
D Q Y(n)
X4(n)
X5(n)
LUT +
X6(n)
X7(n)
46
Address Data
0000 0
0001 C0
0010 C1
…….. ……….
1111 C0+ C1+C2+C3
Fig 4.2 Block Diagram of Serial FIR filter implemented

using Distributed Arithmetic.
4.4 Improved DA Architecture

With increase in number of taps,the latency of the DA deos not increase, while the
size of LUT increases. Latency of implementation is dependent on the size of input
data which almost doubles when moved from 8-bit to 16- bit with the sampling rate
being reduced to half. When the same concept is implemented on multiply – add
operation,it requires larger adders and multipliers resulting in increase of area, having
no effect on the sampling rate.But as the size of the LUT increases,there is an effect
on the latency of the design when using both the above mentioned methods.
Serial DA is area efficient,but can process only one sample every B+1 clock
cycles.Processing of the next sample starts only when all the bits of the current
sample are processed. This is the biggest disadvantage of Serial DA architecture.
Usage of parallel DA architecture would solve the problem (J. G. Proakis ,D. G.
Manolakis 1996).
An essential DA engineering, for a length Nth total of-item calculation,

acknowledges one piece from each of N words. On the off chance that two bits for
each word are acknowledged, then the computational pace can be basically
progressed. The most extreme rate can be accomplished with a completely pipelined
word-parallel design as appearing in Fig.4.3. For greatest velocity, a different ROM
(with indistinguishable substance) for every piece vector Xb[n] ought to be given
(Attri.S, Sohi. B.S,Chopra.Y.C 2001: 462– 466).
47
X0[0]
ROM
X0[N-1]
+
X1[0]
21
ROM
Y
X1[N-1]
+
X2[0]
22
ROM
X2[N-1]
X3[0]
ROM
20
X3[N-1]
Fig 4.3 Parallel Distributed Arithmetic Architecture.
The Improved Distributed Arithmetic calculation is like Distributed Arithmetic

with part of the LUTs. The quantity of words in Distributed Arithmetic LUT is 2n
taps which exponentially increments with n-taps. The LUT part strategy viably
decreases the memory use. While actualizing the outline for pulverization channel
utilizing Distributed Arithmetic, it is not required to utilize poly stage structure, since
it doesn't bring any advantage.
On the off chance that the coefficients are little, it is exceptionally advantageous to
acknowledge through the rich structure of FPGA LUT. While the coefficient is
substantial, it will take parcel of capacity assets of FPGA and decrease the count
speed. Then, the N-1 cycles likewise bring about too long LUT time and low
registering speed. Shunwen Xiao, Yajun Chen, introduced a change and advancement
of the DA calculation going for the issues of the arrangement in the coefficient of FIR
channel, the capacity asset and the ascertaining speed, which make the memory size
littler and the operation speed speedier to enhance the computational execution.
48
X[n][3] X[n][2] X[n][1] X[n][0]
X[n-1][3] X[n-1][2] X[n-1][1] X[n-1][0]
Lookup Table
X[n-2][3] X[n-2][2] X[n-2][1] X[n-2][0]
X[n-3][3] X[n-3][2] X[n-3][1] X[n-3][0]
X[n-4][3] X[n-4][2] X[n-4][1] X[n-4][0]
X[n-5][3] X[n-5][2] X[n-5][1] X[n-5][0]
Lookup Table
X[n-6][3] X[n-6][2] X[n-6][1] X[n-6][0]
X[n-7][3] X[n-7][2] X[n-7][1] X[n-7][0]
Shifter Shifter
Row Control
Adder Reg Adder Reg
Adder
Output
Fig 4.4 Block Diagram of Improved Distributed Arithmetic

with split LUT.
Fig 4.4 demonstrates the Improved Distributed Arithmetic calculation with split
LUT which can be utilized to execute a channel with higher request or when
coefficients are expansive to actualize with higher request. Here it is ideal to utilize
parallel tables and include the outcomes. By utilizing pipeline enlists, the alteration
won't decrease the pace of outline, where as significantly diminishes the range, since
size of the LUT becomes exponential with the location space.
The FIR channel executed utilizing parallel DA is superior which is accomplished

at the expense of expansive range. The advantage lies in the reduction of number of
clock cycles by a factor of 2 with proportionate increase of throughput by a factor of
2.The mentioned advantages are obtained at the cost of doubling the number of
required LUT’s and size of the scaling accumulator necessary for storing the
intermediate results.
49
4.5 Implementation of Basic Distributed Arithmetic
The simulation and implementation results of basic DA using Verilog on Spartan3
FPGA is as illustrated below. The top module of Basic DA model is as shown in Fig
4.5.
Fig 4.5 Top Module of Basic DA Module.
Fig 4.6 Basic DA Module Simulation result.

50
Fig 4.6 gives the simulation results of basic DA module. The code is tested with
seven input’s each of 8 bits, represented as x_ino to x_in6 for 4 different values of
each input. Filter coefficients are accessed from the program. Depending on the input
applied at 0 ns, output (yro to yr6) in the figure above is obtained at approximately
150 ns representing a delay of 150ns.
Fig 4.7 represents the RTL Schematic of basic DA Module.
Fig 4.7 RTL Schematic of Basic DA Module.

Device utilization summary of basic DA on FPGA is as given in Table 4.1. Table
shows that the design takes only 3% of slices,2% of slice flip-flops and only 2% of
LUT utilization of the available resources, there by resulting in an area efficient
architecture.
51
Table 4.1 Design utilization summary of Basic DA Module.
Device Utilization Summary
(Estimated values)
Logic Utilization Used Available Utilization
Number of slices 141 3584 3%

No. of Slice Flip-
174 7168 2%
Flops
No. of 4 input LUTs 213 7168 2%
No. of bonded
134 141 95%
IOB’s
No. of GCLKS 1 8 12%
Timing Analysis of basic DA Module is tabulated as given below in Table 4.2

Table 4.2 Timing Analysis of Basic DA Module.
Speed Grade -5
7.190ns (Maximum
Minimum period
Frequency: 139.089MHz)
Minimum input arrival time before
13.819ns
clock
Maximum output required time
6.280ns
after clock
Maximum combinational path delay No path found
Fig 4.8 shows the Static Power Analysis of Basic DA Module on Spartan3 FPGA.
Analysis conveys that the total power consumed by the complete design is only 0.06
watts.
Fig 4.8 Static Power Analysis of Basic DA Module.

52
The routed design of Basic DA on FPGA is indicated in Fig 4.9.
Fig 4.9 Routed Designs.
Fig 4.10 illustrates simulation results versus hardware implemented results on

FPGA. Observation shows that there is exact match of the hardware implemented
results with the simulation results.
Fig 4.10 Hardware verified Results using chip scope-pro tool.

53
Table 4.3 below gives the design utilization of basic DA design on different
FPGA’s such as Spartan3 (XC3S400-5 PQ208), Virtex5 (5VLX110T-3 FF1136),
Atrix 7(7A100T-3CSG324) taken as a case study.
Table 4.3 Different FPGA’s design utilization summary of Basic DA
Module.
XC3S400- 5VLX110T- 7A100T-

Logic Utilization
5PQ208 3FF1136 3CSG324
Number of Slice Registers 141 170 171
Number of Slice LUTs 174 202 145

Number of fully used
213 115 101
LUT-FF pairs
Number of bonded IOBs 134 134 134
Number of
1 1 1
BUFG/BUFGCTRLs
The timing analysis of the design on various FPGA’s is as tabulated in Table 4.4.
Table 4.4 Different FPGA’s Timing analysis summary of Basic DA
Module.
XC3S400- 5VLX110T- 7A100T-

Timing parameters
5PQ208 3FF1136 3CSG324
Minimum period (ns) 7.19 2.787 2.276
Maximum Frequency (MHz) 139.089 358.791 439.445
Setup time (ns) 13.819 6.875 4.862
Hold Time (ns) 6.28 2.779 0.645
54

Distributed Arithmetic (Da)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Arithmetic (Da)

Uploaded by

Copyright:

Available Formats

CHAPTER 4

DISTRIBUTED ARITHMETIC (DA)

DA is fundamentally (however not so much) somewhat serial computational

At the point when computational assets are constrained, particularly multipliers,

One of the figuring situations with constrained computational assets, particularly

One of the new approach for the equipment execution of number-crunching

Advantages of DA include computational efficiency, multiplier less architecture,

Distributed Arithmetic specifically targets sum of product term computation

Conveyed Arithmetic calculation shows up as an extremely productive

XB-1[0] ….. X1[0] X0[0]

XB-1[N-1] ….. X1[N-1] X0[N-1]

Fig 4.1 Distributed Arithmetic Block Diagram.

4.3 FIR Filter Implementation using DA

Fig 4.2 Block Diagram of Serial FIR filter implemented

4.4 Improved DA Architecture

An essential DA engineering, for a length Nth total of-item calculation,

Fig 4.3 Parallel Distributed Arithmetic Architecture.

The Improved Distributed Arithmetic calculation is like Distributed Arithmetic

X[n-1][3] X[n-1][2] X[n-1][1] X[n-1][0]

X[n-2][3] X[n-2][2] X[n-2][1] X[n-2][0]

X[n-3][3] X[n-3][2] X[n-3][1] X[n-3][0]

X[n-4][3] X[n-4][2] X[n-4][1] X[n-4][0]

X[n-5][3] X[n-5][2] X[n-5][1] X[n-5][0]

X[n-6][3] X[n-6][2] X[n-6][1] X[n-6][0]

X[n-7][3] X[n-7][2] X[n-7][1] X[n-7][0]

Adder Reg Adder Reg

Fig 4.4 Block Diagram of Improved Distributed Arithmetic

The FIR channel executed utilizing parallel DA is superior which is accomplished

Fig 4.5 Top Module of Basic DA Module.

Fig 4.6 Basic DA Module Simulation result.

Fig 4.7 RTL Schematic of Basic DA Module.

Number of slices 141 3584 3%

Timing Analysis of basic DA Module is tabulated as given below in Table 4.2

Fig 4.8 Static Power Analysis of Basic DA Module.

Fig 4.9 Routed Designs.

Fig 4.10 illustrates simulation results versus hardware implemented results on

Fig 4.10 Hardware verified Results using chip scope-pro tool.

XC3S400- 5VLX110T- 7A100T-

Number of Slice Registers 141 170 171

Number of Slice LUTs 174 202 145

XC3S400- 5VLX110T- 7A100T-

Minimum period (ns) 7.19 2.787 2.276

Maximum Frequency (MHz) 139.089 358.791 439.445

Setup time (ns) 13.819 6.875 4.862

Hold Time (ns) 6.28 2.779 0.645

You might also like