You are on page 1of 4


15 The Basic Shift-Accumulator 507

Figure 11.41 Linear-phase FIR filter with N = 12

11.14.2 Parallel Implementation of Distributed Arithmetic

Distributed arithmetic can, of course, be implemented in parallel formi.e., by
allocating a ROM to each of the terms in Equation (11.47). The ROMs, which are
identical, can be addressed in parallel and their values, appropriately shifted,
added using an adder tree as illustrated in Figure 11.42. The critical path is
through a ROM and through the adder tree. The critical path can be broken into
small pieces to achieve very high speed by pipelining.


The shift-accumulator shall perform a shift-and-add operation and a subtraction
in the last time slot. Obviously, for typical word lengths in ROM of 8 to 18 bits, a
ripple-through adder or a carry-look-ahead adder is unsuitable for speed and
complexity reasons. The shift-accumulator, shown in Figure 11.43, uses carry-
save adders instead. This yields a regular hardware structure, with short delay
paths between the clocking elements. Furthermore, the shift-accumulator can be
implemented using a modular (bit-slice) design. The number of bits in the shift-
accumulator can be chosen freely.
In the first time slot word Fyfd-l from the ROM shall be added to the initially
cleared accumulator. In the next time slot -^Wd-2 shall be added to the previous
508 Chapter 11 Processing Elements

Figure 11.42 Parallel implementation of distributed arithmetic

Figure 11.43 Shift-accumulator using carry-save adders

result divided by 2. This division is done by shifting Fyfd_i one step to the right
and copying the sign bit. One bit of the result is obtained during each clock cycle.
This procedure is continued until FQ, corresponding to the sign bit of the data,
is being subtracted. This is done by adding -Fo, i.e., inverting all the bits in FQ
using the XOR gates and the signal s, and adding one bit in the least-significant
position. We will explain later how this last addition is done. After -Fo has been
added, the most significant part of the inner product must be shifted out of the
accumulator. This can be done by accumulating zeros. The number of clock cycles
for one inner product is WJ+WROM- A more efficient scheme is to free the carry-
save adders in the accumulator by loading the sum and carry bits of the carry-
save adders into two shift registers as shown in Figure 11.44 [12, 35]. The outputs
from these can be added by a single carry-save adder.
This scheme effectively doubles the throughput since two inner products are
computed concurrently for a small increase in chip area.
The result will appear with the least significant part in the output of the shift-
accumulator, and the most significant part in the output of the lower carry-save
11.15 The Basic Shift-Accumulator 509

Figure 11.44 Shift-accumulator augmented with two shift registers

adder. Thus, a special end-bit-slice is needed to separate the relevant bits in the
outputs. A special first bit-slice is also needed to copy the sign bit.
Figure 11.45 shows how the carry-save adders are connected to two shift reg-
isters that are loaded bit-parallel. The loading is accomplished via the multiplex-
ers. The input of the rightmost multiplexer in the lower shift register is used to set
the input of the carry-save adder to 1, in order to get a proper subtraction of FQ.
The required number of bit-slices is equal to the word length, WRQM-

Figure 11.45 The complete shift-accumulator

510 Chapter 11 Processing Elements

The first Wj clock cycles are used to accumulate values from the ROM while
the last WRQM clock cycles are used to shift the result out of the shift registers.
Hence, the required number of clock cycles is

Notice that these two phases can be overlapped with subsequent operations so
that two operations are performed concurrently. In a typical filter implementation
Wd = 16 to 22 bits and WRQM = 4 to 16 bits. Hence, the number of clock cycles neces-
sary is Wj in most applications. The latency between the inputs and outputs is
WROM clock cycles, and a new computation can start every Wj clock cycles. The word
length of the result will be W^ + WRQM~ 1 bits. The result is split into two parts; the
least significant part comes from the output of the last full-adder in the accumulator
and the most significant part is formed as the bit-serial sum of the carry-register
and the sum-register. A special end-bit-slice is needed to form the desired output.
A local control unit can be integrated into the shift-accumulator. All local con-
trol signals can be generated from a single external synchronization signal that
initiates a new computation.
Each bit-slice is provided with a D flip-flop which forms a shift register gener-
ating delayed versions of the synchronization signal. The local control signal
needed for selection of the least and most significant parts of the output is gener-
ated using this shift register. The control is therefore independent of the word
length of the shift-accumulator. This simplifies the layout design and decreases
the probability of design errors. It also decreases the probability of timing prob-
lems that can occur when a signal is distributed over a large distance.


The amount of memory required becomes very large for long inner products. There
are mainly two ways to reduce the memory requirements. The two methods can be
applied at the same time to obtain a very small amount of memory.

11.16.1 Memory Partitioning

One of several possible ways to
reduce the overall memory
requirement is to partition the
memory into smaller pieces
that are added before the
shift-accumulator as shown in
Figure 11.46. The amount of
memory is reduced from 2^
words to 2 2N/2 words if the
original memory is partitioned
into two parts. For example,
for N = 10 we get 210 = 1024
words to 2 25 = 64 words.
Hence, this approach reduces
the memory significantly at Figure 11.46 Reducing the memory by partitioning
the cost of an additional adder.

You might also like