Professional Documents
Culture Documents
result divided by 2. This division is done by shifting Fyfd_i one step to the right
and copying the sign bit. One bit of the result is obtained during each clock cycle.
This procedure is continued until FQ, corresponding to the sign bit of the data,
is being subtracted. This is done by adding -Fo, i.e., inverting all the bits in FQ
using the XOR gates and the signal s, and adding one bit in the least-significant
position. We will explain later how this last addition is done. After -Fo has been
added, the most significant part of the inner product must be shifted out of the
accumulator. This can be done by accumulating zeros. The number of clock cycles
for one inner product is WJ+WROM- A more efficient scheme is to free the carry-
save adders in the accumulator by loading the sum and carry bits of the carry-
save adders into two shift registers as shown in Figure 11.44 [12, 35]. The outputs
from these can be added by a single carry-save adder.
This scheme effectively doubles the throughput since two inner products are
computed concurrently for a small increase in chip area.
The result will appear with the least significant part in the output of the shift-
accumulator, and the most significant part in the output of the lower carry-save
11.15 The Basic Shift-Accumulator 509
adder. Thus, a special end-bit-slice is needed to separate the relevant bits in the
outputs. A special first bit-slice is also needed to copy the sign bit.
Figure 11.45 shows how the carry-save adders are connected to two shift reg-
isters that are loaded bit-parallel. The loading is accomplished via the multiplex-
ers. The input of the rightmost multiplexer in the lower shift register is used to set
the input of the carry-save adder to 1, in order to get a proper subtraction of FQ.
The required number of bit-slices is equal to the word length, WRQM-
The first Wj clock cycles are used to accumulate values from the ROM while
the last WRQM clock cycles are used to shift the result out of the shift registers.
Hence, the required number of clock cycles is
Notice that these two phases can be overlapped with subsequent operations so
that two operations are performed concurrently. In a typical filter implementation
Wd = 16 to 22 bits and WRQM = 4 to 16 bits. Hence, the number of clock cycles neces-
sary is Wj in most applications. The latency between the inputs and outputs is
WROM clock cycles, and a new computation can start every Wj clock cycles. The word
length of the result will be W^ + WRQM~ 1 bits. The result is split into two parts; the
least significant part comes from the output of the last full-adder in the accumulator
and the most significant part is formed as the bit-serial sum of the carry-register
and the sum-register. A special end-bit-slice is needed to form the desired output.
A local control unit can be integrated into the shift-accumulator. All local con-
trol signals can be generated from a single external synchronization signal that
initiates a new computation.
Each bit-slice is provided with a D flip-flop which forms a shift register gener-
ating delayed versions of the synchronization signal. The local control signal
needed for selection of the least and most significant parts of the output is gener-
ated using this shift register. The control is therefore independent of the word
length of the shift-accumulator. This simplifies the layout design and decreases
the probability of design errors. It also decreases the probability of timing prob-
lems that can occur when a signal is distributed over a large distance.