You are on page 1of 6

LOW POWER ENERGY EFFICIENT PIPELINED MULTIPLY-

ACCUMULATE ARCHITECTURE
R Sakthivel K Sravanthi Harish M Kittur
VIT University VIT University VIT University
Vellore,TN Vellore,TN Vellore,TN
India India India
rsakthivel@vit.ac.in sravanthi44@gmail.com kittur@vit.ac.in

ABSTRACT
This paper proposes and implements an energy efficient, high
speed pipelined Multiply and Accumulate (MAC) architecture for
DSP applications. A controller has been designed to detect the
input pattern such that it bypasses multiplier and accumulator
units depending on the consecutive input bits. This architecture is
used for both signed and unsigned multiplication, it includes a
guard bits to support longer iterations and a saturation circuitry to
provide a maximum range in the output. This technique of
bypassing the paths/modules in the architecture makes its more
energy efficient and reduces its critical path when applied on
combined MAC architecture which has proved to be high speed
compared to other conventional MAC architecture and a power
reduction of 28% is achieved. The proposed MAC architectures
are modeled using Verilog HDL and implemented for different
operand sizes i.e. 8, 16, 32. The design has been synthesized using
RTL compiler from Cadence with TSMC 90nm Technology,
place and route is carried out using Cadence Encounter.
Categories and Subject Descriptors
B.5.0 [REGISTER-TRANSFER-LEVEL
IMPLEMENTATION]: Design Aids; B.6.0 [LOGIC
DESIGN]: Design Aids.
General Terms
Design, Performance.
Keywords
Low power, Critical Path, Multiplier and Accumulator (MAC).
1. INTRODUCTION
With rapid advances in multimedia and communication systems,
demand for high data transfer rate has increased, multiplier -and-
accumulator are the commonly used digital blocks in digital
signal processing such as filtering, convolution and inner
products, hence many techniques are coming up with a high
speed, low power MAC, which results in high performance
digital signal processing application.
____________________________________
MAC generally consists of a multiplier and an accumulator. The
multiplier unit multiplies the inputs and gives the output, whose
output is added with the previous multiplied output using
accumulate adder as shown in Figure.1. Multiplier plays an
important role in the performance of MAC unit, it consists of a
partial product generation unit (PP unit), reduction tree and final
adder.
Figure 1. Block diagram of general MAC architecture with
two and three stage pipelining.
In order to increase performance of MAC, pipelining is used
which decreases the critical path by inserting registers between the
PP unit or between the PP unit and the final adder. This creates a
two cycle and three cycle MAC architecture(Figure.1).
Several design techniques are proposed for Multiplier since it has
the largest delay and critical path, inside PP unit partial product
generation can be done by multipliers like Baugh Wooley,
modified booth algorithm [1], or some of its successors [2],
reduction trees like Wallace, Dadda, TDM or HPM. Fast adders
like Kogge Stone, Sparse-tree carry look ahead and hybrid adders
[3] can be used for fast addition of PP unit outputs.
Permission to make digital or hard copies of part or all of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation on the
first page. Copyrights for components of this work owned by others
than ACM must be honored. Abstracting with credit is permitted. To
copy otherwise, to republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee.
ICACCI '12, August 03 - 05 2012, CHENNAI, India
Copyright 2012 ACM 978-1-4503-1196-0/12/08.$10.00.
226
Many techniques are coming up to achieve low power
consumption in digital systems at device level to system level [10-
12]. Low power techniques at different levels of MAC can be
employed to reduce power consumption. One such technique is
proposed in this paper.
The remainder of this paper is organized as follows: Section 2
describes the Baugh Wooley Multiplier and HPM reduction tree,
Section 3 describes combined MAC architecture and its
advantages over conventional MAC architectures (Figure.1),
Section 4 describes our proposed technique, Section 5 describes
the evaluation methodology and simulation results. Finally the
conclusion is given in Section 6.
2. BAUGH WOOLEY MULTIPLIER AND
REDUCTION TREE
Baugh Wooley algorithm [5] is used for both signed and unsigned
multiplication, it is power and energy efficient than modified
booth multiplier of equal bit width. It comprises of three steps:
i) The most signifcant bit (MSB) oI the frst N 1
partial-product rows and all bits of the last partial-
product row, except its MSB, are inverted.
ii) A 1` is added to the N
th
column.
iii) The MSB oI the fnal result is inverted.
It is not widely adopted because it cannot be effectively used on
irregular reduction trees like Wallace, Dadda or TDM. Baugh
Wooley implementations is exclusively based on reduction arrays,
here we are using High Performance Multiplier (HPM) [6]
reduction tree.
HPM reduction tree is a logarithmic depth reduction tree and has
regular structure; this regular structure has made the size of the
reduction circuit less of a concern when designing a multiplier. It
has the layout of simple carry save addition array, and a high
speed, low power Dadda- style tree [9].
3. COMBINED MAC ARCHIRTECTURE
The combined MAC architecture is shown in Figure. 2. Instead of
two carry propagating stages in the same architecture as in the
case of Figure.1, we use single carry propagation in this
architecture by replacing the final adder of the first stage with the
carry save adder in second stage. Now the delays of two stages are
similar and the critical path still depends on the PP unit.
The basic MAC operation using Baugh Wooley multiplier is
shown in Figure. 3. First the two inputs are computed then the
result is sign extended to have the same size of a accumulate
adder. The accumulate adder is Ng bits wider than the multiplier
output to allow higher multiply-accumulate iterations (2
Ng
)
without overflow. Finally the sign extended product is added with
the accumulated value. The disadvantage here is that P [2n-1] bit
has to be computed and then the most significant bit is sign
extended for addition.
The combined MAC uses carry save adder composed of full
adders[4], this adder sums the two partial products from the PP
unit and the previous accumulated value and also avoids
complicated sign-extension procedure by adding the accumulated
value with a row oI Ng1 bits oI '1. This removes the need to
perform carry propagation and obtain P [2N-1] bit for sign
extension.
This architecture has advantages over conventional MAC
architecture in terms of speed, area, power, and energy.
When compared to two-cycle MAC (Figure 1), the
combined MAC requires no final adder.
When compared to three-cycle MAC (Figure 1), this
architecture removes final adder and one pipeline stage
without degrading speed.
Figure 2. Block diagram of combined MAC [4]
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012) 227
Figure: 3 Multiply and accumulate operation of inputs X and Y for a 3 cycle MAC architecture shown in Figure 1. [4]
The saturation unit is used to remove the guard bits (Ng) such that
the final result is 2N bits wide. It takes the output of accumulate
adder which is G [2N+Ng-1] as input and performs the algorithm
as given in [4].
4. PROPOSED TECHNIQUE
In this technique the multiplier and accumulator units are
bypassed depending on the consecutive input bits. The technique
is as follows:
i) If the incoming input values are zero then disable the
multiplier and accumulator and reuse the results of
accumulate adder.
ii) If the subsequent incoming input values are 1` then
disable the multiplier and enable the accumulator.
If the incoming input values are different inputs other than
zero and one, then multiplier and accumulator both are
enabled. The proposed technique is implemented on combined
MAC architecture which results in less switching activities i.e.
dynamic power gets reduced and is faster than combined
MAC architecture as shown in figure 4.
228 International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)

Figure 4. Proposed MAC architecture with control signal to enable and disable the multiplier and accumulator
5. METHODOLOGY VALIDATION
5.1 Evaluation Methodology
All PP units consist of partial product generation based on Baugh
Wooley algorithm and HPM reduction tree for the partial
products. The accumulate adder used here is conditional-sum
adder [7] which is the fastest adder and has a extension of 8 bits to
support loops up to 256 iterations in the MAC unit. The final
adder is Kogge Stone, it is the fastest parallel prefix adder used to
support fast addition of PP units output in case of MAC-2C and
MAC-3C.
The MAC-2C, MAC-3C, combined MAC and the proposed MAC
are implemented for 8, 16, and 32 bits using Verilog HDL.
Synthesized using RTL compiler from cadence in 90nm
technology. The simulation result is as shown in Figure. 5. Place-
and-route in Cadence Encounter as shown in Figure. 6.
5.2 Implementation Results
The results obtained from cadence RTL compiler are listed in
Table 1. The critical path is through the PP unit for all the four
designs, since the pipeline registers are at bottom of PP unit for
the combined MAC, combined MAC and MAC-3C has same
delay.
The proposed technique results as shown in Table 1 show that the
power consumed is less and is energy efficient when compared to
the combined MAC, this advantage comes with a small increase in
area due to extra circuitry as shown in Figure. 4.
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012) 229
Figure: 5 Simulation result of 16 bit combined MAC
Figure: 6 Layout of proposed technique for 16bit combined
MAC
6. CONCLUSION
In this paper an energy efficient, high speed pipelined Multiply
and Accumulate architecture for DSP applications has been
presented. The proposed technique of bypassing the paths in the
architecture depending on consecutive input patterns is applied on
power efficient combined MAC and evaluating the results in
cadence RTL compiler 90-nm technology show that the average
power of the proposed technique is 28% less power than the
combined MAC. Hence this can be used in applications where
power plays a major role.
Table: 1 Evaluation results of MAC-2C, MAC- 3C, Combined
MAC and the proposed MAC for 8, 16 and 32 bit operand
size.
7. ACKNOWLEDGMENT
We wish to thank my friends and colleagues for their support and
also I wish to thank the unknown reviewers for their valuable
suggestions and remarks given.
230 International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
8. REFERENCES
[1] W.-C. Yeh and C.-W. Jen, 'High-speed Booth encoded
parallel multiplier design, IEEE Trans. on Computers, vol.
49, no. 7, pp. 692701, July 2000.
[2] M.R. Santoro and M. A. Horowitz, 'SPIM: A pipeline 64x64
bit iterative multiplier, IEEE J. Solid-State Circuits (JSSC),
vol. 2, no. 1, pp. 487493, April 1989.
[3] D. H. K. Hoe, C. Martinez and S. J. Vundavalli 'Design and
Characterization oI Parallel PreIix Adders using FPGAs, in
Proc. of Intl. Symp System Theory (SSST), March 2011, pp.
168 172.
[4] Tung Thanh Hoang, Magnus Sj alander, and Per Larsson-
EdeIors, 'A High-Speed, Energy-Efficient Two-Cycle
Multiply-Accumulate (MAC) Architecture and Its
Application to a Double-Throughput MAC Unit, IEEE
circuits and systems, December 2010, pp. 3073 3081.
[5] M. Sj alander and P. Larsson-EdeIors, 'High-speed and low-
power multipliers using the Baugh-Wooley algorithm and
HPM reduction tree, in Proc. of IEEE Intl. Conf. on
Electronics, Circuits and Systems (ICECS), August 2008, pp.
3336.
[6] H. Eriksson, P. Larsson-Edefors, M. Sheeran, M. Sj alander,
D. Johansson, and M. Scholin, 'Multiplier reduction tree with
logarithmic logic depth and regular connectivity, in Proc. of
IEEE Intl. Symp. on Circuits and Systems (ISCAS), May
2006, pp. 48.
[7] J. Sklansky, 'Conditional-sum addition logic, IRE Trans.
Electronic Comput., vol. EC-9, pp. 226231, 1960.
[8] A. Abdelgawad and M. Bayoumi, 'High speed and area-
eIfcient Multiply Accumulate (MAC) unit Ior digital signal
processing applications, in Proc. of IEEE Intl. Symp. on
Circuits and Systems (ISCAS), May 2007, pp. 31993202.
[9] Bickerstaff K.A.C., Schulte M., Swartzlander, in Proc. of
IEEE Intl. Conf. on Application-Specific Array Processors,
October 1993, pp. 478 489.
[10] V. Moshnyaga, 'Reducing switching activity of
subtraction via variable trucaction of the most
signiIicant bits, Journal of VSLI Signal Processing, no.
33, 2003, pp.75-82 .
[11] M.I to, D. Chinnery, and K. Keutzer, 'Low power
multiplication algorithm for switching activity reduction
through operand decomposition, Proc. Of the 21st Intl.
Conf. on Computer Design (ICCD 2003), 2003, pp. 21-26.
[12] V. Menon, S. Chennupati, N. K. Samala, D.Radhakrishman,
and B. Izadi, 'Switching activity minimization in
combinational logic design, Proc. of the Intl. Conf. on
Embedded Systems and Applications, June 2004, pp.47-
53.
[13] C.R. Baugh and B. A. Wooley, 'A two`s complement
parallel array multiplication algorithm, IEEE Trans.
Comput., vol. C-22, pp.10451047, Dec. 1973.

International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012) 231

You might also like