Professional Documents
Culture Documents
4, APRIL 2015
AbstractInexact (or approximate) computing is an attractive paradigm for digital processing at nanometric scales. Inexact computing
is particularly interesting for computer arithmetic designs. This paper deals with the analysis and design of two new approximate 4-2
compressors for utilization in a multiplier. These designs rely on different features of compression, such that imprecision in computation
(as measured by the error rate and the so-called normalized error distance) can meet with respect to circuit-based figures of merit of a
design (number of transistors, delay and power consumption). Four different schemes for utilizing the proposed approximate
compressors are proposed and analyzed for a Dadda multiplier. Extensive simulation results are provided and an application of the
approximate multipliers to image processing is presented. The results show that the proposed designs accomplish significant
reductions in power dissipation, delay and transistor count compared to an exact design; moreover, two of the proposed multiplier
designs provide excellent capabilities for image multiplication with respect to average normalized error distance and peak signal-to-
noise ratio (more than 50 dB for the considered image examples).
1 INTRODUCTION
Fig. 2. 4 2 compressor.
the error generated by rounding the result to n bits) are
found in the next step. The correction constant (n k bits) is analysis and simulation results show that the proposed
selected to be as close as possible to the estimated value of approximate designs for both the compressor and the multi-
the sum of these errors to reduce the error distance. plier are viable candidates for inexact computing.
A truncated multiplier with constant correction has the This paper is organized as follows. Section 2 is a review
maximum error if the partial products in the n k least sig- of existing schemes for (exact) compressors. The two new
nificant columns are all ones or all zeros. A variable correction designs of an approximate 4-2 compressor are presented in
truncated multiplier has been proposed in [6]. This method Section 3. Multiplication and four different approximate
changes the correction term based on column n k 1. If all multipliers are proposed in Section 4. Simulation results for
partial products in column n k 1 are one, then the correc- the approximate compressors and multipliers are provided
tion term is increased. Similarly, if all partial products in this in Section 5. The application of the proposed approximate
column are zero, the correction term is decreased. multipliers to image processing is presented in Section 6.
In [7], a simplified (and thus inaccurate) 2 2 multiplier Section 7 concludes the manuscript.
block is proposed for building larger multiplier arrays. In
the design of a fast multiplier, compressors have been
widely used [8], [9], [10] to speed up the partial product 2 EXACT COMPRESSORS
reduction tree and decrease power dissipation. Optimized
designs of 4-2 exact compressors have been proposed in [8], The main goal of either multi-operand carry-save addition or
[11], [12], [13], [14], [15], [16]. Kelly et al. [17] and Ma et al. parallel multiplication is to reduce n numbers to two num-
[18] have also considered compression for approximate bers; therefore, n 2 compressors (or n 2 counters) have
multiplication. In [17], an approximate signed multiplier been widely used in computer arithmetic. A n 2 compres-
has been proposed for use in arithmetic data value specula- sor (Fig. 1) is usually a slice of a circuit that reduces n num-
tion (AVDS); multiplication is performed using the Baugh- bers to two numbers when properly replicated. In slice i of
Wooley algorithm. However, no new design is proposed for the circuit, the n 2 compressor receives n bits in position i
the compressors for the inexact computation. Designs of and one or more carry bits from the positions to the right,
approximate compressors have been proposed in [18]; how- such as i 1 or i 2. It produces two output bits in positions i
ever, these designs do not target multiplication. It should be and i 1 and one or more carry bits into the higher positions,
noted that the approach of [7] improves over [17], [18] by such as i 1 or i 2. For the correct operation of the circuit
utilizing a simplified multiplier block that is amenable to shown in Fig. 1, the following inequality must be satisfied
approximate multiplication. n c1 c2 c3 3 2c1 4c2 8c3 . . . ; (1)
Initially in this paper, two novel approximate 4-2 com-
pressors are proposed and analyzed. It is shown that these where cj denotes the number of carry bits from slice i to
simplified compressors have better delay and power con- slice i j.
sumption than the optimized (exact) 4-2 compressor designs A widely used structure for compression is the 4-2 com-
found in the technical literature [8]. These approximate com- pressor; a 4-2 compressor (Fig. 2) can be implemented with
pressors are then used in the restoration module of a Dadda a carry bit between adjacent slices c1 1. The carry bit
multiplier; four different schemes are proposed for inexact from the position to the right is denoted as cin while the
multiplication. Extensive simulation results are provided at carry bit into the higher position is denoted as cout . The two
circuit-level for figures of merit, such as delay, transistor output bits in positions i and i 1 are also referred to as the
count, power dissipation, error rate and normalized error sum and carry respectively.
distance under CMOS feature sizes of 32, 22 and 16 nm. The The following equations give the outputs of the 4-2
application of these multipliers to image processing is then compressor, while Table 1 shows its truth table.
presented. The results of two examples of multiplication of
two images are reported; these results show that the third Sum x1 x2 x3 x4 Cin (2)
and fourth approximate multipliers yield an output product
image that has a very high quality and resemblance to the Cout x1 x2x3 x1 x2x1 (3)
image generated by an exact multiplier, i.e., excellent values
for the average NED and the peak signal-to-noise ratio Carry x1 x2 x3 x4Cin x1 x2 x3 x4x4:
(PSNR) are found (for the PSNR more than 50db). The (4)
986 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 4, APRIL 2015
TABLE 1
Truth Table of 4-2 Compressor
3.1 Design 1
As shown in Table 1, the carry output in an exact compres-
sor has the same value of the input cin in 24 out of 32 states. Fig. 4. Optimized 4-2 compressor of [8].
MOMENI ET AL.: DESIGN AND ANALYSIS OF APPROXIMATE COMPRESSORS FOR MULTIPLICATION 987
TABLE 2
Truth Table of the First Approximate 4-2 Compressor
3.2 Design 2
A second design of an approximate compressor is proposed
to further increase performance as well as reducing the
error rate. Since the carry and cout outputs have the same
weight, the proposed equations for the approximate carry
and cout in the previous part can be interchanged. In this
new design, carry uses the right hand side of (7) and cout is
In the last step, the change of the value of cout in some always equal to cin ; since cin is zero in the first stage, cout
states, may reduce the error distance provided by approxi- and cin will be zero in all stages. So, cin and cout can be
mate carry and sum and also more simplification in the pro- ignored in the hardware design. Fig. 7 shows the block dia-
posed design. gram of this approximate 4-2 compressor and the expres-
sions below describe its outputs.
Cout0 x1x2 x3x4 : (7)
Sum0 x1 x2 x3 x4 (8)
Although the above mentioned simplifications of carry
and sum increase the error rate in the proposed approxi-
mate compressor, its design complexity and therefore the Carry0 x1x2 x3x4 : (9)
power consumption are considerably decreased. This can
be realized by comparing (2)-(4) and (5)-(7). Table 2
Note that (9) is the same as (7) and (8) is the same as (6)
shows the truth table of the first proposed approximate
for cin 0. Fig. 8 shows the gate level implementation of the
compressor. It also shows the difference between the
second proposed design. The delay of the critical path of
inexact output of the proposed approximate compressor
this approximate design is 2D, so it is 1D less than the previ-
and the output of the exact compressor. As shown in
ous designs; moreover, a further reduction in the number of
Table 2, the proposed design has 12 incorrect outputs out
gates is accomplished.
of 32 outputs (thus yielding an error rate of 37.5 percent).
This is less than the error rate using the best approximate
full-adder cell of [2].
Equations (5)-(7) are the logic expressions for the outputs
of the first design of the approximate 4-2 compressor pro-
posed in this manuscript.
The gate level structure of the first proposed design
(Fig. 6) shows that the critical path of this compressor has
still a delay of 3D, so it is the same as for the exact compres-
sor of Fig. 5. However, the propagation delay through the
gates of this design is lower than the one for the exact com-
pressor. For example, the propagation delay in the XOR
gate that generates both the XOR and XNOR signals in [8], Fig. 6. Gate level implementation of Design 1.
988 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 4, APRIL 2015
TABLE 3
Truth Table of Second Proposed 4-2 Compressor
TABLE 4
Simulation Results @32 nm)
TABLE 6 TABLE 8
Delay Improvement in Reduction Circuitry Transistor Count Improvement in Reduction Circuitry
TABLE 7 TABLE 9
Power Consumption Improvement in Reduction Circuitry Approximate Multipliers and Their Features
Fig. 11. Image multiplication (a) example 1, (b) example 2 (both using an
exact multiplier).
TABLE 11
PSNR and Average NED for First Example
TABLE 12
PSNR and Average NED for Second Example
TABLE 13
Ranking of Approximate Multipliers
multiplier that uses the second proposed approxi- [3] S. Cheemalavagu, P. Korkmaz, K. V. Palem, B. E. S. Akgul, and
L. N. Chakrapani, A probabilistic CMOS switch and its realiza-
mate compressor for all bits, has the best delay. tion by exploiting noise, presented at the IFIP Int. Conf. Very
With relatively modest reductions in transistor count Large Scale Integ., Perth, Australia, Oct. 2005.
and power consumption, the third and fourth pro- [4] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, Bio-
posed multipliers have very low average NED val- inspired imprecise computational blocks for efficient VLSI imple-
mentation of soft-computing applications, IEEE Trans. Circuits
ues, thus presenting the best tradeoff for energy with Syst. I: Reg. Papers, vol. 57, no. 4, pp. 850862, Apr. 2010.
accuracy. [5] M. J. Schulte and E. E. Swartzlander Jr., Truncated multiplication
Moreover, the application of these approximate multi- with correction constant, in Proc. Workshop VLSI Signal Process.
VI, 1993, pp. 388396.
pliers to image processing has confirmed that two of the [6] E. J. King and E. E. Swartzlander Jr., Data dependent truncated
proposed designs achieve a PSNR of nearly 50 dB in the out- scheme for parallel multiplication, in Proc. 31st Asilomar Conf.
put generated by multiplying two input images, thus viable Signals, Circuits Syst., 1998, pp. 11781182.
for most applications. [7] P. Kulkarni, P. Gupta, and M. D. Ercegovac, Trading accuracy for
power in a multiplier architecture, J. Low Power Electron., vol. 7,
Table 13 compares the four proposed approximate no. 4, pp. 490501, 2011.
design with four other approximate designs found in the [8] C. Chang, J. Gu, and M. Zhang, Ultra low-voltage low- power
technical literature by ranking them under various metrics. CMOS 4-2 and 5-2 compressors for fast arithmetic circuits, IEEE
Trans. Circuits Syst., vol. 51, no. 10, pp. 19851997, Oct. 2004.
Multiplier 4 is overall the best design with respect to all fig-
[9] D. Radhakrishnan and A. P. Preethy, Low-Power CMOS pass
ures of merit for approximate multiplication as well as the logic 4-2 compressor for high-speed multiplication, in Proc. IEEE
two PSNR examples. Multiplier 5 has the best performance 43rd Midwest Symp. Circuits Syst., 2000, vol. 3, pp. 12961298.
in terms of Max High NED and number of correct outputs; [10] Z. Wang, G. A. Jullien, and W. C. Miller, A new design technique
for column compression multipliers, IEEE Trans. Comput.,
however, its rather poor performance for the other figures vol. 44, no. 8, pp. 962970, Aug. 1995.
of merit causes its ranking to be in the middle once the [11] J. Gu and C. H. Chang, Ultra low-voltage, low-power 4-2
PSNR examples are considered. Multiplier 3 is the second compressor for high speed multiplications, in Proc. 36th IEEE
best design among the schemes considered in this manu- Int. Symp. Circuits Syst., Bangkok, Thailand, May 2003,
pp. v-321v-324.
script. It offers overall good performance in most metrics of [12] M. Margala and N. G. Durdle, Low-power low-voltage 4-2 com-
Table 13. Current and future research addresses the trade- pressors for VLSI Applications, in Proc. IEEE Alessandro Volta
offs of the different figures of merit in the proposed designs Memorial Workshop Low-Power Design, 1999, pp. 8490.
[13] B. Parhami, Computer Arithmetic; Algorithms and Hardware Designs,
to establish conditions by which combined metrics can be 2nd ed. London, U.K.: Oxford Univ. Press, 2010.
attained. Moreover, physical designs of the approximate [14] K. Prasad and K. K. Parhi, Low-power 4-2 and 5-2 compressors,
multipliers are being pursued to further confirm the analy- in Proc. 35th Asilomar Conf. Signals, Syst. Comput., 2001, vol. 1,
sis presented in this paper. pp. 129133.
[15] M. D. Ercegovac and T. Lang, Digital Arithmetic. Amsterdam, The
In conclusion, this paper has shown that by an appropri- Netherlands: Elsevier, 2003.
ate design of an approximate compressor, multipliers can [16] D. Baran, M. Aktan, and V. G. Oklobdzija, Energy efficient imple-
be designed for inexact computing; these multipliers offer mentation of parallel CMOS multipliers with improved
significant advantages in terms of both circuit-level and compressors, in Proc. ACM/IEEE 16th Int. Symp. Low Power Elec-
tron. Design, 2010, pp. 147152.
error figures of merit. Although not discussed and beyond [17] D. Kelly, B. Phillips, and S. Al-Sarawi, Approximate signed binary
the scope of this manuscript, the proposed designs may also integer multipliers for arithmetic data value speculation, in Proc.
be useful in other arithmetic circuits for applications in Conf. Design Architect. Signal Image Process., 2009, pp. 97104.
[18] J. Ma, K. Man, T. Krilavicius, S. Guan, and T. Jeong,
which inexact computing can be used. The provision of an Implementation of high performance multipliers based on
error indicator (as required for other applications) is a topic approximate compressor design, presented at the Int. Conf. Elec-
of current investigation. trical and Control Technologies, Kaunas, Lithuania, 2011.
Jie Han (S02-M05) received the BSc degree in Fabrizio Lombardi (M81-SM02-F09) received
electronic engineering from Tsinghua University, the BSc (Hons.) degree in electronic engineering
Beijing, China, in 1999, and the PhD degree from in 1977 from the University of Essex, United
the Delft University of Technology, The Nether- Kingdom and the masters degree in microwaves
lands, in 2004. He is currently an assistant pro- and modern optics from Microwave Research
fessor in the Department of Electrical and Unit, University College London, in 1978, the
Computer Engineering, University of Alberta, diploma in microwave engineering in 1978 and
Edmonton, AB, Canada. His research interests the PhD degree from the University of London in
include reliability and fault tolerance, nanoelec- 1982. He is currently the holder of the Interna-
tronic circuits and systems, novel computational tional Test Conference (ITC) Endowed chair pro-
models for nanoscale and biological applications. fessorship at Northeastern University, Boston.
He was nominated for the 2006 Christiaan Huygens Prize of Science by During 2007-2010 he was the editor-in-chief of the IEEE Transactions
the Royal Dutch Academy of Science (KoninklijkeNederlandseAkademie on Computers. He is also an associate editor of the IEEE Transactions
van Wetenschappen (KNAW) Christiaan Huygens Wetenschapsprijs). on Nanotechnology and the inaugural editor-in-chief of the IEEE Trans-
His work was recognized by the 125th anniversary issue of Science, for actions on Emerging Topics in Computing. He currently is an elected
developing theory of fault-tolerant nanocircuits. He was a Technical Pro- member of the Board of Governors of the IEEE Computer Society. His
gram chair and a general chair in IEEE International Symposium on research interests include bio-inspired and nano manufacturing/comput-
Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) ing, VLSI design, testing, and fault/defect tolerance of digital systems.
2012 and 2013, respectively. He was also a technical program commit- He has extensively published in these areas and coauthored/edited
tee member in several other international symposia and conferences. seven books. He is a fellow of the IEEE.
He is a member of the IEEE.