You are on page 1of 4

Speed Improvement Algorithm for 1616 Multipliers using Sizing Optimization

B. Eghbalkhah, B. Afzal and A. Afzali-Kusha


Low-Power High-Performance Nanosystems Laboratory ECE Department, University of Tehran Tehran, Iran
Abstract-In this paper the speed improvement of a 1616 multiplier is addressed via sizing of the transistors used in multiplying blocks. Genetic algorithm (GA) is used to calculate the appropriate W for transistors. Modification of W/L ratio of transistors has reduced the multiplier delay up to 16 percent under different supply voltages and technologies with respect to the case of transistors having non-optimized but common W/L ratios. The algorithm is implemented in Matlab and circuit simulations are done using HSpice for 0.18 um, 0.13 um, 100 nm and 70 nm Static CMOS technologies. The multiplier is simulated with different supply voltages in each technology.

II. 1616 MULTIPLIER STRUCTURE The multiplier designed in this paper is a 1616 multiplier with Wallace Tree structure [5]. Fig. 1, illustrates the 32 stages of this multiplier symbolically in which each stage has the input vectors uij and ci. Note that ci is the carry vector generated in previous stage.

I. INTRODUCTION Digital Signal Processors and application specific integrated circuits rely on the efficient implementation of arithmetic circuits to execute dedicated algorithms such as convolution, correlation and digital filtering [1]. For example in transposed direct form structure of FIR filters m (number of filter taps) are required for fast filtering [2]. The multiplier count in an Echo-cancellation system is 128 which lead to a considerable circuit delay. Assuming a delay of T for each multiplier in direct form structure will cause a delay of 128T in whole system. Hence the speed improvement of multipliers, even for small values, increases the speed of these filters effectively. The speed improvement methods for multipliers can be mainly categorized in two groups. In the first group the optimization is done by manipulating the sub-blocks of the multiplier such as full adders and logic gates [3]. In the second category a new structure is proposed for multipliers based on the bus width and other parameters [4]. In this paper an efficient algorithm is used to minimize the delay of the whole multiplier block by minimizing the delay of intermediate stages via transistor sizing. The rest of the paper is organized as follows. In section 2, 1616 multiplier design and its corresponding sub-blocks are discussed while in section 3 the carry and sum generation delays are calculated. Section 4 introduces the proposed method based on genetic algorithm for minimization of the delay of intermediate stages. The simulation results are discussed in section 5. Finally the summary and conclusions are given in section 6.

Fig. 1. Stages of Wallace Tree multiplier used in this paper.

In each stage, the uij vectors are assumed to be ready at zero time but the delay of the ci vectors is different for each stage. In the structure shown in Fig.1 from stages 2 to 16 there are n elements of uij and (n-2) elements of ci. In the 17th stage there are 15 elements of uij and also 15 elements of ci while when the stage number is between 18 and 31, (32-n) elements of uij and (32-n+2) elements of ci is needed in that stage to calculate the carry vector for the next stage. The equivalency of the bits in one stage in Wallace Tree structure is utilized to minimize the carry generation delay for the next stage by changing the order of summation in an appropriate way. Full adder is the core element of complex arithmetic circuits like tree-structured multipliers. The full adder circuit used in our multiplier uses two XOR gates and a carry generation circuit. The corresponding equations for full adder circuit are as follows:

Vo1 = V1 V2
Vs = Vo1 V3
Vc = V1.Vo1 + Vo1.Vs

(1) (2) (3)

This work is financially supported by Iranian Telecommunication Research Center (ITRC).

Fig. 2(a)-2(c), illustrates the symbols used for XOR gate, carry generation circuit and the full adder constructed from these circuits. Also Fig. 2(d) shows the addition chain in a

1-4244-1278-1/07/$25.00 2007 IEEE

98

Authorized licensed use limited to: Isfahan University of Technology. Downloaded on August 17,2010 at 06:08:43 UTC from IEEE Xplore. Restrictions apply.

stage in which the sum bit of one full adder is used as an input to the other one in addition to two independent inputs.

charge the capacitors which are connected to in smaller time. The objective of this paper is to determine the values for channel widths for NMOS and PMOS transistors in XOR gates (Wnx and Wpx) as well as in carry generation circuits (Wnc and Wpc) to increase the speed of the full adder blocks and as a result the speed of multiplier. III. DELAY CALCULATION A. Average Delay Calculation

Table I shows the delay value for generation of Vs and Vc for all possible cases of input values.
Table I
Delay values for Vs and Vc in fill adder

Fig. 2. Symbols for (a) XOR gate, (b) Carry generation circuit, (c) Fulladder and (d) Summation chain

V1 0 0 0 0 1 1 1 1 txs txd tcs tcd 0 1 1 0 0

V2 0 1 0 1 0 1 1 1

The circuit shown in Fig. 3 is used to implement the XOR gates [6]. This XOR gate can be used as carry generation circuit with some innovative tricks helping the layout of the multiplier being more regular. As illustrated in Fig. 2(d), V1, V2,, Vn are carry bits from previous stage and uij elements of that stage. Vs1 and Vs1n (in general Vsi and Vsin) are the outputs of the second XOR gate which must be able to charge the gate capacitors of two NMOS transistors and two PMOS transistors at the input of the XOR gate of the consecutive full adder at the same stage. In addition these bits must be able to charge one NMOS and one PMOS transistor gate capacitors in the carry generation circuit of the first and second full adder. Note that V1, V2,, Vn are only used as inputs for the XOR gates.

Vc Delay 0 2txs+tcd 2txd txd+tcd txd+txs 2txd+tcd 2txs txd+txs+tcs txd+txs 2txs+tcd 2txs txd+txs+tcs txd+txs 0 txd+tcs 2txd 1 txd+tcs+txs txd+txs : Charging delay for XOR gate : Discharging delay for XOR gate : Charging delay for carry circuit : Discharging delay for carry circuit

V3

Vs Delay

According to the values presented in table I the average delay for carry and sum generation in a full adder is:

3t xd + 3t xs + 2tcs + 2tcd 4 ts = t xd + t xs F = tc =

(4) (5)

In this work tc is minimized by exploiting genetic algorithm to improve the speed of multiplier. Thus the goal of algorithm is to minimize F(Wnx, Wpx, Wnc, Wpc) via finding appropriate channel width for transistors. This function is the average delay of carry generation circuit. Since there is no simple and accurate relation between delay and channel width of the transistors, genetic algorithm is used to address this optimization problem.
Fig. 3. XOR gate used in the full adder block [6].

B.

Delay measurement with simulations

According to the above facts, V1, V2,, Vn are chosen from carry bits coming from previous stages. On the other hand Vs bits are used as the bits which should be applied to the input of both XOR gate and carry generation circuits. This will result in smaller channel width (W) for the transistors of the carry generation circuit than transistors in XOR gates because carry bits will need less current drive ability. Also this method decreases the gate capacitors of the carry generation circuit and consequently Vs bits are able to

In order to find the delay of XOR gate and carry generation circuit V2 is fixed to VDD. This is done because both of them have two inputs and only the effect of one input should be considered in delay calculation. The block diagram shown in Fig. 5(a) is used to prepare a quasi-real condition with loading capacitors being added, to find the maximum delay of the XOR gate. This simulation structure is used to find the maximum charging and discharging delay for a single bit. As shown in Fig. 3, an inverter delay is needed to generate VXOR from VXNOR. Thus is Fig. 4(d), V0

99
Authorized licensed use limited to: Isfahan University of Technology. Downloaded on August 17,2010 at 06:08:43 UTC from IEEE Xplore. Restrictions apply.

is changed in a way that causes V1n to change and then the delay of V2 is measured with respect to V1n. This will involve the inverter delay in delay measurement. Fig. 5(b) illustrates the simulation structure for carry generation delay measurement. Similar considerations are used for measurement of the maximum delay for carry generation circuit.

random values. B. Fitness Function The fitness function, gr(Wnx, Wpx, Wnc, Wpc),is defined for each gene in a way that minimizes the goal function. Equation (8) declares this fitness function.

V0

XOR Block

V1

XOR Block

V2

XOR Block

gr =

A Fr L + / 5

(8)

Carry Block
(a)

Carry Block

V0

XOR Block

V1

Carry Block

VC2

XOR Block

XOR Block
(b)

Where A, L and are the average value, minimum and variance of Fr respectively. It is clear that the smaller Fr value makes the corresponding gene more valuable. The factor /5 is added to denominator not only to guarantee it from becoming zero but to reduce the big effects of Fr-L at first generations in which the variance of Fr is high. The algorithm is implemented in Matlab and HSpice is called to measure the corresponding delays and fitness function for each gene. Since the Roulette Wheel algorithm is chosen as the selection function of the algorithm, the gene which has better fitness function has more chance to be selected. After the selection of 40 genes, 40 other genes will be generated based on them. The cross over process is performed for each gene in four points as depicted in Fig. 5.

Fig. 4. Simulation structure for delay measurement of (a) XOR gate, (b) carry generation circuit

IV. GENETIC ALGORITHM A. The Structure of Genes C.

Fig. 5. Cross over process in ith part of a gene Mutation

Each gene is consisted of four parts; each part standing for one of channel widths in F. For simplicity the ith part of rth gene (Cr[i]) is chosen between 0 and 1. The relation between Cr[i] and W is:

W = [2cr [i ] + ]m

(6)

Note that equation (6) is defined in a way that the maximum value for each W is 2+ um. This prevents the power consumption of the block from increasing seriously. The value of is based on the technology used for circuit implementation and varies between 0.08 for 70 nm process and 0.35 for 0.18 um process. The minimum value for each W is more than 0.08 um and each Cr[i] is represented with 10 bits. These assumptions lead to:

To make the algorithm to converge to an appropriate answer and prevent it from trapping in local extremes, mutation function is used. The mutation count in the first generation is 3% of population and occurs with the probability of . The mutation in the next generations reduces by the amount of PM = 0.03 / (1.08p-1) (p is the generation index) to retain the stability of the algorithm. V. SIMULATION RESULTS The genetic algorithm exploited in our work converges to the optimized answer in less than 20 generations. Matlab and HSpice are jointly used to simulate and measure the delay values and find the optimum answer for the problem. As shown in Fig. 6 the delay of the full adder block is reduced to 0.26 ns which lead to 1.4 ns delay for 1616 multiplier with 1.8V power supply in 0.18 um CMOS technology. This makes this multiplier much faster then the multipliers such as the one introduced in [7] with 2.5 ns delay. As an example, the values for channel width (W) of the transistor before and after applying the genetic algorithm and corresponding delay values for 0.18 um process with 1.8V supply voltage, are presented in Table II.

min(W ) 0.002m

(7)

This amount of variation meets the process limitations. Each gene has made up of 40 bits and the number of generation count is chosen to be twice the number of bits in a gene. The values of genes in the first generation are

100
Authorized licensed use limited to: Isfahan University of Technology. Downloaded on August 17,2010 at 06:08:43 UTC from IEEE Xplore. Restrictions apply.

Table II
Simulation results and comparison with non-optimized values

Multiplier Delay Wnc (um) Wpc (um) Wnx (um) Wpx (um)

Before Optimization 1.66 0.35 0.7 0.4 0.8

After Optimization 1.4 0.386720 0.404648 0.625430 0.677461

Table III shows that the delay of the multiplier under different conditions (3 various supply voltages for 4 different CMOS technologies) before and after using the genetic algorithm. According to the data in Table III, as the supply voltage of the circuit is reduced, the delay of the multiplier increases.
2

A speed improvement algorithm for a 1616 multiplier is introduced in this paper. Transistor sizing performed by exploiting genetic algorithm to find the best values for the channel width of the transistors in order to reduce the carry generation delay. Since there is no simple and accurate relation between delay and channel width of the transistors, genetic algorithm is used to address this optimization problem. Simulation results show 11.68 to 15.66 percent improvement in the speed of the 1616 multiplier under various supply voltages and different technology nodes.
Table III
Simulation results and comparison for multiplier delay under various supply voltages and different processes

Process 180 nm

VDD (V) VDD= 2.5 VDD= 1.8 VDD= 1.5 VDD= 1.7 VDD= 1.3 VDD= 0.9 VDD= 1.3 VDD= 1.0 VDD= 0.7 VDD= 0.8 VDD= 0.7 VDD= 0.5

1.5 1

V1 Vc2 V2

130 nm

100 nm

0.5 0 -0.5 0

70 nm

Delay (ns) Before After Opt. Opt. 1.53 1.3 1.66 1.4 2.75 2.35 1.37 1.21 1.76 1.53 3.13 2.73 1.13 0.97 1.51 1.30 2.65 2.27 1.11 0.94 1.41 1.20 3.56 3.01

Improvement 15.03 % 15.66 % 14.55 % 11.68 % 13.07 % 12.78 % 14.16 % 13.91 % 14.34 % 15.32 % 14.89 % 15.45 %

REFERENCES
200 400 600

T(ps)
(a)

[1]

[2]

V
1.5

[3]

V1
1 0.5 0
[4]

V2 Vc2
[5]

200

T(ps)
(b)

400

600

[6]

Fig. 6. (a) charging and (b) Discharging delays for XOR gate and carry generation circuit

[7] [8]

Table III also shows 11.68 to 15.66 percent of improvement in multiplier speed and makes our multipliers speed better than the one proposed in [8]. Note that our multiplier in 100 nm process with 1.3 V supply voltage with 0.97 ns delay is faster the multiplier proposed in [8] having the delay of 1 ns implemented in 90 nm process. VI. SUMMARY AND CONCLUSION

K. E. Khamei, A. Nabavi, and S. Hessabi, Design of variable fractional delay FIR filters using genetic algorithm, in IEEE Proceedings of Electronics, Circuits and Systems, vol. 1, pp. 48 51, Dec. 2003. J. Park, K. Muhammad and K. Roy, "High Performance FIR Filter Design Based on Sharing Multiplication", IEEE Transactions On Very Large Scale Integration (VLSI) Systems, VOL.11, NO.2, April 2003. K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi, A. Shimizu, "A 3.8-ns CMOS 1616-b Multiplier Using Complementary Pass-Transistor Logic", IEEE Journal Of SolidState Circuits , VOL. 25, NO. 2, April 1990. R. K. Kolagotla, H. R. Srinivas, and G. F. Burns, "VLSI Implementation of a 200-MHz 16x16 Left-to-Right Carry-Free Multiplier in 0.35 um CMOS Technology for next-generation DSPs, " in Proc. IEEE 1997 Custom Integrated Circuits Conf., pp.469-672, May 1997. C.S. Wallace, A Suggestion for a Fast Multiplier, IEEE Transaction On Electronic and Computer, Vol.EC-13, PP. 14-17, February 1964. C. C. Yu, W. P. Wang and B. D. Liu, "A 3-input XOR/XNOR for Low-Voltage Low-Power Applications" Proceedings of the 2000 IEEE Asia Pacific Conference on Circuits and Systems, pp.505508 December 2000. S. Roberta, W. Snyder, H. Chin, H. Hingarh, S. leibiger, R. Labri, L. Bouknight, M. Bisval, "A 2.5 ns ECL 1616 Multiplier", IEEE, 1990 Custom Integrated Circuit Conferenc. B. R. Zeydel, V. G. Oklobdzija, S. Mathew, R. K. Krishnamurthy, and S. Borkar, A 90 nm 1 GHz 22 mW 16/spl times/16-bit 2's complement multiplier for wireless baseband, in IEEE Proc. Int Symposium on VLSI Circuits, pp. 235 236, June 2003

101
Authorized licensed use limited to: Isfahan University of Technology. Downloaded on August 17,2010 at 06:08:43 UTC from IEEE Xplore. Restrictions apply.

You might also like