You are on page 1of 6

6B- I

Low-Power Domino Circuits using NMOS Pull-up on Off-critical Paths


Abdulkadir U. D i d , Yuvraj S. Dhillon, and Abhijit Chatterjee
Georgia Institute of Technology, Atlanta, GA
{ utku,yuvrajsd,chat] @ece.gatech.edu
~~

Abstract - Domino logic is used extensively in -high speed


microprocessor datapath design. Atthough domino gates have
small propagation delay, they consume relatively more power.
We propose a scheme to reduce the power consumption of
combinational domino logic blocks while maintaining the
performance. We replace the PMOS precharge transistor with
an NMOS transistor to reduce the overall power consumption of
the gate at the expense of higher delay. We use a heuristic
algorithm to replace the fast, high power gates on the off-critical
paths with slower, low power gates while maintaining the circuit
performance. Our technique reduces dynamic energy of
ISCASSS circuits by 16.25%.

I. Introduction
Domino logic is used extensively in high-speed circuit
design. The main reason for the higher performance of
domino logic compared to static CMOS is the reduced input
capacitance seen by driver; gates in domino logic [I]. In
CMOS, both PMOS and NMOS transistors are driven at the
input stage whereas in domino logic, only NMOS transistors,
which have lower gate capacitance than PMOS, are driven.
The higher performance of domino logic comes at the
expense of higher power consumption. The switching activity
in domino circuits is, on the average, double that in CMOS
circuits. This leads to higher power consumption even though
the switched capacitance is lower.
The use of dual supply voltage promises to be an effective
way of reducing power consumption in digital circuits
[7][81[9][10]. However, its implementation in CMOS logic
necessitates the use of a level shifter whenever a low voltage
gate drives a high voltage gate. In the absence of a level
shifter, the PMOS transistors in the high voltage gate are not
completely turned off by the high output of the low voltage
gate. This causes significant energy wastage due to high
current flow from supply voltage to ground. The level shifters
limit the logic granularity at which dual voltages can be used,
which in turn reduces the effectiveness of using dual voltages
in CMOS logic. Clustered Voltage Scaling (CVS) [7][8] and
module level voltage scaling [91[10] is used in CMOS to
reduce the overhead of level shifters. Domino logic, however,
does not have this problem since there are no PMOS
transistors that are driven by the previous logic gates, making
it possible to use dual supply voltages at the gate level.
Shieh et. al. [2] use dual supply voltages. gate sizing, and
a contention-alleviated static keeper (CASK) to reduce power
consumption in domino circuits while keeping the delay
fixed. This approach needs two sepatate supply voltages for
the gates and a bias voltage for the CASK circuitry, which is
~
Jung et. al. [3] use
used to speed up V D to ~V D~ o interfaces.

Adit D.Singh

Auburn Universioi, Auburn, AL


adsingh @eng.auburn.edu

dual supply voltages and dual threshold voltages together


with a low voltage swing clock in order to reduce power
consumption of domino circuits. A separate back-biasing
voltage is also used to be abIe to turn off the pull-up PMOS
completely when low voltage swing cIock is applied to a gate
with high supply voltage.
In this paper, we propose a scheme to lower the power
consumption of combinational domino logic blocks while
maintaining the performance. The basic idea is to replace the
standard domino gates on the off-critical paths by low-power
(but higher delay) domino gates. The low-power domino
gates use a novel technique to effectively operate at a lower
supply voltage, as explained in detail in Section 3. In brief,
we replace the PMOS pull-up transistor in Figure ](a) with an
NMOS transistor (Figure I(b)) that leads to reduced voltage
swing at the input of the output inverter. This node has a higk.
capacitance in domino logic gates to eliminate problems due
to charge sharing. So, reducing the voltage applied to this
node reduces the energy consumption when that node is
charging or discharging. Henceforth, we will refer to these
two different types of gates as PPD (PMOS Pull-up Domino)
and NPD (NMOS Pull-up Domino) gates.
The main contributions of this paper are:
A novel circuit design style for domino logic gates that
makes it possible to exploit the advantages of dual supply
voltage usage withour the necessity of U second supply
voltage or level shifers.
A novel algorithm to determine the high-power gates to
be swapped with their low-power (but higher delay)
equivalents for optitnal energy saving.

11. Domino Logic Preliminaries


Figure I(a) shows a 3-input AND gate implemented in
domino logic style. A domino gate consists of a precharge
transistor (Mp) which charges the input capacitance of the
output inverter (Node x i n Figure l(a)) during the precharge
phase. During the evaluate phase, the evaluate transistor (Me)
is turned on and provides a path for charge flow from Node x
to ground depending on the values of the inputs. Domino
logic circuits may suffer from charge sharing and charge
leakage if they are not designed properly. Generally, Node x
is designed to have a higher capacitance than other nodes in
order to reduce the effects of charge sharing and charge
leakage during evaluate phase.
To eliminate charge sharing during the precharge phase,
the inputs to the gates should all be low. This is achieved by
using a CMOS logic inverter at the output stage of every gate
that drives the inputs of the following stages low during

This work was bupported by NSF Information Technology Research


Contract, CCR 022-0259

0-7803-8736-8/05/$20.0002005 IEEE.

533

ASP-DAC 2005

I",,

06

5
2
w."dmim)

15

15

Figure 2. Variation of propagation delay for NPD AND gate with


vidth of extra NMOS transistor

Figure 1. Domino Lcigic 3-input AND gate with PMOS (a) and
NMOS (b) pull-up

precharge. In domino logic, there are only two types of


switching possible at the output in every evaluate-precharge
cycle. The output of a gate either remains at logic 0 during the
whole cycle (if the gate evaluates to 0) or a 0 to 1 transition
followed by a 1 to 0 transition is observed (if the gate
evaluates to 1). Therefore, the energy consumption of a
domino gate in an evaluate-precharge cycle is either EOo for a
0-0 transition at the output or Eo'* for a 0-1-0 transition at the
output. The propagation delay of a domino gate is the delay
of the 0-1 transition in the evaluate phase when the gate
evaluates to 1.
Due to the absence of a PMOS network, domino logic
does not require a level shifter when a low voltage gate drives
a high voltage gate. This makes domino logic very suitable
for multiple supply voltage based designs. Different voltages
can be applied in gate level granularity. However, the
problem of generating multiple supply voltages and routing
these voltages to different gates still remains.

4 SIE.014

llE114

I
05

25

35

N M E Wdhlumj

Figure 3. Variation of Eo'' or NPD AND gate with width of extra


NMOS transistor

III. NMOS Pull-up Domino Logic


To solve the problem of generating and routing two
different voltage supplies, we propose to use NMOS pull-up
transistors in the gates those are not on the critical path to
obtain a pseudo dual voltage operation as shown in Figure
1 (b). Node x in the figure is charged up to V,, - V,, - AV,,,

because of the limitation of the NMOS transistor when


passing a high voltage. The AV,,, term is the threshold

gure 4. Variation of Ew for NPD AND gate with width of extra


4MOS transistor

voltage modification seen on the NMOS pull-up transistor


due to the body-bias effect. The reduction of operating
voltage at Node x reduces the switching power dissipation in
the node capacitance, This reduction is a significant fraction
of ,the total power consumption of the gate since this
capacitance is generally larger than the other node
capacitances as explained in Section 2. Note that an inverted
clock signal is applied as the precharge signal to the NMOS
pull-up transistor in NPD.
Applying an input voltage that is smaller than the supply
voltage to the inverter results in a large standby current in the

534

inverter because of the inverter PMOS transistor not being


completely turned off. This problem is solved by adding an
always-on NMOS transistor, between the power supply and
the PMOS transistor. This reduces the source voltage of the
PMOS transistor and enables us to turn it completely off
when Node x is at high voltage.
Since the rest of the gate - the NMOS network and the
CMOS inverter - now has to operate at a lower voltage than
VDD,the delay o f the NPD gate is increased as compared to
the PPD gate. Furthermore, the delays of the gates driven by

'mn'02

l"li

-DR3NPOCYNPD

-W?
hpo by NQD

..c.&U03 UW bf PPD

2mna .

-OR3

NPD by P W

-OR3

PPD by NPD

lOlEUl5

-AhQ3PPObyPPD
cAND3NPOblPPD
AND3 PPD ai NPD
2mEa15~-..- .AND3NPDwNPD
OR3 PPDbf@PG
OR3 PW by PPO
-0R3NPD
qPPD
-.r.OR3PPDb/CPD

. .
.-.

..................... , . . . ~..................
............
....... ~ ..., ......
........

..................

>"ut- -.

A....

.____
.....,............... ........

...... .............

6iOE.316

om?

0%

,mi

15m

>mr

30

Load Capacitance

K r e 5. Variation of propagation delay for PPD and NPD ANC


i d OR3 gates with different drivers and load capacitances

Figure 7. Variation of E'"' for PPD and NPD AND3 and OR3 gates
with differentdrivers and load capacitances
inverters of the gates which will not be switching at the same
time. This reduces the area without decreasing the
performance significantly. We observed that putting onr
NMOS transistor of width W from power supply to the output

._

o m

om

,a+

1%

?am?

Ir4F

inverter of all the gates in a path gives very close propagation


delay to the case where each gate has an NMOS transistor of
width W between the power supply and its output'inverter.
The slight increase in delay may be adjusted by increasing the
width of the global NMOS properly.
Figures 5 , 6, and 7 shows the variation of deIay and
energy values for PPD and NPD three-input AND and OR
gates with different fan-outs when driven by an NPD and a
PPD gate. As seen on the figures, while Eo'' increases
similarly for both NPD and PPD domino gates with
increasing fan-outs, propagation delay increases more for
NPD gates as fan-out of the gate increases.

80

Load C q a c n a n c e

Figure 6. Vanation of E'"' for PPD and NPD AND3 and OR3
gates with different drivers and load capacitances

IV. Algorithm for Replacing PPD Gates with

the NPD gate are also increased became of the reduced


driving voltage output by the NPD gate.
The delay increase of the NPD gates can be reduced by
increasing the driving capability of the NMOS transistor in
the inverter pull-up. We vaned the width of this transistor to
vary its driving capability. Figures 2, 3 and 4 show the SPICE
simulation results of the variation of the delays and energy
consumption values respectively for a 3 fan-in, 1 fan-out
AND NPD gate with different values of NMOS transistor
width and with different drivers (NPD and PPD). The delay
and energies of the corresponding PPD gate are also shown.
We see that there is a trade-off between reduced delay and
energy of a gate versus increased gate area a? the width of the
NMOS varies. We shall show the effect of this trade-off on
overall energy savings in Section 6. The area overhead can be
reduced by using a single NMOS transistor for the output

535

NPD Gates
To reduce the energy consumption of combinational
domino logic circuits, we propose replacing the fast, high
energy PPD gates on the non-critical paths with the slow, low
energy NPD gates. The total delay of the circuit remains the
same as the original circuit with only PPD gates.
We first represent the combinational circuit as a directed
acyclic graph (DAG), G(V,E). If the circuit has multiple
primary inputs (PIS), we create a dummy P1 vertex, Pld, which
fans-out to the original PIS. Underlying this is the assumption
that all inputs arrive simultaneously. Similarly for POs, we
create a dummy PO vertex, POd, which has fan-ins from the
original POs. Each vertex 'v' of the DAG has associated with
it the following information:
(1) The logic function computed by the gate
corresponding to the vertex.

(2) Four delay values: PPdelay: delay if driven by PPD,


mapped to PPD; PNdelay: delay if driven by PPD,
mapped to NPLI; NPdelay: delay if driven by NPD,
mapped to PPD; delay:
delay if driven by NPD,
mapped to NPD.
(3) Four energy values: PPenergy, PNenergy, NPenergy,
energy, similar to above.
(4)The current delay (v.delay), energy (vtnergy), time
slack (v.ts), early start time (v.es), early finish time (v.ef),
late start time (VAS)and late finish time (v.lf) of the vertex
at any stage of thc replacement process. Please refer to [5]
for definitions of early start time, etc.
The delay/energq consumption values for PPD and NPD
gates of different fim-inslfan-outs/types/switching activities
when driven by NPD or PPD gates are obtained from SPICE
simulations as explained in Section 5. The dummy vertices
have zero delaydenergy consumptions.
The algorithm for replacing PPD gates with NPD gates
consists of two steps:
a) Initialization: In this step, we first topologically sort the
DAG to get the sorted vertex list, V. VEis used to compute
the total circuit delay, T, of the baseline circuit in which each
vertex is mapped to a PPD gate, and to efficiently compute
the time slack, early start time, early finish time, late start
time and late finish time of each vertex using the function
Update_Time-Slacks(V). We also compute the possible
delay and energy values that a vertex can have when the gate
corresponding to the vertex is mapped as NPD or PPD and
driven by an NPD or PPD gate. For example, if vertex v
represents an AND gate with fan-in fi and fan-out fo, we
lookup the four possible delay values it can have from the
SPICE table:
(1) v.PPdelay=spice-data[AND][PMOS][fi][fo][TRH]
(2) v.PNdelay=spicedata [ANDI[NMOSl [fil [fol [TRHI
(3) v.NPdelay=spice-data [AND][PMOS][fi][fo][TRL]
(4) v.NNdelay=spice-data [AND][NMOS] [fi][fo][TRL]
where the second index of spice-data is the pull-up type of
the gate and the last index refers to the transmission delay
when the gate has a high drive (TRH) or a low drive (TRL)
depending on whether it is driven by a PPD gate or an NPD
gate respectively. The four possible energy values are
computed similarly.
With each vertex mapped to a PPD gate, we compute an
energy metric for each vertex, vertex-energy-saving. As
the name suggests, vertex-energy-saving is just the energy
saving obtainable if the vertex gate is changed from PPD to
NPD. This metric is zero for a vertex if either (i) the delay
increase of the vertex due to the change is greater than the
vertexs time slack or (ii) the delay increase of any of the
driven gates (due to the reduced drive provided by the NPD
gate) is greater than its time slack. In case all the delay
increases are less than the corresponding time slacks, the
change in energy of the driven gates is also included i n the
metric for the driver gate.

Ugorithm PPD-NPD
nputs: Topologically sorted list of circuit vertices, V;
lutput: Circuit with off-critical path PPD gates replaced with
rJPD gates.
:or every vertex v in order of decreasing metric value {
If v has any predecessor mapped to NPD gate [

high-delay+ v.NNdelay;
low-energy+ v.NNenergy;

Else 1

high-delay+ v.PNdelay;
low-energy- v.PNenergy;

1
If((high-delay-v.delay)<_v.ts) {
f l a g t 0;
For every successor p of v [
If p is mapped to a NPD gate [
If((v.es+high_delay)?p.es) {
if((v.es+high-delay+p.NNdelay)sp.lfi flag1;break;

1
Else if((p.NNdelay-p.delay)>p.ts)

f l a g t 1;break;

Else [

If((v.es+highPdelay)3.es) [

If((v.es+high-delay+p.NPdelay)>p.lf)

flag+

I ;break;

Else if((p.NPdelay-p.delay)>p.ts)
flag+ 1;break;

I
If(flag=O) {
Map v to an NPD gate.

v.delay+ high-delay;
v.energy+ low-energy:

For every successor p of v {


If p is mapped to a NPD gate {
p.delayt p.NNdelay;

p.energy+ p.NNenergy;

1
Else {

p.delay+ p.NPdelay;
p.energy- p.NPenergy;

1
1

Update-Time-Slacks(\.)

I
Figure 8. Algorithm for replacing off-critical path PPD gates with
NPD gates
ietric, and attempt to replace the gate corresponding to the
vertex with the NPD equivalent. This might not always be
possible, even for a vertex with non-zero metric value,
because the metric for the vertex was computed i n step (a)
under the assumption that all other vertices are mapped to
PPD gates and hence the vertex had a lot of slack. As the
replacement proceeds, the slack available to a vertex keeps on
reducing and might not be sufficiently, large to allow

(b) PPD to NPD replacement: In this step, we visit every


vertex in decreasing order of the vertex-energy-saving

536

Circuit

C1908
C2670

C3540

Table 1. Results

Delay (sec)

#Gates

Initial
Energy (J)

Final
Energy (J)

8.62E-12
2.24E-11
1.91E-I I
2.47E-11

794
1253
1987

C75.52

Average

2 m u , 17 63%
17%

216u.77362

13%

075u

12%

1.75u

226u

2.75

3250

9.13
27.03
19198
20.24

CPU Time

24.55

10.2 1
22.24
40.88

16.25

11.78

duplication algorithm [4] to turn the CMOS logic style


mapping to domino logic style, where inverters are present
only at the primary inputs.
To do the energy optimization described in Section 4, we
need the delay .and dynamic energy consumption
characteristics of the gates in the circuit, Lookup tables for
energy anddelay for different values of fan-out capacitances
were generated for the gates used in the synthesis using 0 . 1 8 ~
SPICE Level 49 MOSFET models [6]. Delay and energy
values for the gates change depending on the magnitude of
the driving voltage. Therefore, two simulations were run for
every NPD and PPD gate: one for the case when the gate is
driven by an NPD gate, and the other for the case when it is
driven by a PPD gate.
We applied inputs with switching activity, 0.1 and static
probability of 0.5 [of input being high) to all the primary
inputs and used Synopsys Design Compiler to get the static
probabilities of the internal nodes. The average energy, Ei,for

324ll.l?9?%

025u

0.23

19%

14%

% Saving

4.598-3 1
6.04E- 1 1
7.95E-11

C5315

Fraction of
NPD Gates

37%

N d t h of Extra NMOS (m)

PPD-NPD replacement. After every replacement, the slacks


for all gates are recomputed using the function
Update-Time_Slacks(V). The whole procedure is repeated
till all vertices with non-zero metric values have been have
been visited. Figure 8 gives the details of the algorithm used
in this step.
After step (b) has been carried out, some PPD gates on
off-critical paths have been rcplaced by NPD gates. This step
does not change the total circuit delay since only those PPD
gates which have sufficient time-slack are replaced by NPD
gates.

V. Methodology
We tested our scheme on the ISCASSS benchmark
circuits. The circuits were synthesized using Synopsys Design
Compiler to a target library that was reduced to have only two
to four input AND and OR gates, and INVERTER gates
for simplicity. Circuits were optimized for minimum delay.
Since domino logic can only implement non-inverting
functions, the resulting circuit was not suitable for mapping to
domino logic gates. We implemented the bubble pushing and

537

gate i was calculated as follows:

where p and $are the static probabilities of the output of


gate i being 0 and 1 respectively.

VI. Results
We implemented the algorithm in Figure 8 using C++.
The compiled program was run for each of the benchmark
circuits on a Sun Sparc Ultra-80 machine. As mentioned in
Section 4, the delay and energy characteristics of the NPD
gates can be varied by changing the width of the inverter
NMOS pull-up transistor, This variation in the delay and
energy characteristics leads to a variation in the number of
PPD gates that are replaced with NPD gates by our algorithm
and hence a variation in the overall energy savings obtained.
Figure 9 shows that as we increase the width of the inverter
NMOS pull-up in the NPD gates, the energy savings go up.
This is expected because both the delay and energy of NPD
gates goes down with this increased width as shown in

Figures 2, 3 and 4. This translates into more energy savings


due to a larger number of PPD gates being replaced with
NPD gates. The designer has to determine the width of the
NMOS transistor by properly considering the trade-off
between energy saving and area increase. Table 1 shows the
results for the case when the inverter NMOS pull-up
transistor is sized tcr be the same size as the inverter PMOS
transistor (1.08pm).Inverter NMOS transistor is sized to be
0.27pmm,Energy values do not include the energy consumed
by the clocking network. Although the overall average energy
savings is 16.25%, we see that the average energy savings for
the bigger circuits (>lo00 gates) is 22.95% while that for the
smaller circuits (<1000 gates) is 7.32%. Thus, the technique
is considerably more. effective for large circuits.
Energy savings may be increased further by modifying the
threshold voltage of the NMOS pull-up transistor, thus
changing the delay and energy consumption characteristics of
the low power gates. A lower (higher) threshold voltage will
decrease (increase) the delay of the gate while increasing
(decreasing) the power consumption. The threshold voltage
value for the additional NMOS transistor between power
supply and the output inverter has to be the same as the
NMOS pull-up transistor for proper operation.

VII. Conclusion
We presented a method to save dynamic energy in domino
logic circuits by replacing the PMOS pull-up transistor by an
NMOS transistor and adding an additional NMOS transistor
between power supply and the output inverter. This method
gives an average energy saving of 16.25% for the ISCAS85
benchmark circuits. This method makes it possible to exploit
the advantages of using dual supply voltage without the
necessiiy of a second supply voltage or level shifters. Apart

from the additional NMOS transistor for low energy gates,


the savings are obtained virtually for free.

REFERENCES
I. P. Uyemura, CMOS Logic Circuit Design, Kluwer
Academic Publishers, March 1999.
S . 1. Shieh, 1. S. Wang, Design of low-power domino circuits
using
multiple
supply
voltages,
IEEE International Conference on Electronics, circuits and
Systems, Sept. 2001, pp. 711 - 714. .
S . 0. Jung, K. W. Kim, S. M. Kang, Low-swing clock domino
logic incorporating dual supply and dual threshold voltages,
Design Automation Conference, June 2002, pp. 467 - 472.
M. R. Prasad, D. Kirkpatrick, R. K. Brayton, Domino logic
synthesis and technology mapping, In:. Workhop on Logic
Synthesis, 1997.

J. D. Wiest, F. K. Levy, A Management Guide to


PERTKPM, Prentice-Hall, 1977.
http://www.tsmc.com, Level 49 Spice parameters for 0.18p
TSMC process.
K. Usami, M. Horowitz, Clustered Voltage Scaling Technique
for Low-Power Design, lnrernarional Symposium on Low
Power Design, April 1995, pp. 3 - 8.
M. Igarashi, K. Usami, K. Nogami, F. Minami, Y. Kawasaki, T.
Aoki, M. Takano, S. Sonoda, M. Ichida, N.Hatanaka, A lowpower design method using multiple supply voltages,
International Symposium on Low Power Electronics and
Design, Aug. 1997, pp. 36 - 41
Y. S. Dhillon, A. U. Did, A. Chattejee, H. H. S. Lee,
Algorithm for achieving minimum energy consumption in
CMOS circuits using multiple supply and threshold voltages at
the module level, International Conference on Computer
Aided Desian. Nov. 2003. DV. 693 - 700.
[IO] J. Chang, M.Pedram, Energy Minimization Using Multiple
Supply Voltages, IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, Dec. 1997,pp. 436 - 443
.

538

You might also like