You are on page 1of 67

Micro transductors 08

Low Power VLSI Design 2


Dr.-Ing. Frank Sill
Department of Electrical Engineering, Federal University of Minas Gerais,

Av. Antnio Carlos 6627, CEP: 31270-010, Belo Horizonte (MG), Brazil

franksill@ufmg.br http://www.cpdee.ufmg.br/~frank/

Agenda

Recap Power reduction on


Gate

level level

Architecture Algorithm System

level

level

Copyright Sill, 2008

Micro transductors 08, Low Power 2

Recap: Problems of Power Dissipation

Continuously increasing performance demands

Increasing power dissipation of technical devices


Today: power dissipation is a main problem

High Power dissipation leads to: Reduced time of operation High efforts for cooling

Higher weight (batteries) Reduced mobility

Increasing operational costs Reduced reliability

Copyright Sill, 2008

Micro transductors 08, Low Power 2

Recap: Consumption in CMOS


Voltage (Volt, V) Current (Ampere, A) Energy

Water pressure (bar) Water quantity per second (liter/s) Amount of Water

CL

Energy consumption is proportional to capacitive load!


Copyright Sill, 2008 Micro transductors 08, Low Power 2 4

Recap: Energy and Power


Power is height of curve Watts Approach 1 Approach 2 time Energy is area under curve Watts Approach 1

Approach 2
time

Energy = Power * time for calculation = Power * Delay


Copyright Sill, 2008 Micro transductors 08, Low Power 2 5

Recap: Power Equations in CMOS

P = f CL VDD2 + VDD Ipeak (P01 + P10 ) + VDD Ileak Dynamic power ( 40 - 70% today and decreasing relatively) Short-circuit power ( 10 % today and decreasing absolutely) Leakage power ( 20 50 % today and increasing)

Copyright Sill, 2008

Micro transductors 08, Low Power 2

Recap: Levels of Optimization


Savings
MEM

Speed
Seconds Minute Minutes Hour Hours

Error
> 50 % 25-50 % 15-30 % 10-20 % 5-10 %
nach Massoud Pedram

System Algorithm Architecture Gate Transistor


Copyright Sill, 2008

MEM

ALU MP3

> 70 % 40-70 % 25-40 % 15-25 % 10-15 %

T1

T T

Micro transductors 08, Low Power 2

Recap: Logic Restructuring


Logic restructuring: changing the topology of a logic network to reduce transitions
AND: P01 = P0 * P1 = (1 - PAPB) * PAPB

0.5 A B 0.5

(1-0.25)*0.25 = 3/16 W 7/64 = 0.109 X 15/256 C F 0.5 D 0.5

0.5 A 0.5 B 0.5 C 0.5 D

3/16 Y 15/256 F Z 3/16 = 0.188

Chain implementation has a lower overall switching activity than tree implementation for random inputs
Source: Timmernann, 2007

BUT: Ignores glitching effects


Copyright Sill, 2008 Micro transductors 08, Low Power 2 8

Recap: Input Ordering


(1-0.5x0.2)*(0.5x0.2)=0.09 0.5 A B 0.2 X C 0.1 F (1-0.2x0.1)*(0.2x0.1)=0.0196 0.2 B X C F 0.1 A 0.5
AND: P01 = (1 - PAPB) * PAPB

Beneficial: postponing introduction of signals with a high transition rate (signals with signal probability close to 0.5)
Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

Recap: Glitching
A B C

X
Z

ABC X Z

101

000

Unit Delay
Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

10

Design Layer: Gate Level

Basic elements:
Logic

gates elements (flipflops, latches)

Sequential

Behavior of elements is described in libraries

Copyright Sill, 2008

Micro transductors 08, Low Power 2

11

Dynamic Power and Device Size


Device

Sizing (= changing gate width)


1.5

Affects input capacitance Cin Affects dynamic power consumption Pdyn


normalized energy

Affects load capacitance Cload

fcircuit=1
fcircuit=2 fcircuit=5 fcircuit=10 fcircuit=20

Optimal

fanout factor f for Pdyn is smaller than for performance (especially for large loads)

e.g., for Cload=20, Cin=1


= 20

0.5

fcircuit

fopt_energy fopt_performance
For

= 3.53
0

= 4.47

fanout f

Low Power: avoid oversizing (f too big) beyond the optimal


Copyright Sill, 2008 Micro transductors 08, Low Power 2

Source: Nikolic, UCB

12

VDD versus Delay and Power


6 10 8

Relative Delay td

5 4 3 2 1 0 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 Supply voltage (VDD)

td

Pdyn

6 4 2 0

Delay (td) and dynamic power consumption (Pdyn) are functions of VDD
Micro transductors 08, Low Power 2 13

Copyright Sill, 2008

Relative Pdyn

Multiple VDD

Main ideas:

Use of different supply voltages within the same design High VDD for critical parts (high performance needed) Low VDD for non-critical parts (only low performance demands)

At design phase:

Determine critical path(s) (see upper next slide) High VDD for gates on those paths Lower VDD on the other gates (in non-critical paths)

For low VDD: prefer gates that drive large capacitances (yields the largest energy benefits)

Usually two different VDD (but more are possible)


Micro transductors 08, Low Power 2 14

Copyright Sill, 2008

Multiple VDD contd

Level converters:

Necessary, when module at lower supply drives gate at higher supply (step-up) If gate supplied with VDDL drives a gate supplied with VDDH then PMOS never turns off VDDH Possible implementation:

Cross-coupled PMOS transistors NMOS transistor operate on reduced supply Vin

VDDL

Vout

No need of level converters for step-down change in voltage Reducing of overhead:


Conversions at register boundaries Embedding of inside flipflop


Micro transductors 08, Low Power 2 15

Copyright Sill, 2008

Data Paths

Data propagate through different data paths between registers (flipflops - FF)

Paths mostly differ in propagation delay times


Frequency of clock signal (CLK) depends on path with longest delay critical path
FF FF FF

FF

FF

FF

Paths
Path
FF FF FF

CLK
Copyright Sill, 2008

CLK
Micro transductors 08, Low Power 2

CLK
16

Data Paths: Slack


C A B
A B
G1 ready with evaluation

G1

G2

Y
C

all Inputs of G1 arrived

all inputs of G2 arrived

delay of G1
Copyright Sill, 2008

Slack for G1

time
17

Micro transductors 08, Low Power 2

Multiple VDD in Data Paths


Minimum energy consumption when all logic paths are critical (same delay) Possible Algorithm: clustered voltage-scaling Each path starts with VDDH and switches to VDDL (blue gates) when slack is available Level conversion in flipflops at end of paths

Connected with VDDL

Connected with VDDH

Copyright Sill, 2008

Micro transductors 08, Low Power 2

18

Design Layer: Architecture Level


Also known as Register transfer level (RTL) Base elements:


Register structures Arithmetic logic units (ALU) Memory elements

Only behavior is described


(no inner structure)

Copyright Sill, 2008

Micro transductors 08, Low Power 2

19

Clock Gating

Most popular method for power reduction of clock signals and functional units

Gate off clock to idle functional units


Logic for generation of disable signal necessary

Higher complexity of control logic Higher power consumption Critical timing critical for avoiding of
clock glitches at OR gate output

R Functional e unit g

Additional gate delay on clock signal


clock disable

Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

20

Clock Gating contd

Clock-Gating in Low-Power Flip-Flop

CLK

Source: Agarwal, 2007

Copyright Sill, 2008

Micro transductors 08, Low Power 2

21

Clock Gating contd

Clock gating over consideration of state in Finite-StateMachines (FSM)

PI
Flip-flops Combinational logic

PO

Clock activation logic CLK


Copyright Sill, 2008

Latch
Source: L. Benini and G. De Micheli, Dynamic Power Management, Boston: Springer, 1998.

Micro transductors 08, Low Power 2

22

Clock Gating: Example


Without clock gating

30.6mW
With clock gating

8.5mW

VDE

DEU

10

15

Power [mW]

20

25

MIF DSP/ HIF

90% of FlipFlops clock-gated


70% power reduction by clock-gating

896Kb SRAM

MPEG4 decoder
Source: M. Ohashi, Matsushita, 2002

Copyright Sill, 2008

Micro transductors 08, Low Power 2

23

Recap: VDD versus Delay and Power


6 10 8

Relative Delay td

5 4 3 2 1 0 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 Supply voltage (VDD)

td

Pdyn

6 4 2 0

Dynamic Power can be traded by delay


Copyright Sill, 2008 Micro transductors 08, Low Power 2 24

Relative Pdyn

A Reference Datapath

Register

Input

Combinational logic

Register

Output

Cref CLK Supply voltage Total capacitance switched per cycle Clock frequency Power consumption: Pref
Micro transductors 08, Low Power 2

= Vref = Cref = fClk = CrefVref2fclk


Source: Agarwal, 2007

Copyright Sill, 2008

25

Parallel Architecture
Each copy processes every Nth input, operates at fclk/N reduced voltage Register Comb. Logic Copy 1 Comb. Logic Copy 2 N to 1 multiplexer Supply voltage: VN Vref N = Deg. of parallelism Register fclk

Register

Output

Input

fclk/N Register

Multiphase Clock gen. and mux control


CK

fclk/N

Comb. Logic Copy N

Source: Agarwal, 2007

Copyright Sill, 2008

Micro transductors 08, Low Power 2

26

Pipelined Architecture

Reduces the propagation time of a block by factor N Voltage can be reduced at constant clock frequency

Constant throughput

Area A
CLK

A/N
CLK

A/N

A/N

Functionality:
Data

CLK
Copyright Sill, 2008 Micro transductors 08, Low Power 2 27

Parallel Architecture: Example

Reference Data path (for example)


A

Critical path delay Tadder + Tcomparator (= 25 ns) fref = 40 MHz Total capacitance being switched = Cref VDD = Vref = 5V Power for reference datapath = Pref = Cref Vref2 fref
Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

28

Parallel Architecture: Example contd

Area = 1476 x 1219 2


The clock rate can be reduced by half with the same throughput fpar = fref / 2 Vpar = Vref / 1.7, Cpar = 2.15 Cref Ppar = (2.15 Cref) (Vref / 1.7)2 (fref / 2) = 0.36 Pref
Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

29

Pipelined Architecture: Example

fpipe = fref, , Cpipe = 1.1 Cref , Vpipe = Vref / 1.7 Voltage can be dropped while maintaining the original throughput Ppipe = CpipeVpipe2 fpipe = (1.1 Cref) (Vref/1.7)2 fref = 0.37 Pref

Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

30

Approximate Trend
N-parallel proc. Capacitance Voltage Frequency Dynamic Power Chip area N*Cref Vref/N fref/N CrefVref2fref/N2 N times N-stage pipeline proc. Cref Vref/N fref CrefVref2fref/N2 10-20% increase

Source: G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer Academic Publishers, 1998.

Copyright Sill, 2008

Micro transductors 08, Low Power 2

31

Guarded Evaluation

Reduction of switching activity by adding latches at inputs


A B C Multiplier condition B A Latch condition Multiplier

Latch preserves previous value of inputs to suppress activity Could also use AND gates to mask inputs to zero = forced zero
Micro transductors 08, Low Power 2 32

Copyright Sill, 2008

Precomputation
Precomputed inputs R1

Gated inputs

R2

Combination logic f(X)

Outputs

Precomputation logic

g(X)

Load disable

Identify logical conditions at inputs that are invariant to the output


Since those inputs dont affect output, disable input transitions Trade area for energy
Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

33

Precomputation: Design Issues

Design steps
1. Selection of precomputation architecture 2. Determination of precomputed and gated inputs (Register R1 should be much smaller than R2) 3. Search good implementation for g(X) 4. Evaluation of potential energy savings based on input statistics (if savings not sufficient go to step 2 or 3 and try again)

Also works for multiple output functions where g(X) is the product of gj(X) over all j

Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

34

Precomputation: Example

Binary Comparator
An
Bn An-1 Bn-1 A1 B1 Load disable An = Bn R1 n-bit binary value comparator A>B A>B

R2

Can achieve up to 75% power reduction with 3% area overhead and 1 to 5 additional gate delays in worst case path Source: Irwin, 2000
35

Copyright Sill, 2008

Micro transductors 08, Low Power 2

Adder Design

Various algorithms exist to implement an integer adder

Ripple, select, skip (x2), Look-ahead, conditional-sum.

Each with its own characteristics of timing and power consumption.

Ripple Carry FA FA FA FA

Carry Select FA FA FA FA FA FA
FA

0 1

Variable/Fixed Width Carry Skip

Carry Look-ahead FA FA FA FA FA FA FA FA
Source: Mendelson, Intel

Copyright Sill, 2008

Micro transductors 08, Low Power 2

36

Adder Design
Energy (pJ)
Ripple Carry Constant Width Carry Skip Variable Width Carry Skip Carry Lookahead Carry Select Conditional Sum

Delay (nSec)
54.27 28.38 21.84 17.13 19.56 20.05

117 109 126 171 216 304

Adders differ in Energy and delay Different adders for different applications Also true for other units (multiplier, counter, )
Source: Callaway, Swartzlander Estimating the power consumption of CMOS adders - 11th Symposium on Computer Arithmetic, 1993. Proceedings.

Copyright Sill, 2008

Micro transductors 08, Low Power 2

37

Bus Power

Buses are significant source of power dissipation

50% of dynamic power for interconnect switching (Magen, SLIP 04) MIT Raw processors on-chip network consumes 36% of total chip power (Wang et al. 2003)

Caused by: High switching activities Large capacitive loading

Wout

Xout

Yout

Zout

Bus receivers
Bus Bus drivers
Source: Irwin, 2000

Ain
Copyright Sill, 2008

Bin

Cin

Din

Micro transductors 08, Low Power 2

38

Bus Power Reduction


For an n-bit bus: Pbus = n* fClkCloadVDD2 Alternative bus structures


Segmented buses (lower Cload) Charge recovery buses Bus multiplexing (lower fClk possible) Code compression Instruction loop buffers

Minimizing bus traffic (n)


Minimization of bit switching activity (fclk) by data encoding Minimize voltage swing (VDD2) using differential signaling
Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

39

Reducing Shared Resources


Shared resources incur switching overhead Local bus structures reduce overhead

Global bus architecture

Local bus architecture

Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

40

Reducing Shared Resources contd

Bus segmentation
Another Control

way to reduce shared buses of bus segment by controller blocks (B)

Shared Bus
B

Segmented Bus
B

Source: Evgeny Bolotin Jan 2004

Copyright Sill, 2008

Micro transductors 08, Low Power 2

41

Design Layer: Algorithm Level

Base elements:
Functions Procedures Processes Control

structures

Description of design behavior

Copyright Sill, 2008

Micro transductors 08, Low Power 2

42

Coding styles

Use processor-specific instruction style:


Variable types Function calls style Conditionalized instructions (for ARM)

Follow general guidelines for software coding

Use table look-up instead of conditionals


Make local copies of global variables so that they can be assigned to registers Avoid multiple memory look-ups with pointer chains

Copyright Sill, 2008

Micro transductors 08, Low Power 2

43

Source-code Transformations

Minimize power-consuming activity:

Computation
A*B+A*C A*(B+C)

Communication
for (c = 1..N) receive (A) B=c*A receive (A) for (c = 1..N) B=c*A

Storage
for (c = 1..N) B[c] = A[c]*D[c] for (c = 1..N) F[c] = B[c]-1

for (c = 1..N) F[c] = A[c]*D[c]-1

Copyright Sill, 2008

Micro transductors 08, Low Power 2

44

Datapath Energy Consumption


14000

Switched Capacitance (nF)

12000 10000 8000 6000 4000 2000 0 bubble.c heap.c quick.c

Others Functional Unit Pipeline Registers Register File

Algorithms can differ in power dissipation


Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

45

Adaptive Dynamic Voltage Scaling (DVS)


Slow down processor to fill idle time More Delay lower operational voltage

Active

Idle Active Active

Idle

3.3 V

2.4 V

Runtime Scheduler determines processor speed and selects appropriate voltage

Transitions delay for frequencies ~150s


Potential to realize 10x energy savings
Micro transductors 08, Low Power 2 46

Copyright Sill, 2008

Adaptive DVS: Example

Task with 100 ms deadline, requires 50 ms CPU time at full speed

Normal system gives 50 ms computation, 50 ms idle/stopped time

Half speed/voltage system gives 100 ms computation, 0 ms idle


Same number of CPU cycles but: E = C (VDD/2)2 = Eref / 4 Dynamic Voltage Scaling adapts voltage to workload

T1 Speed

T2

T1
Same work, lower energy

T2

Task

Idle

Task

Time
Copyright Sill, 2008 Micro transductors 08, Low Power 2

Time
47

Design Layer: System Level

Basic Elements:

Complex modules Processors Calculation and control units Sensors


MEM MEM ALU MP3

Copyright Sill, 2008

Micro transductors 08, Low Power 2

48

Dynamic Power Management

Systems are:

Designed to deliver peak performance, but Not needing peak performance most of the time

Components are idle sometimes Dynamic power management (DPM):

Puts idle components in low-power non-operational states when idle Observes and controls the system Power consumption of power manager is negligible

Power manager:

Copyright Sill, 2008

Micro transductors 08, Low Power 2

49

Processor Sleep Modes

Software power control - power management DOZE NAP Most units stopped except on-chip cache memory (cache coherency) Cache also turned off, PLL still on, time out or external interrupt to resume PLL off, external interrupt to resume

SLEEP

Deeper sleep mode consumes less power

Deeper sleep mode requires more latency to resume

Copyright Sill, 2008

Micro transductors 08, Low Power 2

50

Processor Sleep Modes: Example

PowerPC sleep modes


Mode No power mgmt Dynamic power mgmt DOZE NAP SLEEP SLEEP without PLL SLEEP without clock 66Mhz 2.18W 1.89W 307mW 113mW 89mW 18mW 2mW 80Mhz 2.54W 2.20W 366mW 135mW 105mW 19mW 2mW

10 cycles to wake up from SLEEP

100us to wake up from SLEEP+


Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

51

Transmeta LongRun

Applies adaptive DVS


LongRun policies:

Detection of different workload scenarios Based on runtime performance information Processor supply voltage Processor frequency Clock frequency always within limits required by supply voltage to avoid clock skew problems

After detection accordingly adaptation of:


Use of core frequency/voltage hard coded operating points

Best trade-off between performance and power possible

Copyright Sill, 2008

Micro transductors 08, Low Power 2

52

Transmeta LongRun contd


100
% of max powerl consumption

90 80 70 60 50 40 30 20 10 0 300
Typical operating region Peak performance region

300 Mhz 0.80 V

400

433 Mhz 0.87 V

500

533 Mhz 0.95 V

600

667 Mhz 1.05 V

700

800 Mhz 1.15 V

800

900 Mhz 1.25 V

900

1000 Mhz 1.30 V

1000

Frequency (MHz)
Source: Transmeta

Copyright Sill, 2008

Micro transductors 08, Low Power 2

53

Transmeta LongRun: Example

Source: Transmeta

Copyright Sill, 2008

Micro transductors 08, Low Power 2

54

Battery aware design

Non-linear effects influence life time of batteries Rate Capacity If discharging currents higher than allowed real capacity goes under nominal capacity Battery Recovery

Capacity (mAh)

1000 800 600 400 200

1000 mAh (Standard Capacity) 125mA ( Rated Current)

Discharge current (mA) Available Charge (mA)

Pulsed discharge increases nominal capacity Based on recovery times Discharge (as long there is no rate Current capacity effect) (mA)

time idle

time
Source: Timmermann, 2007

Copyright Sill, 2008

Micro transductors 08, Low Power 2

55

Battery aware design contd


Diffusion Model from - Rakhmatov, Vrudula et al.

Fully charged battery

After a recent discharge

After Recovery

Fully discharged

Electro-active species

Analytically very sound but computationally intensive Cannot be used for online scheduling decisions.
Micro transductors 08, Low Power 2 56

Copyright Sill, 2008

Battery aware design: Example 1

Performance of a bipolar lead-acid battery subjected to six current impulses. Pulse length=3 ms, rest period=22 ms.
Current Battery Voltage

Source: LaFollette, Design and performance of high specific power, pulsed discharge, bipolar lead acid batteries, 10th Annual Battery Conference on Applications and Advances , Long Beach, pp. 4347, January 1995.

Copyright Sill, 2008

Micro transductors 08, Low Power 2

57

Battery aware design: Example 2


Current [mA] Current [mA]

Discharge profile A
Profile Aver. Current [mA]

Discharge profile B
Battery lifetime [ms] Specif. energy [Wh/Kg]

A
B

123.8
124.2

357053
536484

15.12
18.58

Minimum average current Maximum battery life time


Source: Timmermann, 2007

Copyright Sill, 2008

Micro transductors 08, Low Power 2

58

Backup

Copyright Sill, 2008

Micro transductors 08, Low Power 2

59

FSM: Clock-Gating

Moore machine: Outputs depend only on the state variables.


If

a state has a self-loop in the state transition graph (STG), then clock can be stopped whenever a selfloop is to be executed.
Xi/Zk Si Sk Sj Xk/Zk Clock can be stopped when (Xk, Sk) combination occurs.

Xj/Zk

Copyright Sill, 2008

Micro transductors 08, Low Power 2

60

Trend: Interconnects
Interconnects
Propagation delays of global wires will be a multiple of the clock cycle.

Example (very optimistic): 610 clock cycles in 50nm technology


[Benini, 2002]

Source: Tenhunen, 2005

Copyright Sill, 2008

Micro transductors 08, Low Power 2

61

Bus Multiplexing
or

Number of bus transitions per cycle = 2 (1 + 1/2 + 1/4 + ...) = 4

Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

62

Resource Sharing and Activity II

Copyright Sill, 2008

Micro transductors 08, Low Power 2

63

Bus Multiplexing

Sharing of long data buses with time multiplexing Example:


S1 uses even cycles S2 odd

S1

D1

S1

D1

S2

D2

S2

D2

Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors 08, Low Power 2

64

Correlated Data Streams


Bit switching probabilities
Muxed Dedicated 1

For a shared (multiplexed) bus advantages of data correlation are lost (bus carries samples from two uncorrelated data streams)

0,5

Bus sharing should not be used for positively correlated data streams Bus sharing may prove advantageous in a negatively correlated data stream (where successive samples switch sign bits) - more random switching

0 14 12 MSB 10 8 6 4 2 0 LSB
Source: Irwin, 2000

Bit position
Copyright Sill, 2008 Micro transductors 08, Low Power 2 65

Disadvantages of Bus Multiplexing

If data bus is shared, advantages of data correlation are lost (bus carries samples from two uncorrelated data streams) Bus sharing should not be used for positively correlated data streams Bus sharing may prove advantageous in a negatively correlated data stream (where successive samples switch sign bits) - more random switching
Micro transductors 08, Low Power 2 66

Copyright Sill, 2008

Adaptive DVS contd

Implementation
Power-Speed Control Knob

Workload Filter
Variable Power-Speed System
FIFO Input Buffer

Copyright Sill, 2008

Micro transductors 08, Low Power 2

67

You might also like