Professional Documents
Culture Documents
Av. Antnio Carlos 6627, CEP: 31270-010, Belo Horizonte (MG), Brazil
franksill@ufmg.br http://www.cpdee.ufmg.br/~frank/
Agenda
level level
level
level
High Power dissipation leads to: Reduced time of operation High efforts for cooling
Water pressure (bar) Water quantity per second (liter/s) Amount of Water
CL
Approach 2
time
P = f CL VDD2 + VDD Ipeak (P01 + P10 ) + VDD Ileak Dynamic power ( 40 - 70% today and decreasing relatively) Short-circuit power ( 10 % today and decreasing absolutely) Leakage power ( 20 50 % today and increasing)
Speed
Seconds Minute Minutes Hour Hours
Error
> 50 % 25-50 % 15-30 % 10-20 % 5-10 %
nach Massoud Pedram
MEM
ALU MP3
T1
T T
0.5 A B 0.5
Chain implementation has a lower overall switching activity than tree implementation for random inputs
Source: Timmernann, 2007
Beneficial: postponing introduction of signals with a high transition rate (signals with signal probability close to 0.5)
Source: Irwin, 2000
Recap: Glitching
A B C
X
Z
ABC X Z
101
000
Unit Delay
Source: Irwin, 2000
10
Basic elements:
Logic
Sequential
11
fcircuit=1
fcircuit=2 fcircuit=5 fcircuit=10 fcircuit=20
Optimal
fanout factor f for Pdyn is smaller than for performance (especially for large loads)
0.5
fcircuit
fopt_energy fopt_performance
For
= 3.53
0
= 4.47
fanout f
12
Relative Delay td
5 4 3 2 1 0 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 Supply voltage (VDD)
td
Pdyn
6 4 2 0
Delay (td) and dynamic power consumption (Pdyn) are functions of VDD
Micro transductors 08, Low Power 2 13
Relative Pdyn
Multiple VDD
Main ideas:
Use of different supply voltages within the same design High VDD for critical parts (high performance needed) Low VDD for non-critical parts (only low performance demands)
At design phase:
Determine critical path(s) (see upper next slide) High VDD for gates on those paths Lower VDD on the other gates (in non-critical paths)
For low VDD: prefer gates that drive large capacitances (yields the largest energy benefits)
Level converters:
Necessary, when module at lower supply drives gate at higher supply (step-up) If gate supplied with VDDL drives a gate supplied with VDDH then PMOS never turns off VDDH Possible implementation:
VDDL
Vout
Data Paths
Data propagate through different data paths between registers (flipflops - FF)
FF
FF
FF
Paths
Path
FF FF FF
CLK
Copyright Sill, 2008
CLK
Micro transductors 08, Low Power 2
CLK
16
G1
G2
Y
C
delay of G1
Copyright Sill, 2008
Slack for G1
time
17
Minimum energy consumption when all logic paths are critical (same delay) Possible Algorithm: clustered voltage-scaling Each path starts with VDDH and switches to VDDL (blue gates) when slack is available Level conversion in flipflops at end of paths
18
19
Clock Gating
Most popular method for power reduction of clock signals and functional units
Higher complexity of control logic Higher power consumption Critical timing critical for avoiding of
clock glitches at OR gate output
R Functional e unit g
20
CLK
21
PI
Flip-flops Combinational logic
PO
Latch
Source: L. Benini and G. De Micheli, Dynamic Power Management, Boston: Springer, 1998.
22
30.6mW
With clock gating
8.5mW
VDE
DEU
10
15
Power [mW]
20
25
896Kb SRAM
MPEG4 decoder
Source: M. Ohashi, Matsushita, 2002
23
Relative Delay td
5 4 3 2 1 0 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 Supply voltage (VDD)
td
Pdyn
6 4 2 0
Relative Pdyn
A Reference Datapath
Register
Input
Combinational logic
Register
Output
Cref CLK Supply voltage Total capacitance switched per cycle Clock frequency Power consumption: Pref
Micro transductors 08, Low Power 2
25
Parallel Architecture
Each copy processes every Nth input, operates at fclk/N reduced voltage Register Comb. Logic Copy 1 Comb. Logic Copy 2 N to 1 multiplexer Supply voltage: VN Vref N = Deg. of parallelism Register fclk
Register
Output
Input
fclk/N Register
fclk/N
26
Pipelined Architecture
Reduces the propagation time of a block by factor N Voltage can be reduced at constant clock frequency
Constant throughput
Area A
CLK
A/N
CLK
A/N
A/N
Functionality:
Data
CLK
Copyright Sill, 2008 Micro transductors 08, Low Power 2 27
Critical path delay Tadder + Tcomparator (= 25 ns) fref = 40 MHz Total capacitance being switched = Cref VDD = Vref = 5V Power for reference datapath = Pref = Cref Vref2 fref
Source: Irwin, 2000
28
The clock rate can be reduced by half with the same throughput fpar = fref / 2 Vpar = Vref / 1.7, Cpar = 2.15 Cref Ppar = (2.15 Cref) (Vref / 1.7)2 (fref / 2) = 0.36 Pref
Source: Irwin, 2000
29
fpipe = fref, , Cpipe = 1.1 Cref , Vpipe = Vref / 1.7 Voltage can be dropped while maintaining the original throughput Ppipe = CpipeVpipe2 fpipe = (1.1 Cref) (Vref/1.7)2 fref = 0.37 Pref
30
Approximate Trend
N-parallel proc. Capacitance Voltage Frequency Dynamic Power Chip area N*Cref Vref/N fref/N CrefVref2fref/N2 N times N-stage pipeline proc. Cref Vref/N fref CrefVref2fref/N2 10-20% increase
Source: G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer Academic Publishers, 1998.
31
Guarded Evaluation
Latch preserves previous value of inputs to suppress activity Could also use AND gates to mask inputs to zero = forced zero
Micro transductors 08, Low Power 2 32
Precomputation
Precomputed inputs R1
Gated inputs
R2
Outputs
Precomputation logic
g(X)
Load disable
Since those inputs dont affect output, disable input transitions Trade area for energy
Source: Irwin, 2000
33
Design steps
1. Selection of precomputation architecture 2. Determination of precomputed and gated inputs (Register R1 should be much smaller than R2) 3. Search good implementation for g(X) 4. Evaluation of potential energy savings based on input statistics (if savings not sufficient go to step 2 or 3 and try again)
Also works for multiple output functions where g(X) is the product of gj(X) over all j
34
Precomputation: Example
Binary Comparator
An
Bn An-1 Bn-1 A1 B1 Load disable An = Bn R1 n-bit binary value comparator A>B A>B
R2
Can achieve up to 75% power reduction with 3% area overhead and 1 to 5 additional gate delays in worst case path Source: Irwin, 2000
35
Adder Design
Ripple Carry FA FA FA FA
Carry Select FA FA FA FA FA FA
FA
0 1
Carry Look-ahead FA FA FA FA FA FA FA FA
Source: Mendelson, Intel
36
Adder Design
Energy (pJ)
Ripple Carry Constant Width Carry Skip Variable Width Carry Skip Carry Lookahead Carry Select Conditional Sum
Delay (nSec)
54.27 28.38 21.84 17.13 19.56 20.05
Adders differ in Energy and delay Different adders for different applications Also true for other units (multiplier, counter, )
Source: Callaway, Swartzlander Estimating the power consumption of CMOS adders - 11th Symposium on Computer Arithmetic, 1993. Proceedings.
37
Bus Power
50% of dynamic power for interconnect switching (Magen, SLIP 04) MIT Raw processors on-chip network consumes 36% of total chip power (Wang et al. 2003)
Wout
Xout
Yout
Zout
Bus receivers
Bus Bus drivers
Source: Irwin, 2000
Ain
Copyright Sill, 2008
Bin
Cin
Din
38
Segmented buses (lower Cload) Charge recovery buses Bus multiplexing (lower fClk possible) Code compression Instruction loop buffers
Minimization of bit switching activity (fclk) by data encoding Minimize voltage swing (VDD2) using differential signaling
Source: Irwin, 2000
39
Shared resources incur switching overhead Local bus structures reduce overhead
40
Bus segmentation
Another Control
Shared Bus
B
Segmented Bus
B
41
Base elements:
Functions Procedures Processes Control
structures
42
Coding styles
43
Source-code Transformations
Computation
A*B+A*C A*(B+C)
Communication
for (c = 1..N) receive (A) B=c*A receive (A) for (c = 1..N) B=c*A
Storage
for (c = 1..N) B[c] = A[c]*D[c] for (c = 1..N) F[c] = B[c]-1
44
45
Slow down processor to fill idle time More Delay lower operational voltage
Active
Idle
3.3 V
2.4 V
T1 Speed
T2
T1
Same work, lower energy
T2
Task
Idle
Task
Time
Copyright Sill, 2008 Micro transductors 08, Low Power 2
Time
47
Basic Elements:
48
Systems are:
Designed to deliver peak performance, but Not needing peak performance most of the time
Puts idle components in low-power non-operational states when idle Observes and controls the system Power consumption of power manager is negligible
Power manager:
49
Software power control - power management DOZE NAP Most units stopped except on-chip cache memory (cache coherency) Cache also turned off, PLL still on, time out or external interrupt to resume PLL off, external interrupt to resume
SLEEP
50
51
Transmeta LongRun
Detection of different workload scenarios Based on runtime performance information Processor supply voltage Processor frequency Clock frequency always within limits required by supply voltage to avoid clock skew problems
52
90 80 70 60 50 40 30 20 10 0 300
Typical operating region Peak performance region
400
500
600
700
800
900
1000
Frequency (MHz)
Source: Transmeta
53
Source: Transmeta
54
Non-linear effects influence life time of batteries Rate Capacity If discharging currents higher than allowed real capacity goes under nominal capacity Battery Recovery
Capacity (mAh)
Pulsed discharge increases nominal capacity Based on recovery times Discharge (as long there is no rate Current capacity effect) (mA)
time idle
time
Source: Timmermann, 2007
55
After Recovery
Fully discharged
Electro-active species
Analytically very sound but computationally intensive Cannot be used for online scheduling decisions.
Micro transductors 08, Low Power 2 56
Performance of a bipolar lead-acid battery subjected to six current impulses. Pulse length=3 ms, rest period=22 ms.
Current Battery Voltage
Source: LaFollette, Design and performance of high specific power, pulsed discharge, bipolar lead acid batteries, 10th Annual Battery Conference on Applications and Advances , Long Beach, pp. 4347, January 1995.
57
Discharge profile A
Profile Aver. Current [mA]
Discharge profile B
Battery lifetime [ms] Specif. energy [Wh/Kg]
A
B
123.8
124.2
357053
536484
15.12
18.58
58
Backup
59
FSM: Clock-Gating
a state has a self-loop in the state transition graph (STG), then clock can be stopped whenever a selfloop is to be executed.
Xi/Zk Si Sk Sj Xk/Zk Clock can be stopped when (Xk, Sk) combination occurs.
Xj/Zk
60
Trend: Interconnects
Interconnects
Propagation delays of global wires will be a multiple of the clock cycle.
61
Bus Multiplexing
or
62
63
Bus Multiplexing
S1
D1
S1
D1
S2
D2
S2
D2
64
For a shared (multiplexed) bus advantages of data correlation are lost (bus carries samples from two uncorrelated data streams)
0,5
Bus sharing should not be used for positively correlated data streams Bus sharing may prove advantageous in a negatively correlated data stream (where successive samples switch sign bits) - more random switching
0 14 12 MSB 10 8 6 4 2 0 LSB
Source: Irwin, 2000
Bit position
Copyright Sill, 2008 Micro transductors 08, Low Power 2 65
If data bus is shared, advantages of data correlation are lost (bus carries samples from two uncorrelated data streams) Bus sharing should not be used for positively correlated data streams Bus sharing may prove advantageous in a negatively correlated data stream (where successive samples switch sign bits) - more random switching
Micro transductors 08, Low Power 2 66
Implementation
Power-Speed Control Knob
Workload Filter
Variable Power-Speed System
FIFO Input Buffer
67