You are on page 1of 52

Optimizing Power @ Design Time

Interconnect and Clocks

Jan M. Rabaey

Low Power Design Essentials 2008

Chapter 6

Chapter Outline

Trends and bounds An OSI approach to interconnect optimization


Physical layer Data link and MAC Network Application

Clock distribution

Low Power Design Essentials 2008

6.2

ITRS Projections

Calendar Year Interconnect One Half Pitch MOSFET Physical Gate Length Number of Interconnect Levels On-Chip Local Clock Chip-to-Board Clock # of Hi Perf. ASIC Signal I/O Pads # of Hi Perf. ASIC Power/Ground Pads Supply Voltage Supply Current

2012 35 nm 14 nm 12-16 20 GHz 15 GHz 2500 2500 0.7-0.9 V 283-220 A

2018 18 nm 7 nm 14-18 53 GHz 56 GHz 3100 3100 0.5-0.7 V 396-283 A

2020 14 nm 6 nm 14-18 73 GHz 89 GHz 3100 3100 0.5-0.7 V 396-283 A

Low Power Design Essentials 2008

[Source: ITRS Roadmap, 2004, 2005]

6.3

Increasing Impact of Interconnect

Interconnect is now exceeding transistors in


Latency Power dissipation Manufacturing complexity

Direct consequence of scaling

Low Power Design Essentials 2008

6.4

Communication Dominant Part of Power Budget


Control Execution Units 15% I/O Drivers 15% 10% I/O CLB
5%

Interconnect

9%

Clock 21% 20% Caches 40%

Clocks

65%

mProcessor
I/O
Clock

FPGA

Memory

Logic

Signal processor
Low Power Design Essentials 2008 6.5

Idealized Wire Scaling Model

Parameter W, H, t

Relation

Local Wire 1/S

Constant Length 1/S

Global Wire 1/S

L
C R tp ~ CR E LW/t L/WH L2/Ht CV2

1/S
1/S S 1 1/SU2

1
1 S2 S2 1/U2

1/SC
1/SC S2/SC S2/SC2 1/(SCU2)

Low Power Design Essentials 2008

6.6

Distribution of Wire Lengths on Chip

IEEE 1998

Low Power Design Essentials 2008

[Ref: J. Davis, C&S98]

6.7

Technology Innovations
Reduce resistivity (e.g. Copper) Reduce dielectric permittivity (e.g. Aerogels or air)

IEEE 1998

Novel interconnect media (carbon nanotubes, optical)

Reduce wirelengths through 3D-integration


Low Power Design Essentials 2008 (Pictures courtesy of IBM and IFC FCRP) 6.8

Logic Scaling
10
0

Power [W], P

10 10 10 10 10

-3

10-6 J 10-9 J 10-12J


10-15J

-6

-9

10-18J

-12

Ptp ~ 1/S3
-15

10

-12

10

-9

10

-6

10

-3

10

Delay [s], tp
Low Power Design Essentials 2008 [Ref: J. Davis, Proc01] 6.9

Interconnect Scaling
10
10

10 10
L-2 t= 10-5 [s/cm-2]

-5

10

-4

(Length)-2 [cm-2], L-2

10 10 10 10

4 2

10

-7
(1)

(F = 0.1)

10-9
10-11

10 10 10

-2 -1

(10)

10-13
0
(1000)

(100)

-0

-2

L t~S
-18

-2

2
-12

10 10
2

10

-4

10

10

-15

10

10

-9

10

-6

10

-3

Delay [s], t
Low Power Design Essentials 2008 [Ref: J. Davis, Proc01] 6.10

(Length)[cm], L

10

10

-3

Lower Bounds on Interconnect Energy

PS C B log 2(1 ) E P / C kTB


bit S
C: capacity in bits/sec B: bandwidth Ps: average signal power

Shannons theorem on maximum capacity of communication channel

Claude Shannon

Ebit (min) Ebit (C / B 0) kT ln( 2)


Valid for an infinitely long bit transition (C/B0) Equals 4.10-21J/bit at room temperature
Low Power Design Essentials 2008 [Ref: J. Davis, Proc01] 6.11

Reducing Interconnect Power/Energy


Same philosophy as with logic: reduce capacitance, voltage (or voltage swing) and/or activity A major difference: sending a bit(s) from one point to another is fundamentally a communications /networking problem, and it helps to consider it as such. Abstraction layers are different:
For computation: device, gate, logic, micro-architecture For communication: wire, link, network, transport

Helps to organize along abstraction layers, well understood in the networking world: the OSI protocol stack

Low Power Design Essentials 2008

6.12

OSI Protocol Stack


Reference model for wired and wireless protocol design Also useful guide for conception and optimization of on-chip communication Layered approach allows for orthogonalization of concerns and decomposition of constraints

Presentation/Application

Session
Transport Network Data Link Physical

No requirement to implement all layers of the stack Layered structure must not necessarily be maintained in final implementation
Low Power Design Essentials 2008 [Ref: M. Sgroi, DAC01] 6.13

The Physical Layer


Transmit bits over physical interconnect medium (wire) Physical medium
Material choice, repeater insertion

Presentation/Application

Session
Transport Network Data Link Physical

Signal waveform
Discrete levels, pulses, modulated sinusoids

Voltages
Reduced swing

Timing, synchronization
So far, on-chip communication almost uniquely level-based
Low Power Design Essentials 2008 6.14

Repeater Insertion

Optimal receiver insertion results in wire delay linear with L

t p L ( R d Cd )(rwcw )
with RdCd and rwcw intrinsic delays of inverter and wire, respectively But: At major energy cost!

Low Power Design Essentials 2008

6.15

Repeater Insertion Example


1 cm Cu wire in 90 nm technology (on intermediate layers)
rw = 250 W/mm; cw = 200 fF/mm tp = 0.69rwcwL2 = 3.45 nsec

Optimal driver insertion:


tpopt = 0.5 nsec Requires insertion of 13 repeaters Energy per transition 8 times larger than just charging the wire (6 pJ verus 0.75 pJ)!

It pays to back off!

Low Power Design Essentials 2008

6.16

Wire Energy-Delay Trade-off


(dMin, eMax)
1 0.9

Repeater overhead

0.8 0.7

L = 1cm (Cu) 90 nm CMOS

eNorm

0.6 0.5

0.4

wire energy only


0.3 0.2 0.1 1 2 3 4 5 6 7 8

dNorm
Low Power Design Essentials 2008 6.17

Multi-dimensional Optimization
1.2 1.1

Number of stages

Design parameters: Voltage, number of stages, buffer sizes Voltage scaling has largest impact, followed by selection of number of repeaters Transistor sizing secondary.

VDD (V)

0.9 0.8 0.7 0.6 0.5 12 10 8 6

4
2 0 1 2 3 4 5 6 7 8

Low Power Design Essentials 2008

dNorm

6.18

Reduced Swing

Transmitter (TX)

Receiver (RX)

Ebit = CVDDVswing Concerns:


Overhead (area, delay) Robustness (supply noise, crosstalk, process variations) Repeaters?
Low Power Design Essentials 2008 6.19

Traditional Level Converter


VDDH VDDH VDDL VDDH

in

OUT VDDL CL

OUT

Requires two discrete voltage levels Asynchronous level conversion adds extra delay
Low Power Design Essentials 2008 [Ref: H. Zhang, TVLSI00] 6.20

Avoiding Extra References


VDD VDD VDD

P1 in2 in CL N1 N3 P3

P2 out

N2

VTC Transient

Low Power Design Essentials 2008

[Ref: H. Zhang, VLSI00]

6.21

Differential (Clocked) Signaling


REF in clk VDD

CL d_b REF d out_b

CL

clk

clk

out

Allows for very low swings (200 mV) Robust Quadratic energy savings But: doubling the wiring, extra clock signal, complexity
[Ref: T. Burd, UCB01] 6.22

Low Power Design Essentials 2008

Lower Bound on Signal Swing?


Reduction of signal swing translates into higher power dissipation in receiver trade-off between wire and receiver energy dissipation Reduced SNR impacts reliability current on-chip interconnect strategies require Bit Error Rate (BER) of zero (in contrast to communication and network links)
Noise source: power supply noise, crosstalk

Swings as low as 200 mV have been reported [Ref: Burd00], 100 mV definitely possible Further reduction requires crosstalk suppression
shielding
GND GND GND

folding

Low Power Design Essentials 2008

6.23

Quasi-Adiabatic Charging
Uses stepwise approximation of adiabatic (dis)charging Capacitors acting as charge reservoir Energy drawn from supply reduced by factor N

VDD

CT1 CT2 CTN-1

VDD/ N t

Low Power Design Essentials 2008

[Ref: L. Svensson, ISLPED96]

6.24

Charge Redistribution Schemes

VDD E
B1

B1

P
E P E

1 0

B1 B0 B0

RX1 RX0

3VDD/4 VDD/2

B1 = 1 B1 B0

VDD/4
B0

B0 = 0

GND

Precharge

Eval

Precharge

Charge recycled from top to bottom Precharge phase equalizes differential lines Energy/bit = 2C(VDD/N)2 Challenges: Receiver design, noise margins

Low Power Design Essentials 2008

[Ref: H. Yamauchi, JSSC95]

6.25

Alternative Communication Schemes


Example: Capacitively-driven wires

Offers some compelling advantages Reduced swing Swing is VDD/(n+1) without extra supply Reduced load Allows for smaller driver Reduced delay Capacitor pre-emphasizes edges
Low Power Design Essentials 2008

Pitchfork capacitors exploit sidewall capacitance


6.26

[Ref: D. Hopkins, ISSCC07]

Signaling Protocols

Network
din reqin ackin dout reqout ackout

Globally Asynchronous
self-timed handshaking Din protocol
REQ in

Processor Module (mProc, ALU, MPY, SRAM)

Allows individual modules to dynamically trade-off performancedone for energy-efficiency

Low Power Design Essentials 2008

6.27

Signaling Protocols

Network
din reqin ackin dout

Globally Asynchronous
reqout ackout
Din
REQ in

Physical Layer Interface Module


din dout clk

done

Clk

Processor Module (mProc, ALU, MPY, SRAM)

done

Locally synchronous

Low Power Design Essentials 2008

6.28

The Data Link /Media Access Layer


Reliable transmission over Presentation/Application physical link and sharing interconnect medium Session between multiple sources and destinations (MAC) Transport Bundling, serialization, Network packetizing Data Link Error detection and correction Coding Physical Multiple-access schemes
Low Power Design Essentials 2008 6.29

Coding

N+k

Decoder

Encoder

TX

Link

RX

Adding redundancy to communication link (extra bits) to: Reduce transitions (activity encoding) Reduce energy/bit (error-correcting coding)

Low Power Design Essentials 2008

6.30

Activity Reduction Through Coding


Example: Bus-Invert Coding N D

Decoder

Encoder

N+1 Denc Invert bit p

N D

Data word D inverted if Hamming distance from previous is larger than N/2.
D 00101010 00111011 11010100 00001101 01110110
Low Power Design Essentials 2008

#T 2 7 5 6

Denc 00101010 00111011 00101011 00001101 10001001

p 0 0 1 0 1

#T 2 1+1 3+1 2+1

[Ref: M. Stan, TVLSI95]

6.31

Bus-Invert Coding

D
P

Reg

Denc

p
Decode Bus

Encode

Gain: 25 % (at best for random data) Overhead: Extra wire (and activity) Encoder, decoder Not effective for correlated data

Low Power Design Essentials 2008

[Ref: M. Stan, TVLSI95]

6.32

Other Transition Coding Schemes


Advanced bus-invert coding (e.g. partition bus into sub-components)
(e.g. [M.Stan, TVLSI97])

Coding for address busses ( which often display sequentiality)


(e.g. [L. Benini, DATE98])

Full-fledged channel coding, borrowed from communication links


(e.g. [S. Ramprasad, TVLSI99])

bit k-1 h h h

bit k h h h h h

bit k+1 h i i

Delay factor g 1 1+r 1 + 2r 1 + 2r 1 + 3r

Coding to reduce impact of Miller capacitance between neighboring wires


[Ref: Sotiriadis, ASPDAC01]

1 + 4r

Maximum capacitance transition can be avoided by coding


Low Power Design Essentials 2008 6.33

Error-Correcting Codes
N D Decoder Encoder N+k Denc N D

Example: (4,3,1) Hamming Code

e.g.
B3 wrong

P1P2B3P4B5B6B7
with

P1 + B3 + B5 + B7 = 0

1 1 =3

P2 + B3 + B6 + B7 = 0
P4 + B5 + B5 + B7 = 0

Adding redundancy allows for more aggressive scaling of signal swings and/or timing Simpler codes such as Hamming prove most effective

Low Power Design Essentials 2008

6.34

Media Access
Sharing of physical media over multiple data streams increases capacitance and activity (see Chapter 5), but reduces area Many multi-access schemes known from communications
Time domain:Time-Division Multiple Access (TDMA) Frequency domain: narrow band, code division multiplexing

Buses based on Arbitration-based TDMA most common in todays ICs

Low Power Design Essentials 2008

6.35

Bus Protocols and Energy


Some Lessons from the Communications world:
When utilization is low, simple schemes are more effective When traffic is intense, reservation of resources minimizes overhead and latency (collisions, resends)

Combining the two leads to energy efficiency Example : SiliconBackplane MicroNetwork

Arbitration
Command

Current Slot

Independent arbitration for every cycle includes two phases:


- Distributed TDMA for guaranteed latency/bandwidth - Round robin for random access
Low Power Design Essentials 2008 [Courtesy: Sonics, Inc] 6.36

The Network Layer


Topology-independent end-to-end communication over multiple data links (routing, bridging, repeaters) Topology Static versus dynamic configuration / routing

Presentation/Application

Session
Transport Network Data Link Physical

Becoming more important in todays complex multi-processor designs The Network-on-a-Chip (NOC)
Low Power Design Essentials 2008 [Ref: G. De Micheli, Morgan-Kaufman06] 6.37

Network-on-a-Chip (NoC)

or

Dedicated networks with reserved links preferable for high traffic channels but: limited connectivity, area overhead Flexibility an increasing requirement in multi (many) core chip implementations
Low Power Design Essentials 2008 6.38

The Network Trade-offs


Interconnect-oriented architecture trades off flexibility, latency, energy and area-efficiency through the following concepts
Locality - eliminate global structures Hierarchy - expose locality in communication requirements Concurrency/Multiplexing Very Similar to Architectural Space Trade-offs
Local Logic

Router

Network Wires Proc

Dedicated wiring
Low Power Design Essentials 2008

Network-on-a-Chip
[Courtesy: B. Dally, Stanford] 6.39

Networking Topology
Homogeneous
Crossbar, Butterfly, Torus, Mesh,Tree,

Heterogeneous
Hierarchy

Crossbar

Tree

Low Power Design Essentials 2008

Mesh (FPGA)

6.40

Network Topology Exploration

Energy x Delay

Mesh

Binary Tree

Short connections in tree are redundant


Manhattan Distance

Mesh Energy x Delay

Binary Tree Mesh + Inverse Manhattan Distance

Inverse clustering complements mesh


Low Power Design Essentials 2008 [Ref: V. George, Springer01]

6.41

Circuit-Switched versus Packet Based


On-Chip Reality: Wires (bandwidth) are relatively cheap, buffering and routing expensive Packet-switched approach versatile
Preferred approach in large networks But routers come with large overhead Case study Intel: 18% of power in link, 82% in router

C Bus C

Bus to connect over short distances

Circuit-switched approach attractive for high-data rate quasi-static links Hierarchical combination often preferred choice
Hierarchical circuit and packet switched networks for longer connections Low Power Design Essentials 2008

C
BusR C

C
Bus R

C
BusR C

C
Bus R

C
6.42

Example: The Pleiades Network-on-a-Chip


Configurable platform for low-energy communication and signal-processing applications (See Chapter 5) Allows for dynamic tasklevel reconfiguration of process network

Configuration Bus
Arithmetic Module Arithmetic Module Arithmetic Module

Configurable Interconnect

mP

Configurable Configurable Logic Logic

Network Interface
Dedicated Arithmetic

Energy-efficient flexible network essential to the concept


Low Power Design Essentials 2008 [Ref: H. Zhang, JSSC00]

Configuration
6.43

Pleiades Network Layer


Hierarchical reconfigurable mesh network
Level-1 Mesh Level-2 Mesh

Cluster

Cluster

Universal Switchbox

Hierarchical Switchbox

Network statically configured at start of session and ripped up at end

Structured approach reduces interconnect energy with factor 7 over straightforward cross-bar
Low Power Design Essentials 2008 6.44

Top Layers of the OSI Stack


Abstracts communication architecture to system and performs data formatting and conversion Establishes and maintains end-to-end communications
flow control, message reordering, packet segmentation and reassembly

Presentation/Application

Session Transport
Network Data Link Physical

Example: Establish, maintain and rip-up connections in dynamically reconfigurable Systems-on-a-Chip Important in power-management
Low Power Design Essentials 2008 6.45

What About Clock Distribution?


Clock easily the most energy-consuming signal of a chip
Largest length Largest fanout Most activity (a = 1)

Skew control adding major overhead


Intermediate clock repeaters De-skewing elements

Opportunities
Reduced swing Alternative clock distribution schemes Avoiding a global clock altogether

Low Power Design Essentials 2008

6.46

Reduced-Swing Clock Distribution


Similar to reduced-swing interconnect Relatively easy to implement But: Extra-delay in flip-flops adds directly to clock period
IEEE 1995

VDD

NMOS clock PMOS clock


Regular 2-phase clock

GND

VDD

PMOS clock NMOS clock


Half-swing clock

GND

Example: half-swing clock distribution scheme


[Ref: H. Kojima, JSSC95] 6.47

Low Power Design Essentials 2008

Alternative Clock Distribution Schemes


Example: Transmission-Line Based Clock Distribution Canceling skew in perfect transmission line scenario

IEEE 2006

Low Power Design Essentials 2008

[Ref: V. Prodanov, CICC06]

6.48

Summary
Interconnect important component of overall power dissipation Structured approach with exploration at different abstraction layers most effective Lot to be learned from communications and networking community yet, techniques must be applied judiciously
Cost relationship between active and passive components different

Some exciting possibilities for the future: 3Dintegration, novel interconnect materials, optical or wireless I/O
Low Power Design Essentials 2008 6.49

References
Books and Book Chapters
T. Burd, Energy-Efficient Processor System Design, http://bwrc.eecs.berkeley.edu/Publications/2001/THESES/energ_eff_process-sys_des/index.htm, UCB, 2001. G. De Micheli and L. Benini, Networks on Chips: Technology and Tools, Morgan-Kaufman, 2006. V. George and J. Rabaey, Low-energy FPGAs: Architecture and Design, Springer 2001. J. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated Circuits: A Design Perspective, 2nd ed, Prentice Hall 2003. C. Svensson, Low-Power and Low-Voltage Communication for SoCs, in C. Piguet, Low-Power Electronics Design, Ch. 14, CRC Press, 2005. L. Svensson, Adiabatic and Clock-Powered Circuits, in C. Piguet, Low-Power Electronics Design, Ch. 15, CRC Press, 2005. G. Yeap, Special Techniques, in Practical Low Power Digital VLSI Design, Ch 6., Kluwer Academic Publishers, 1998.

Articles
L. Benini et al, Address bus encoding techniques for system-level power optimization, Proceedings DATE98, pp. 861-867, Paris, February 1998 T. Burd et al., A Dynamic Voltage Scaled Microprocessor System, IEEE ISSCC Digest of Technical Papers, pp. 294-295, Feb. 2000. M. Chang et al, CMP Network-on-Chop Overlaid with Multi-Band RF Interconnect, International Symposium on High-Performance Computer Architecture, Febr. 2008. D.M. Chapiro, Globally Asynchronous Locally Synchronous Systems, PhD thesis, Stanford Low Power University, 1984. Design Essentials 2008 6.50

References (cntd)
W. Dally, Route Packets, Not Wires: On-Chip Interconnect Networks, Proceedings DAC 2001, pp. 684-689, Las Vegas, June 2001. J. Davis and J. Meindl, Is Interconnect the Weak Link?, IEEE Circuits and Systems Magazine, pp. 30-36, March 1998. J. Davis et al., Interconnect Limits on Gigascale Integration (GSI) in the 21st Century, Proceedings of the IEEE, Vol. 89, No. 3, pp. 305-324, March 2001. D. Hopkins et al, "Circuit techniques to enable 430Gb/s/mm2 proximity communication," IEEE International Solid-State Circuits Conference, vol. XL, pp. 368 - 369, February 2007. H. Kojima et al., Half-Swing Clocking Scheme for 75% Power Saving in Clocking Circuitry, Journal of Solid Stated Circuits, vol. 30, no 4, pp. 432-435, April 1995. E. Kusse and J. Rabaey, Low-energy embedded FPGA structures, Proceedings ISLPED98, pp.155-160, Monterey, Aug. 1998. V. Prodanov and M. Banu, GHz Serial Passive Clock Distribution in VLSI using Bidirectional Signaling, Proceedings CICC 06. S. Ramprasad et al., A coding framework for low-power address and data busses, IEEE Transactions on VLSI Signal Processing, Vol. 7, No 2, pp. 212-221, June 1999. M. Sgroi et al, Addressing the System-on-a-Chip Woes Through Communication-Based Design, Proceedings DAC 2001, pp. 678-683, Las Vegas, June 2001. P. Sotiriadis and A. Chandrakasan, Reducing Bus Delay in Submicron Technology Using Coding, Proceedings ASPDAC Conference, Yokohama, January 2001.

Low Power Design Essentials 2008

6.51

References (cntd)
M. Stan and W. Burleson, Bus-Invert Coding for Low-Power I/O, IEEE Transactions on VLSI, pp. 48-58, March 1995. M.. Stan, W. Burleson, "Low-Power Encodings for Global Communication in CMOS VLSI", IEEE Transactions on VLSI Systems, pp. 444-455, Dec. 1997. V. Sathe, J.-Y. Chueh, and M. C. Papaefthymiou, Energy-Efficient GHz-Class Charg-Recovery logic, IEEE JSSC vol. 42 No 1, pp.38-47, January 2007. L. Svensson et al., A sub-CV2 pad Driver with 10 ns Transition Time, Proc. ISLPED 96, Monterey, Aug. 12-14, 1996. D. Wingard, Micronetwork-Based Integration for SOCs, Proceedings DAC 01, pp. pp. 673-677, Las Vegas, June 2001. H. Yamauchi et al., An Asymptotically Zero Power Charge Recycling Bus, IEEE Journal of Solid Stated Circuits, vol. 30, no 4, pp. 423-431, April 1995. H. Zhang, V. George and J. Rabaey, Low-Swing on-chip Signaling Techniques: Effectiveness and Robustness, IEEE Transactions on VLSI Systems, Vol. 8, No 3, pp. 264-272, June 2000. H. Zhang et al, A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications, IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp. 1697-1704, Nov. 2000.

Low Power Design Essentials 2008

6.52

You might also like