You are on page 1of 64

Technology Optimization and

Performance Projections
David J. Frank
12/4/09
International Winter School for Graduate Students
IIT, Bombay, India

1. Introduction

log(Performance)

The End of Scaling?

Time (yr)
2

log(Performance)

The End of Scaling?

Time (yr)
3

The End of Scaling is Optimization

log(System Performance)

Stop when you get to the top.

Miniaturization

Outline
1.
2.
3.
4.

Review limitations to Scaling


Optimizing technology within constraints
Optimization results
Technology projections

1. Limitations to Scaling
1. Electrostatic constraints
2. Quantum mechanical leakage currents
3. Discreteness of matter and energy
4. Thermodynamic limitations
5. Practical and environmental constraints on power
Basic idea of Scaling:
Adjust dimensions,
voltages, & doping to
achieve smaller FET
with same electrostatic
behavior.

Electrostatic constraints on FET design


1. A good design should have
A.
B.
C.
D.
E.
F.

High output resistance


High gain
Low sensitivity to variations
High transconductance
High drain current
High speed

2. Must choose a compromise:


Short, but not so short that 2D
effects kill A, B, C.

}
}

Long channel
behavior
Short channel
behavior
Gate
S

Quantum Mechanical Tunneling Leakage Currents


Currents increase exponentially as the barriers become thinner.

Everything
becomes
leaky.

FET 'ON'

FET 'OFF'

Vdd

Gnd

Gnd

Gnd

Gnd

Vdd

Gate insulator tunneling


Subthreshold leakage
Direct source-to-drain tunneling
Drain-to-body tunneling

ee-

Source
e-

Channel

Drain

Atomistic effects
The number of dopant atoms
in the depletion layer of a
MOSFET has been scaling
roughly as Leff1.5.

Statistical variation in the


number of dopants, N, varies
as N1/2, causing increasing VT
uncertainty for small N.
249,403,263 Si atoms
68,743 donors
13,042 acceptors

[D. J. Frank, et al., 1999 Symp. VLSI Tech.]

Thermodynamic limitations
The Boltzmann distribution determines the subthreshold slope
and leakage current, VT, and diode leakage currents, too.
e(VG -VT )/kT

IsubVT = I0 e

VT can only be scaled by reducing the temperature, which is


not acceptable for many applications.
Speed is very sensitive to VT/VDD ratio.
Irreversible computation => All
switching energy is converted to heat.
All leakage currents and IR drops are
irreversible => More heat.

e-

Source

Channel

Drain

10

Practical and Environmental Issues


Power consumption and heat removal are
limited by practical considerations.
Low power applications are often battery
powered
Many must be lightweight => power < ~few watts.
Disposable batteries can cost >> $500/watt over life of
device.
Rechargeables can cost > $50/watt over life of device.

Home electronics is limited to <~1000W by


heating of the room and cost of electricity.
High performance is limited by difficulty of
heat removal from chip (~100 W/chip). (Cost
of electricity is ~$5/watt over life.)
11

2. Optimizing Technology within Constraints


Practicality imposes power
constraints.
Electrostatics imposes
geometric constraints
Thermodynamics imposes
voltage constraints.
Quantum mechanics imposes
miniaturization constraints due to
tunneling.

Fixed architectural complexity


+ Fixed power constraints
+ Device physics
= Existence of an optimal technology with maximal performance.

leakage increases due


to tunneling effects

dynamic power
Large

Miniaturization

Small

log(Performance)

Power

leakage power

Declining available
dynamic power
overwhelms speed
improvements of
scaling

Large

Miniaturization

Small
12

Background: Schematic organization of an


optimization program
Fixed
parameters

Variables:
initial guess
new
values:
improved
guess

Area Model
Thermal
Model

Wiring
statistics
Wire Capacitance

Delay
tolerance adjustments

Adjust for Latency


of Long Paths

Goal: optimize device


technology to
maximize chip-level
performance, subject
to power constraints.

Device Structure
IV Model

Leakage Model

Constrained
optimizer

Leakage Power
tolerance adjustments

Total Power

13

Models and Approximations


System Assumptions
Processor chip is assumed to have a fixed number of cores, each with
a specified number of logic gates.
Only the logic within the cores is considered within the optimizations.
The clock and memory aspects of the chip are assumed to scale in
the same way as the logic (delay, power, and area).

Clock

Core-to-core and core-to-memory communication is not dealt with.

Fudge

Repeaters

Logic
Treat in
detail

Memory
Fudge

Treat these by simple scaling from the logic part.


14

How much area do the processor cores take?


100% to 25%, generally decreasing with generation:
70%

Prescott. 125M FETs

100%
Alpha 21264 ('96)
15M FETs, L1 cache only

40%
Dothan, 140M FETs

40%

Power4, 174M
FETs

~25% 2 cores, 1.72B FETs


15

Area usage within a processor core


Approximate area fractions for a high-performance
microprocessor core in leading-edge technology
9.3%
23.3%

7.0%

data from:
M. Scheuermann
and M. Wisniewski

9.3%
23.3%

7.0%
9.3%

60%
60.5%

23.3%

1/3

7.0%
9.3%

20.2%
23.3%

2/3
13.3%
40.3%

Buffers & extra latches

7.0%

20.2%

33%
31% 36%

20.2%
13.3%

Caches (L1)

Buffers & extra latches

Macros

Caches

Caps, Clock dist., Unused

Register files
Custom & RLMs
Caps, Clock dist., Unused

12.5%

14.5%
1.2%

Buffers & extra


latches
Caches
Register files

Processors built with nanotechnology are likely to


have similar area usage statistics.
Nanotechnology may require additional area
allocations for defective circuitry.
Estimates of power and computational densities
should take into account realistic area efficiencies.

Latches and
LCBs
Logic
Unused/caps
Caps, Clock
dist., Unused

90%
11.2%

14.5%

Buffers & extra latches


Caches
Register files
Latches and LCBs
Logic in use
Unused Logic
Unused/caps
Caps, Clock dist., Unused

16

Optimization Approaches
1. Engineering approach:
Maximize system performance, at fixed power.
Use total logic transition rate (LTR),
LTR = Ngates x activity factor/logic depth x 1/Delay
Relatively little dependence on architectural details.
2. Business approach:
Maximize Return on Investment (ROI).

17

FET Model
Using a general temperature-dependent short-channel FET model in
which VT, tD, and tox are coupled, halo doping effects are included, and
VT is set by the doping.
Modified alpha power model:

W kT kT / e (E )
V V

I D (VGS ) = effI
0
EC F GS T
tox e FI EC LCH 0
kT / e

Fermi-Dirac
integral of order

10W FET
Lg=28nm

1mW FET
Lg=45nm

18

Circuit Delay Estimation


Basic circuit elements are:
FI=2, FO=1.65 wire-loaded NAND gates for logic
inverters for repeaters, FO ~ 1.2
Delay calculations:
1 =

V DD ( C parasitic + C wire + C gateload )


*
2 I Deff

Current is adjusted to account


for noise and variations.

2 = Rwire (C wire + C gateload )


3 = Lwire (c / 2)

3/ 4

Propagation
delay

1 + 42 / 3 + 34 / 3
=
0.5 + (1 VT / VDD ) (1 + )

Final delay empirically merges


the separate components.
[Eble's thesis]
Correction for
VT/VDD.
19

Power Calculation
PTOT = PDYN + PsubVT + POX + PB 2 B
PDYN =

lD

N CKT

1
2

C (VH VL )VDD

PsubVT =1.7 N CKT VDD I off (VT ,VDD, tox , , LG , WL )


Pox = Acore Dox ( WL )VDD J ox (VT ,VDD, tox , )
PB 2 B =

1
3

Acore Dox ( WL )VDD J B 2 B ( FMax ,VDD )

LTR = lD NCKT

Note that
cross-through
power is not
included.

The powers are computed separately for logic and for repeaters.
= mean delay for a single loaded logic gate

lD

is activity factor divided by logic depth. Usually ~0.012 in


recent optimizations.

20

Communication and Wiring Models


(

4FO 2r 3
2 NCKT
l
3 + FO
lR

LnoRptr

lR

N Rptr =

l Max

lR

linet (l)dl
inet (l)dl

2 r 3

linet (l)dl l R

Units are gate pitches.


r = Rent exponent, 0.6, here.

# Wiring Levels
Required

inet (l) =

log(number of wires)

Assume wire lengths distributed according to Rent's rule.

lR

2 NCKT log(length)

From optimizations:
12
10
8
6
4
1E+5 1E+6 1E+7 1E+8

Number of Logic Gate


21

Repeater Model

100

0.9
0.8
0.7

80

0.6

70
60
50
0.01

0.5
0.4

Pecon=10 W/cm2
0.1

Latency Penalty Factor

0.3

0.2

Repeater Spacing (cm)

90

1
Repeater spacing
Repeater width

Repeater Width (um)

Repeater Spacing (um)

Long wires receive repeaters with a spacing that is


optimized.
Long wire delay can be absorbed into pipeline depth, but
the latency causes inefficiency, so we use a latency
penalty factor: .

0.1

0.01
9S
10S

0.001
0.01

11S

0.1

12S

13S

10

100

CPU Core Power (W)

22

Local Variation Modeling

Variation sources:

Signal Coupling noise


Supply noise
Statistical doping variations
LER gate length variations

Consequences modeled:
Increased static power
combine 1 sigma of doping, length, and noise

Critical path delay distribution


yield-based, using estimated critical path
distribution,
and 1 sigma of doping and length, and worst case
noise.

Single stage functionality


use worst case (~6 sigma) of doping and length,
no noise.
23

Accounting for variations

A complete most-probable worst-case-vector methodology is


used to handle both local and global variability.

Two variable example:

Worst-case vectors:
MPWC
vector

Blue curves are contours of


constant probability.
Red curves are contours of
function to be minimized.

3 2 1

Murphy
vector

24

Power vs Frequency Variation Windows


Optimizations for high-perf processor, 45 nm node technology, PDSOI
Power-constrained optimization parameters: VDD, VTn, VTp, LG, repeater size and spacing.

Chip Power (W)

Area constrained is also constrained, to 5.6 cm2, by adjusting the widths.


This is the point
that is calculated,
optimized to, and
reported.

VDD+5%

230 W

VDD-10%

Nominal
design point:
VDD=1.058 V

These boxes are for


0.675 sigma, for 50%
yield.
Power and delay are
treated as independent:
25% of yield is lost for
each.
The data shows 10%
frequency variation per
sigma.

Clock Frequency (GHz)


25

Impact of variability on performance


Atomistic effects are leading to greater device variability.
Increasing variability requires larger design margins.
Designing for larger margins decreases performance.
1.3
Relative Performance

P=0.01

P=1

P=50

1.2

Increased variability
requires:

1.1
1

Higher supply voltages

0.9
0.8
0.7
0%

Less scaled FETs

65nm node, dual


processor core
50%

100%

150%

200%

Relative Margin
26

Summary: Models, Assumptions and


Approximations
Power modeling
Dynamic switching energy plus static power mechanisms
including sub-threshold current, gate oxide tunneling, and
body-to-drain band-to-band tunneling.
Device modeling
Bulk MOSFETs: VT, and depletion depth determined by the
halo doping, 2D effects are taken into account.
Gate length is fully optimized, not set by the technology
node.
Circuit modeling
Delay is for FI=FO=2 or 3 NAND gates, based on model
from J.C. Eble's thesis [Ga.Tech. '98].
Capacitance includes gate, parasitic, and wire parts (Rent's
rule).
Wire resistance includes temperature dependence and
surface scattering in small wires.
27

Summary: Models, Assumptions and


Approximations
Chip-level modeling
Allocate fixed fraction of chip power and area to logic, and
assume fixed number of logic gates. Logic part is
optimized, and the rest is assumed to scale similarly.
Assume multiple processor cores are interconnected in a
way that does not greatly add to the wiring burden.
Long wires are fatter, and receive repeaters with a spacing
that is optimized.
Long wire delay is accounted for using a latency penalty
factor.
On-chip tolerance/variability and noise is accounted for.

28

3. Optimization Results
General results
Evaluating specific possible device
directions

Increasing mobility
High-k gate dielectric and metal gates
3D stacking
Better heat sinks
Sub-ambient cooling
Multi-processor tradeoffs

29

Optimize by technology node


For each node, pre-specify
the following parameters:
Wire half-pitch,
gate overlap,
halo scalelength,
contact resistance,
LER sigma,
ACLV,
mobility,
gate depletion,
k_wire,
k_gate

Optimizations over 7 variables:


tox, Lg, ND, <w>, Vdd, Srpt, <wrpt>

Dual core processor with


aggressive air cooling

Note that the LG, tox, VDD, VT, width, etc. are NOT preselected.
They are solved for by the optimizations.
30

Optimization results
Gate Length vs Power
120
90 nm

65 nm

45 nm

32 nm

Gate Length (nm)

100
80
60
40
20
0
0.01

0.1

10

Total Chip Power (W)

Dual core processor with


aggressive air cooling

100

Equiv. Oxynitride Thickness (nm)

Oxide Thickness vs Power


1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.01

90 nm

65 nm

45 nm

32 nm

Oxynitride

High-k, for 32nm

0.1

10

100

Total Chip Power (W)

(High-k case assumes 0.3nm


barrier layer, bandedge metal gate,
HfO2-like insulator characteristics.)
31

Supply and Threshold Votlage (V)

Optimization results
Voltages vs Power

0.9
0.8

Vdd,90
VT,90

Vdd,65
VT,65

Vdd,45
VT,45

Vdd,32
VT,32

0.7
0.6
0.5
0.4

Dual core processor


with aggressive air
cooling

0.3
0.2
0.1
0.01

0.1

10

100

Total Chip Power (W)

Supply voltages are lower for low power applications.


High-k lowers VDD ~ 15% at the 45nm generation.
32

Optimal Power Allocation Fractions


Oxide pwr, rptrs
SubVT pwr, rptrs
Dyn. pwr, rptrs

Oxide pwr, logic


SubVT pwr, logic
Dyn. pwr, logic

Active power fraction:


70% at low power to
40% at high power.

Power Allocation

100%
80%
60%
40%
20%
0%
1

10

30

100

300

Chip Power (W)


45nm technology with microchannel
heat sink and water cooling.
4 core chip.
33

Mobility dependence
Enhanced mobility has greatest benefit at high power.
Even for large mobility enhancements, performance boost is
modest: 10-15%.
1.12

45nm technology
dual core processor
1.15 water cooled

Relative Performance

Relative Performance

1.2

1.1
1.05
1 W chip

10 W chip

100W chip

1.5
2
2.5
Mobility Enhancement Factor

1W

1.1

10W

100W

1.08
1.06
1.04

32nm technology
8 core processor
Air cooled

1.02
1

1.5
2
2.5
3
Mobility Enhancement Factor

34

Performance relative to poly-Si

Metal-gate workfunction for high-k


and oxynitride
1.4
5W, oxynitride
50W, oxynitride

1.3

5W, high-k
50W, high-k

1.2
1.1
1
0.9
0.8
0.7
0.6
0.5

0.1
0.2
0.3
0.4
0.5
Workfunction offset from bandedge (ev)

45nm node, dual core processor


with aggressive air cooling
35

3D stacking
Multiple layers offer higher performance
due to shorter wires.
RED = 1 Layer, GREEN = 2 Layers
400

Relative Performance

300

10

200

20
15
10
5
0

100
1

10

100

Mean Wire Length (um)

10
Chip Power (W)

100

0
1.2
1
0.8
0.6
0.4
0.2
0

10

100

Chip Area (cm2)

10

Tot Si area

Mean FET Width (nm)

Footprint

15

100

Chip Power (W)

(4 core, 45nm node, water cooling.)

36

Cooling scenario optimizations

5
4
3
2
1
0
1

-40C Liquid
18C Water
Hi-Perf. Air
Low-Cost Air

4 core processor design


45nm technology
10
100
1000
Total Chip Power (W)

10000

Performance (arb units)

Performance (arb units)

Forced liquid cooling through microchannel fins may permit very high
power densities.
Optimized (maximum) performance increases as the ~log of the power.
7
6
5
4

-40C Liquid
18C Water
Hi-Perf. Air
Low-Cost Air

3
2
1
0
1

8 core processor design


32nm technology
10
100
1000
Total Chip Power (W)

10000

Optimized over 7 variables: Lg, tox, Nd, <w>, Drptr, <wrptr>, Vdd.
Low temperature case does not include refrigerator power.
37

Multiprocessor motivation
The energy / performance tradeoff is very steep at the high end.
Lower power, more parallel processors potentially offer more
computation for the same total power level.
30

4 core processors

0
0
10
30

4-processor chips with


micro-channel water
cooling, optimizing
everything.

1
10
3

3x
3x

9 variables: tox, Lg, ND,


<w>, Vdd, wHP, Srpt,
<wrpt>, xhalo

0.

1E+14

1
0.

03

1E+13

0.

01
0.

00

0.1
1E+12

3
00
0.

0.

Loaded Switching Energy (fJ)

10

1E+15

1E+16

Total Logic Transistions / sec

38

Dependence on number of cores


Constant total number of transistors, divided equally among n cores:

Performance (arb units)

10

2 cores

4 cores

8 cores

16 cores

1
1

10

100

1000

Total Chip Power (W)


39

4. Future Projections (22 11nm)

Device options
General results
Technology projections
Beyond 11nm?

40

Device Options

PDSOI
IBMs best understood technology.

FinFET

FinFET Drain

Improved electrostatic control of


Gate
channel offers shorter gates, lower
voltages, higher speed.
Source
Workfunction VT control. Not
entirely planar.

ETSOI

Gate

ETSOI Gate
Source

The ETSOI and FinFET


devices simulated using
an in-house scaling
model.
Drain

Somewhat improved electrostatic Buried Oxide


control compared to PDSOI. More Substrate
compatible with conventional
planar processing. Workfunction
VT control.
Bulk Gate

Shallow Bulk

Source

Comparable source/drain
resistance and parasitic
capacitance models
were implemented for
ETSOI, FinFET, and for
shallow bulk MOSFETs.

Drain

Shallow junctions, raised S/D.


41

General optimization results


performance vs power
1.E+10

1.E+09

Frequency (Hz)

As always, the
easiest way to
increase
performance is
to increase the
power.

22nm
16nm
11nm
1.E+08

1.E+07
0.01

0.1

10

100

1000

Total Chip Power(W)

Conditions: PDSOI, 4 core processor chip, constraining total chip power


Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths,
repeater size and spacing.
42

Gate Length and Chip Area vs Power


50

Lower power requires longer gate


lengths, to reduce variability.

0.8

45

0.7

40

Chip Area (cm2)

Gate Length (nm)

0.6

22nm

35
30
25

11nm

20

22nm

0.5
22nm

22nm

0.4
16nm

16nm

11nm

11nm

0.3

11nm

15

0.2
10
5
0
0.01

0.1

Toxeqv: 0.6 0.82 nm


0.1

10

Total Chip Power(W)

100

1000

0
0.01

Device density increases for


each generation. Shrinking area
increases power density, too.
0.1

10

100

1000

Total Chip Power(W)

Conditions: PDSOI, 4 core processor chip, constraining total chip power


Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths,
repeater size and spacing.
43

Voltage and Energy vs Power


Optimal supply voltage can become quite low for low power constraints, leading to very
low energy use per logic transition.
16

1.4

14

Energy per logic transition (fJ)

1.2

Supply Voltage (V)

0.8

11nm

22nm

0.6

0.4

0.2

0
0.01

12
10
22nm
8 16nm

22nm

11nm

11nm

16nm

22nm

11nm

0.1

10

Total Chip Power(W)

100

1000

0
0.01

0.1

10

100

1000

Total Chip Power(W)

Conditions: PDSOI, 4 core processor chip, constraining total chip power


Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths,
repeater size and spacing.
44

Energy vs performance trade-off


Energy per logic transition (fJ)

10

10x
22nm
16nm

11nm

22nm

4x
11nm

0.1
1.E+07

1.E+08

1.E+09

1.E+10

Clock Frequency (Hz)

Conditions: PDSOI, 4 core processor chip, constraining total chip power


Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths,
repeater size and spacing.
45

Optimal On/Off Ratio


PDSOI nFETs, currents measured at nominal process and bias conditions

bi
as

co
nd
itio
ns

0.1

Hi
gh

1000:1
22 nm
16 nm

0.001

11 nm

co
nd
itio
ns

Off-current (A/cm)

0.01

Lo
w

bi
as

0.0001

10000:1

0.00001
0.1

10

100

On-current (A/cm)

Conditions: PDSOI, 4 core processor chip, constraining total chip power


Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths,
repeater size and spacing.
46

Performance at constant power density (PDSOI)


Supply Voltage (V)

Performance increases at 32nm due to hi-k introduction, but then falls as strain
1
diminishes and gate dielectric does not scale further.
0.9
6

50 W/cm2

0.7
0.6

10W/cm2

0.5
0.4

25 W/cm2
50W/cm2

0.3
0.2
0.1
0

45nm

10W/cm2

32nm

22nm

16nm

11nm

Technology Node

25W/cm2

50W/cm2
6

10 W/cm2

Chip Area (cm2)

Frequency (GHz)

0.8

0
45nm

32nm

22nm
Technology Node

16nm

11nm

5
4
10 W/cm2

25 W/cm2
50 W/cm2

2
1
0
45nm

32nm

22nm

16nm

11nm

Technology Node

Conditions: PDSOI, 4 core processor chip, constraining total chip power density
Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater
size and spacing.
47

Performance at constant power density


comparing technologies
ETSOI and FinFET offer moderate performance advantage over PDSOI for 22nm node
and beyond. The industry should transition to FinFET at 16nm to avoid performance loss.
4.5
4

Frequency (GHz)

3.5
3
Bulk

2.5

PDSOI
FinFET

ETSOI
1.5
1

25 W/cm2

0.5
0
45nm

32nm

22nm

16nm

11nm

Technology Node

Conditions: 4 core processor chip, constraining total chip power at 25 W/cm2


Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater size
and spacing, fin height, sidewall thickness (Fin), Si thickness (ET).
48

Power density at constant performance


Required power density drops at
32nm due to transition to hi-k, then
rises due to decreasing strain and
lack of scaling of gate insulator.

60

PDSOI

Transition to FinFET at 16nm to


obtain continued improvement.

40

PD 1e15
PD 1.5e15

30

~3.9 GHz

Fin 1e15
Fin 1.5e15

FinFET

20

10

PDSOI
PDSOI

~2.6 GHz

10

FinFET

0
45nm

32nm

22nm

16nm

11nm

Area (cm2)

Power Density (W/cm2)

50

PD 1e15
PD 1.5e15

Fin 1e15
Fin 1.5e15

Technology node

FinFET

Conditions: PDSOI and FinFET, 4 core processor chip, constraining total chip
performance
Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater
size and spacing, fin height, sidewall thickness

0.1
45nm

32nm

22nm

16nm

11nm

Technology node

49

Performance at constant area and power


16

Area = 4 cm2, power = 100 W, fixed.

14

FinFET
12

25 W/cm2

Relative Performance

50 W/cm2

Number of cores is adjusted to maintain


constant area. Chip performance is
assumed linear with the number of cores.
Maximizing total chip performance.

10

PD 25W/cm2
PD 50W/cm2

Fin 25W/cm2
Fin 50W/cm2

25 W/cm2

50 W/cm2

80

PDSOI
0
45nm

32nm

22nm

16nm

11nm

Technology node

Number of Cores

70

Conditions: PDSOI and FinFET, variable # core processor chip, constraining both
chip power and chip area (4 cm2).
Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater
size and spacing, fin height, sidewall thickness, and number of cores.

FinFET

60
50

PD 25W/cm2
PD 50W/cm2

40

Fin 25W/cm2
Fin 50W/cm2

30
20

PDSOI

10
0
45nm

32nm

22nm

16nm

11nm

Technology node

50

Supply voltage considerations


1.8

18%

VDD can be reduced by


higher activity, higher
mobility, and tighter
tolerances.

25 W/cm2

1.7

Relative Performance

1.6

0.7X Variability

1.5

1.4X

1.4

Mobility

Vary Vdd
Vary actf

1.3

Vary Tol
Vary mu

1.2

0.7X

1.1

1.4X

Optimizing everything
except VDD.

Activity factor

0.9

6%

0.8
0.4

0.5

0.6

0.7

0.8

0.9

Supply Voltage (V)

Conditions: PDSOI, 4 core processor chip, constraining total chip power density
Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater
size and spacing.
51

Optimizations vs gate length for FinFETs and ETSOI


Everything except gate length is optimized. The gate length is scanned.

Higher DIBL

0.12

40

0.1

ETSOI

0.08

FinFET

30
20
ETSOI
FinFET

10
0
0

20

40

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

0.02

60

FinFET

20

40

Gate Length (nm)

20

40

60

Gate Length (nm)

ETSOI

11 nm node

0.04

Gate Length (nm)


Supply Voltage, Vdd (V)

Ultimately, lower
performance

0.06

Higher VDD
Thinner tSi

DIBL (V)

50

Silicon Thickness (nm)

Shorter gate
lengths necessitate:

Rel. Performance

25 W/cm2 power density constraints.

60

8
7
6
5
4
3
2
1
0

ETSOI
FinFET

20

40

60

Gate Length (nm)

52

FinFET

Bulk

Fin
35

0.5

30

Gate Length (nm)

0.6
11

0.4

22

0.3
0.2
0.1
0
0

25W/cm2

0.02
0.04
DIBL (V/V)
FinFET

Supply Voltage, VDD (V)

ETSOI

ETSOI

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

ET

Bulk

25
20
15
10
5
0

0.06

22nm

16nm

11nm

Technology Node

Bulk

Fin

ET

Bulk

35
22

0.02

0.04
DIBL (V/V)

0.06

11

0.08

Gate Length (nm)

1W

Supply Voltage, VDD (V)

DIBL Scaling enables VDD Scaling

30
25
20
15
10
5
0
22nm

16nm

11nm

Technology Node

53

Novel devices for 11nm and beyond

III-V FinFETs
Higher mobility improves drive current

Tunnel FETs
Improved subthreshold slope enables low VDD and low energy
operation.
To properly model this device, have to be able to calculate the tunneling
barrier shapes and band-edge alignments.
We are in the process of developing a compact model for TFETs for the
optimizer, but results are not yet available. As an interim measure, we
can alter the Boltzmann constant in the conventional FET model, to see
the impact of steeper subthreshold slope.

Carbon Nanotube Transistors


Ballistic current flow in the channel should enable very high switching
speeds for these devices.
A compact model for CNTs suitable for the optimizer is being presented
at IEDM this year, but results are not available yet.
54

Comparing devices Energy vs Performance


Energy per logic transition (fJ)

10

PDSOI
1

0.1

CNFET

FinFET

TFET
0.01
1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

1.E+11

Frequency (Hz)

[General trends, not exact results.]


55

Summary

CMOS scaling is limited by electrostatic, quantum mechanical,


discreteness, thermodynamic and practical effects.
Optimization can and should be used to find the best design points
in the midst of these various constraints.
Example: low power needs somewhat less scaled devices.

Technology performance projections based on optimization for


PDSOI, FinFETs, ETSOI, and Bulk MOSFETs show:
Density improvements should continue, as long as wiring density
continues to improve
Performance improvements are likely to be rather modest, even with a
switch to FinFETs for 16 and/or 11nm nodes.

Exploratory devices (CNTs and/or TFETs) may offer substantial


performance advantages, someday.

56

Acknowledgements

Wilfried Haensch
Leland Chang
Paul Solomon
Steve Koester
Lan Wei
Philip Wong
Ghavam Shahidi

Mike Scheuermann
Phillip Restle
Omer Dokumaci
Mary Wisniewski
Steve Kosonocky
Yuan Taur
Bob Dennard

57

Extra slides

58

Generalized heat sink model

Two level heat flow model:


Flow in the silicon wafer
Flow in the heat sink material

In each layer, the flow can be:


3D (spherical) for spots smaller
than thickness
2D (cylindrical) at distances
larger than the thickness

In silicon layer, inhomogeneous


power dissipation is accounted
for, to estimate maximum junction
temperature at hottest point.

Comparison
of simplified
analytic
model with
detailed
numerical
model.

Temperature Rise (K)

This
model
is red
My model
is red.
FE
is blue
Kaismodel
data is blue.

Heat sources

Si wafer
Interface

Heat spreader
(e.g., SiC or Cu)
Interface to final coolant
(e.g., air or water)

Si thermal sheet resistance of Si wafer


HS thermal sheet resistance of heat sink
RSi thermal contact resistance of Si wafer
RHS thermal contact resistance of heat sink

Hot spot size (cm)

59

III-V FinFETs
III-V FinFETs are modeled by increasing the mobility in the conventional model.
Increased mobility enables an improved energy/performance tradeoff by
reducing the voltage needed for high performance designs.
FinFET

III-V 2x

III-V 4x

10

Fin

2x

4x

Drive current multiplier:

1.2

1x
(=Si)
1

2x
(~GaAs)
4x
(~InGaAs)
0.1
1.E+13

1.E+14

1.E+15

Supply Voltage (V)

Energy/transition (fJ)

1.4

1
0.8
0.6
0.4
0.2
0
0

10

12

14

Frequency (GHz)

1.E+16

Performance (transitions/sec)

60

16

Beating the sub-threshold slope limit


E 2 B
e
T~
EG
S~

EG3
E

Source

EG3 B
~ 2 e
l

Gate

Drain

EG l

Ec
Ev

Vgs2
2Vgs + D(Vgs ,Vds )

Krishna K. Bhwalka et al. 2005

P++

Log Ids

n++

S
High Vth low
power space

60mV/dec

Vgs

Tunnel FETs show strong voltage


dependence of sub-threshold slope
On-current not yet on par with
conventional high performance FETs at
comparable voltages

[Haensch]
61

TFET Heterostuctures
Planar SiGe H-TFET
gate

III-V H-TFET
ON-state

poly

source

drain

p++ SiGe

Off-state

Buried Oxide

Optimum electrostatics with gate


all around
New material combinations

Si

Integration onto silicon possible


Vg > 0

SiGe
High on
current

III-V materials offer lower effective


masses and more heterojunctions,
to further improve tunneling.
Nanowire Geometry Offers:

n+ Si

p-Si

SiGe Heterostuctures offer low


effective bandgap, which improves
tunneling.

Ambipolar
device
supressed

Scaling to quantum capacitance


limit
Improved Ion

Si

Vg = 0
Ambipolar
device
supressed

SiGe
Si

[Koester, Riel, Koswatta]


62

CNFET: Good or Bad?

Carbon Nanotube Field Effect Transistor (CNFET)


1D devices
Better transport characteristics
Worse parasitics
Leakage, Variations, etc
CNFET: Good or Bad?
The judgment highly depends on the application
Our approach is to build a optimizer for a full technology,
with proper consideration of device properties and system
needs.
Device modeling
Circuit performance benchmarking
System application consideration

[Lan Wei, Stanford]


63

Wire Model
Assumed constant 2:1 height to width wiring with equal lines
and spaces. ( 0.062 kBEOL fF/um )
Relative Performance change due to R(T)

Relative Performance

70nm ln 1 + e(T 40) /10


+
(W ,T ) = 300K
W
26

Consequences of wire resistance


model:

2.5
10W,const R
1W, const R
0.1W, const R
10W
1W
0.1W

2
1.5
1
0.5
50

100

150

200

250

300

350

400

Junction Temperature (K)

LTR with Scatt / LTR w/o Scatt.

Performance loss due to scattering


1.00

0.95

0.90

0.85
0.001

0.01

0.1

10

Total Power (W)


64

You might also like