Professional Documents
Culture Documents
Performance Projections
David J. Frank
12/4/09
International Winter School for Graduate Students
IIT, Bombay, India
1
1. Introduction
The End of Scaling?
log(Performance)
Time (yr)
2
The End of Scaling?
log(Performance)
Time (yr)
3
The End of Scaling is Optimization
log(System Performance) Stop when you get to the top.
Miniaturization
4
Outline
1. Review limitations to Scaling
2. Optimizing technology within constraints
3. Optimization results
4. Technology projections
5
1. Limitations to Scaling
1. Electrostatic constraints
2. Quantum mechanical leakage currents
3. Discreteness of matter and energy
4. Thermodynamic limitations
5. Practical and environmental constraints on power
6
Electrostatic constraints on FET design
}
A. High output resistance
B. High gain Long channel
C. Low sensitivity to variations behavior
}
D. High transconductance
E. High drain current Short channel
F. High speed behavior
2. Must choose a compromise: Gate
Short, but not so short that 2D S D
effects kill A, B, C.
7
Quantum Mechanical Tunneling Leakage Currents
becomes
leaky.
e-
Channel Drain
8
Atomistic effects
249,403,263 Si atoms
68,743 donors
13,042 acceptors
10
Practical and Environmental Issues
• Power consumption and heat removal are
limited by practical considerations.
• Low power applications are often battery
powered
– Many must be lightweight => power < ~few watts.
– Disposable batteries can cost >> $500/watt over life of
device.
– Rechargeables can cost > $50/watt over life of device.
• Home electronics is limited to <~1000W by
heating of the room and cost of electricity.
• High performance is limited by difficulty of
heat removal from chip (~100 W/chip). (Cost
of electricity is ~$5/watt over life.)
11
2. Optimizing Technology within Constraints
Practicality imposes power
Fixed architectural complexity constraints.
+ Fixed power constraints Electrostatics imposes
+ Device physics geometric constraints
= Existence of an optimal tech- Thermodynamics imposes
nology with maximal performance. voltage constraints.
Quantum mechanics imposes
miniaturization constraints due to
tunneling.
leakage increases due
to tunneling effects Declining available
dynamic power
log(Performance) overwhelms speed
leakage power improvements of
Power
scaling
dynamic power
Large Miniaturization Small Large Miniaturization Small
12
Background: Schematic organization of an
optimization program
Fixed Variables: Goal: optimize device
parameters initial guess
technology to
new maximize chip-level
values:
improved performance, subject
Area Model guess
to power constraints.
Wiring Thermal Device Structure
statistics Model
Constrained
IV Model Leakage Model optimizer
Wire Capacitance
13
Models and Approximations
System Assumptions
Processor chip is assumed to have a fixed number of cores, each with
a specified number of logic gates.
Only the logic within the cores is considered within the optimizations.
The clock and memory aspects of the chip are assumed to scale in
the same way as the logic (delay, power, and area).
Core-to-core and core-to-memory communication is not dealt with.
Repeaters
Clock
Memory
Logic
14
How much area do the processor cores take?
100% to 25%, generally decreasing with generation:
70%
40%
Power4, 174M
FETs ~25% 2 cores, 1.72B FETs
15
Area usage within a processor core
Approximate area fractions for a high-performance
microprocessor core in leading-edge technology
9.3%
23.3%
7.0%
1/3 20.2%
23.3%
9.3%
13.3%
33%
40.3%
Buffers & extra latches 31% 36% 20.2%
13.3%
Caches (L1) Buffers & extra latches 12.5% 14.5%
Macros Caches 1.2% 90%
Caps, Clock dist., Unused Register files Buffers & extra
11.2% 14.5%
latches
Custom & RLMs
Caches
Caps, Clock dist., Unused Buffers & extra latches
Register files
Caches
Latches and
Processors built with nanotechnology are likely to LCBs
Register files
16
Optimization Approaches
1. Engineering approach:
Maximize system performance, at fixed power.
Use total logic transition rate (LTR),
LTR = Ngates x activity factor/logic depth x 1/Delay
Relatively little dependence on architectural details.
2. Business approach:
Maximize Return on Investment (ROI).
17
FET Model
Using a general temperature-dependent short-channel FET model in
which VT, tD, and tox are coupled, halo doping effects are included, and
VT is set by the doping.
Modified alpha power model:
γ s
Wε ηkT ηkT / e µ(E⊥ ) V − V
I D (VGS ) = effI µ0 EC Fα GS T
tox e FI EC LCH µ0 ηkT / e
Fermi-Dirac
integral of order α
10W FET
Lg=28nm
1mW FET
Lg=45nm
18
Circuit Delay Estimation
Basic circuit elements are:
FI=2, FO=1.65 wire-loaded NAND gates for logic
inverters for repeaters, FO ~ 1.2
Delay calculations:
τ3 = Lwire (c / 2) Propagation
delay
τ=
( )3/ 4
τ1 + τ 42 / 3 + τ34 / 3 Final delay empirically merges
the separate components.
0.5 + (1 − VT / VDD ) (1 + α) [Eble's thesis]
Correction for
VT/VDD.
19
Power Calculation
PTOT = PDYN + PsubVT + POX + PB 2 B
PDYN = α
lD
N CKT 1
2 C (VH −VL )VDD τ
PB 2 B = 1
3 Acore Dox ( WL )VDD J B 2 B ( FMax ,VDD )
Note that
1 cross-through
LTR = lαD NCKT power is not
τ included.
The powers are computed separately for logic and for repeaters.
τ = mean delay for a single loaded logic gate
α
is activity factor divided by logic depth. Usually ~0.012 in
lD recent optimizations.
20
Communication and Wiring Models
Assume wire lengths distributed according to Rent's rule.
4FO 2r −3
( )
2 r −3
log(number of wires)
inet (l) = l − 2 NCKT
3 + FO
lR
LnoRptr =
∫ 1
linet (l)dl
lR
∫ 1
inet (l)dl
lR 2 NCKT log(length)
l Max
N Rptr = ∫
lR
linet (l)dl l R
From optimizations:
# Wiring Levels
Units are gate pitches. 12
Required
r = Rent exponent, 0.6, here. 10
8
6
4
1E+5 1E+6 1E+7 1E+8
Number of Logic Gate
21
Repeater Model
Long wires receive repeaters with a spacing that is
optimized.
Long wire delay can be absorbed into pipeline depth, but
the latency causes inefficiency, so we use a latency
penalty factor: γ.
100 1 1
Repeater Spacing (um)
Repeater spacing
Repeater Width (um)
0.9
22
Local Variation Modeling
• Variation sources:
– Signal Coupling noise
– Supply noise
– Statistical doping variations
– LER gate length variations
• Consequences modeled:
– Increased static power
• combine 1 sigma of doping, length, and noise
– Critical path delay distribution
• yield-based, using estimated critical path
distribution,
• and 1 sigma of doping and length, and worst case
noise.
– Single stage functionality
• use worst case (~6 sigma) of doping and length,
no noise.
23
Accounting for variations
• A complete most-probable worst-case-vector methodology is
used to handle both local and global variability.
Worst-case vectors:
Blue curves are contours of
MPWC
constant probability. vector
3σ 2σ 1σ
Red curves are contours of
Murphy
function to be minimized.
vector
24
Power vs Frequency Variation Windows
Optimizations for high-perf processor, 45 nm node technology, PDSOI
Power-constrained optimization parameters: VDD, VTn, VTp, LG, repeater size and spacing.
Area constrained is also constrained, to 5.6 cm2, by adjusting the widths.
These boxes are for
This is the point ±0.675 sigma, for 50%
that is calculated, yield.
VDD+5%
Chip Power (W)
1.1
Increased variability
requires:
1
Higher supply voltages
0.9
Less scaled FETs
0.8 65nm node, dual
processor core
0.7
0% 50% 100% 150% 200%
Relative Margin
26
Summary: Models, Assumptions and
Approximations
Power modeling
Dynamic switching energy plus static power mechanisms
including sub-threshold current, gate oxide tunneling, and
body-to-drain band-to-band tunneling.
Device modeling
Bulk MOSFETs: VT, and depletion depth determined by the
halo doping, 2D effects are taken into account.
Gate length is fully optimized, not set by the technology
node.
Circuit modeling
Delay is for FI=FO=2 or 3 NAND gates, based on model
from J.C. Eble's thesis [Ga.Tech. '98].
Capacitance includes gate, parasitic, and wire parts (Rent's
rule).
Wire resistance includes temperature dependence and
surface scattering in small wires.
27
Summary: Models, Assumptions and
Approximations
Chip-level modeling
Allocate fixed fraction of chip power and area to logic, and
assume fixed number of logic gates. Logic part is
optimized, and the rest is assumed to scale similarly.
Assume multiple processor cores are interconnected in a
way that does not greatly add to the wiring burden.
Long wires are fatter, and receive repeaters with a spacing
that is optimized.
Long wire delay is accounted for using a latency penalty
factor.
On-chip tolerance/variability and noise is accounted for.
28
3. Optimization Results
• General results
• Evaluating specific possible device
directions
– Increasing mobility
– High-k gate dielectric and metal gates
– 3D stacking
– Better heat sinks
– Sub-ambient cooling
– Multi-processor tradeoffs
29
Optimize by technology node
For each node, pre-specify Optimizations over 7 variables:
the following parameters: tox, Lg, ND, <w>, Vdd, Srpt, <wrpt>
• Wire half-pitch,
• gate overlap,
• halo scalelength,
• contact resistance,
• LER sigma,
• ACLV,
• mobility,
Dual core processor with
• gate depletion,
aggressive air cooling
• k_wire,
• k_gate
Note that the LG, tox, VDD, VT, width, etc. are NOT preselected.
They are solved for by the optimizations.
30
Optimization results
1.3
80 Oxynitride
1.2
1.1
60
1
40 0.9
0.8 High-k, for 32nm
20 0.7
0.6
0
0.5
0.01 0.1 1 10 100
0.01 0.1 1 10 100
Total Chip Power (W)
Total Chip Power (W)
31
Optimization results
0.9 Voltages vs Power
Supply and Threshold Votlage (V)
0.6
0.5
0.4
32
Optimal Power Allocation Fractions
Oxide pwr, rptrs Oxide pwr, logic
SubVT pwr, rptrs SubVT pwr, logic
Dyn. pwr, rptrs Dyn. pwr, logic Active power fraction:
100%
70% at low power to
40% at high power.
Power Allocation
80%
60%
40%
20%
0%
1 3 10 30 100 300
Chip Power (W)
45nm technology with microchannel
heat sink and water cooling.
4 core chip.
33
Mobility dependence
Enhanced mobility has greatest benefit at high power.
Even for large mobility enhancements, performance boost is
modest: 10-15%.
1.2 1.12
45nm technology
Relative Performance
1W 10W 100W
Relative Performance
dual core processor 1.1
1.15 water cooled
1.08
1.1 1.06
1.04 32nm technology
1.05
1.02 8 core processor
1 W chip 10 W chip 100W chip Air cooled
1 1
1 1.5 2 2.5 3 1 1.5 2 2.5 3
Mobility Enhancement Factor Mobility Enhancement Factor
34
Metal-gate workfunction for high-k
and oxynitride
1.4
Performance relative to poly-Si
Tot Si area
15 0.8
Footprint
10 0.6
0.4
5 0.2
0 0
1 10 100 1 10 100
Chip Power (W) Chip Power (W)
(4 core, 45nm node, water cooling.)
36
Cooling scenario optimizations
Forced liquid cooling through microchannel fins may permit very high
power densities.
Optimized (maximum) performance increases as the ~log of the power.
5 7
Performance (arb units)
30
4-processor chips with
0
micro-channel water
Loaded Switching Energy (fJ)
10
0
cooling, optimizing
‘everything’.
30
1
10
3 3x
3x
9 variables: tox, Lg, ND,
1
0.
0.
0.
1
00
01
03
<wrpt>, xhalo
00
3
1
0.1
1E+12 1E+13 1E+14 1E+15 1E+16
Total Logic Transistions / sec
38
Dependence on number of cores
Constant total number of transistors, divided equally among n cores:
Performance (arb units) 10
39
4. Future Projections (22 → 11nm)
• Device options
• General results
• Technology projections
• Beyond 11nm?
40
Device Options
• PDSOI
– IBM’s best understood technology.
• FinFET FinFET Drain
– Improved electrostatic control of Gate
channel offers shorter gates, lower Gate
voltages, higher speed. The ETSOI and FinFET
Source
devices simulated using
Workfunction VT control. Not an in-house scaling
entirely planar. model.
ETSOI Gate
• ETSOI Source Drain Comparable source/drain
resistance and parasitic
– Somewhat improved electrostatic Buried Oxide capacitance models
control compared to PDSOI. More Substrate were implemented for
compatible with conventional ETSOI, FinFET, and for
shallow bulk MOSFETs.
planar processing. Workfunction
VT control. Bulk Gate
• Shallow Bulk Source Drain
– Shallow junctions, raised S/D.
41
General optimization results –
performance vs power
1.E+10
As always, the
easiest way to
increase
performance is Frequency (Hz)
1.E+09
1.E+08
1.E+07
0.01 0.1 1 10 100 1000
Total Chip Power(W)
40
0.6
35 22nm 22nm
Gate Length (nm)
14
1.2
10
0.8 22nm 22nm
11nm 22nm 8 16nm 16nm
0.6 11nm 11nm
6
22nm
0.4
4
11nm
0.2 2
0 0
0.01 0.1 1 10 100 1000 0.01 0.1 1 10 100 1000
Total Chip Power(W) Total Chip Power(W)
10x
22nm
1 16nm
11nm
22nm
4x
11nm
0.1
1.E+07 1.E+08 1.E+09 1.E+10
Clock Frequency (Hz)
ns
itio
nd
co
as
0.01
1000:1
bi
gh
Off-current (A/cm)
Hi
22 nm
0.001 16 nm
ns 11 nm
itio
nd
co
0.0001
as
bi
w
Lo
10000:1
0.00001
0.1 1 10 100
On-current (A/cm)
5 0.3
0.2
0.1
0
4
Frequency (GHz)
3 25W/cm2
50W/cm2
6
2 5
10 W/cm2
1
0
45nm 32nm 22nm 16nm 11nm 0
45nm 32nm 22nm 16nm 11nm
Technology Node Technology Node
Conditions: PDSOI, 4 core processor chip, constraining total chip power density
Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater
size and spacing.
47
Performance at constant power density –
comparing technologies
ETSOI and FinFET offer moderate performance advantage over PDSOI for 22nm node
and beyond. The industry should transition to FinFET at 16nm to avoid performance loss.
4.5
3.5
3
Frequency (GHz)
Bulk
2.5
PDSOI
2 FinFET
ETSOI
1.5
1
25 W/cm2
0.5
0
45nm 32nm 22nm 16nm 11nm
Technology Node
Area (cm2)
PD 1e15
PD 1.5e15
1
Fin 1e15
0 Fin 1.5e15
Technology node
Conditions: PDSOI and FinFET, 4 core processor chip, constraining total chip FinFET
performance
0.1
Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater 45nm 32nm 22nm 16nm 11nm
49
Performance at constant area and power
16
Area = 4 cm2, power = 100 W, fixed.
14
Number of cores is adjusted to maintain
FinFET constant area. Chip performance is
12
25 W/cm2 assumed linear with the number of cores.
50 W/cm2
Maximizing total chip performance.
Relative Performance
10
PD 25W/cm2
PD 50W/cm2
8
Fin 25W/cm2
Fin 50W/cm2
6
50 W/cm2 25 W/cm2 80
4
70
60
FinFET
Number of Cores
2
PDSOI 50
PD 25W/cm2
PD 50W/cm2
40
0 Fin 25W/cm2
Fin 50W/cm2
45nm 32nm 22nm 16nm 11nm 30
Technology node 20
PDSOI
Conditions: PDSOI and FinFET, variable # core processor chip, constraining both 10
1.5
Rel. Performance
0.1 ETSOI
40
0.08 FinFET
DIBL (V)
30
0.06
Shorter gate 20
0.04
lengths necessitate: 10 ETSOI
0.02
FinFET
0
Higher DIBL 0
0 20 40 60 0 20 40 60
Thinner tSi
Supply Voltage, Vdd (V)
0.7 8
20
0.3
15
0.2
10
0.1
5
0 0
0 0.02 0.04 0.06 22nm 16nm 11nm
DIBL (V/V) Technology Node
25W/cm2 FinFET ETSOI Bulk Fin ET Bulk
0.8 35
Supply Voltage, VDD (V)
0.7 22 30
0.2 10
0.1 5
0
0
0 0.02 0.04 0.06 0.08 22nm 16nm 11nm
54
Comparing devices – Energy vs Performance
10
Energy per logic transition (fJ)
PDSOI
CNFET
0.1
FinFET
TFET
0.01
1.E+06 1.E+07 1.E+08 1.E+09 1.E+10 1.E+11
Frequency (Hz)
55
Summary
• CMOS scaling is limited by electrostatic, quantum mechanical,
discreteness, thermodynamic and practical effects.
• Optimization can and should be used to find the best design points
in the midst of these various constraints.
– Example: low power needs somewhat less scaled devices.
• Technology performance projections based on optimization for
PDSOI, FinFETs, ETSOI, and Bulk MOSFETs show:
– Density improvements should continue, as long as wiring density
continues to improve
– Performance improvements are likely to be rather modest, even with a
switch to FinFETs for 16 and/or 11nm nodes.
• Exploratory devices (CNTs and/or TFETs) may offer substantial
performance advantages, someday….
56
Acknowledgements
• Wilfried Haensch • Mike Scheuermann
• Leland Chang • Phillip Restle
• Paul Solomon • Omer Dokumaci
• Steve Koester • Mary Wisniewski
• Lan Wei • Steve Kosonocky
• Philip Wong • Yuan Taur
• Ghavam Shahidi • Bob Dennard
57
Extra slides
58
Generalized heat sink model
Heat sources
• Two level heat flow model: Si wafer
– Flow in the silicon wafer
Interface
– Flow in the heat sink material
• In each layer, the flow can be: Heat spreader
– 3D (spherical) for spots smaller (e.g., SiC or Cu)
than thickness
– 2D (cylindrical) at distances Interface to final coolant
larger than the thickness (e.g., air or water)
• In silicon layer, inhomogeneous
power dissipation is accounted ρSi – thermal sheet resistance of Si wafer
for, to estimate maximum junction
temperature at hottest point. ρHS – thermal sheet resistance of heat sink
RSi – thermal contact resistance of Si wafer
Temperature Rise (K)
of simplified FE
Kai’smodel is blue
data is blue. RHS – thermal contact resistance of heat sink
analytic
model with
detailed
numerical
model.
59
III-V FinFETs
III-V FinFETs are modeled by increasing the mobility in the conventional model.
Increased mobility enables an improved energy/performance tradeoff by
reducing the voltage needed for high performance designs.
FinFET III-V 2x III-V 4x
10 Fin 2x 4x
1.4
2x 0.4
(~GaAs) 0.2
4x 0
0 2 4 6 8 10 12 14 16
(~InGaAs) Frequency (GHz)
0.1
1.E+13 1.E+14 1.E+15 1.E+16
Performance (transitions/sec)
60
Beating the sub-threshold slope limit
Source Gate
Drain
EG3
E 2 −B EG3 − B EG l
T~ e E
~ 2 e
EG l Ec
Ev
Vgs2
S~
2Vgs + D(Vgs ,Vds )
Krishna K. Bhwalka et al. 2005
P++ n++
Log Ids
S
High Vth low
power space
Vg > 0
Scaling to quantum capacitance
Ambipolar
device
limit
SiGe
supressed
Improved Ion
High on
Si
current
Vg = 0
Ambipolar
SiGe device
supressed
x
Si
Relative Performance
ρ (W ,T ) = ρ300K ⋅ + 10W,const R
W 26 2 1W, const R
0.1W, const R
10W
1.5 1W
0.1W
model: 0.5
50 100 150 200 250 300 350 400
Junction Temperature (K)
1.00
0.95
0.90
0.85
0.001 0.01 0.1 1 10
64