DJF - Scaling and Optimization - IIT PDF

Technology Optimization and
Performance Projections
David J. Frank
12/4/09
International Winter School for Graduate Students
IIT, Bombay, India
1
1. Introduction
The End of Scaling?
log(Performance)
Time (yr)
2
The End of Scaling?
log(Performance)
Time (yr)
3
The End of Scaling is Optimization
log(System Performance) Stop when you get to the top.
Miniaturization
4
Outline
1. Review limitations to Scaling
2. Optimizing technology within constraints
3. Optimization results
4. Technology projections
5
1. Limitations to Scaling
1. Electrostatic constraints
2. Quantum mechanical leakage currents
3. Discreteness of matter and energy
4. Thermodynamic limitations
5. Practical and environmental constraints on power
Basic idea of Scaling:

Adjust dimensions,
voltages, & doping to
achieve smaller FET
with same electrostatic
behavior.
6
Electrostatic constraints on FET design
1. A good design should have
}
A. High output resistance
B. High gain Long channel
C. Low sensitivity to variations behavior
}
D. High transconductance
E. High drain current Short channel
F. High speed behavior
2. Must choose a compromise: Gate
Short, but not so short that 2D S D
effects kill A, B, C.
7
Quantum Mechanical Tunneling Leakage Currents
Currents increase exponentially as the barriers become thinner.

FET 'ON' FET 'OFF'
Vdd Gnd
Everything Gnd Gnd Gnd Vdd
becomes
leaky.
Gate insulator tunneling

e- Subthreshold leakage
e- Direct source-to-drain tunneling
Source Drain-to-body tunneling
e-
Channel Drain
8
Atomistic effects
The number of dopant atoms Statistical variation in the

in the depletion layer of a number of dopants, N, varies
MOSFET has been scaling as N1/2, causing increasing VT
roughly as Leff1.5. uncertainty for small N.
249,403,263 Si atoms
68,743 donors
13,042 acceptors
[D. J. Frank, et al., 1999 Symp. VLSI Tech.] 9

Thermodynamic limitations
The Boltzmann distribution determines the subthreshold slope
and leakage current, VT, and diode leakage currents, too.
e(VG -VT )/ηkT
IsubVT = I0 e
VT can only be scaled by reducing the temperature, which is
not acceptable for many applications.
Speed is very sensitive to VT/VDD ratio.

e-
Irreversible computation => All
switching energy is converted to heat. Source
All leakage currents and IR drops are
irreversible => More heat.
Channel Drain
10
Practical and Environmental Issues
• Power consumption and heat removal are
limited by practical considerations.
• Low power applications are often battery
powered
– Many must be lightweight => power < ~few watts.
– Disposable batteries can cost >> $500/watt over life of
device.
– Rechargeables can cost > $50/watt over life of device.
• Home electronics is limited to <~1000W by
heating of the room and cost of electricity.
• High performance is limited by difficulty of
heat removal from chip (~100 W/chip). (Cost
of electricity is ~$5/watt over life.)
11
2. Optimizing Technology within Constraints
Practicality imposes power
Fixed architectural complexity constraints.
+ Fixed power constraints Electrostatics imposes
+ Device physics geometric constraints
= Existence of an optimal tech- Thermodynamics imposes
nology with maximal performance. voltage constraints.
Quantum mechanics imposes
miniaturization constraints due to
tunneling.
leakage increases due
to tunneling effects Declining available
dynamic power
log(Performance) overwhelms speed
leakage power improvements of
Power
scaling
dynamic power
Large Miniaturization Small Large Miniaturization Small
12
Background: Schematic organization of an
optimization program
Fixed Variables: Goal: optimize device
parameters initial guess
technology to
new maximize chip-level
values:
improved performance, subject
Area Model guess
to power constraints.
Wiring Thermal Device Structure
statistics Model
Constrained
IV Model Leakage Model optimizer
Wire Capacitance
Delay Leakage Power

tolerance adjustments tolerance adjustments
Adjust for Latency Total Power

of Long Paths
13
Models and Approximations
System Assumptions
Processor chip is assumed to have a fixed number of cores, each with
a specified number of logic gates.
Only the logic within the cores is considered within the optimizations.
The clock and memory aspects of the chip are assumed to scale in
the same way as the logic (delay, power, and area).
Core-to-core and core-to-memory communication is not dealt with.
Repeaters
Clock
Memory
Logic
Fudge Treat in Fudge

detail
Treat these by simple scaling from the logic part.
14
How much area do the processor cores take?
100% to 25%, generally decreasing with generation:
70%
Prescott. 125M FETs

100%
Alpha 21264 ('96)

15M FETs, L1 cache only 40%
Dothan, 140M FETs
40%
Power4, 174M
FETs ~25% 2 cores, 1.72B FETs
15
Area usage within a processor core
Approximate area fractions for a high-performance
microprocessor core in leading-edge technology
9.3%
23.3%
7.0%
9.3% data from:

23.3%
7.0% M. Scheuermann
9.3% and M. Wisniewski
60%
23.3%
7.0%
1/3 20.2%
23.3%
9.3%
60.5% 2/3 20.2%

7.0%
13.3%
33%
40.3%
Buffers & extra latches 31% 36% 20.2%
13.3%
Caches (L1) Buffers & extra latches 12.5% 14.5%
Macros Caches 1.2% 90%
Caps, Clock dist., Unused Register files Buffers & extra
11.2% 14.5%
latches
Custom & RLMs
Caches
Caps, Clock dist., Unused Buffers & extra latches
Register files
Caches
Latches and
Processors built with nanotechnology are likely to LCBs
Register files
have similar area usage statistics. Logic

Latches and LCBs
Logic in use
Nanotechnology may require additional area Unused/caps
Unused Logic
allocations for defective circuitry. Caps, Clock
Unused/caps
Estimates of power and computational densities dist., Unused
Caps, Clock dist., Unused
should take into account realistic area efficiencies.
16
Optimization Approaches
1. Engineering approach:
Maximize system performance, at fixed power.
Use total logic transition rate (LTR),
LTR = Ngates x activity factor/logic depth x 1/Delay
Relatively little dependence on architectural details.
2. Business approach:
Maximize Return on Investment (ROI).
17
FET Model
Using a general temperature-dependent short-channel FET model in
which VT, tD, and tox are coupled, halo doping effects are included, and
VT is set by the doping.
Modified alpha power model:
γ s
Wε ηkT  ηkT / e   µ(E⊥ )  V − V 
I D (VGS ) = effI   µ0   EC Fα  GS T 
tox e  FI EC LCH   µ0   ηkT / e 
Fermi-Dirac
integral of order α
10W FET
Lg=28nm
1mW FET
Lg=45nm
18
Circuit Delay Estimation
Basic circuit elements are:
FI=2, FO=1.65 wire-loaded NAND gates for logic
inverters for repeaters, FO ~ 1.2
Delay calculations:
V DD ( C parasitic + C wire + C gateload )

τ1 = *
2 I Deff Current is adjusted to account
τ 2 = Rwire (C wire + C gateload ) for noise and variations.
τ3 = Lwire (c / 2) Propagation
delay
τ=
( )3/ 4
τ1 + τ 42 / 3 + τ34 / 3 Final delay empirically merges
the separate components.
0.5 + (1 − VT / VDD ) (1 + α) [Eble's thesis]
Correction for
VT/VDD.
19
Power Calculation
PTOT = PDYN + PsubVT + POX + PB 2 B
PDYN = α
lD
N CKT 1
2 C (VH −VL )VDD τ
PsubVT =1.7 N CKT VDD I off (VT ,VDD, tox , η, LG , WL )
Pox = Acore Dox ( WL )VDD J ox (VT ,VDD, tox , η)
PB 2 B = 1
3 Acore Dox ( WL )VDD J B 2 B ( FMax ,VDD )
Note that
1 cross-through
LTR = lαD NCKT power is not
τ included.
The powers are computed separately for logic and for repeaters.
τ = mean delay for a single loaded logic gate
α
is activity factor divided by logic depth. Usually ~0.012 in
lD recent optimizations.
20
Communication and Wiring Models
Assume wire lengths distributed according to Rent's rule.
4FO  2r −3
( )
2 r −3

log(number of wires)
inet (l) = l − 2 NCKT
3 + FO  
lR
LnoRptr =
∫ 1
linet (l)dl
lR
∫ 1
inet (l)dl
lR 2 NCKT log(length)
l Max
N Rptr = ∫
lR
linet (l)dl l R
From optimizations:
# Wiring Levels
Units are gate pitches. 12
Required
r = Rent exponent, 0.6, here. 10
8
6
4
1E+5 1E+6 1E+7 1E+8
Number of Logic Gate
21
Repeater Model
Long wires receive repeaters with a spacing that is
optimized.
Long wire delay can be absorbed into pipeline depth, but
the latency causes inefficiency, so we use a latency
penalty factor: γ.
100 1 1
Repeater Spacing (um)
Repeater spacing
Repeater Width (um)
0.9
Repeater Spacing (cm)

90 Repeater width
0.8
0.7 0.1
80
0.6
70 0.5
0.01
0.4
60 Pecon=10 W/cm2
0.3 9S 11S 12S 13S
10S
50 0.2
0.001
0.01 0.1 1
0.01 0.1 1 10 100
Latency Penalty Factor CPU Core Power (W)
22
Local Variation Modeling
• Variation sources:
– Signal Coupling noise
– Supply noise
– Statistical doping variations
– LER gate length variations
• Consequences modeled:
– Increased static power
• combine 1 sigma of doping, length, and noise
– Critical path delay distribution
• yield-based, using estimated critical path
distribution,
• and 1 sigma of doping and length, and worst case
noise.
– Single stage functionality
• use worst case (~6 sigma) of doping and length,
no noise.
23
Accounting for variations
• A complete most-probable worst-case-vector methodology is
used to handle both local and global variability.
Two variable example:
Worst-case vectors:
Blue curves are contours of
MPWC
constant probability. vector
3σ 2σ 1σ
Red curves are contours of
Murphy
function to be minimized.
vector
24
Power vs Frequency Variation Windows
Optimizations for high-perf processor, 45 nm node technology, PDSOI
Power-constrained optimization parameters: VDD, VTn, VTp, LG, repeater size and spacing.
Area constrained is also constrained, to 5.6 cm2, by adjusting the widths.
These boxes are for
This is the point ±0.675 sigma, for 50%
that is calculated, yield.
VDD+5%
Chip Power (W)
optimized to, and Power and delay are

reported. treated as independent:
25% of yield is lost for
230 W each.
The data shows 10%
frequency variation per
Nominal sigma.
design point:
VDD=1.058 V
VDD-10%
Clock Frequency (GHz)

25
Impact of variability on performance
– Atomistic effects are leading to greater device variability.
– Increasing variability requires larger design margins.
– Designing for larger margins decreases performance.
1.3
P=0.01 P=1 P=50
1.2
Relative Performance
1.1
Increased variability
requires:
1
Higher supply voltages
0.9
Less scaled FETs
0.8 65nm node, dual
processor core
0.7
0% 50% 100% 150% 200%
Relative Margin
26
Summary: Models, Assumptions and
Approximations
Power modeling
Dynamic switching energy plus static power mechanisms
including sub-threshold current, gate oxide tunneling, and
body-to-drain band-to-band tunneling.
Device modeling
Bulk MOSFETs: VT, and depletion depth determined by the
halo doping, 2D effects are taken into account.
Gate length is fully optimized, not set by the technology
node.
Circuit modeling
Delay is for FI=FO=2 or 3 NAND gates, based on model
from J.C. Eble's thesis [Ga.Tech. '98].
Capacitance includes gate, parasitic, and wire parts (Rent's
rule).
Wire resistance includes temperature dependence and
surface scattering in small wires.
27
Summary: Models, Assumptions and
Approximations
Chip-level modeling
Allocate fixed fraction of chip power and area to logic, and
assume fixed number of logic gates. Logic part is
optimized, and the rest is assumed to scale similarly.
Assume multiple processor cores are interconnected in a
way that does not greatly add to the wiring burden.
Long wires are fatter, and receive repeaters with a spacing
that is optimized.
Long wire delay is accounted for using a latency penalty
factor.
On-chip tolerance/variability and noise is accounted for.
28
3. Optimization Results
• General results
• Evaluating specific possible device
directions
– Increasing mobility
– High-k gate dielectric and metal gates
– 3D stacking
– Better heat sinks
– Sub-ambient cooling
– Multi-processor tradeoffs
29
Optimize by technology node
For each node, pre-specify Optimizations over 7 variables:
the following parameters: tox, Lg, ND, <w>, Vdd, Srpt, <wrpt>
• Wire half-pitch,
• gate overlap,
• halo scalelength,
• contact resistance,
• LER sigma,
• ACLV,
• mobility,
Dual core processor with
• gate depletion,
aggressive air cooling
• k_wire,
• k_gate
Note that the LG, tox, VDD, VT, width, etc. are NOT preselected.
They are solved for by the optimizations.
30
Optimization results
Gate Length vs Power Oxide Thickness vs Power
Equiv. Oxynitride Thickness (nm)

120 1.6
90 nm 65 nm 45 nm 32 nm 1.5 90 nm 65 nm 45 nm 32 nm
100 1.4
Gate Length (nm)
1.3
80 Oxynitride
1.2
1.1
60
1
40 0.9
0.8 High-k, for 32nm
20 0.7
0.6
0
0.5
0.01 0.1 1 10 100
0.01 0.1 1 10 100
Total Chip Power (W)
(High-k case assumes 0.3nm

Dual core processor with barrier layer, bandedge metal gate,
aggressive air cooling HfO2-like insulator characteristics.)
31
Optimization results
0.9 Voltages vs Power
Supply and Threshold Votlage (V)
Vdd,90 Vdd,65 Vdd,45 Vdd,32

0.8 VT,90 VT,65 VT,45 VT,32
0.7
0.6
0.5
0.4
0.3 Dual core processor

0.2 with aggressive air
0.1
cooling
0.01 0.1 1 10 100
– Supply voltages are lower for low power applications.

– High-k lowers VDD ~ 15% at the 45nm generation.
32
Optimal Power Allocation Fractions
Oxide pwr, rptrs Oxide pwr, logic
SubVT pwr, rptrs SubVT pwr, logic
Dyn. pwr, rptrs Dyn. pwr, logic Active power fraction:
100%
70% at low power to
40% at high power.
Power Allocation
80%
60%
40%
20%
0%
1 3 10 30 100 300
Chip Power (W)
45nm technology with microchannel
heat sink and water cooling.
4 core chip.
33
Mobility dependence
Enhanced mobility has greatest benefit at high power.
Even for large mobility enhancements, performance boost is
modest: 10-15%.
1.2 1.12
45nm technology
1W 10W 100W
dual core processor 1.1
1.15 water cooled
1.08
1.1 1.06
1.04 32nm technology
1.05
1.02 8 core processor
1 W chip 10 W chip 100W chip Air cooled
1 1
1 1.5 2 2.5 3 1 1.5 2 2.5 3
Mobility Enhancement Factor Mobility Enhancement Factor
34
Metal-gate workfunction for high-k
and oxynitride
1.4
Performance relative to poly-Si
5W, oxynitride 5W, high-k

1.3 50W, oxynitride 50W, high-k
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0 0.1 0.2 0.3 0.4 0.5
Workfunction offset from bandedge (ev)
45nm node, dual core processor

with aggressive air cooling
35
3D stacking
Multiple layers offer higher performance
due to shorter wires.
RED = 1 Layer, GREEN = 2 Layers
15 400
Relative Performance Mean FET Width (nm)
300
10
200
5 100
0 0
1 10 100 1 10 100
20 1.2
Mean Wire Length (um) 1
Chip Area (cm2)
Tot Si area
15 0.8
Footprint
10 0.6
0.4
5 0.2
0 0
1 10 100 1 10 100
Chip Power (W) Chip Power (W)
(4 core, 45nm node, water cooling.)
36
Cooling scenario optimizations
Forced liquid cooling through microchannel fins may permit very high
power densities.
Optimized (maximum) performance increases as the ~log of the power.
5 7
Performance (arb units)
Performance (arb units)

4 6
-40C Liquid 5
3 18C Water
Hi-Perf. Air 4 -40C Liquid
18C Water
Low-Cost Air
2 3 Hi-Perf. Air
Low-Cost Air
2
1 4 core processor design 8 core processor design
1
45nm technology 32nm technology
0 0
1 10 100 1000 10000 1 10 100 1000 10000
Total Chip Power (W) Total Chip Power (W)
Optimized over 7 variables: Lg, tox, Nd, <w>, Drptr, <wrptr>, Vdd.
Low temperature case does not include refrigerator power.
37
Multiprocessor motivation
The energy / performance tradeoff is very steep at the high end.
Lower power, more parallel processors potentially offer more
computation for the same total power level.
10
4 core processors
30
4-processor chips with
0
micro-channel water
Loaded Switching Energy (fJ)
10
0
cooling, optimizing
‘everything’.
30
1
10
3 3x
3x
9 variables: tox, Lg, ND,
1
<w>, Vdd, wHP, Srpt,

0.
3
0.
0.
0.
0.
0.
1
00
01
03
<wrpt>, xhalo
00
3
1
0.1
1E+12 1E+13 1E+14 1E+15 1E+16
Total Logic Transistions / sec
38
Dependence on number of cores
Constant total number of transistors, divided equally among n cores:
Performance (arb units) 10
2 cores 4 cores 8 cores 16 cores

1
1 10 100 1000
39
4. Future Projections (22 → 11nm)
• Device options
• General results
• Technology projections
• Beyond 11nm?
40
Device Options
• PDSOI
– IBM’s best understood technology.
• FinFET FinFET Drain
– Improved electrostatic control of Gate
channel offers shorter gates, lower Gate
voltages, higher speed. The ETSOI and FinFET
Source
devices simulated using
Workfunction VT control. Not an in-house scaling
entirely planar. model.
ETSOI Gate
• ETSOI Source Drain Comparable source/drain
resistance and parasitic
– Somewhat improved electrostatic Buried Oxide capacitance models
control compared to PDSOI. More Substrate were implemented for
compatible with conventional ETSOI, FinFET, and for
shallow bulk MOSFETs.
planar processing. Workfunction
VT control. Bulk Gate
• Shallow Bulk Source Drain
– Shallow junctions, raised S/D.
41
General optimization results –
performance vs power
1.E+10
As always, the
easiest way to
increase
performance is Frequency (Hz)
1.E+09
to increase the 22nm

16nm
power. 11nm
1.E+08
1.E+07
0.01 0.1 1 10 100 1000
Total Chip Power(W)
Conditions: PDSOI, 4 core processor chip, constraining total chip power

Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths,
repeater size and spacing.
42
Gate Length and Chip Area vs Power
50
Lower power requires longer gate 0.8
lengths, to reduce variability.
45
0.7
40
0.6
35 22nm 22nm
Gate Length (nm)
Chip Area (cm2)

0.5
30
22nm 22nm
25 0.4
16nm 16nm
11nm 11nm 11nm
20
0.3
15 11nm
0.2
10 Device density increases for
5
0.1 each generation. Shrinking area
Toxeqv: 0.6 – 0.82 nm increases power density, too.
0 0
0.01 0.1 1 10 100 1000 0.01 0.1 1 10 100 1000
Total Chip Power(W) Total Chip Power(W)

43
Voltage and Energy vs Power
Optimal supply voltage can become quite low for low power constraints, leading to very
low energy use per logic transition.
1.4 16
14
1.2
Energy per logic transition (fJ)

12
1
Supply Voltage (V)
10
0.8 22nm 22nm
11nm 22nm 8 16nm 16nm
0.6 11nm 11nm
6
22nm
0.4
4
11nm
0.2 2
0 0
0.01 0.1 1 10 100 1000 0.01 0.1 1 10 100 1000
Total Chip Power(W) Total Chip Power(W)

44
Energy vs performance trade-off
10
10x
22nm
1 16nm
11nm
22nm
4x
11nm
0.1
1.E+07 1.E+08 1.E+09 1.E+10
Clock Frequency (Hz)

45
Optimal On/Off Ratio
PDSOI nFETs, currents measured at nominal process and bias conditions
0.1
ns
itio
nd
co
as
0.01
1000:1
bi
gh
Off-current (A/cm)
Hi
22 nm
0.001 16 nm
ns 11 nm
itio
nd
co
0.0001
as
bi
w
Lo
10000:1
0.00001
0.1 1 10 100
On-current (A/cm)

46
Performance at constant power density (PDSOI)
Performance increases at 32nm due to hi-k introduction, but then falls as strain
diminishes and gate dielectric does not scale further.
1
0.9
Supply Voltage (V)

0.8
6 0.7
0.6 10W/cm2
0.5 25 W/cm2
50 W/cm2 0.4 50W/cm2
5 0.3
0.2
0.1
0
4
Frequency (GHz)
45nm 32nm 22nm 16nm 11nm
10W/cm2 Technology Node
3 25W/cm2
50W/cm2
6
2 5
10 W/cm2
Chip Area (cm2)

4
10 W/cm2
1 3 25 W/cm2
50 W/cm2
2
1
0
45nm 32nm 22nm 16nm 11nm 0
Technology Node Technology Node
Conditions: PDSOI, 4 core processor chip, constraining total chip power density
Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater
size and spacing.
47
Performance at constant power density –
comparing technologies
ETSOI and FinFET offer moderate performance advantage over PDSOI for 22nm node
and beyond. The industry should transition to FinFET at 16nm to avoid performance loss.
4.5
3.5
3
Frequency (GHz)
Bulk
2.5
PDSOI
2 FinFET
ETSOI
1.5
1
25 W/cm2
0.5
0
Technology Node
Conditions: 4 core processor chip, constraining total chip power at 25 W/cm2

Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater size
and spacing, fin height, sidewall thickness (Fin), Si thickness (ET).
48
Power density at constant performance
Required power density drops at
60 32nm due to transition to hi-k, then
rises due to decreasing strain and
50 lack of scaling of gate insulator.
PDSOI Transition to FinFET at 16nm to
Power Density (W/cm2)
40 obtain continued improvement.

PD 1e15
PD 1.5e15
30
~3.9 GHz Fin 1e15
Fin 1.5e15
FinFET
10
20
PDSOI
~2.6 GHz PDSOI
10
FinFET
Area (cm2)
PD 1e15
PD 1.5e15
1
Fin 1e15
0 Fin 1.5e15
Technology node
Conditions: PDSOI and FinFET, 4 core processor chip, constraining total chip FinFET
performance
0.1
Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater 45nm 32nm 22nm 16nm 11nm
size and spacing, fin height, sidewall thickness Technology node
49
Performance at constant area and power
16
Area = 4 cm2, power = 100 W, fixed.
14
Number of cores is adjusted to maintain
FinFET constant area. Chip performance is
12
25 W/cm2 assumed linear with the number of cores.
50 W/cm2
Maximizing total chip performance.
10
PD 25W/cm2
PD 50W/cm2
8
Fin 25W/cm2
Fin 50W/cm2
6
50 W/cm2 25 W/cm2 80
4
70
60
FinFET
Number of Cores
2
PDSOI 50
PD 25W/cm2
PD 50W/cm2
40
0 Fin 25W/cm2
Fin 50W/cm2
45nm 32nm 22nm 16nm 11nm 30
Technology node 20
PDSOI
Conditions: PDSOI and FinFET, variable # core processor chip, constraining both 10
chip power and chip area (4 cm2). 0

Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater Technology node
size and spacing, fin height, sidewall thickness, and number of cores.
50
Supply voltage considerations
1.8 VDD can be reduced by
18% 25 W/cm2
1.7 higher activity, higher
mobility, and tighter
1.6
tolerances.
0.7X Variability
1.5
1.4 1.4X Mobility

Vary Vdd
Vary actf
1.3
Vary Tol
1.2 Vary mu
0.7X
1.1
1.4X
1
Activity factor Optimizing everything
0.9 except VDD.
6%
0.8
0.4 0.5 0.6 0.7 0.8 0.9
Supply Voltage (V)

Conditions: PDSOI, 4 core processor chip, constraining total chip power density
Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater
size and spacing.
51
Optimizations vs gate length for FinFETs and ETSOI
Everything except gate length is optimized. The gate length is scanned.
25 W/cm2 power density constraints.
50 0.12
Rel. Performance
0.1 ETSOI
40
0.08 FinFET
DIBL (V)
30
0.06
Shorter gate 20
0.04
lengths necessitate: 10 ETSOI
0.02
FinFET
0
Higher DIBL 0
0 20 40 60 0 20 40 60
Higher VDD Gate Length (nm) Gate Length (nm)
Thinner tSi
Supply Voltage, Vdd (V)
0.7 8
Silicon Thickness (nm)

0.6 7
Ultimately, lower 0.5 6
5
performance 0.4
4
0.3 ETSOI 3
0.2 ETSOI
FinFET 2
0.1 1 FinFET
0 0
0 20 40 60 0 20 40 60
Gate Length (nm) Gate Length (nm)

11 nm node
52
DIBL Scaling enables VDD Scaling
FinFET ETSOI Bulk Fin ET Bulk
Supply Voltage, VDD (V)

0.6 35
1W
0.5 30
Gate Length (nm)

11
0.4 22
25
20
0.3
15
0.2
10
0.1
5
0 0
0 0.02 0.04 0.06 22nm 16nm 11nm
DIBL (V/V) Technology Node
25W/cm2 FinFET ETSOI Bulk Fin ET Bulk
0.8 35
Supply Voltage, VDD (V)
0.7 22 30
Gate Length (nm)

11
0.6
25
0.5
20
0.4
0.3 15
0.2 10
0.1 5
0
0
0 0.02 0.04 0.06 0.08 22nm 16nm 11nm
DIBL (V/V) Technology Node

53
Novel devices for 11nm and beyond
• III-V FinFETs
– Higher mobility improves drive current
• Tunnel FETs
– Improved subthreshold slope enables low VDD and low energy
operation.
– To properly model this device, have to be able to calculate the tunneling
barrier shapes and band-edge alignments.
– We are in the process of developing a compact model for TFETs for the
optimizer, but results are not yet available. As an interim measure, we
can alter the Boltzmann constant in the conventional FET model, to see
the impact of steeper subthreshold slope.
• Carbon Nanotube Transistors
– Ballistic current flow in the channel should enable very high switching
speeds for these devices.
– A compact model for CNTs suitable for the optimizer is being presented
at IEDM this year, but results are not available yet.
54
Comparing devices – Energy vs Performance
10
PDSOI
CNFET
0.1
FinFET
TFET
0.01
1.E+06 1.E+07 1.E+08 1.E+09 1.E+10 1.E+11
Frequency (Hz)
[General trends, not exact results.]
55
Summary
• CMOS scaling is limited by electrostatic, quantum mechanical,
discreteness, thermodynamic and practical effects.
• Optimization can and should be used to find the best design points
in the midst of these various constraints.
– Example: low power needs somewhat less scaled devices.
• Technology performance projections based on optimization for
PDSOI, FinFETs, ETSOI, and Bulk MOSFETs show:
– Density improvements should continue, as long as wiring density
continues to improve
– Performance improvements are likely to be rather modest, even with a
switch to FinFETs for 16 and/or 11nm nodes.
• Exploratory devices (CNTs and/or TFETs) may offer substantial
performance advantages, someday….
56
Acknowledgements
• Wilfried Haensch • Mike Scheuermann
• Leland Chang • Phillip Restle
• Paul Solomon • Omer Dokumaci
• Steve Koester • Mary Wisniewski
• Lan Wei • Steve Kosonocky
• Philip Wong • Yuan Taur
• Ghavam Shahidi • Bob Dennard
57
Extra slides
58
Generalized heat sink model
Heat sources
• Two level heat flow model: Si wafer
– Flow in the silicon wafer
Interface
– Flow in the heat sink material
• In each layer, the flow can be: Heat spreader
– 3D (spherical) for spots smaller (e.g., SiC or Cu)
than thickness
– 2D (cylindrical) at distances Interface to final coolant
larger than the thickness (e.g., air or water)
• In silicon layer, inhomogeneous
power dissipation is accounted ρSi – thermal sheet resistance of Si wafer
for, to estimate maximum junction
temperature at hottest point. ρHS – thermal sheet resistance of heat sink
RSi – thermal contact resistance of Si wafer
Temperature Rise (K)
Comparison This model

My model is red
is red.
of simplified FE
Kai’smodel is blue
data is blue. RHS – thermal contact resistance of heat sink
analytic
model with
detailed
numerical
model.
Hot spot size (cm)
59
III-V FinFETs
III-V FinFETs are modeled by increasing the mobility in the conventional model.
Increased mobility enables an improved energy/performance tradeoff by
reducing the voltage needed for high performance designs.
FinFET III-V 2x III-V 4x
10 Fin 2x 4x
1.4
Drive current multiplier: 1.2

Energy/transition (fJ)
Supply Voltage (V)

1x
0.8
(=Si)
1 0.6
2x 0.4
(~GaAs) 0.2
4x 0
0 2 4 6 8 10 12 14 16
(~InGaAs) Frequency (GHz)
0.1
1.E+13 1.E+14 1.E+15 1.E+16
Performance (transitions/sec)
60
Beating the sub-threshold slope limit
Source Gate
Drain
EG3
E 2 −B EG3 − B EG l
T~ e E
~ 2 e
EG l Ec
Ev
Vgs2
S~
2Vgs + D(Vgs ,Vds )
Krishna K. Bhwalka et al. 2005
P++ n++
Log Ids
S
High Vth low
power space
Tunnel FETs show strong voltage

60mV/dec dependence of sub-threshold slope
On-current not yet on par with
conventional high performance FETs at
Vgs comparable voltages
[Haensch]
61
TFET Heterostuctures SiGe Heterostuctures offer low
effective bandgap, which improves
Planar SiGe H-TFET III-V H-TFET tunneling.
gate III-V materials offer lower effective

ON-state
masses and more heterojunctions,
to further improve tunneling.
poly
source drain Nanowire Geometry Offers:
p++ SiGe n+ Si Optimum electrostatics with gate
p-Si
all around
Buried Oxide
Off-state
New material combinations
Si
Integration onto silicon possible
Vg > 0
Scaling to quantum capacitance
Ambipolar
device
limit
SiGe
supressed
Improved Ion
High on
Si
current
Vg = 0
Ambipolar
SiGe device
supressed
x
Si
[Koester, Riel, Koswatta]

62
CNFET: Good or Bad?
• Carbon Nanotube Field Effect Transistor (CNFET)
– 1D devices
– Better transport characteristics ☺
– Worse parasitics
– Leakage, Variations, etc…
• CNFET: Good or Bad?
– The judgment highly depends on the application
– Our approach is to build a optimizer for a full technology,
with proper consideration of device properties and system
needs.
• Device modeling
• Circuit performance benchmarking
• System application consideration
[Lan Wei, Stanford]

63
Wire Model
Assumed constant 2:1 height to width wiring with equal lines
and spaces. ( 0.062 kBEOL fF/um ) Relative Performance change due to R(T)
3
 70nm ln 1 + e(T −40) /10 ( ) 2.5
ρ (W ,T ) = ρ300K ⋅  +  10W,const R
 W 26  2 1W, const R
0.1W, const R
10W
1.5 1W
0.1W
Consequences of wire resistance 1
model: 0.5
50 100 150 200 250 300 350 400
Junction Temperature (K)
Performance loss due to scattering

LTR with Scatt / LTR w/o Scatt.
1.00
0.95
0.90
0.85
0.001 0.01 0.1 1 10
Total Power (W)
64

DJF - Scaling and Optimization - IIT PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DJF - Scaling and Optimization - IIT PDF

Uploaded by

Copyright:

Available Formats

Technology Optimization and

Basic idea of Scaling:

1. A good design should have

Currents increase exponentially as the barriers become thinner.

Gate insulator tunneling

The number of dopant atoms Statistical variation in the

[D. J. Frank, et al., 1999 Symp. VLSI Tech.] 9

Speed is very sensitive to VT/VDD ratio.

Delay Leakage Power

Adjust for Latency Total Power

Fudge Treat in Fudge

Treat these by simple scaling from the logic part.

Prescott. 125M FETs

Alpha 21264 ('96)

Dothan, 140M FETs

9.3% data from:

60.5% 2/3 20.2%

have similar area usage statistics. Logic

V DD ( C parasitic + C wire + C gateload )

PsubVT =1.7 N CKT VDD I off (VT ,VDD, tox , η, LG , WL )

Pox = Acore Dox ( WL )VDD J ox (VT ,VDD, tox , η)

Repeater Spacing (cm)

Two variable example:

optimized to, and Power and delay are

Clock Frequency (GHz)

Gate Length vs Power Oxide Thickness vs Power

Equiv. Oxynitride Thickness (nm)

(High-k case assumes 0.3nm

Vdd,90 Vdd,65 Vdd,45 Vdd,32

0.3 Dual core processor

– Supply voltages are lower for low power applications.

5W, oxynitride 5W, high-k

45nm node, dual core processor

Performance (arb units)

<w>, Vdd, wHP, Srpt,

2 cores 4 cores 8 cores 16 cores

to increase the 22nm

Conditions: PDSOI, 4 core processor chip, constraining total chip power

Chip Area (cm2)

Conditions: PDSOI, 4 core processor chip, constraining total chip power

Energy per logic transition (fJ)

Conditions: PDSOI, 4 core processor chip, constraining total chip power

Energy per logic transition (fJ)

Conditions: PDSOI, 4 core processor chip, constraining total chip power

Conditions: PDSOI, 4 core processor chip, constraining total chip power

Supply Voltage (V)

45nm 32nm 22nm 16nm 11nm

10W/cm2 Technology Node

Chip Area (cm2)

Conditions: 4 core processor chip, constraining total chip power at 25 W/cm2

40 obtain continued improvement.

45nm 32nm 22nm 16nm 11nm

size and spacing, fin height, sidewall thickness Technology node

chip power and chip area (4 cm2). 0

1.4 1.4X Mobility

Supply Voltage (V)

Higher VDD Gate Length (nm) Gate Length (nm)

Silicon Thickness (nm)

Gate Length (nm) Gate Length (nm)

Supply Voltage, VDD (V)

Gate Length (nm)

Gate Length (nm)

DIBL (V/V) Technology Node