DJF - Scaling and Optimization - IIT

Technology Optimization and
Performance Projections
David J. Frank
12/4/09
International Winter School for Graduate Students
IIT, Bombay, India
1. Introduction
log(Performance)
The End of Scaling?
Time (yr)
2
log(Performance)
The End of Scaling?
Time (yr)
3
The End of Scaling is Optimization
log(System Performance)
Stop when you get to the top.
Miniaturization
Outline
1.
2.
3.
4.
Review limitations to Scaling

Optimizing technology within constraints
Optimization results
Technology projections
1. Limitations to Scaling
1. Electrostatic constraints
2. Quantum mechanical leakage currents
3. Discreteness of matter and energy
4. Thermodynamic limitations
5. Practical and environmental constraints on power
Basic idea of Scaling:
Adjust dimensions,
voltages, & doping to
achieve smaller FET
with same electrostatic
behavior.
Electrostatic constraints on FET design

1. A good design should have
A.
B.
C.
D.
E.
F.
High output resistance

High gain
Low sensitivity to variations
High transconductance
High drain current
High speed
2. Must choose a compromise:

Short, but not so short that 2D
effects kill A, B, C.
}
}
Long channel
behavior
Short channel
behavior
Gate
S
Quantum Mechanical Tunneling Leakage Currents

Currents increase exponentially as the barriers become thinner.
Everything
becomes
leaky.
FET 'ON'
FET 'OFF'
Vdd
Gnd
Gnd
Gnd
Gnd
Vdd
Gate insulator tunneling

Subthreshold leakage
Direct source-to-drain tunneling
Drain-to-body tunneling
ee-
Source
e-
Channel
Drain
Atomistic effects
The number of dopant atoms
in the depletion layer of a
MOSFET has been scaling
roughly as Leff1.5.
Statistical variation in the

number of dopants, N, varies
as N1/2, causing increasing VT
uncertainty for small N.
249,403,263 Si atoms
68,743 donors
13,042 acceptors
[D. J. Frank, et al., 1999 Symp. VLSI Tech.]
Thermodynamic limitations
The Boltzmann distribution determines the subthreshold slope
and leakage current, VT, and diode leakage currents, too.
e(VG -VT )/kT
IsubVT = I0 e
VT can only be scaled by reducing the temperature, which is

not acceptable for many applications.
Speed is very sensitive to VT/VDD ratio.
Irreversible computation => All
switching energy is converted to heat.
All leakage currents and IR drops are
irreversible => More heat.
e-
Source
Channel
Drain
10
Practical and Environmental Issues

Power consumption and heat removal are
limited by practical considerations.
Low power applications are often battery
powered
Many must be lightweight => power < ~few watts.
Disposable batteries can cost >> $500/watt over life of
device.
Rechargeables can cost > $50/watt over life of device.
Home electronics is limited to <~1000W by

heating of the room and cost of electricity.
High performance is limited by difficulty of
heat removal from chip (~100 W/chip). (Cost
of electricity is ~$5/watt over life.)
11
2. Optimizing Technology within Constraints

Practicality imposes power
constraints.
Electrostatics imposes
geometric constraints
Thermodynamics imposes
voltage constraints.
Quantum mechanics imposes
miniaturization constraints due to
tunneling.
Fixed architectural complexity

+ Fixed power constraints
+ Device physics
= Existence of an optimal technology with maximal performance.
leakage increases due

to tunneling effects
dynamic power
Large
Miniaturization
Small
log(Performance)
Power
leakage power
Declining available
dynamic power
overwhelms speed
improvements of
scaling
Large
Miniaturization
Small
12
Background: Schematic organization of an

optimization program
Fixed
parameters
Variables:
initial guess
new
values:
improved
guess
Area Model
Thermal
Model
Wiring
statistics
Wire Capacitance
Delay
tolerance adjustments
Adjust for Latency

of Long Paths
Goal: optimize device

technology to
maximize chip-level
performance, subject
to power constraints.
Device Structure
IV Model
Leakage Model
Constrained
optimizer
Leakage Power
tolerance adjustments
Total Power
13
Models and Approximations

System Assumptions
Processor chip is assumed to have a fixed number of cores, each with
a specified number of logic gates.
Only the logic within the cores is considered within the optimizations.
The clock and memory aspects of the chip are assumed to scale in
the same way as the logic (delay, power, and area).
Clock
Core-to-core and core-to-memory communication is not dealt with.
Fudge
Repeaters
Logic
Treat in
detail
Memory
Fudge
Treat these by simple scaling from the logic part.

14
How much area do the processor cores take?

100% to 25%, generally decreasing with generation:
70%
Prescott. 125M FETs
100%
Alpha 21264 ('96)
15M FETs, L1 cache only
40%
Dothan, 140M FETs
40%
Power4, 174M
FETs
~25% 2 cores, 1.72B FETs

15
Area usage within a processor core

Approximate area fractions for a high-performance
microprocessor core in leading-edge technology
9.3%
23.3%
7.0%
data from:
M. Scheuermann
and M. Wisniewski
9.3%
23.3%
7.0%
9.3%
60%
60.5%
23.3%
1/3
7.0%
9.3%
20.2%
23.3%
2/3
13.3%
40.3%
Buffers & extra latches
7.0%
20.2%
33%
31% 36%
20.2%
13.3%
Caches (L1)
Macros
Caches
Caps, Clock dist., Unused
Register files
Custom & RLMs
12.5%
14.5%
1.2%
Buffers & extra

latches
Caches
Register files
Processors built with nanotechnology are likely to

have similar area usage statistics.
Nanotechnology may require additional area
allocations for defective circuitry.
Estimates of power and computational densities
should take into account realistic area efficiencies.
Latches and
LCBs
Logic
Unused/caps
Caps, Clock
dist., Unused
90%
11.2%
14.5%

Caches
Register files
Latches and LCBs
Logic in use
Unused Logic
Unused/caps
16
Optimization Approaches
1. Engineering approach:
Maximize system performance, at fixed power.
Use total logic transition rate (LTR),
LTR = Ngates x activity factor/logic depth x 1/Delay
Relatively little dependence on architectural details.
2. Business approach:
Maximize Return on Investment (ROI).
17
FET Model
Using a general temperature-dependent short-channel FET model in
which VT, tD, and tox are coupled, halo doping effects are included, and
VT is set by the doping.
Modified alpha power model:
W kT kT / e (E )
V V
I D (VGS ) = effI
0
EC F GS T
tox e FI EC LCH 0
kT / e
Fermi-Dirac
integral of order
10W FET
Lg=28nm
1mW FET
Lg=45nm
18
Circuit Delay Estimation

Basic circuit elements are:
FI=2, FO=1.65 wire-loaded NAND gates for logic
inverters for repeaters, FO ~ 1.2
Delay calculations:
1 =
V DD ( C parasitic + C wire + C gateload )

*
2 I Deff
Current is adjusted to account

for noise and variations.
2 = Rwire (C wire + C gateload )

3 = Lwire (c / 2)
3/ 4
Propagation
delay
1 + 42 / 3 + 34 / 3
=
0.5 + (1 VT / VDD ) (1 + )
Final delay empirically merges

the separate components.
[Eble's thesis]
Correction for
VT/VDD.
19
Power Calculation
PTOT = PDYN + PsubVT + POX + PB 2 B
PDYN =
lD
N CKT
1
2
C (VH VL )VDD
PsubVT =1.7 N CKT VDD I off (VT ,VDD, tox , , LG , WL )

Pox = Acore Dox ( WL )VDD J ox (VT ,VDD, tox , )
PB 2 B =
1
3
Acore Dox ( WL )VDD J B 2 B ( FMax ,VDD )
LTR = lD NCKT
Note that
cross-through
power is not
included.
The powers are computed separately for logic and for repeaters.
= mean delay for a single loaded logic gate
lD
is activity factor divided by logic depth. Usually ~0.012 in

recent optimizations.
20
Communication and Wiring Models

(
4FO 2r 3
2 NCKT
l
3 + FO
lR
LnoRptr
lR
N Rptr =
l Max
lR
linet (l)dl
inet (l)dl
2 r 3
linet (l)dl l R
Units are gate pitches.

r = Rent exponent, 0.6, here.
# Wiring Levels
Required
inet (l) =
log(number of wires)
Assume wire lengths distributed according to Rent's rule.
lR
2 NCKT log(length)
From optimizations:
12
10
8
6
4
1E+5 1E+6 1E+7 1E+8
Number of Logic Gate

21
Repeater Model
100
0.9
0.8
0.7
80
0.6
70
60
50
0.01
0.5
0.4
Pecon=10 W/cm2
0.1
Latency Penalty Factor
0.3
0.2
Repeater Spacing (cm)
90
1
Repeater spacing
Repeater width
Repeater Width (um)
Repeater Spacing (um)
Long wires receive repeaters with a spacing that is

optimized.
Long wire delay can be absorbed into pipeline depth, but
the latency causes inefficiency, so we use a latency
penalty factor: .
0.1
0.01
9S
10S
0.001
0.01
11S
0.1
12S
13S
10
100
CPU Core Power (W)
22
Local Variation Modeling
Variation sources:
Signal Coupling noise

Supply noise
Statistical doping variations
LER gate length variations
Consequences modeled:
Increased static power
combine 1 sigma of doping, length, and noise
Critical path delay distribution

yield-based, using estimated critical path
distribution,
and 1 sigma of doping and length, and worst case
noise.
Single stage functionality

use worst case (~6 sigma) of doping and length,
no noise.
23
Accounting for variations
A complete most-probable worst-case-vector methodology is

used to handle both local and global variability.
Two variable example:
Worst-case vectors:
MPWC
vector
Blue curves are contours of

constant probability.
Red curves are contours of
function to be minimized.
3 2 1
Murphy
vector
24
Power vs Frequency Variation Windows

Optimizations for high-perf processor, 45 nm node technology, PDSOI
Power-constrained optimization parameters: VDD, VTn, VTp, LG, repeater size and spacing.
Chip Power (W)
Area constrained is also constrained, to 5.6 cm2, by adjusting the widths.

This is the point
that is calculated,
optimized to, and
reported.
VDD+5%
230 W
VDD-10%
Nominal
design point:
VDD=1.058 V
These boxes are for

0.675 sigma, for 50%
yield.
Power and delay are
treated as independent:
25% of yield is lost for
each.
The data shows 10%
frequency variation per
sigma.
Clock Frequency (GHz)

25
Impact of variability on performance

Atomistic effects are leading to greater device variability.
Increasing variability requires larger design margins.
Designing for larger margins decreases performance.
1.3
Relative Performance
P=0.01
P=1
P=50
1.2
Increased variability
requires:
1.1
1
Higher supply voltages
0.9
0.8
0.7
0%
Less scaled FETs
65nm node, dual

processor core
50%
100%
150%
200%
Relative Margin
26
Summary: Models, Assumptions and

Approximations
Power modeling
Dynamic switching energy plus static power mechanisms
including sub-threshold current, gate oxide tunneling, and
body-to-drain band-to-band tunneling.
Device modeling
Bulk MOSFETs: VT, and depletion depth determined by the
halo doping, 2D effects are taken into account.
Gate length is fully optimized, not set by the technology
node.
Circuit modeling
Delay is for FI=FO=2 or 3 NAND gates, based on model
from J.C. Eble's thesis [Ga.Tech. '98].
Capacitance includes gate, parasitic, and wire parts (Rent's
rule).
Wire resistance includes temperature dependence and
surface scattering in small wires.
27
Summary: Models, Assumptions and

Approximations
Chip-level modeling
Allocate fixed fraction of chip power and area to logic, and
assume fixed number of logic gates. Logic part is
optimized, and the rest is assumed to scale similarly.
Assume multiple processor cores are interconnected in a
way that does not greatly add to the wiring burden.
Long wires are fatter, and receive repeaters with a spacing
that is optimized.
Long wire delay is accounted for using a latency penalty
factor.
On-chip tolerance/variability and noise is accounted for.
28
3. Optimization Results
General results
Evaluating specific possible device
directions
Increasing mobility
High-k gate dielectric and metal gates
3D stacking
Better heat sinks
Sub-ambient cooling
Multi-processor tradeoffs
29
Optimize by technology node

For each node, pre-specify
the following parameters:
Wire half-pitch,
gate overlap,
halo scalelength,
contact resistance,
LER sigma,
ACLV,
mobility,
gate depletion,
k_wire,
k_gate
Optimizations over 7 variables:

tox, Lg, ND, <w>, Vdd, Srpt, <wrpt>
Dual core processor with

aggressive air cooling
Note that the LG, tox, VDD, VT, width, etc. are NOT preselected.
They are solved for by the optimizations.
30
Gate Length vs Power
120
90 nm
65 nm
45 nm
32 nm
Gate Length (nm)
100
80
60
40
20
0
0.01
0.1
10
Total Chip Power (W)
Dual core processor with

aggressive air cooling
100
Equiv. Oxynitride Thickness (nm)
Oxide Thickness vs Power

1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.01
90 nm
65 nm
45 nm
32 nm
Oxynitride
High-k, for 32nm
0.1
10
100
(High-k case assumes 0.3nm

barrier layer, bandedge metal gate,
HfO2-like insulator characteristics.)
31
Supply and Threshold Votlage (V)
Voltages vs Power
0.9
0.8
Vdd,90
VT,90
Vdd,65
VT,65
Vdd,45
VT,45
Vdd,32
VT,32
0.7
0.6
0.5
0.4
Dual core processor

with aggressive air
cooling
0.3
0.2
0.1
0.01
0.1
10
100
Supply voltages are lower for low power applications.

High-k lowers VDD ~ 15% at the 45nm generation.
32
Optimal Power Allocation Fractions

Oxide pwr, rptrs
SubVT pwr, rptrs
Dyn. pwr, rptrs
Oxide pwr, logic

SubVT pwr, logic
Dyn. pwr, logic
Active power fraction:

70% at low power to
40% at high power.
Power Allocation
100%
80%
60%
40%
20%
0%
1
10
30
100
300
Chip Power (W)

45nm technology with microchannel
heat sink and water cooling.
4 core chip.
33
Mobility dependence
Enhanced mobility has greatest benefit at high power.
Even for large mobility enhancements, performance boost is
modest: 10-15%.
1.12
45nm technology
dual core processor
1.15 water cooled
1.2
1.1
1.05
1 W chip
10 W chip
100W chip
1.5
2
2.5
Mobility Enhancement Factor
1W
1.1
10W
100W
1.08
1.06
1.04
32nm technology
8 core processor
Air cooled
1.02
1
1.5
2
2.5
3
Mobility Enhancement Factor
34
Performance relative to poly-Si
Metal-gate workfunction for high-k

and oxynitride
1.4
5W, oxynitride
50W, oxynitride
1.3
5W, high-k
50W, high-k
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.1
0.2
0.3
0.4
0.5
Workfunction offset from bandedge (ev)
45nm node, dual core processor

with aggressive air cooling
35
3D stacking
Multiple layers offer higher performance
due to shorter wires.
RED = 1 Layer, GREEN = 2 Layers
400
300
10
200
20
15
10
5
0
100
1
10
100
Mean Wire Length (um)
10
Chip Power (W)
100
0
1.2
1
0.8
0.6
0.4
0.2
0
10
100
Chip Area (cm2)
10
Tot Si area
Mean FET Width (nm)
Footprint
15
100
Chip Power (W)
(4 core, 45nm node, water cooling.)
36
Cooling scenario optimizations
5
4
3
2
1
0
1
-40C Liquid
18C Water
Hi-Perf. Air
Low-Cost Air
4 core processor design

45nm technology
10
100
1000
10000
Performance (arb units)
Forced liquid cooling through microchannel fins may permit very high
power densities.
Optimized (maximum) performance increases as the ~log of the power.
7
6
5
4
-40C Liquid
18C Water
Hi-Perf. Air
Low-Cost Air
3
2
1
0
1
8 core processor design

32nm technology
10
100
1000
10000
Optimized over 7 variables: Lg, tox, Nd, <w>, Drptr, <wrptr>, Vdd.
Low temperature case does not include refrigerator power.
37
Multiprocessor motivation
The energy / performance tradeoff is very steep at the high end.
Lower power, more parallel processors potentially offer more
computation for the same total power level.
30
4 core processors
0
0
10
30
4-processor chips with

micro-channel water
cooling, optimizing
everything.
1
10
3
3x
3x
9 variables: tox, Lg, ND,

<w>, Vdd, wHP, Srpt,
<wrpt>, xhalo
0.
1E+14
1
0.
03
1E+13
0.
01
0.
00
0.1
1E+12
3
00
0.
0.
Loaded Switching Energy (fJ)
10
1E+15
1E+16
Total Logic Transistions / sec
38
Dependence on number of cores

Constant total number of transistors, divided equally among n cores:
10
2 cores
4 cores
8 cores
16 cores
1
1
10
100
1000

39
4. Future Projections (22 11nm)
Device options
General results
Technology projections
Beyond 11nm?
40
Device Options
PDSOI
IBMs best understood technology.
FinFET
FinFET Drain
Improved electrostatic control of

Gate
channel offers shorter gates, lower
voltages, higher speed.
Source
Workfunction VT control. Not
entirely planar.
ETSOI
Gate
ETSOI Gate
Source
The ETSOI and FinFET

devices simulated using
an in-house scaling
model.
Drain
Somewhat improved electrostatic Buried Oxide

control compared to PDSOI. More Substrate
compatible with conventional
planar processing. Workfunction
VT control.
Bulk Gate
Shallow Bulk
Source
Comparable source/drain
resistance and parasitic
capacitance models
were implemented for
ETSOI, FinFET, and for
shallow bulk MOSFETs.
Drain
Shallow junctions, raised S/D.

41
General optimization results

performance vs power
1.E+10
1.E+09
Frequency (Hz)
As always, the
easiest way to
increase
performance is
to increase the
power.
22nm
16nm
11nm
1.E+08
1.E+07
0.01
0.1
10
100
1000
Total Chip Power(W)
Conditions: PDSOI, 4 core processor chip, constraining total chip power

Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths,
repeater size and spacing.
42
Gate Length and Chip Area vs Power

50
Lower power requires longer gate

lengths, to reduce variability.
0.8
45
0.7
40
Chip Area (cm2)
Gate Length (nm)
0.6
22nm
35
30
25
11nm
20
22nm
0.5
22nm
22nm
0.4
16nm
16nm
11nm
11nm
0.3
11nm
15
0.2
10
5
0
0.01
0.1
Toxeqv: 0.6 0.82 nm

0.1
10
Total Chip Power(W)
100
1000
0
0.01
Device density increases for

each generation. Shrinking area
increases power density, too.
0.1
10
100
1000
Total Chip Power(W)

43
Voltage and Energy vs Power

Optimal supply voltage can become quite low for low power constraints, leading to very
low energy use per logic transition.
16
1.4
14
Energy per logic transition (fJ)
1.2
Supply Voltage (V)
0.8
11nm
22nm
0.6
0.4
0.2
0
0.01
12
10
22nm
8 16nm
22nm
11nm
11nm
16nm
22nm
11nm
0.1
10
Total Chip Power(W)
100
1000
0
0.01
0.1
10
100
1000
Total Chip Power(W)

44
Energy vs performance trade-off

10
10x
22nm
16nm
11nm
22nm
4x
11nm
0.1
1.E+07
1.E+08
1.E+09
1.E+10
Clock Frequency (Hz)

45
Optimal On/Off Ratio

PDSOI nFETs, currents measured at nominal process and bias conditions
bi
as
co
nd
itio
ns
0.1
Hi
gh
1000:1
22 nm
16 nm
0.001
11 nm
co
nd
itio
ns
Off-current (A/cm)
0.01
Lo
w
bi
as
0.0001
10000:1
0.00001
0.1
10
100
On-current (A/cm)

46
Performance at constant power density (PDSOI)

Supply Voltage (V)
Performance increases at 32nm due to hi-k introduction, but then falls as strain
1
diminishes and gate dielectric does not scale further.
0.9
6
50 W/cm2
0.7
0.6
10W/cm2
0.5
0.4
25 W/cm2
50W/cm2
0.3
0.2
0.1
0
45nm
10W/cm2
32nm
22nm
16nm
11nm
Technology Node
25W/cm2
50W/cm2
6
10 W/cm2
Chip Area (cm2)
Frequency (GHz)
0.8
0
45nm
32nm
22nm
Technology Node
16nm
11nm
5
4
10 W/cm2
25 W/cm2
50 W/cm2
2
1
0
45nm
32nm
22nm
16nm
11nm
Technology Node
Conditions: PDSOI, 4 core processor chip, constraining total chip power density
Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater
size and spacing.
47
Performance at constant power density

comparing technologies
ETSOI and FinFET offer moderate performance advantage over PDSOI for 22nm node
and beyond. The industry should transition to FinFET at 16nm to avoid performance loss.
4.5
4
Frequency (GHz)
3.5
3
Bulk
2.5
PDSOI
FinFET
ETSOI
1.5
1
25 W/cm2
0.5
0
45nm
32nm
22nm
16nm
11nm
Technology Node
Conditions: 4 core processor chip, constraining total chip power at 25 W/cm2

Optimizing: VDD, tox, dopings (for VTs), LG, p:n width ratio, mean widths, repeater size
and spacing, fin height, sidewall thickness (Fin), Si thickness (ET).
48
Power density at constant performance

Required power density drops at
32nm due to transition to hi-k, then
rises due to decreasing strain and
lack of scaling of gate insulator.
60
PDSOI
Transition to FinFET at 16nm to

obtain continued improvement.
40
PD 1e15
PD 1.5e15
30
~3.9 GHz
Fin 1e15
Fin 1.5e15
FinFET
20
10
PDSOI
PDSOI
~2.6 GHz
10
FinFET
0
45nm
32nm
22nm
16nm
11nm
Area (cm2)
Power Density (W/cm2)
50
PD 1e15
PD 1.5e15
Fin 1e15
Fin 1.5e15
Technology node
FinFET
Conditions: PDSOI and FinFET, 4 core processor chip, constraining total chip
performance
size and spacing, fin height, sidewall thickness
0.1
45nm
32nm
22nm
16nm
11nm
Technology node
49
Performance at constant area and power

16
Area = 4 cm2, power = 100 W, fixed.
14
FinFET
12
25 W/cm2
50 W/cm2
Number of cores is adjusted to maintain

constant area. Chip performance is
assumed linear with the number of cores.
Maximizing total chip performance.
10
PD 25W/cm2
PD 50W/cm2
Fin 25W/cm2
Fin 50W/cm2
25 W/cm2
50 W/cm2
80
PDSOI
0
45nm
32nm
22nm
16nm
11nm
Technology node
Number of Cores
70
Conditions: PDSOI and FinFET, variable # core processor chip, constraining both
chip power and chip area (4 cm2).
size and spacing, fin height, sidewall thickness, and number of cores.
FinFET
60
50
PD 25W/cm2
PD 50W/cm2
40
Fin 25W/cm2
Fin 50W/cm2
30
20
PDSOI
10
0
45nm
32nm
22nm
16nm
11nm
Technology node
50
Supply voltage considerations

1.8
18%
VDD can be reduced by

higher activity, higher
mobility, and tighter
tolerances.
25 W/cm2
1.7
1.6
0.7X Variability
1.5
1.4X
1.4
Mobility
Vary Vdd
Vary actf
1.3
Vary Tol
Vary mu
1.2
0.7X
1.1
1.4X
Optimizing everything
except VDD.
Activity factor
0.9
6%
0.8
0.4
0.5
0.6
0.7
0.8
0.9
Supply Voltage (V)
Conditions: PDSOI, 4 core processor chip, constraining total chip power density
size and spacing.
51
Optimizations vs gate length for FinFETs and ETSOI

Everything except gate length is optimized. The gate length is scanned.
Higher DIBL
0.12
40
0.1
ETSOI
0.08
FinFET
30
20
ETSOI
FinFET
10
0
0
20
40
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.02
60
FinFET
20
40
Gate Length (nm)
20
40
60
Gate Length (nm)
ETSOI
11 nm node
0.04
Gate Length (nm)

Supply Voltage, Vdd (V)
Ultimately, lower
performance
0.06
Higher VDD
Thinner tSi
DIBL (V)
50
Silicon Thickness (nm)
Shorter gate
lengths necessitate:
Rel. Performance
25 W/cm2 power density constraints.
60
8
7
6
5
4
3
2
1
0
ETSOI
FinFET
20
40
60
Gate Length (nm)
52
FinFET
Bulk
Fin
35
0.5
30
Gate Length (nm)
0.6
11
0.4
22
0.3
0.2
0.1
0
0
25W/cm2
0.02
0.04
DIBL (V/V)
FinFET
Supply Voltage, VDD (V)
ETSOI
ETSOI
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
ET
Bulk
25
20
15
10
5
0
0.06
22nm
16nm
11nm
Technology Node
Bulk
Fin
ET
Bulk
35
22
0.02
0.04
DIBL (V/V)
0.06
11
0.08
Gate Length (nm)
1W
Supply Voltage, VDD (V)
DIBL Scaling enables VDD Scaling
30
25
20
15
10
5
0
22nm
16nm
11nm
Technology Node
53
Novel devices for 11nm and beyond
III-V FinFETs
Higher mobility improves drive current
Tunnel FETs
Improved subthreshold slope enables low VDD and low energy
operation.
To properly model this device, have to be able to calculate the tunneling
barrier shapes and band-edge alignments.
We are in the process of developing a compact model for TFETs for the
optimizer, but results are not yet available. As an interim measure, we
can alter the Boltzmann constant in the conventional FET model, to see
the impact of steeper subthreshold slope.
Carbon Nanotube Transistors

Ballistic current flow in the channel should enable very high switching
speeds for these devices.
A compact model for CNTs suitable for the optimizer is being presented
at IEDM this year, but results are not available yet.
54
Comparing devices Energy vs Performance

10
PDSOI
1
0.1
CNFET
FinFET
TFET
0.01
1.E+06
1.E+07
1.E+08
1.E+09
1.E+10
1.E+11
Frequency (Hz)
[General trends, not exact results.]

55
Summary
CMOS scaling is limited by electrostatic, quantum mechanical,

discreteness, thermodynamic and practical effects.
Optimization can and should be used to find the best design points
in the midst of these various constraints.
Example: low power needs somewhat less scaled devices.
Technology performance projections based on optimization for

PDSOI, FinFETs, ETSOI, and Bulk MOSFETs show:
Density improvements should continue, as long as wiring density
continues to improve
Performance improvements are likely to be rather modest, even with a
switch to FinFETs for 16 and/or 11nm nodes.
Exploratory devices (CNTs and/or TFETs) may offer substantial

performance advantages, someday.
56
Acknowledgements
Wilfried Haensch
Leland Chang
Paul Solomon
Steve Koester
Lan Wei
Philip Wong
Ghavam Shahidi
Mike Scheuermann
Phillip Restle
Omer Dokumaci
Mary Wisniewski
Steve Kosonocky
Yuan Taur
Bob Dennard
57
Extra slides
58
Generalized heat sink model
Two level heat flow model:

Flow in the silicon wafer
Flow in the heat sink material
In each layer, the flow can be:

3D (spherical) for spots smaller
than thickness
2D (cylindrical) at distances
larger than the thickness
In silicon layer, inhomogeneous

power dissipation is accounted
for, to estimate maximum junction
temperature at hottest point.
Comparison
of simplified
analytic
model with
detailed
numerical
model.
Temperature Rise (K)
This
model
is red
My model
is red.
FE
is blue
Kaismodel
data is blue.
Heat sources
Si wafer
Interface
Heat spreader
(e.g., SiC or Cu)
Interface to final coolant
(e.g., air or water)
Si thermal sheet resistance of Si wafer

HS thermal sheet resistance of heat sink
RSi thermal contact resistance of Si wafer
RHS thermal contact resistance of heat sink
Hot spot size (cm)
59
III-V FinFETs
III-V FinFETs are modeled by increasing the mobility in the conventional model.
Increased mobility enables an improved energy/performance tradeoff by
reducing the voltage needed for high performance designs.
FinFET
III-V 2x
III-V 4x
10
Fin
2x
4x
Drive current multiplier:
1.2
1x
(=Si)
1
2x
(~GaAs)
4x
(~InGaAs)
0.1
1.E+13
1.E+14
1.E+15
Supply Voltage (V)
Energy/transition (fJ)
1.4
1
0.8
0.6
0.4
0.2
0
0
10
12
14
Frequency (GHz)
1.E+16
Performance (transitions/sec)
60
16
Beating the sub-threshold slope limit

E 2 B
e
T~
EG
S~
EG3
E
Source
EG3 B
~ 2 e
l
Gate
Drain
EG l
Ec
Ev
Vgs2
2Vgs + D(Vgs ,Vds )
Krishna K. Bhwalka et al. 2005
P++
Log Ids
n++
S
High Vth low
power space
60mV/dec
Vgs
Tunnel FETs show strong voltage

dependence of sub-threshold slope
On-current not yet on par with
conventional high performance FETs at
comparable voltages
[Haensch]
61
TFET Heterostuctures
Planar SiGe H-TFET
gate
III-V H-TFET
ON-state
poly
source
drain
p++ SiGe
Off-state
Buried Oxide
Optimum electrostatics with gate

all around
New material combinations
Si
Integration onto silicon possible

Vg > 0
SiGe
High on
current
III-V materials offer lower effective

masses and more heterojunctions,
to further improve tunneling.
Nanowire Geometry Offers:
n+ Si
p-Si
SiGe Heterostuctures offer low

effective bandgap, which improves
tunneling.
Ambipolar
device
supressed
Scaling to quantum capacitance

limit
Improved Ion
Si
Vg = 0
Ambipolar
device
supressed
SiGe
Si
[Koester, Riel, Koswatta]

62
CNFET: Good or Bad?
Carbon Nanotube Field Effect Transistor (CNFET)

1D devices
Better transport characteristics
Worse parasitics
Leakage, Variations, etc
CNFET: Good or Bad?
The judgment highly depends on the application
Our approach is to build a optimizer for a full technology,
with proper consideration of device properties and system
needs.
Device modeling
Circuit performance benchmarking
System application consideration
[Lan Wei, Stanford]

63
Wire Model
Assumed constant 2:1 height to width wiring with equal lines
and spaces. ( 0.062 kBEOL fF/um )
Relative Performance change due to R(T)
70nm ln 1 + e(T 40) /10

+
(W ,T ) = 300K
W
26
Consequences of wire resistance

model:
2.5
10W,const R
1W, const R
0.1W, const R
10W
1W
0.1W
2
1.5
1
0.5
50
100
150
200
250
300
350
400
Junction Temperature (K)
LTR with Scatt / LTR w/o Scatt.
Performance loss due to scattering

1.00
0.95
0.90
0.85
0.001
0.01
0.1
10
Total Power (W)

64

DJF - Scaling and Optimization - IIT

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DJF - Scaling and Optimization - IIT

Uploaded by

Copyright:

Available Formats

Technology Optimization and

The End of Scaling?

The End of Scaling?

The End of Scaling is Optimization

Stop when you get to the top.

Review limitations to Scaling

Electrostatic constraints on FET design

High output resistance

2. Must choose a compromise:

Quantum Mechanical Tunneling Leakage Currents

Gate insulator tunneling

Statistical variation in the

[D. J. Frank, et al., 1999 Symp. VLSI Tech.]

VT can only be scaled by reducing the temperature, which is

Practical and Environmental Issues

Home electronics is limited to <~1000W by

2. Optimizing Technology within Constraints

Fixed architectural complexity

leakage increases due

Background: Schematic organization of an

Adjust for Latency

Goal: optimize device

Models and Approximations

Core-to-core and core-to-memory communication is not dealt with.

Treat these by simple scaling from the logic part.

How much area do the processor cores take?

Prescott. 125M FETs

~25% 2 cores, 1.72B FETs

Area usage within a processor core

Buffers & extra latches

Buffers & extra latches

Caps, Clock dist., Unused

Buffers & extra

Processors built with nanotechnology are likely to

Buffers & extra latches

Circuit Delay Estimation

V DD ( C parasitic + C wire + C gateload )

Current is adjusted to account

2 = Rwire (C wire + C gateload )

Final delay empirically merges

PsubVT =1.7 N CKT VDD I off (VT ,VDD, tox , , LG , WL )

Acore Dox ( WL )VDD J B 2 B ( FMax ,VDD )

is activity factor divided by logic depth. Usually ~0.012 in

Communication and Wiring Models

Units are gate pitches.

Assume wire lengths distributed according to Rent's rule.

Number of Logic Gate

Latency Penalty Factor

Repeater Spacing (cm)

Repeater Width (um)

Repeater Spacing (um)

Long wires receive repeaters with a spacing that is

CPU Core Power (W)

Local Variation Modeling

Signal Coupling noise

Critical path delay distribution

Single stage functionality

Accounting for variations

A complete most-probable worst-case-vector methodology is

Two variable example:

Blue curves are contours of

Power vs Frequency Variation Windows

Chip Power (W)