Professional Documents
Culture Documents
Design Tradeoffs
6.004x Computation Structures
Part 1 Digital Circuits
Copyright 2015 MIT EECS
Optimization metrics:
1.
2.
3.
4.
5.
6.
area
time
vs.
Advanced Micro Devices (with permission)
1
2
Tunneling current
through gate oxide: SiO2
is a very good insulator,
but when very thin (< 20)
electrons can tunnel
across.
FINFET
VIN
VOUT
C
tCLK=1/fCLK
Energy dissipated
to discharge C:
E NFET = fCLK
tCLK /2
0
= fCLK C
= fCLK C
iNFET (t)VOUT dt
tCLK /2
0
0
VDD
dVOUT
VOUT dt
dt
VOUT dVOUT
2
VDD
= fCLK C
2
E=
P(t)dt
t
C discharges and
then recharges:
I =C
dVOUT
dt
tCLK
tCLK /2 PFET
= fCLK C
= fCLK C
tCLK
tCLK /2
0
VDD
d(VDD VOUT )
(VDD VOUT )dt
dt
2
VDD
= fCLK C
2
VIN
tCLK=1/fCLK
Energy dissipated
= f C VDD2 per node
= f N C VDD2 per chip
where
f = frequency of charge/discharge
N = number of changing nodes/chip
VOUT
C
C discharges and
then recharges:
trends
Back of the envelope:
f ~ 1GHz = 1e9 cycles/sec
N ~ 1e8 changing nodes/cycle
C ~ 1fF = 1e-15 farads/node
V ~ 1V
100 Watts
L08: Design Tradeoffs, Slide #5
A[31:0]
B[31:0]
add
ALUFN[0]
What if we could
eliminate unnecessary
transitions? When the
output of a CMOS gate
doesnt change, the gate
doesnt dissipate much
power!
V
N
boole
ALUFN[3:0]
ALU[31:0]
shift
ALUFN[1:0]
Z
V
N
ALUFN[2:1]
cmp
ALUFN[5:4]
DQ
G
add
ALUFN[0]
DQ
ALUFN[5:4] == 10
Signals in this
region make
transitions only
when ALU is doing
a shift operation
boole
ALUFN[3:0]
ALU[31:0]
DQ
ALUFN[5:4] == 11
Variations: Dynamically
adjust tCLK or VDD (either
overall or in specific
regions) to accommodate
workload.
6.004 Computation Structures
shift
ALUFN[1:0]
Must computation
consume energy?
See 6.5 of Notes
Z
V
N
ALUFN[2:1]
cmp
ALUFN[5:4]
A
B
CO FA CI
S
Sn-1
An-2 Bn-2
A
B
CO FA CI
S
Sn-2
A2 B2
A1 B1
A0 B0
A
B
CO FA CI
S
A
B
CO FA CI
S
A
B
CO FA CI
S
S2
S1
S0
CIN-1 to SN-1
(N) is read order N and tells us that the latency of our adder
grows in proportion to the number of bits in the operands.
6.004 Computation Structures
Performance/Cost Analysis
Order Of notation:
g(n) = (f(n))
g(n) = O(f(n))
Example:
n2+2n+3 = (n2)
since
n2 < n2+2n+3 < 2n2
almost always
B[31:16]
A[15:0]
16-bit Adder
16-bit Adder
1
B[15:0]
16-bit Adder
S[31:16]
S[15:0]
Once the low half computes the actual value
of the carry-in to the high half, use it select
the correct version of the high-half addition.
A[20:12] B[20:12]
A[11:5] B[11:5]
COUT SUM
COUT SUM
COUT SUM
COUT,S[31:21]
S[20:12]
S[11:5]
A[4:0] B[4:0]
S[4:0]
Select input is
heavily loaded,
so buffer for
speed.
where G = AB and P = A + B
CO logic using only
3 NAND2 gates!
Think Ill borrow
that for my FA
circuit!
generate propagate
AB
CI
CO
G
PS
GHL = GH + PH GL
PHL = PH PL
H
GH PH
GL
P/G generation
A
B
CO FA CI
G P S
A
B
CO FA CI
G P S
GP PL
GHL PHL
GH PH
1st level of
lookahead
GL
GP PL
GHL PHL
B
CI
P S
FA
A
CO
G
B
CI
P S
FA
A5 B5
A
CO
G
B
CI
P S
FA
A4 B4
A3 B3
A2 B2
A1 B1
A0 B0
A
CO
G
A
CO
G
A
CO
G
A
CO
G
A
CO
G
B
CI
P S
FA
B
CI
P S
FA
B
CI
P S
FA
B
CI
P S
FA
GH PH GL
GH PH GL
GH PH GL
GH PH GL
GHL PHL PL
GHL PHL PL
GHL PHL PL
GHL PHL PL
GP
G7-6
(log N)
A6 B6
P7-6
GP
G5-4
P5-4
GP
G3-2
P3-2
GH PH GL
GH PH GL
GHL PHL PL
GHL PHL PL
GP
G7-4
P7-4
B
CI
P S
FA
GP
G1-0
P1-0
GP
G3-0
P3-0
GH PH GL
GP
GHL PHL PL
G7-0 P7-0
GL H
cL
PL C
cin
C6
GL H
cL
PL C
cin
C4
C5
G4
P4
GL H
cL
PL C
cin
C4
C4
C3 C2
G2
P2
G1-0
P1-0
GL H
cL
PL C
cin
C2
GL
PL
C6 = G5-4+P5-4 C4
GL H
cL
PL C
cin
C0
6.004 Computation Structures
C4 = G3-0+P3-0 C0
cH
C cL
cin
C0
C1
G0
P0
C0
GL H
cL
PL C
cin
C0
Notice that the
inputs on the
left of each C
blocks are the
same as the
inputs on the
right of each
corresponding
GP block.
L08: Design Tradeoffs, Slide #17
B
CI
P S
FA
A6 B6
A
CO
G
B
CI
P S
FA
GH PH Cj
GL
GP/CPL
C
GHL PHL Cii
A5 B5
A4 B4
A
CO
G
A
CO
G
B
CI
P S
FA
B
CI
P S
FA
GH PH Cj
GL
GP/CPL
C
GHL PHL Cii
GH PH Cj
GL
GP/CPL
C
GHL PHL Cii
A3 B3
A
CO
G
B
CI
P S
FA
GH PH Cj
GL
GP/CPL
C
GHL PHL Cii
GP PL
GHL PHL
A
CO
G
B
CI
P S
A
CO
G
FA
B
CI
P S
FA
GH PH Cj
GL
GP/CPL
C
GHL PHL Cii
cH
GL
cL
PL C
ci
GH PH CH
GL
PL
CL
GHL PHL Cin
= GP/C
AB
B
CI
P S
FA
A0 B0
tPD = (log N)
C0
GL
A
CO
G
A1 B1
GH PH Cj
GL
GP/CPL
C
GHL PHL Cii
GH PH Cj
GL
GP/CPL
C
GHL PHL Cii
GH PH
A2 B2
CI
CO
G
PS
Binary Multiplication*
The Binary
Multiplication
Table
Hey, that
looks like
an AND
gate
* 0 1
0 0 0
A3
x B3
1 0 1
A2
B2
A1
B1
A0
B0
Easy part: forming partial products (just an AND gate since BI is either 0 or 1)
Hard part: adding M N-bit partial products
6.004 Computation Structures
AB
FA
Combinational Multiplier
a3
CI
a3
CO
a1
a2
a2
a1
a0
a0
b1
S
AB
HA
HA
a3
CO S
z7
a2
FA
a1
FA
FA
FA
a3
a2
a1
a0
FA
FA
FA
HA
z6
z5
z4
z3
FA
b0
HA
a0
b2
HA
b3
z2
z1
z0
Latency = (N)
Throughput = (1/N)
Hardware = (N2)
6.004 Computation Structures
2s Complement Multiplication
Step 1: twos complement operands so high
order bit is 2N-1. Must sign extend partial
products and subtract the last one
X3
X2
X1
X0
* Y3
Y2
Y1
Y0
-------------------X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0
+ X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1
+ X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2
- X3Y3 X3Y3 X2Y3 X1Y3 X0Y3
----------------------------------------Z7
Z6
Z5
Z4
Z3
Z2
Z1
Z0
+
+
+
+
-
X3Y1
X3Y2 X2Y2
X3Y3 X2Y3 X1Y3
1
+
+
+
+
2s Complement Multiplier
a3
1
a3
FA
a3
a2
a2
FA
a1
FA
FA
FA
a3
a2
a1
a0
HA
FA
FA
FA
HA
z7
z6
z5
z4
z3
a1
a2
a1
FA
a0
a0
b0
b1
HA
a0
b2
HA
b3
z2
z1
z0
gotta break
that long
carry chain!
a3
HA
a3
z7
a2
a1
a2
a2
FA
a1
FA
FA
FA
a3
a2
a1
a0
FA
FA
FA
HA
z6
z5
z4
z3
a1
FA
a0
a0
b0
b1
HA
a0
b2
HA
b3
z2
z1
z0
a3
a3
a1
a2
a0
a2
a1
a0
HA
HA
HA
a2
a1
a0
FA
FA
a2
a1
a0
FA
FA
FA
HA
HA
HA
FA
HA
a3
a3
b1
b2
FA
b3
Latency = (N)
Throughput = (1)
Hardware = (N2)
HA
z7
b0
z6
z5
z4
z3
z2
z1
z0
L08: Design Tradeoffs, Slide #26
SN
LSB
NC
M bits
N
Carry-save
xN
N+1
Repeat M times {
P P + (BLSB==1 ? A : 0)
shift SN,P,B right one bit
}
Done: (N+M)-bit result in P,B
Latency = (N)
Throughput = (1/N)
Hardware = (N)
L08: Design Tradeoffs, Slide #27
Summary
Power dissipation can be controlled by dynamically
varying TCLK, VDD or by selectively eliminating
unnecessary transitions.
Functions with N inputs have minimum latency of
O(log N) if output depends on all the inputs. But it
can take some doing to find an implementation
that achieves this bound.
Performing operations in slices is a good way to
reduce hardware costs (but latency increases)
Pipelining can increase throughput (but latency
increases)
Asymptotic analysis only gets you so far factors of
10 matter in real life and typically N isnt a
parameter thats changing within a given design.
6.004 Computation Structures