Professional Documents
Culture Documents
PipelineTimings
12th Jan, 2006
Pipelined Processors
Parallel architectures
Function-parallel
Instr level (ILP)
Thread level
Data-parallel
Process level
Intels terminology:
Pipelined VLIWs Superscalar
processors
processors
Anshul Kumar, CSE
intra ILP
inter ILP
slide 2
Processor Performance
MIPS and MFLOPS
may not truly represent performance
Execution time of a program
true measure of performance
SPEC rating
acceptable
slide 3
IF
RF EX/AG M
WB
Number of instructions
Cycles per instruction(Av)
Clock cycle time
slide 4
Architecture - N * CPI * t
slide 5
Comb
Reg
Clock
Pmax
slide 6
Ideal Pipelining
Tinst
S stages
t = Tinst / S
CPI = 1
Effective time per inst Teff = 1 * Tinst / S
Anshul Kumar, CSE
slide 7
S stages
Frequency of interruptions - b
t = Tinst / S
CPI = 1 + (S - 1) * b
Teff = (1 + (S - 1) * b) * Tinst / S
Anshul Kumar, CSE
slide 8
Teff
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10
S
Anshul Kumar, CSE
slide 9
Reg
Comb
Clock
Pmax
t = Pmax + C
Pmax = max propagation delay
C = clocking overhead
Anshul Kumar, CSE
slide 10
Clocking Overhead
Fixed overhead
Setup time
Output delay
Variable overhead
(stretching factor) k
Clock skew
t = Tinst / S + k * Tinst / S + c
= (1 + k) * Tinst / S + c
Anshul Kumar, CSE
slide 11
[1 + (S - 1) * b] *
[(1 + k) * Tinst / S + c]
slide 12
15
Teff
10
5
0
1 3 5 7 9 11 13 15
S
slide 13
IF
RF
AG T
DF
EX
PA
slide 14
Example
Put Away 2 ns
Execute 7+7+8 ns
Data - ALU 3 ns
Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
Decode 6+6 ns
Data - IR 3 ns
Cache Data 10 ns
Cache Dir 6 ns
PC - MAR 4 ns
slide 15
Optimal Pipelining
Tinst = 4+6+10+3+12+9+3+6+10+3+22+2
= 90 ns
b = 0.2
c = 4 ns
k = 5%
slide 16
Example
Put Away 2 ns
Execute 7+7+8 ns
Tseg = 10 ns
S = 10
t = 14.5 ns
S * t = 145 ns
Data - ALU 3 ns
Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
Decode 6+6 ns
Data - IR 3 ns
Cache Data 10 ns
Cache Dir 6 ns
PC - MAR 4 ns
slide 17
Example
Put Away 2 ns
Execute 7+7+8 ns
S=9
Tseg = 13 ns
t = 17.65 ns
S * t = 159 ns
Data - ALU 3 ns
Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
Decode 6+6 ns
Data - IR 3 ns
Cache Data 10 ns
Cache Dir 6 ns
PC - MAR 4 ns
slide 18
Example
Put Away 2 ns
Execute 7+7+8 ns
Tseg = 20 ns
S=5
t = 25 ns
S * t = 125 ns
Data - ALU 3 ns
Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
Decode 6+6 ns
Data - IR 3 ns
Cache Data 10 ns
Cache Dir 6 ns
PC - MAR 4 ns
slide 19
Comparison
S
Tseg
S * t
Teff
13
17.65
159
45.89
10
10
14.50
145
40.60
20
25.00
125
45.00
slide 20
Cycle Quantization
Delays are not integral multiple of clock
period
Total overhead = clocking overhead
+ quantization overhead
S * t Tinst + S * C
(ignoring k)
quantization overhead = S * (t - C) -Tinst
reduces as clock period becomes small
Anshul Kumar, CSE
slide 21
Wave Pipelining
Omit inter-stage registers
Reduced clocking overhead
Anshul Kumar, CSE
slide 22
Wave Pipeline
Registers separate
adjoining stages
Clock period > max prop
delay
Inter-stage data stored in
registers
No registers between
adjoining stages
Clock period less than
max prop delay
Waves of data propagate
through combinational
network (effectively, data
is stored in the
combinational circuit
delay!)
slide 23
No pipelining
Reg X
X Reg Y
Clock
X
X
Y
slide 24
Conventional pipelining
Reg X
Y Z
Clock
X
X
Y
Y
Z
Z
W
Z Reg W
Wave pipelining
Reg X
Z Reg W
Clock
Z
Anshul Kumar, CSE
slide 26
Timing
Reg
Reg
Comb ckt
Clock
Tp+s
T
clock period
X
Y
p
propagation delay
Anshul Kumar, CSE
s
set-up time
slide 27
Reg
Comb ckt
Clock
T
Clock skew =
X
Y
s
T p + s + 2
slide 28
slide 29
Reg
Comb ckt
Clock
Y
T
pmin
pmax
Anshul Kumar, CSE
T p + s + 4
slide 30
Y
nT
pmin
pmax
pmin (n-1) T + 2
nT pmax + s + 2
T
p + s + 4
Anshul Kumar, CSE
(n-1) T
slide 31
Comparison
Conventional Pipeline
T pmax/n + s + 2
(plus cycle quantization
overhead)
nT pmax + ns + 2n
Wave Pipeline
T p + s + 4
nT pmax + s + 2
slide 32
slide 33
Additional Reading
Wayne P. Burleson, Maciej Ciesielski, Fabian
Klass, and Wentai Liu, Wave-Pipelining: A
Tutorial and Research Survey, IEEE Trans.
on VLSI Systems, vol. 6, no. 3, September
1998, pp. 464 474.
slide 34