Professional Documents
Culture Documents
Dynamic scheduling
OS
Compiler
CPU
Firmware
Out-of-order execution
Scoreboard
Dynamic scheduling with WAW/WAR
I/O
Memory
Tomasulos algorithm
Add register renaming to fix WAW/WAR
Digital Circuits
Gates & Transistors
Next unit
Adding speculation and precise state
Dynamic load scheduling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
F D E+ E+ E+ W
F D d* d* E* E* E* E* E* W
F p* p* D E+ E+ E+ W
p2,p3,p4
p2,p4,p5
p2,p5,p6
p4,4,p7
regfile
insn buffer
D
D$
Ready Table
P2 P3 P4 P5 P6 P7
Yes Yes
Yes Yes Yes
Yes
Yes Yes Yes Yes
Yes
Yes Yes Yes Yes Yes Yes
add p2,p3,p4
sub p2,p4,p5 and div p4,4,p7
mul p2,p5,p6
Time
Register Renaming
To eliminate WAW and WAR hazards
Example
Names: r1,r2,r3
Locations: p1,p2,p3,p4,p5,p6,p7
Original mapping: r1p1, r2p2, r3p3, p4p7 are free
MapTable
FreeList
Raw insns
Renamed insns
r1
p1
p4
p4
p4
p4,p5,p6,p7
p5,p6,p7
p6,p7
p7
add
sub
mul
div
add
sub
mul
div
r2
p2
p2
p2
p2
r3
p3
p3
p5
p6
r2,r3,r1
r2,r1,r3
r2,r3,r3
r1,4,r1
p2,p3,p4
p2,p4,p5
p2,p5,p6
p4,4,p7
Renaming
+Removes WAW and WAR dependences
+Leaves RAW intact!
CS/ECE 752 (Wood): Dynamic Scheduling I
Before We Continue
If we can do this in software
why build complex (slow-clock, high-power) hardware?
+ Performance portability
Dont want to recompile for new machines
+ More information available
Memory addresses, branch directions, cache misses
+ More registers available (??)
Compiler may not have enough to fix WAR/WAW hazards
+ Easier to speculate and recover from mis-speculation
Flush instead of recover code
But compiler has a larger scope
Compiler does as much as it can (not much)
Hardware does the rest
CS/ECE 752 (Wood): Dynamic Scheduling I
10
11
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
blt r1,r2,0
// loop
// A in f0
// i in r1
// N*4 in r2
12
I$
B
P
D$
In-order pipeline
Often written as F,D,X,W (multi-cycle X includes M)
Example pipeline: 1-cycle int (including mem), 3-cycle pipelined FP
13
ldf X(r1),f1
c1 c2 c3
mulf f0,f1,f2 c3 c4+ c7
stf f2,Z(r1)
c7 c8 c9
addi r1,4,r1
c8 c9 c10
ldf X(r1),f1 c10 c11 c12
mulf f0,f1,f2 c12 c13+ c16
stf f2,Z(r1) c16 c17 c18
Down: insns
Across: pipeline stages
In boxes: cycles
Basically: stages cycles
Convenient for out-of-order
14
I$
B
P
D$
In-order pipeline
Structural hazard: 1 insn register (latch) per stage
1 insn per stage per cycle (unless pipeline is replicated)
Younger insn cant pass older insn without clobbering it
Out-of-order pipeline
Implement passing functionality by removing structural hazard
CS/ECE 752 (Wood): Dynamic Scheduling I
15
Instruction Buffer
insn buffer
regfile
I$
B
P
D$
D1
D2
16
I$
B
P
D$
17
I$
B
P
D$
S
E*
E*
E
+
E
+
E*
E/
F-regfile
CS/ECE 752 (Wood): Dynamic Scheduling I
18
19
20
21
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
== CAMs
==
T
FU
22
Scoreboard Pipeline
New pipeline structure: F, D, S, X, W
F (fetch)
Same as it ever was
D (dispatch)
Structural or WAW hazard ? stall : allocate scoreboard entry
S (issue)
RAW hazard ? wait : read registers, go to execute
X (execute)
Execute operation, notify scoreboard when done
W (writeback)
WAR hazard ? wait : write register, free scoreboard entry
W and RAW-dependent S in same cycle
W and structural-dependent D in same cycle
CS/ECE 752 (Wood): Dynamic Scheduling I
23
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
==
==
T
FU
24
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
==
==
T
FU
25
26
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
==
==
T
FU
Execute insn
27
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
==
==
T
FU
28
f0
f1
f2
r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
Reg Status
Reg T
R1
R2
T1
T2
no
no
no
no
no
29
Scoreboard: Cycle 1
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
no
yes ldf
no
no
no
Reg Status
Reg T
f0
f1
f2
r1
c1
R1
R2
T1
T2
f1
r1
LD
allocate
30
Scoreboard: Cycle 2
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
c1
c2
c2
R1
R2
T1
T2
r1
f0
f1
LD
no
yes ldf f1
no
yes mulf f2
no
Reg Status
Reg T
f0
f1
f2
r1
LD
FP1
allocate
31
Scoreboard: Cycle 3
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c2
c3
no
yes ldf f1
yes stf yes mulf f2
no
f2
f0
Reg Status
Reg T
f0
f1
f2
r1
R2
T1
T2
r1
r1
f1
FP1
-
LD
LD
FP1
allocate
32
Scoreboard: Cycle 4
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
Reg Status
Reg T
c1
c2
c3
c4
c2
c4
c3
c4
R1
R2
T1
T2
r1
f2
f0
r1
f1
FP1
-
LD
yes addi r1
no
yes stf yes mulf f2
no
f0
f1
f2
r1
LD
FP1
ALU
f1 written clear
allocate
free
f0 (LD) is ready issue mulf
33
Scoreboard: Cycle 5
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
yes
yes
yes
yes
no
addi
ldf
stf
mulf
Reg Status
Reg T
c1
c2
c3
c4
c5
c2
c4
c3
c5
c4
R1
R2
T1
T2
r1
f1
f2
r1
f2
f0
r1
r1
f1
FP1
-
ALU
-
c5
f0
f1
f2
r1
LD
FP1
ALU
allocate
34
Scoreboard: Cycle 6
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
yes
yes
yes
yes
no
addi
ldf
stf
mulf
c1
c2
c3
c4
c5
c2 c3 c4
c4 c5+
R1
R2
T1
T2
r1
f1
f2
r1
f2
f0
r1
r1
f1
FP1
-
ALU
-
c5
Reg Status
Reg T
f0
f1
f2
r1
c6
LD
FP1
ALU
35
Scoreboard: Cycle 7
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
yes
yes
yes
yes
no
addi
ldf
stf
mulf
Reg Status
Reg T
c1
c2
c3
c4
c5
c2 c3 c4
c4 c5+
c5
c6
R1
R2
T1
T2
r1
f1
f2
r1
f2
f0
r1
r1
f1
FP1
-
ALU
-
f0
f1 LD
f2 FP1
r1 ALU
W wait: WAR hazard w/ stf (r1)
How to tell? Untagged r1 in FuStatus
Requires CAM
36
Scoreboard: Cycle 8
Insn Status
Insn
c1
c2
c3
c4
c5
c8
c2 c3 c4
c4 c5+ c8
c8
c5 c6
R1
R2
T1
T2
addi r1
ldf f1
stf -
r1
f2
r1
r1
FP1
ALU
-
mulf f2
f0
f1
LD
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
yes
yes
yes
no
yes
Reg Status
Reg T
f0
f1 LD
f2 FP1 FP2
r1 ALU
W wait
Scoreboard: Cycle 9
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
c1
c2
c3
c4
c5
c8
c2 c3 c4
c4 c5+ c8
c8 c9
c5 c6 c9
c9
Reg Status
Reg T
f0
f1
f2
r1
LD
FP2
ALU
r1 written clear
no
yes ldf f1
yes stf no
yes mulf f2
R1
R2
T1
T2
f2
r1
r1
ALU
-
f0
f1
LD
free
r1 (ALU) is ready issue ldf
38
Scoreboard: Cycle 10
Insn Status
Insn
ldf X(r1),f1
c1
mulf f0,f1,f2 c2
stf f2,Z(r1)
c3
addi r1,4,r1
c4
ldf X(r1),f1
c5
mulf f0,f1,f2 c8
stf f2,Z(r1) c10
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
no
yes ldf f1
yes stf no
yes mulf f2
c2 c3 c4
c4 c5+ c8
c8 c9 c10
c5 c6 c9
c9 c10
Reg Status
Reg T
f0
f1
f2
r1
LD
FP2
R1
R2
T1
T2
f2
r1
r1
FP2
f0
f1
LD
39
In-Order
D
X
Scoreboard
W
D
S
X
ldf X(r1),f1
c1 c2 c3 c1 c2 c3
mulf f0,f1,f2 c3 c4+ c7 c2 c4 c5+
stf f2,Z(r1)
c7 c8 c9 c3 c8 c9
addi r1,4,r1
c8 c9 c10 c4 c5 c6
ldf X(r1),f1 c10 c11 c12 c5 c9 c10
mulf f0,f1,f2 c12 c13+ c16 c8 c11 c12+
stf f2,Z(r1) c16 c17 c18 c10 c15 c16
W
c4
c8
c10
c9
c11
c15
c17
Big speedup?
Only 1 cycle advantage for scoreboard
Why? addi WAR hazard
Scoreboard issued addi earlier (c8 c5)
But WAR hazard delayed W until c9
Delayed issue of second iteration
CS/ECE 752 (Wood): Dynamic Scheduling I
40
In-Order
D
X
c1
c7
c11
c12
c14
c16
c20
c2+
c8+
c12
c13
c15
c17+
c21
Scoreboard
W
D
S
X
c7
c11
c13
c14
c16
c20
c22
c1
c2
c3
c4
c5
c6
c7
c2
c8
c12
c5
c13
c15
c19
c3+
c9+
c13
c6
c14
c16+
c20
W
c8
c12
c14
c13
c15
c19
c21
Assume
5 cycle cache miss on first ldf
Ignore FUST structural hazards
Little relative advantage
addi WAR hazard (c7 c13) stalls second iteration
41
Scoreboard Redux
The good
+ Cheap hardware
InsnStatus + FuStatus + RegStatus ~ 1 FP unit in area
+ Pretty good performance
1.7X for FORTRAN (scientific array) programs
42
Register Renaming
Register renaming (in hardware)
Change register names to eliminate WAR/WAW hazards
An elegant idea (like caching & pipelining)
Key: think of registers (r1,f0) as names, not storage locations
+ Can have more locations than names
+ Can have multiple active versions of same name
43
FreeList
Raw insns
Renamed insns
r1
p1
p4
p4
p4
p4,p5,p6,p7
p5,p6,p7
p6,p7
p7
add
sub
mul
div
add
sub
mul
div
r2
p2
p2
p2
p2
r3
p3
p3
p5
p6
r2,r3,r1
r2,r1,r3
r2,r3,r3
r1,4,r1
p2,p3,p4
p2,p4,p5
p2,p5,p6
p4,4,p7
Renaming
+Removes WAW and WAR dependences
+Leaves RAW intact!
CS/ECE 752 (Wood): Dynamic Scheduling I
44
45
Map Table
T: tag (RS#) that will write this register
46
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
47
48
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
49
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
50
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
51
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
52
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
==
==
T
FU
53
And Tomasulo
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
54
55
Map Table
Reg T
f0
f1
f2
r1
RS#2
RS#4
Reservation Stations
T FU busy op
R
T1
T2
2
4
[r1]
RS#2 [f0]
LD
FP1
yes ldf f1
yes mulf f2
V1
V2
56
1
2
3
4
5
ALU
LD
ST
FP1
FP2
CDB
T
V
f0
f1
f2
r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
Reservation Stations
T FU busy op
R
Map Table
Reg T
T1
T2
V1
V2
no
no
no
no
no
57
Tomasulo: Cycle 1
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
ALU
LD
ST
FP1
FP2
f0
f1
f2
r1
c1
Reservation Stations
T FU busy op
R
1
2
3
4
5
no
yes ldf
no
no
no
Map Table
Reg T
f1
RS#2
T1
T2
V1
V2
[r1]
CDB
T
V
allocate
58
Tomasulo: Cycle 2
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c2
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
no
yes ldf f1
no
yes mulf f2
no
Map Table
Reg T
f0
f1
f2
r1
RS#2
RS#4
T1
T2
V1
V2
[r1]
RS#2 [f0] -
CDB
T
V
allocate
59
Tomasulo: Cycle 3
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c2
c3
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
no
yes ldf f1
yes stf yes mulf f2
no
f0
f1
f2
r1
T1
Map Table
Reg T
T2
V1
RS#2
RS#4
V2
[r1]
RS#4 [r1]
RS#2 [f0] -
CDB
T
V
allocate
60
Tomasulo: Cycle 4
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c2
c4
c3
c4
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
yes addi r1
no
yes stf yes mulf f2
no
Map Table
Reg T
CDB
T
V
f0
f1
f2
r1
RS#2 [f1]
RS#2
RS#4
RS#1
T1
T2
V1
[r1] -
V2
allocate
free
RS#4 [r1]
RS#2 [f0] CDB.V RS#2 ready
grab CDB value
61
Tomasulo: Cycle 5
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c5
c2
c4
c3
c5
c4
c5
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
yes
yes
yes
yes
no
addi
ldf
stf
mulf
r1
f1
f2
Map Table
Reg T
f0
f1
f2
r1
RS#2
RS#4
RS#1
T1
T2
V1
V2
RS#4
-
RS#1
-
[r1]
[f0]
[r1]
[f1]
CDB
T
V
allocate
62
Tomasulo: Cycle 6
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c5
c6
c2 c3 c4
c4 c5+
c5
Map Table
Reg T
f0
f1
f2
r1
c6
RS#2
RS#4RS#5
RS#1
Reservation Stations
T FU busy op
R
T1
T2
V1
V2
1
2
3
4
5
RS#4
-
RS#1
RS#2
[r1]
[f0]
[f0]
[r1]
[f1]
-
ALU
LD
ST
FP1
FP2
yes
yes
yes
yes
yes
addi
ldf
stf
mulf
mulf
CDB
T
V
r1
f1
f2
f2
allocate
63
Tomasulo: Cycle 7
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c5
c6
c2 c3 c4
c4 c5+
c5
c7
c6
Map Table
Reg T
c7
CDB
T
V
f0
RS#1 [r1]
f1 RS#2
f2 RS#5
r1 RS#1
no W wait on WAR: scoreboard would
anyone who needs old r1 has RS copy
D stall on store RS: structural
addi finished (W)
clear r1 RegStatus
CDB broadcast
Reservation Stations
T FU busy op
R
T1
T2
V1
V2
1
2
3
4
5
RS#4
-
RS#1
RS#2
[f0]
[f0]
ALU
LD
ST
FP1
FP2
no
yes
yes
yes
yes
ldf
stf
mulf
mulf
f1
f2
f2
64
Tomasulo: Cycle 8
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c5
c6
c2 c3 c4
c4 c5+ c8
c8
c5 c6 c7
c7 c8
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
no
yes ldf f1
yes stf no
yes mulf f2
T1
T2
Map Table
Reg T
CDB
T
V
f0
f1
f2
r1
RS#4 [f2]
RS#2
RS#5
V1
V2
RS#4 -
[r1]
CDB.V [r1] RS#4 ready
grab CDB value
RS#2 [f0] 65
Tomasulo: Cycle 9
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c5
c6
c2 c3 c4
c4 c5+ c8
c8 c9
c5 c6 c7
c7 c8 c9
c9
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
no
no
yes stf no
yes mulf f2
Map Table
Reg T
CDB
T
V
f0
f1
f2
r1
RS#2 [f1]
RS#2
RS#5
T1
T2
V1
[f2] [r1]
V2
66
Tomasulo: Cycle 10
Insn Status
Insn
ldf X(r1),f1
c1
mulf f0,f1,f2 c2
stf f2,Z(r1)
c3
addi r1,4,r1
c4
ldf X(r1),f1
c5
mulf f0,f1,f2 c6
stf f2,Z(r1) c10
ALU
LD
ST
FP1
FP2
CDB
T
V
f0
c2 c3 c4
f1
c4 c5+ c8
f2 RS#5
c8 c9 c10
r1
c5 c6 c7
c7 c8 c9 stf finished (W)
c9 c10
no output register no CDB broadcast
Reservation Stations
T FU busy op
R
1
2
3
4
5
Map Table
Reg T
no
no
yes stf no
yes mulf f2
T1
V1
V2
RS#5 -
[r1]
[f0] [f1]
T2
free allocate
67
Scoreboard
D
S
X
ldf X(r1),f1
c1 c2 c3
mulf f0,f1,f2 c2 c4 c5+
stf f2,Z(r1)
c3 c8 c9
addi r1,4,r1
c4 c5 c6
ldf X(r1),f1
c5 c9 c10
mulf f0,f1,f2 c8 c11 c12+
stf f2,Z(r1) c10 c15 c16
Hazard
Insn buffer
FU
RAW
WAR
WAW
Scoreboard
stall in D
wait in S
wait in S
wait in W
stall in D
Tomasulo
W
D
S
c4 c1 c2 c3 c4
c8 c2 c4 c5+ c8
c10 c3 c8 c9 c10
c9 c4 c5 c6 c7
c11 c5 c7 c8 c9
c15 c6 c9 c10+ c13
c17 c10 c13 c14 c15
Tomasulo
stall in D
wait in S
wait in S
none
none
68
Scoreboard
D
S
X
ldf X(r1),f1
c1 c2
mulf f0,f1,f2 c2 c8
stf f2,Z(r1)
c3 c12
addi r1,4,r1
c4 c5
ldf X(r1),f1
c8 c13
mulf f0,f1,f2 c12 c15
stf f2,Z(r1) c13 c19
c3+
c9+
c13
c6
c14
c16+
c20
Tomasulo
W
D
S
c8
c12
c14
c13
c15
c19
c21
c1
c2
c3
c4
c5
c6
c7
c2
c8
c12
c5
c7
c9
c13
c3+
c9+
c13
c6
c8
c10+
c14
c8
c12
c14
c7
c9
c13
c15
Assume
5 cycle cache miss on first ldf
Ignore FUST and RS structural hazards
+ Advantage Tomasulo
No addi WAR hazard (c7) means iterations run in parallel
69
RS: N tag/value w-ports (D), N value r-ports (S), 2N tag CAMs (W)
Select logic: WN priority encoder (S)
MT: 2N r-ports (D), N w-ports (D)
RF: 2N r-ports (D), N w-ports (W)
CDB: N (W)
Which are the expensive pieces?
70
Split design
Divide RS into N banks: 1 per FU?
Implement N separate W/N1 encoders
+ Simpler: N * logW/N
Less scheduling flexibility
71
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
T
FU
72
No Bypassing
D
S
X
Bypassing
W
D
S
ldf X(r1),f1
c1 c2 c3 c4 c1 c2 c3 c4
mulf f0,f1,f2 c2 c4 c5+ c8 c2 c3 c4+ c7
stf f2,Z(r1)
c3 c8 c9 c10 c3 c6 c7 c8
addi r1,4,r1
c4 c5 c6 c7 c4 c5 c6 c7
ldf X(r1),f1
c5 c7 c8 c9 c5 c7 c7 c9
mulf f0,f1,f2 c6 c9 c10+ c13 c6 c9 c8+ c13
stf f2,Z(r1) c10 c13 c14 c15 c10 c13 c11 c15
Modern scheduler
Split CDB tag and value, move tag broadcast to S
ldf tag broadcast now in cycle 2 mulf S in cycle 3
How do multi-cycle operations work? How do cache misses work?
CS/ECE 752 (Wood): Dynamic Scheduling I
73
74