Professional Documents
Culture Documents
2
Goal of this Chapter
Make Superscalar Processors Work
Evolution
1. Detect hazards in hardware and enforce pipeline stalls
• basic correctness
2. Out of order completion
3. Scoreboard – out of order execution
4. Superscalar – multiple instruction issue using available functional
units in parallel, hardware hazard check
5. Wider superscalar – add more functional units
6. Register renaming
7. Speculative execution - branch prediction and damage control
8. Giving up on further ILP (multithreaded, multicore processors)
Make CPI < 1, or its reciprocal IPC > 1
3
Remember
[H&P p.177]
“The goal of both our software and hardware techniques is
to exploit parallelism by preserving program order only
where it affects the outcome of the program. Detecting
and avoiding hazards ensures that necessary program
order is preserved.” (sequential consistency)
4
Recall from Pipelining
5
Recall Simple Data Hazard Resolution:
In-order issue, in-order
completion
Time (clock cycles)
I
n
lw r1, 0(r2)
ALU
Ifetch Reg DMem Reg
s
t
r.
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
O
r
d Bubble
ALU
Ifetch Reg DMem Reg
e and r6,r2,r7
r
ALU
Bubble Ifetch Reg DMem
or r8,r2,r9
Extend to Multiple instruction issue?
What if load had longer delay? Can AND issue?
6
In-Order Issue, Out-of-order Completion
ALU
Ifetch Reg Reg
Add
DMem DMem’ Reg
7
Ideas to Reduce Stalls
Technique Reduces
Dynamic scheduling Data hazard stalls
Dynamic branch Control stalls
prediction
Issuing multiple Ideal CPI
instructions per cycle
Dynamic ILP
Speculation Data and control stalls
Dynamic memory Data hazard stalls involving
disambiguation memory
Loop unrolling Control hazard stalls
Basic compiler pipeline Data hazard stalls
scheduling
Compiler dependence Ideal CPI and data hazard stalls
Static ILP analysis
Software pipelining and Ideal CPI and data hazard stalls
trace scheduling
Compiler speculation Ideal CPI, data and control stalls
8
Instruction-Level Parallelism (ILP)
• Basic Block (BB) ILP is quite small
– BB: a straight-line code sequence with no branches in except to the
entry and no branches out except at the exit
– average dynamic branch frequency 15% to 25%
=> 4 to 7 instructions execute between a pair of branches
– Plus instructions in BB likely to depend on each other
• To obtain substantial performance enhancements, we
must exploit ILP across multiple basic blocks
• Simplest: loop-level parallelism to exploit parallelism
among iterations of a loop
– Vector is one way
– If not vector, then either dynamic via branch prediction or static via
loop unrolling by compiler
9
Review Data Dependence and Hazards
Data Dependence Potential Hardware Hazard
10
Data Dependence and Hazards
• Dependences are a property of programs
• Presence of dependence indicates potential for a
hazard, but actual hazard and length of any stall is a
property of the pipeline
• Importance of the data dependencies
1) indicates the possibility of a hazard
2) determines order in which results must be calculated
3) sets an upper bound on how much parallelism can
possibly be exploited
• Next, look at HW schemes to avoid hazard
11
ILP and Data Hazards
• program order: order instructions would execute in if
executed sequentially 1 at a time as determined by
original source program
• HW/SW goal: exploit parallelism by preserving
appearance of program order
– modify order in manner than cannot be observed by program
– must not affect the outcome of the program
• Ex: Instructions involved in a name dependence can
execute simultaneously if name used in instructions is
changed so instructions do not conflict
– Register renaming resolves name dependence for regs
– Either by compiler or by HW
– add r1, r2, r3
– sub r2, r4,r5
– and r3, r2, 1
12
Control Dependencies
13
Control Dependence Ignored
14
Exception Behavior
• Preserving exception behavior => any changes in
instruction execution order must not change how
exceptions are raised in program (=> no new exceptions)
• Example:
DADDU R2,R3,R4
BEQZ R2,L1
NOP
LW R1,0(R2)
L1:
• Problem with moving LW before BEQZ?
15
Data Flow
16
Advantages of
Dynamic Scheduling
• Handles cases when dependences unknown at compile
time
– (e.g., because they may involve a memory reference)
• It simplifies the compiler
• Allows code that compiled for one pipeline to run
efficiently on a different pipeline
• Hardware speculation, a technique with significant
performance advantages, that builds on dynamic
scheduling
17
HW Schemes: Instruction Parallelism
• Key idea: Allow instructions behind stall to proceed
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14
• Enables out-of-order execution
and allows out-of-order completion
• Will distinguish when an instruction begins execution and
when it completes execution; between 2 times, the
instruction is in execution
• In a dynamically scheduled pipeline, all instructions pass
through issue stage in order (in-order issue)
18
Data Hazards: An Example
dest src1 src2
I1 DIVD f6, f6, f4
I2 LD f2, 45(r3)
RAW Hazards
WAR Hazards
WAW Hazards
20
Complex Pipelining
ALU Mem
IF ID Issue WB
Fadd
GPR’s
Fmul
FPR’s
Fdiv
Inst. Data
PC D Decode GPRs X1 + X2 Mem X3 W
Mem
22
Complex Pipeline
ALU Mem
IF ID Issue WB
Fadd
GPR’s
FPR’s
Fmul
23
When is it Safe to Issue an
Instruction?
Suppose a data structure keeps track of all the
instructions in all the functional units
24
Scoreboard for In-order Issues
25
In-Order Issue Limitations: an example
latency 1 2
1 LD F2, 34(R2) 1
In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6
In-order restriction prevents instruction 4
from being dispatched
ALU Mem
IF ID Issue WB
Fadd
Fmul
In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6
Out-of-order: 1 (2,1) 4 4 . . . . 2 3 . . 3 5 . . . 5 6 6
28
How many instructions can be in
the pipeline?
Which features of an ISA limit the number of
instructions in the pipeline?
Number of Registers
29
Overcoming the Lack of Register
Names
Floating Point pipelines often cannot be kept filled
with small number of registers.
IBM 360 had only 4 Floating Point Registers
30
Little’s Law
Issue Execution WB
Example:
• 4 floating point registers
• 8 cycles per floating point operation
31
Instruction-level Parallelism via Renaming
latency 1 2
1 LD F2, 34(R2) 1
In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6
Out-of-order: 1 (2,1) 4 4 5 . . . 2 (3,5) 3 6 6
Any antidependence can be eliminated by renaming.
(renaming additional storage)
Can it be done in hardware? yes!
32
Register Renaming
ALU Mem
IF ID Issue WB
Fadd
Fmul
34
A Dynamic Algorithm:
Tomasulo’s Algorithm
• For IBM 360/91 (before caches!)
• Goal: High Performance without special compilers
• Small number of floating point registers (4 in 360) prevented
interesting compiler scheduling of operations
– This led Tomasulo to try to figure out how to get more effective registers —
renaming in hardware!
• Why Study 1966 Computer?
• The descendants of this have flourished!
– Alpha 21264, HP 8000, MIPS 10000, Pentium III,4,
PowerPC 604, …
35
Tomasulo Algorithm
• Control & buffers distributed with Function Units (FU)
– FU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers to reservation
stations (RS);
– form of register renaming ;
– avoids WAR, WAW hazards
– More reservation stations than registers, so can do optimizations compilers
can’t
• Results to FU from RS, not through registers, over Common Data Bus
that broadcasts results to all FUs
• Load and Stores treated as FUs with RSs as well
• Integer instructions can get ahead & go past branches, allowing
FP ops beyond basic block in FP queue
36
Tomasulo Organization
From Mem FP Op FP Registers
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5 Store
Load6
Buffers
Add1
Add2 Mult1
Add3 Mult2
Reservation To Mem
Stations
FP
FP adders
adders FP
FP multipliers
multipliers
38
Three Stages of Tomasulo Algorithm
39
Tomasulo Example
Instruction stream
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 Load1 No
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2 3 Load/Buffers
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
FU count Add2 No
3 FP Adder R.S.
Add3 No
down 2 FP Mult R.S.
Mult1 No
Mult2 No
Clock cycle
counter
40
Tomasulo Example Cycle 1
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
41
Tomasulo Example Cycle 2
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
42
Tomasulo Example Cycle 3
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 Yes SUBD M(A1) Load2
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
2 Add1 Yes SUBD M(A1) M(A2)
Add2 No
Add3 No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
2 Add2 Yes ADDD (M-M) M(A2)
Add3 No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
48
Tomasulo Example Cycle 9
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
1 Add2 Yes ADDD (M-M) M(A2)
Add3 No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
49
Tomasulo Example Cycle 10
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
0 Add2 Yes ADDD (M-M) M(A2)
Add3 No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
52
Tomasulo Example Cycle 13
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
53
Tomasulo Example Cycle 14
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
54
Tomasulo Example Cycle 15
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
57
Tomasulo Example Cycle 55
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
58
Tomasulo Example Cycle 56
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5 56
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)
• Complexity
– design delays of 360/91, MIPS 10000, Alpha 21264,
IBM PPC 620, but not in silicon!
• Many associative stores (CDB) at high speed
• Performance limited by Common Data Bus
– Each CDB must go to multiple functional units
high capacitance, high wiring density
– Number of functional units that can complete per cycle limited to
one!
» Multiple CDBs more FU logic for parallel assoc stores
• Non-precise interrupts!
– We will address this later
61
Tomasulo Loop Example
Loop: L.D F0 0(R1)
MULT.D F4 F0 F2
S.D F4 0(R1)
DSUBI R1 R1 #8
BNE R1 R0 Loop
62
Loop Carried Name Dependencies
Two iterations shown –what the compiler generated
Loop: L.D F0 0(R1)
MULT.D F4 F0 F2
S.D F4 0(R1)
DSUBI R1 R1 #8
BNE R1 R0 Loop
63
Loop Carried Name Dependencies
65
Loop Example
Instruction status: Exec Write
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 Load1 No
1 MULTD F4 F0 F2 Load2 No
1 SD F4 0 R1 Load3 No
Iter- 2 LD F0 0 R1 Store1 No
ation 2 MULTD F4 F0 F2 Store2 No
Count 2 SD F4 0 R1 Store3 No
Reservation Stations: S1 S2 RS Added Store Buffers
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
Mult1 No SUBI R1 R1 #8
Mult2 No BNEZ R1 Loop
Register result status Instruction Loop
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
0 80 Fu
67
Loop Example Cycle 2
Instruction status: Exec Write
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 Load1 Yes 80
1 MULTD F4 F0 F2 2 Load2 No
Load3 No
Store1 No
Store2 No
Store3 No
Reservation Stations: S1 S2 RS
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8
Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
2 80 Fu Load1 Mult1
68
Loop Example Cycle 3
Instruction status: Exec Write
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 Load1 Yes 80
1 MULTD F4 F0 F2 2 Load2 No
1 SD F4 0 R1 3 Load3 No
Store1 Yes 80 Mult1
Store2 No
Store3 No
Reservation Stations: S1 S2 RS
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8
Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
3 80 Fu Load1 Mult1
73
Loop Example Cycle 8
Instruction status: Exec Write
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 Load1 Yes 80
1 MULTD F4 F0 F2 2 Load2 Yes 72
1 SD F4 0 R1 3 Load3 No
2 LD F0 0 R1 6 Store1 Yes 80 Mult1
2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2
2 SD F4 0 R1 8 Store3 No
Reservation Stations: S1 S2 RS
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8
Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
8 72 Fu Load2 Mult2
74
Loop Example Cycle 9
Instruction status: Exec Write
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 Load1 Yes 80
1 MULTD F4 F0 F2 2 Load2 Yes 72
1 SD F4 0 R1 3 Load3 No
2 LD F0 0 R1 6 Store1 Yes 80 Mult1
2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2
2 SD F4 0 R1 8 Store3 No
Reservation Stations: S1 S2 RS
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8
Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
9 72 Fu Load2 Mult2
75
Loop Example Cycle 10
Instruction status: Exec Write
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No
1 MULTD F4 F0 F2 2 Load2 Yes 72
1 SD F4 0 R1 3 Load3 No
2 LD F0 0 R1 6 10 Store1 Yes 80 Mult1
2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2
2 SD F4 0 R1 8 Store3 No
Reservation Stations: S1 S2 RS
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8
Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
10 64 Fu Load2 Mult2
76
Loop Example Cycle 11
Instruction status: Exec Write
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No
1 MULTD F4 F0 F2 2 Load2 No
1 SD F4 0 R1 3 Load3 Yes 64
2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1
2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2
2 SD F4 0 R1 8 Store3 No
Reservation Stations: S1 S2 RS
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8
4 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
11 64 Fu Load3 Mult2
82
Loop Example Cycle 17
Instruction status: Exec Write
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No
1 MULTD F4 F0 F2 2 14 15 Load2 No
1 SD F4 0 R1 3 Load3 Yes 64
2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2
2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2
2 SD F4 0 R1 8 Store3 Yes 64 Mult1
Reservation Stations: S1 S2 RS
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8
Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
17 64 Fu Load3 Mult1
83
Loop Example Cycle 18
Instruction status: Exec Write
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No
1 MULTD F4 F0 F2 2 14 15 Load2 No
1 SD F4 0 R1 3 18 Load3 Yes 64
2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2
2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2
2 SD F4 0 R1 8 Store3 Yes 64 Mult1
Reservation Stations: S1 S2 RS
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8
Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
18 64 Fu Load3 Mult1
84
Loop Example Cycle 19
Instruction status: Exec Write
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No
1 MULTD F4 F0 F2 2 14 15 Load2 No
1 SD F4 0 R1 3 18 19 Load3 Yes 64
2 LD F0 0 R1 6 10 11 Store1 No
2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2
2 SD F4 0 R1 8 19 Store3 Yes 64 Mult1
Reservation Stations: S1 S2 RS
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8
Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
19 56 Fu Load3 Mult1
85
Loop Example Cycle 20
Instruction status: Exec Write
ITER Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 Yes 56
1 MULTD F4 F0 F2 2 14 15 Load2 No
1 SD F4 0 R1 3 18 19 Load3 Yes 64
2 LD F0 0 R1 6 10 11 Store1 No
2 MULTD F4 F0 F2 7 15 16 Store2 No
2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1
Reservation Stations: S1 S2 RS
Time Name Busy Op Vj Vk Qj Qk Code:
Add1 No LD F0 0 R1
Add2 No MULTD F4 F0 F2
Add3 No SD F4 0 R1
Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8
Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
20 56 Fu Load1 Mult1
87
Why can Tomasulo overlap iterations of
loops?
• Register renaming
– Multiple iterations use different physical destinations for registers
(dynamic loop unrolling).
• Reservation stations
– Permit instruction issue to advance past integer control flow operations
– Also buffer old values of registers - totally avoiding the WAR stall that we
saw in the scoreboard.
88
Tomasulo’s scheme offers 2 major
advantages
(1) the distribution of the hazard detection logic
– distributed reservation stations and the CDB
– If multiple instructions waiting on single result, & each instruction has
other operand, then instructions can be released simultaneously by
broadcast on CDB
"Multicast result forwarding"
– If a centralized register file were used, the units would have to read their
results from the registers when register buses are available.
(2) the elimination of stalls for WAW and WAR hazards
89
What about Precise Interrupts?
• State of machine looks as if no instruction beyond faulting
instructions has issued
• Tomasulo had:
90
Relationship between precise
interrupts and speculation:
• Speculation: guess and check
• Important for branch prediction:
– Need to “take our best shot” at predicting branch direction.
• If we speculate and are wrong, need to back up and
restart execution to point at which we predicted
incorrectly:
– This is exactly same as precise exceptions!
• Technique for both precise interrupts/exceptions and
speculation: in-order completion or commit
91
HW support for precise interrupts
• Need HW buffer for results of
uncommitted instructions:
reorder buffer
– 3 fields: instr, destination, value
Reorder
– Use reorder buffer number instead of
Buffer
reservation station when execution FP
completes Op
– Supplies operands between execution Queue FP Regs
complete & commit
– (Reorder buffer can be operand
source => more registers like RS)
– Instructions commit Res Stations Res Stations
– Once instruction commits, FP Adder FP Adder
result is put into register
– As a result, easy to undo speculated
instructions
on mispredicted branches
or exceptions
92
Four Steps of Speculative
Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send operands
& reorder buffer no. for destination (this stage sometimes called “dispatch”)
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch CDB for result;
when both in reservation station, execute; checks RAW (sometimes called
“issue”)
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commit—update register (or memory) with reorder result
When instr. at head of reorder buffer & result present, update register with
result (or store to memory) and remove instr from reorder buffer. Mispredicted
branch flushes reorder buffer (sometimes called “graduation”)
93
What are the hardware complexities with
reorder buffer (ROB)?
Compare network
Program Counter
Reorder
Buffer
Exceptions?
FP
Dest Reg
Op
Result
Queue
Valid
FP Regs
94
Tomasulo Summary
• Reservations stations: implicit register renaming to larger
set of registers + buffering source operands
– Prevents registers as bottleneck
– Avoids WAR, WAW hazards of Scoreboard
– Allows loop unrolling in HW
• Not limited to basic blocks
(integer units gets ahead, beyond branches)
• Today, helps cache misses as well
– Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?)
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
• 360/91 descendants are Pentium III++; PowerPC 604;
MIPS R10000; HP-PA 8000; Alpha 21264
95
Tomasulo Algorithm and Branch
Prediction
• 360/91 predicted branches, but did not speculate: pipeline
stopped until the branch was resolved
– No speculation; only instructions that can complete
• Speculation with Reorder Buffer allows execution past
branch, and then discard if branch fails
– just need to hold instructions in buffer until branch can commit
96
Case for Branch Prediction when
Issue N instructions per clock cycle
1. Branches will arrive up to n times faster in an n-issue
processor
2. Amdahl’s Law => relative impact of the control stalls will
be larger with the lower potential CPI in an n-issue
processor
98
7 Branch Prediction Schemes
1. 1-bit Branch-Prediction Buffer
2. 2-bit Branch-Prediction Buffer
3. Correlating Branch Prediction Buffer
4. Tournament Branch Predictor
99
Dynamic Branch Prediction
100
Dynamic Branch Prediction
(Jim Smith, 1981)
T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
T Taken
Taken
T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
T Taken
Taken
NT
102
Correlating Branches
Idea: taken/not taken of
recently executed branches Branch address (4 bits)
is related to behavior of next
branch (as well as the 2-bits per branch
history of that branch local predictors
behavior)
– Then behavior of recent
branches selects between, say,
4 predictions of next branch,
updating just that prediction Prediction
Prediction
– One of these 4 predictors is
selected by the global branch
history
• (2,2) predictor: 2-bit global,
2-bit local
2-bit global
branch history
(01 = not taken then taken)
103
Accuracy of Different Schemes
(Figure 3.8, p. 200)
0%
104
Re-evaluating Correlation
105
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the table
• 4096 entry table programs vary from 1% misprediction
(nasa7, tomcatv) to 18% (eqntott), with spice at 9% and
gcc at 12%
• For SPEC92,
4096 about as good as infinite table
106
Tournament Predictors
108
Tournament Predictor in Alpha 21264
110
Accuracy of Branch Prediction
99%
tomcatv 99%
100%
95%
doduc 84%
97%
86%
fpppp 82% Profile-based
98%
2-bit counter
88% Tournament
li 77%
98%
86%
espresso 82%
96%
88%
gcc 70%
94%
9%
Conditional branch misprediction rate
8%
7%
Local
6%
5%
Correlating
4%
3%
2%
Tournament
1%
0%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
113
MIPS Conditional Move
Move Conditional on not Zero
MOVN rd, rs, rt # if rt != 0 then rd <= rs
Move Conditional on Zero
MOVZ rd, rs, rt # if rt = 0 then rd <= rs
also floating point versions
114
MAX function without branches
r1=MAX(r2,r3)
if (r2>r3) r1=r2; else r1=r3;
115
Pitfall: Sometimes bigger and
dumber is better
• 21264 uses tournament predictor (29 Kbits) with 1K local
predictors
• Earlier 21164 uses a simple 2-bit predictor with 2K entries
(or a total of 4 Kbits)
• SPEC95 benchmarks, 21264 outperforms
– 21264 avg. 11.5 mispredictions per 1000 instructions
– 21164 avg. 16.5 mispredictions per 1000 instructions
• Reversed for transaction processing (TP) !
– 21264 avg. 17 mispredictions per 1000 instructions
– 21164 avg. 15 mispredictions per 1000 instructions
• TP code much larger & 21164 hold 2X branch predictions
based on local behavior (2K vs. 1K local predictor in the
21264)
116
Limitations of BHTs
Only predicts branch direction. Therefore, cannot redirect
fetch stream until after branch target is determined.
Correctly A PC Generation/Mux
predicted P Instruction Fetch Stage 1
taken branch F Instruction Fetch Stage 2
penalty B Branch Address Calc/Begin Decode
I Complete Decode
Jump Register J Steer Instructions to Functional units
penalty
R Register File Read
E Integer Execute
Remainder of execute pipeline
(+ another 6 stages)
PC
target BP
129
BTB is only for Control Instructions
130
Branch Target Buffer (BTB)
2k-entry direct-mapped BTB
I-Cache PC (can also be associative)
Entry PC Valid predicted
target PC
132
Combining BTB and BHT
• BTB entries are considerably more expensive than BHT, but can
redirect fetches at earlier stage in pipeline and can accelerate
indirect branches (JR)
• BHT can hold many more entries and is more accurate
A PC Generation/Mux
BTB P Instruction Fetch Stage 1
F Instruction Fetch Stage 2
BHT in later BHT B Branch Address Calc/Begin Decode
pipeline stage
I Complete Decode
corrects when
BTB misses a J Steer Instructions to Functional units
predicted R Register File Read
taken branch
E Integer Execute
133
Uses of Jump Register (JR)
• Switch statements (jump to address of matching case)
134
Subroutine Return Stack
Small structure to accelerate JR for subroutine returns,
typically much more accurate than BTBs.
fa() { fb(); }
fb() { fc(); }
fc() { fd(); }
Pop return address
Push call address when
when subroutine
function call executed
return decoded
&fd() k entries
&fc() (typically k=8-16)
&fb()
135
Mispredict Recovery
Out-of-order execution?
136
In-Order Commit for Precise Exceptions
In-order Out-of-order In-order
Kill Kill
Kill
Execute
Inject handler PC Exception?
137
Branch Misprediction in Pipeline
Inject correct PC
Kill Kill
Complete
Execute
138
Recovering ROB/Renaming Table
Reorder
buffer Load Commit
Store
FU FU FU
Unit Unit
< t, result >
t1
Snapshots for t2 Reg
r1 ti mispredict recovery File
.
r2 tj
tn
Rename
Load Store
Table FU FU
FU FU
Unit Unit
• One regfile for both committed and speculative values (no data in ROB)
• During decode, instruction result allocated new physical register, source
regs translated to physical regs through rename table
• Instruction reads data from regfile at start of execute (not in decode)
• Write-back updates reg. busy bits on instructions in ROB (assoc. search)
• Snapshots of rename table taken at every branch to recover mispredicts
• On exception, renaming undone in reverse order of issue (MIPS R10000)
142
Pipeline Design with Physical Regfile
Update predictors
kill Branch
Resolution
Branch kill
kill
Prediction
kill Out-of-Order In-Order
Decode &
PC Fetch Reorder Buffer Commit
Rename
In-Order
Physical Reg. File
Branch Store
ALU MEM D$
Unit Buffer
Execute
143
Lifetime of Physical Registers
• Physical regfile holds committed and speculative values
• Physical registers decoupled from ROB entries (no data in ROB)
145
Physical Register Management
Rename Physical Regs Free List
Table P0 P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P4 P4 add r3, r1, #4
R4 P5 <R6> p
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
use ex op p1 PR1 p2 PR2 Rd LPRd PRd
x ld p P7 r1 P8 P0
146
Physical Register Management
Rename Physical Regs Free List
Table P0 P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P4 P4 add r3, r1, #4
R4 P5 <R6> p
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
use ex op p1 PR1 p2 PR2 Rd LPRd PRd
x ld p P7 r1 P8 P0
x add P0 r3 P7 P1
147
Physical Register Management
Rename Physical Regs Free List
Table P0 P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P4 P4 add r3, r1, #4
R4 P5 <R6> p
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P3 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
use ex op p1 PR1 p2 PR2 Rd LPRd PRd
x ld p P7 r1 P8 P0
x add P0 r3 P7 P1
x sub p P6 p P5 r6 P5 P3
148
Physical Register Management
Rename Physical Regs Free List
Table P0 P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P2 P4 P4 add r3, r1, #4
R4 P5 <R6> p
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P3 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
use ex op p1 PR1 p2 PR2 Rd LPRd PRd
x ld p P7 r1 P8 P0
x add P0 r3 P7 P1
x sub p P6 p P5 r6 P5 P3
x add P1 P3 r3 P1 P2
149
Physical Register Management
Rename Physical Regs Free List
Table P0 P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P2 P4 P4 add r3, r1, #4
R4 P5 <R6> p
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P3 P4 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
use ex op p1 PR1 p2 PR2 Rd LPRd PRd
x ld p P7 r1 P8 P0
x add P0 r3 P7 P1
x sub p P6 p P5 r6 P5 P3
x add P1 P3 r3 P1 P2
x ld P0 r6 P3 P4
150
Physical Register Management
Rename Physical Regs Free List
Table P0 <R1> p P0
R0 P1 P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P2 P4 P4 add r3, r1, #4
R4 P5 <R6> p P8
R5 P6 <R7> p sub r6, r7, r6
R6 P5 P3 P4 P7 <R3> p add r3, r3, r6
R7 P6 P8 <R1> p
ld r6, 0(r1)
Pn
ROB
use ex op p1 PR1 p2 PR2 Rd LPRd PRd Execute &
x x ld p P7 r1 P8 P0 Commit
x add p P0 r3 P7 P1
x sub p P6 p P5 r6 P5 P3
x add P1 P3 r3 P1 P2
x ld p P0 r6 P3 P4
151
Physical Register Management
Rename Physical Regs Free List
Table P0 <R1> p P0
R0 P1 <R3> p P1
R1 P8 P0 P2 P3 ld r1, 0(r3)
R2 P3 P2
R3 P7 P1 P2 P4 P4 add r3, r1, #4
R4 P5 <R6> p P8
R5 P6 <R7> p P7 sub r6, r7, r6
R6 P5 P3 P4 P7 <R3> p add r3, r3, r6
R7 P6 P8
ld r6, 0(r1)
Pn
ROB
use ex op p1 PR1 p2 PR2 Rd LPRd PRd
x x ld p P7 r1 P8 P0 Execute &
x x add p P0 r3 P7 P1 Commit
x sub p P6 p P5 r6 P5 P3
x add p P1 P3 r3 P1 P2
x ld p P0 r6 P3 P4
152
Reorder Buffer Holds
Active Instruction Window
… (Older instructions) Commit …
ld r1, (r3) ld r1, (r3)
add r3, r1, r2 add r3, r1, r2
sub r6, r7, r9 Execute sub r6, r7, r9
add r3, r3, r6 add r3, r3, r6
ld r6, (r1) ld r6, (r1)
add r6, r6, r3 add r6, r6, r3
st r6, (r1) Fetch st r6, (r1)
ld r6, (r1) ld r6, (r1)
(Newer instructions)
… …
Cycle t Cycle t + 1
153
Superscalar Register Renaming
• During decode, instructions allocated new physical destination register
• Source operands renamed to physical register with newest value
• Execution unit only sees physical register numbers
st r1, (r2)
ld r3, (r4)
156
In-Order Memory Queue
• Execute all loads and stores in program order
=> Load and store cannot leave ROB for execution until
all previous loads and stores have completed
execution
157
Conservative O-o-O Load Execution
st r1, (r2)
ld r3, (r4)
158
Address Speculation
st r1, (r2)
ld r3, (r4)
• Guess that r4 != r2
159
Memory Dependence Prediction
(Alpha 21264)
st r1, (r2)
ld r3, (r4)
• If later find r4==r2, squash load and all following instructions, but
mark load instruction as store-wait
160
Speculative Loads / Stores
Just like register updates, stores should not modify
the memory until after the instruction is committed
161
Speculative Store Buffer
Speculative Load Address
L1 Data
Store
Cache
Buffer
V S Tag Data
V S Tag Data
V
V
S
S
Tag
Tag
Data
Data
Tags Data
V S Tag Data
V S Tag Data
Store Commit Path
Load Data
• On store execute:
– mark entry valid and speculative, and save data and tag of instruction.
• On store commit:
– clear speculative bit and eventually move data to cache
• On store abort:
– clear valid bit
162
Speculative Store Buffer
Speculative Load Address
L1 Data
Store
Cache
Buffer
V S Tag Data
V S Tag Data
V
V
S
S
Tag
Tag
Data
Data
Tags Data
V S Tag Data
V S Tag Data
Store Commit Path
Load Data
163
Datapath: Branch Prediction
and Speculative Execution
Update predictors
kill Branch
Branch
Prediction kill Resolution
kill
kill
Decode &
PC Fetch Reorder Buffer Commit
Rename
Reg. File
Branch Store
ALU MEM D$
Unit Buffer
Execute
164
Dynamic Branch Prediction Summary
167
Getting CPI < 1:
Issuing Multiple Instructions/Cycle
• Vector Processing: Explicit coding of independent loops as
operations on large vectors of numbers
– Multimedia instructions being added to many processors
• Superscalar: varying no. instructions/cycle (1 to 8),
scheduled by compiler or by HW (Tomasulo)
– IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4
• (Very) Long Instruction Words (V)LIW:
fixed number of instructions (4-16) scheduled by the
compiler; put ops into wide templates (TBD)
– Intel Architecture-64 (IA-64) 64-bit address
» Renamed: “Explicitly Parallel Instruction Computer (EPIC)”
– Will discuss in a few lectures
• Anticipated success of multiple instructions lead to
Instructions Per Clock cycle (IPC) vs. CPI
168
Getting CPI < 1: Issuing
Multiple Instructions/Cycle
• Superscalar MIPS: 2 instructions, 1 FP & 1 integer
– Fetch 64-bits/clock cycle; Int on left, FP on right
– Can only issue 2nd instruction if 1st instruction issues
– More ports for FP registers to do FP load & FP op in a pair
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
• 1 cycle load delay expands to 3 instructions in SS
– instruction in right half can’t use it, nor instructions in next slot
169
Multiple Issue Issues
170
Multiple Issue Challenges
• While Integer/FP split is simple for the HW, get CPI of 0.5 only
for programs with:
– Exactly 50% FP operations AND No hazards
• If more instructions issue at same time, greater difficulty of
decode and issue:
– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2
instructions can issue; (N-issue ~O(N2) comparisons)
– Register file: x-way issue: need 2x reads and 1x writes/cycle
– Rename logic: must be able to rename same register multiple times in one
cycle! For instance, consider 4-way issue:
add r1, r2, r3 add p11, p4, p7
sub r4, r1, r2 sub p22, p11, p4
lw r1, 4(r4) lw p23, 4(p22)
add r5, r1, r2 add p12, p23, p4
Imagine doing this transformation in a single cycle!
– Result buses: Need to complete multiple instructions/cycle
» So, need multiple buses with associated matching logic at every
reservation station.
» Or, need multiple forwarding paths
171
Dynamic Scheduling in Superscalar
The easy way
• How to issue two instructions and keep in-order
instruction issue for Tomasulo?
– Assume 1 integer + 1 floating point
– 1 Tomasulo control for integer, 1 for floating point
• Issue 2X Clock Rate, so that issue remains in order
• Only loads/stores might cause dependency between
integer and FP issue:
– Replace load reservation station with a load queue;
operands must be read in the order they are fetched
– Load checks addresses in Store Queue to avoid RAW violation
– Store checks addresses in Load Queue to avoid WAR,WAW
172
How much to speculate?
174
Limits to ILP
175
Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaming – infinite virtual registers
=> all register WAW & WAR hazards are avoided
2. Branch prediction – perfect; no mispredictions
3. Jump prediction – all jumps perfectly predicted
2 & 3 => machine with perfect speculation & an
unbounded buffer of instructions available
4. Memory-address alias analysis – addresses are known
& a store can be moved before a load provided
addresses not equal
Also:
unlimited number of instructions issued/clock cycle;
perfect caches;
1 cycle latency for all instructions (FP *,/);
176
Upper Limit to ILP: Ideal Machine
(H&P-3ed Figure 3.35, page 242)
160 150.1
FP: 75 - 150
140
Instruction Issues per cycle
100
75.2
IPC
80
62.6
54.8
60
40
17.9
20
0
gcc espresso li fpppp doducd tomcatv
Programs
177
More Realistic HW: Branch Impact
Integer: 6 - 12
IPC
Integer: 5 - 15
IPC
35 registers
30 no heap)
25
20 Integer: 4 - 9 16 16
IPC
15
15
12
10
10 9
7 7
5 5 6
4 4 4 5
3 3 3 4 4
5
Program
1K Tournament Prediction, 16 52
50 entry return, 64 registers, issue 47
as many as window FP: 8 - 45 45
Instruction issues per cycle
40
35
34
30
22 22
IPC
20
Integer: 6 - 12
15 15
14
17 16
15 14
13
12 12 11 11 12
10 10 10 10
9 8 9 8 9 9
10 8
6 6 6 7
5 6
4 4 4 4
3 2 3 3 3 3
Program
182
Moore's Law & Processor Speed Over
Time
185
Timeline
187
P6 Pipeline: PentiumPro, M, Core Dual
• Note
– translation (renaming) CISC instructions to RISC uops
– out of order execution
– in order graduation for precise interrupts and
correct speculative execution
188
Pentium 4 Integer ALU
• Operates 2x clock rate
(e.g. 3.3GHz processor -> 6.6GHz integer ALU)
• Serialize data-dependent instructions, yet issue and
graduate together
DADD r1,r1,r2
DADD r1,r1,r3 #issued with instruction above
DADD r1,r1,r4
DADD r1,r1,r5 #second cycle, issued with above
189
Instructions to micro-ops
190
Functional Unit Utilization Low
• 5 FUs
• potential for 3
uops to
complete
• see zero
complete ~half
of cycles
191
Endosymbiotic theory
• Eukaryotic cells engulf / are invaded by bacteria
3.2 billion years ago, become organelles (mitochondria,
chloroplasts), forming a symbiotic union
– mitochondria are the powerhouses in our cells
192
Head to Head ILP competition
Processor Micro architecture Fetch / Func- Clock Transis- Power
Issue / tional Rate tors,
Execute Units (GHz) Die size
193
Performance on SPECint2000
194
Performance on SPECfp2000
195
Normalized Performance: Efficiency
I P
t e
a n A P
n t t o
i I h w
u u l e
m m o r
Rank 2 4 n 5
Int/Trans 4 2 1 3
FP/Trans 4 2 1 3
Int/area 4 2 1 3
FP/area 4 2 1 3
Int/Watt 4 3 1 2
FP/Watt 2 4 3 1
196
No Silver Bullet for ILP
• No obvious over all leader in performance
• The AMD Athlon leads on SPECInt performance
followed by the Pentium 4, Itanium 2, and Power5
• Itanium 2 and Power5, which perform similarly on
SPECFP, clearly dominate the Athlon and Pentium 4 on
SPECFP
• Itanium 2 is the most inefficient processor both for Fl.
Pt. and integer code for all but one efficiency measure
(SPECFP/Watt)
• Athlon and Pentium 4 both make good use of transistors
and area in terms of efficiency,
• IBM Power5 is the most effective user of energy on
SPECFP and essentially tied on SPECINT
197
Limits to ILP
• Doubling issue rates above today’s 3-6 instructions per
clock, say to 6 to 12 instructions, probably requires a
processor to
– Issue 3 or 4 data memory accesses per cycle,
– Resolve 2 or 3 branches per cycle,
– Rename and access more than 20 registers per cycle, and
– Fetch 12 to 24 instructions per cycle.
• Complexities of implementing these capabilities likely
means sacrifices in maximum clock rate
– E.g, widest issue processor is the Itanium 2, but it also has the slowest
clock rate, despite the fact that it consumes the most power!
198
Limits to ILP
• Most techniques for increasing performance increase power
consumption
• The key question is whether a technique is energy efficient: does
it increase power consumption faster than it increases
performance?
• Multiple issue processors techniques all are energy inefficient:
1. Issuing multiple instructions incurs some overhead in logic that grows
faster than the issue rate grows
2. Growing gap between peak issue rates and sustained performance
• Number of transistors switching = f(peak issue rate), and
performance = f( sustained rate),
growing gap between peak and sustained performance
increasing energy per unit of performance
199
Recall from Pipelining
200
Ideas to Reduce Stalls
Technique Reduces
Dynamic scheduling Data hazard stalls
Dynamic branch Control stalls
prediction
Issuing multiple Ideal CPI
instructions per cycle
Dynamic
Speculation Data and control stalls
Dynamic memory Data hazard stalls involving
disambiguation memory
Loop unrolling Control hazard stalls
Basic compiler pipeline Data hazard stalls
scheduling
Static/ Compiler dependence Ideal CPI and data hazard stalls
Compiler analysis
Software pipelining and Ideal CPI and data hazard stalls
trace scheduling
Compiler speculation Ideal CPI, data and control stalls
201
Review Data Dependence and Hazards
Data Dependence Potential Hardware Hazard
202
Static Branch Prediction Performance
203
Recall: Branch Impact
Integer: 6 - 12
IPC
205
FP Loop: Where are the Hazards?
• First translate into MIPS code:
-To simplify, assume 8 is lowest address
206
FP Loop Showing Stalls
1 Loop: L.D F0,0(R1) ;F0=vector element
2 stall
3 ADD.D F4,F0,F2 ;add scalar in F2
4 stall
5 stall
6 S.D F4,0(R1) ;store result
7 DSUBUI R1,R1,8 ;decrement pointer 8B (DW)
8 BNEZ R1,Loop ;branch R1!=zero
9 NOP ;delayed branch slot
Instruction Instruction Latency in
producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
6 clock cycles, but just 3 for execution, 3 for loop overhead; How make faster?
208
Unroll Loop Four Times
(straightforward way)
1 cycle stall
1 Loop:L.D F0,0(R1) Rewrite loop to
3 ADD.D F4,F0,F2 2 cycles stall
6 S.D F4,0(R1)
minimize
;drop DSUBUI & BNEZ
stalls?
7 L.D F6,-8(R1)
9 ADD.D F8,F6,F2
12 S.D F8,-8(R1) ;drop DSUBUI & BNEZ
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D F12,-16(R1) ;drop DSUBUI & BNEZ
19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D F16,-24(R1)
25 DSUBUI R1,R1,#32 ;alter to 4*8
26 BNEZ R1,LOOP
27 NOP
210
Unrolled Loop That Minimizes Stalls
1 Loop:L.D F0,0(R1)
2 L.D F6,-8(R1) • What assumptions made
3 L.D F10,-16(R1) when moved code?
4 L.D F14,-24(R1) – OK to move store past
5 ADD.D F4,F0,F2 DSUBUI even though the
6 ADD.D F8,F6,F2 store changes register
7 ADD.D F12,F10,F2 – OK to move loads before
8 ADD.D F16,F14,F2 stores: get right data?
9 S.D F4,0(R1) – When is it safe for compiler
10 S.D F8,-8(R1) to do such changes?
11 S.D F12,-16(R1)
12 DSUBUI R1,R1,#32
13 BNEZ R1,LOOP
14 S.D 8(R1),F16 ; 8-32 = -24
211
Compiler Perspectives on Code Movement
• Compiler concerned about dependencies in program
• Existence of a Hardware hazard depends on pipeline
• Try to schedule to avoid hazards that cause performance losses
• (True) Data dependencies (RAW)
– Instruction i produces a result used by instruction j, or
– Instruction j is data dependent on instruction k, and instruction k is data dependent
on instruction i.
• If dependent, can’t execute in parallel
• Easy to determine for registers (fixed names)
• Hard for memory (“memory disambiguation” problem):
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
212
Where are the name dependencies?
1 Loop:L.D F0,0(R1)
3 ADD.D F4,F0,F2
6 S.D 0(R1),F4
7 L.D F0,-8(R1)
9 ADD.D F4,F0,F2
12 S.D -8(R1),F4
13 L.D F0,-16(R1)
15 ADD.D F4,F0,F2
18 S.D -16(R1),F4
19 L.D F0,-24(R1)
21 ADD.D F4,F0,F2
24 S.D -24(R1),F4
25 DSUBUI R1,R1,#32
26 BNEZ R1,LOOP
27 NOP
213
Where are the name dependencies?
1 Loop:L.D F0,0(R1)
3 ADD.D F4,F0,F2
6 S.D 0(R1),F4
7 L.D F0,-8(R1)
9 ADD.D F4,F0,F2
12 S.D -8(R1),F4
13 L.D F0,-16(R1)
15 ADD.D F4,F0,F2
18 S.D -16(R1),F4
19 L.D F0,-24(R1)
21 ADD.D F4,F0,F2
24 S.D -24(R1),F4
25 DSUBUI R1,R1,#32
26 BNEZ R1,LOOP
27 NOP
1 Loop:L.D F0,0(R1)
3 ADD.D F4,F0,F2
6 S.D 0(R1),F4
7 L.D F6,-8(R1)
9 ADD.D F8,F6,F2
12 S.D -8(R1),F8
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12
19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D -24(R1),F16
25 DSUBUI R1,R1,#32
26 BNEZ R1,LOOP
27 NOP
216
VLIW: Very Large Instruction Word
• Each “instruction” has explicit coding for multiple
operations
– In IA-64, grouping called a “bundle”
– In Transmeta, grouping called a “molecule” (with “atoms” as ops)
• Tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long instruction
word are independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches
217
Classic VLIW
• Compiler responsible for instruction scheduling, w.r.t.
instruction latencies
• Insert NOPs where there will be stalls
large code size
• One operation per functional unit
• Any pipeline tweak requires new compilation
• Unroll loops to get ILP
218
Example of a VLIW Architecture: IA-64.
Suggested Reading for Interest
219
IA-64 Instruction Group
220
IA64 Instructions & Registers
• 128 registers – 7 bit register address
– 128 65 bit integer
– 128 82 bit FP
• 3 operand instructions
• 14 bit opcode
• 6 bit predicate
221
Instruction Bundles
222
Bundles
{ .mii
ld4 r28=[r8] // Load a 4-byte value
add r9=2,r1 // 2+r1 and put in r9
add r30=1,r1 // 1+r1 and put in r30
}
223
Templates
225
Control Speculation
226
Control Speculation
227
Ambiguous Memory Dependencies
228
Data Speculation
229
Moving Up Loads + Uses: Recovery Code
recover:
ld8 r6 = [r8] ;; // Reload r6 from [r8]
add r5 = r6,r7 // Re-execute the add
br back // Jump back to main code
230
ld.c, chk.a and the ALAT
231
ld.c, chk.a and the ALAT
232
Not a Thing (NaT)
233
If-conversion
if(r4) {
cmp.ne p1, p0=r4, 0 ;; Set predicate reg
add r1= r2, r3
(p1) add r1=r2, r3
ld8 r6=[r5]
(p1) ld8 r6=[r5]
}
if(r1)
cmp.ne p1, p2 = r1, 0 ;; Set predicate reg
r2 = r3 + r4
(p1) add r2 = r3, r4
else
(p2) sub r7 = r6, r5
r7 = r6 - r5
234
Trace Scheduling
• Two steps:
– Trace Selection
» Find likely sequence of basic blocks (trace)
of (statically predicted or profile predicted)
long sequence of straight-line code
– Trace Compaction
» Squeeze trace into few VLIW instructions
» Need bookkeeping code in case prediction is wrong
• This is a form of compiler-generated speculation
– Compiler must generate recovery code to handle cases in which execution does
not go according to speculation.
– Needs extra registers: undo bad guesses by discarding unused results
• Subtle compiler bugs may result in wrong answer:
no hardware speculation
235
Superscalar v. VLIW
• Smaller code size
• Binary compatibility • Simplified Hardware for
across generations of decoding, issuing
hardware instructions
• No Interlock Hardware
(compiler checks?)
• More registers, but
simplified Hardware for
Register Ports (multiple
independent register
files?)
236
Problems with First Generation VLIW
237
Intel/HP IA-64 “Explicitly Parallel
Instruction Computer (EPIC)”
• IA-64: instruction set architecture; EPIC is type
– EPIC = 2nd generation VLIW?
• Itanium™ is name of first implementation (2001)
– Highly parallel and deeply pipelined hardware at 800Mhz
– 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process
• 128 64-bit integer registers + 128 82-bit floating point registers
– Not separate register files per functional unit as in old VLIW
• Hardware checks dependencies
(interlocks => binary compatibility over time)
• Predicated execution (select 1 out of 64 1-bit flags)
=> 40% fewer mispredictions?
238
3rd Generation Itanium
ISSCC abstract: 14.4 A 1.5GHz Third Generation Itanium® Processor.
J. Stinson, S. Rusu (Intel, Santa Clara, CA)
A third-generation 1.5GHz Itanium® processor implements the Explicitly Parallel
Instruction Computing (EPIC) architecture and features an on-die 6MB, 24-way
set associative L3 cache. The 374mm2 die contains 410M transistors and is
implemented in a dual-VT 0.13m technology saving 6-level Cu interconnects
with FSG dielectric and dissipates 130W.
• 1.5 GHz
• 410 million transistors
• 6MB 24-way set associative L3 cache
• 6-level copper interconnect, 0.13 micron
• 130W (i.e. lasts 17s on an AA NiCd)
239
Comments on Itanium
• Remarkably, the Itanium has many of the features
more commonly associated with the dynamically-
scheduled pipelines
– strong emphasis on branch prediction, register renaming,
scoreboarding, a deep pipeline with many stages before
execution (to handle instruction alignment, renaming, etc.), and
several stages following execution to handle exception detection
• Surprising that an approach whose goal is to rely on
compiler technology and simpler HW seems to be at
least as complex as dynamically scheduled
processors!
240
Performance of IA-64 Itanium
(Source: Microprocessor Report Jan 2002)
• ITANIUM (800 MHz):
• SPECint2000(base): 358
• SPECfp2000(base): 703
• POWER4 (1.3 GHz):
• SPECint2000(base): 790
• SPECfp2000(base): 1,098
• SUN UltraSPARC III (1.05 GHz)
• SPECint2000(base): 537
• SPECfp2000(base): 701
241
Summary#1: Hardware versus Software
Speculation Mechanisms
242
Summary#2: Hardware versus Software
Speculation Mechanisms cont’d
• Compiler-based approaches may benefit from the
ability to see further in the code sequence, resulting in
better code scheduling
• HW-based speculation with dynamic scheduling does
not require different code sequences to achieve good
performance for different implementations of an
architecture
– may be the most important in the long run?
243
Summary #3: Software Scheduling
244
Recall Forms of Parallelism
• Pipelining (Appendix A)
• ILP – Dynamically Scheduled
Involving compiler
• ILP – Statically Scheduled
245
ILP
• Limits to ILP (power efficiency, compilers, dependencies
…) seem to limit to 3 to 6 issue for practical options
• Explicitly parallel (Data level parallelism or Thread level
parallelism) is next step to performance
• Coarse grain vs. Fine grained multihreading
– Only on big stall vs. every clock cycle
• Simultaneous Multithreading if fine grained multithreading
based on OOO superscalar microarchitecture
– Instead of replicating registers, reuse rename registers
• Balance of ILP and TLP decided in marketplace
246
Commentary
• Itanium architecture does not represent a significant
breakthrough in scaling ILP or in avoiding the problems of
complexity and power consumption
• Instead of pursuing more ILP, architects are increasingly
focusing on TLP implemented with single-chip multiprocessors
• In 2000, IBM announced the 1st commercial single-chip,
general-purpose multiprocessor, the Power4, which contains 2
Power3 processors and an integrated L2 cache
– Since then, Sun Microsystems, AMD, and Intel have switch to a focus on
single-chip multiprocessors rather than more aggressive uniprocessors.
• Right balance of ILP and TLP is unclear today
– Perhaps right choice for server market, which can exploit more TLP, may
differ from desktop, where single-thread performance may continue to be a
primary requirement
247
And in conclusion …
• Limits to ILP (power efficiency, compilers, dependencies
…) seem to limit to 3 to 6 issue for practical options
• Explicitly parallel (Data level parallelism or Thread level
parallelism) is next step to performance
• Coarse grain vs. Fine grained multihreading
– Only on big stall vs. every clock cycle
• Simultaneous Multithreading if fine grained multithreading
based on OOO superscalar microarchitecture
– Instead of replicating registers, reuse rename registers
• Itanium/EPIC/VLIW is not a breakthrough in ILP
• Balance of ILP and TLP unclear in marketplace
248
Knowing When to Give Up
The more instruction level parallelism you use, the more
you pay for each incremental gain in performance
249
Beyond ILP
Thread Level Parallelism
• Diminishing returns for finding ILP in code that was
designed to be sequential
250
TLP Architectures
• Terra Computer
– Unique instruction set
– Compute engine (supercomputer)
– Many threads ~1000
– No data cache, just get on with another thread when one thread stalls for a
memory access
• Intel Xeon, Pentium 4 Hyperthreading (HT)
– IA-32 instruction set
– Two threads, ~40% utilization increase
Xeon:
– Server applications
– Support for cache coherence to build parallel processors systems
– L1 uop trace cache 8KB, L2 512Kb, L3 2MB on chip, (max)
• Eleven Engineering XiNc (8 hardware thread comm processor)
251