Professional Documents
Culture Documents
Pipeline Hazards
Hazards
Lecture 7 - 1
Pipelining Lessons
6 PM 7 8 9
Time
20
30
40
40
40
40
T a s k O r d e r
B C D
Pipelining doesnt help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup
Hazards
Lecture 7 - 2
Common solution is to stall the pipeline until the hazard is resolved, inserting one or more bubbles, i.e., idle clock cycles, in the pipeline
Hazards
Lecture 7 - 3
ALU
Reg Mem
Reg
Reg
Mem ALU
Reg
Mem
Reg
Mem ALU
Reg
Mem Mem
Reg
Mem ALU
Reg
Mem Mem
Reg
Mem
Reg
Lecture 7 - 4
ALU
Reg Mem
Mem ALU
Hazards
Instruction Order
Reg
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
Mem Mem
Reg
Lecture 7 - 5
ALU
IM IM
Reg IM IM
DM DM ALU
Reg DM ALU
Reg
Reg DM DM ALU
IM
Reg
Reg DM ALU
IM IM
Reg IM IM
Reg DM DM ALU
Reg IM IM
Reg DM DM
Reg
Lecture 7 - 6
Speedup = Ideal CPI x Pipeline depth x Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined
Speedup
Speedup
Hazards
Lecture 7 - 8
SpeedUpA = [Pipeline Depth/(1 + 0)] x (clockunpipe /clockpipe ) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 3) x (clockunpipe /(clockunpipe / 1.05) = (Pipeline Depth/1.2) x 1.05 = 0.87 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.87 x Pipeline Depth) = 1.15
ALU
Mem
Reg Mem
Reg R1
Reg Reg
Mem ALU
Reg
Mem
Reg
Reg Reg
Mem ALU
Reg
XOR R10,R11,R1
Mem
Mem
Reg
Hazards
Lecture 7 - 10
Store into Ri
Read from Ri
Register Ri
Hazards
Lecture 7 - 11
ALU
Mem
Reg Mem
Reg R1
Reg Reg
Mem ALU
Reg
Mem
Reg
Reg Reg
Mem ALU
Reg
Mem
Mem
Reg
Hazards
Lecture 7 - 13
Instri ADD R1, R2, R3 Instrj LW R2, 0(R5) Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, Reads are always in stage 2, and Writes are always in stage 5
Hazards
Lecture 7 - 14
Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes
Hazards
Lecture 7 - 15
ALU
Mem
Reg Mem
Reg
Reg
Mem ALU
Reg
Mem
Reg
Reg
Mem ALU
Reg
Mem
Mem
Reg
Hazards
Lecture 7 - 16
Hazards
D/A Buffer
A/M Buffer
M/W Buffer
ALU MUX
Data Memory
Lecture 7 - 17
Hazards
Lecture 7 - 18
LOAD R1,0(R2)
IM
Reg
DM ALU ALU
Reg
SUB R4,R1,R6
IM
Reg IM
Reg
IM
DM ALU
Reg
AND R6,R1,R7
IM
Reg
DM ALU
Reg
OR R8,R1,R9
Hazards CS510 Computer Architectures
IM
Reg
DM
Lecture 7 - 19
LOAD R1,0(R2)
IM
Reg
DM
Reg
ALU Reg
SUB R4,R1,R6
IM
ALU
IM
Reg
DM
Reg
AND R6,R1,R7
ALU
IM
Reg
DM ALU
Reg
OR R8,R1,R9
Hazards
IM
CS510 Computer Architectures
Reg
DM
Reg
Lecture 7 - 20
Stall
Lecture 7 - 21
Hazards
Lecture 7 - 22
WB Stage
Zero?
M/W Buffer
F/D Buffer
16
D/A Buffer
A/M Buffer
Instr. Memory
Reg File
Data Memory
LMD
MUX
Sign Ext
32
Hazards
ALU
IM
Reg
DM ALU ALU
Reg
44 AND R12,R2, R5
IM IM
Reg Reg
DM DM ALU ALU
Reg Reg
48 OR R13,R6, R2
IM
Reg
DM DM ALU ALU
52 ADD R14,R2, R2
IM IM
Reg Reg
DM ALU
Reg Reg
80 LD R4,R7, 100
IM
Reg
DM
Reg
IF
ID IF
EX ID
Hazards
Lecture 7 - 25
DLX Solution: Get New PC earlier - Move Zero test to ID stage - Additional ADDER to calculate New PC(taken PC) in ID stage - 1 clock cycle penalty for branch in contrast to 3 cycles
Hazards CS510 Computer Architectures Lecture 7 - 26
Mem Stage
WB Stage
When a branch instruction is in Execute stage, Next Address is available here.
Zero?
Instr. Memory
MUX
Add
M/W Buffer
F/D Buffer
16
D/A Buffer
A/M Buffer
Reg File
Data Memory
LMD
MUX
Sign Ext 32
Lecture 7 - 27
Hazards
Lecture 7 - 28
Hazards
Lecture 7 - 29
Hazards
Lecture 7 - 30
(1) STALL
Stall until branch direction is clear
Branch instruction
IF
ID
EX
MEM WB
IF
ID IF
EX ID IF
MEM EX ID
3 cycle penalty
WB MEM
WB
ID IF
WB MEM WB EX MEM
WB
MEM EX
WB MEM
Lecture 7 - 33
Hazards
Lecture 7 - 34
Delayed Branch
Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch TAKEN From fall through: only valuable when branch NOT TAKEN Canceling branches allow more slots to be filled
Hazards
Lecture 7 - 35
Delayed Branch
From before ADD R1, R2, R3 if R2=0 then Delay slot SUB R4, R5, R6 ADD R1, R2, R3 if R1=0 then Delay slot From target From fall through ADD R1, R2, R3 if R1=0 then Delay slot SUB R4, R5, R6
- Improve performance when TAKEN(loop) - Must be alright to execute rescheduled instructions if Not Taken - May need duplicate the instruction if it is the target of another branch instr.
- Improve performance when NOT TAKEN - Must be alright to execute instructions of Taken
Hazards
Lecture 7 - 36
Hazards
Lecture 7 - 37
Canceling Branch
Used when the delayed branch scheduling, i.e., filling the delay slot cannot be done due to
Restrictions on scheduling instructions at the delay slots Limitations on the ability to predict whether it will TAKE or NOT TAKE at compile time
Canceling Branch allows to fill the delay slot even if the instruction to be filled in the delay slot does not meet the requirements
Hazards
Lecture 7 - 38
Scheduling scheme
Branch penalty
Stall pipeline 3 Predict Taken 1 Predict Not Taken 1 Delayed branch 0.5
Hazards
If branch is almost always NOT TAKEN, TAKEN and R4 is not needed on the taken path, and R5 and R6 are not modified in the following instruction(s), this move can increase speed
If branch is almost always TAKEN, TAKEN and R7 is not needed, and R8 and R9 are not modified on the fall-through path, this move can increase speed
Hazards
Lecture 7 - 40
Frequency of Misprediction
doduc
gcc
compress
espresso
hydro2d
Always taken
Hazards
Lecture 7 - 41
swm256
mdljsp2
tomcatv
alvinn
ora
doduc
gcc
compress
espresso
hydro2d
Profile-based
Direction-based
Hazards
Lecture 7 - 42
swm256
mdljsp2
tomcatv
alvinn
ora
Pipelining Summary
Just overlap tasks, and easy if tasks are independent Speed Up <= Pipeline Depth; if ideal CPI is 1, then:
Speedup =
Hazards limit performance on computers: Structural: need more HW resources Data: need forwarding, compiler scheduling Control: Dynamic Prediction, Delayed branch slot, Static(compiler) Prediction
Hazards
Lecture 7 - 43