Lec 07 Hazards

Lecture 7
Pipeline Hazards
Hazards
CS510 Computer Architectures
Lecture 7 - 1
Pipelining Lessons
6 PM 7 8 9
Time
20
30
40
40
40
40
T a s k O r d e r
B C D
Pipelining doesnt help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup
Hazards
Lecture 7 - 2
Its Not That Easy to Achieve the Promised Performance

Limits to pipelining: Hazards prevent the next instruction from executing during its designated clock cycle
Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches and other instructions that change the PC
Common solution is to stall the pipeline until the hazard is resolved, inserting one or more bubbles, i.e., idle clock cycles, in the pipeline
Hazards
Lecture 7 - 3
Structural Hazards /Memory

Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
ALU
LOAD Mem Instruction Order Instr 1 Instr 2 Instr 3
Reg Mem
Mem Mem ALU
Reg
Reg
Mem ALU
Reg
Mem
Reg
Mem ALU
Reg
Mem Mem
Reg
Mem ALU
Reg
Instr 4 Operation on Memory by 2 different instructions in the same clock cycle

Hazards
Mem Mem
Reg
Mem
Reg
Lecture 7 - 4
Structural Hazards with Single-Port Memory

Time(clock cycles)
ALU
Mem LOAD Mem
Reg Mem
Mem ALU
Reg Mem Mem ALU
Hazards
Instruction Order
Instr 1 Instr 2 Stall 3 Instr Stall Stall Instr 3
Reg
Reg Mem ALU
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
3 cycles stall with 1-port memory

Mem Mem
Reg
Lecture 7 - 5
Avoiding Structural Hazard with Dual-Port Memory

Time(clock cycles)
ALU
LOAD Instruction Order Instr 1 Instr 2 Instr 3 Instr 4 Instr 5

Hazards
IM IM
Reg IM IM
DM DM ALU
Reg DM ALU
Reg
Reg DM DM ALU
IM
Reg
Reg DM ALU
IM IM
Reg IM IM
Reg DM DM ALU
Reg IM IM
Reg DM DM
No stall with 2-port memory

Reg
Lecture 7 - 6
Speed Up Equation for Pipelining

Speedup from pipelining Ave Instr Time unpipelined Ave Instr Time pipelined CPIunpipelined CPIpipelined CPIunpipelined x Clock Cycleunpipelined x Clock Cyclepipelined
Ideal CPI = CPIunpipelined
Clock Cycleunpipelined x CPIpipelined Clock Cyclepipelined /Pipeline depth(Number of pipeline stages)
Speedup = Ideal CPI x Pipeline depth x Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined
Ideal CPI for pipelined machines is almost always 1

Hazards CS510 Computer Architectures Lecture 7 - 7
Speed Up Equation for Pipelining

CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr = 1 + Pipeline stall clock cycles per instr Ideal CPI x Pipeline depth x Ideal CPI + Pipeline stall CPI Pipeline depth 1 + Pipeline stall CPI
x
Speedup
Clock Cycleunpipelined Clock Cyclepipelined
Speedup
Clock Cycleunpipelined Clock Cyclepipelined
Hazards
Lecture 7 - 8
Dual-Port vs Single-Port Memory

Machine A: 2-port memory(needs no stall for Load); same clock cycle as unpipelined machine Machine B: 1-ported memory(needs 3 cycles stall for Load); 1.05 times faster clock rate than the unpipelined machine Ideal CPI = 1 for both Loads are 40% of instructions executed
SpeedUpA = [Pipeline Depth/(1 + 0)] x (clockunpipe /clockpipe ) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 3) x (clockunpipe /(clockunpipe / 1.05) = (Pipeline Depth/1.2) x 1.05 = 0.87 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.87 x Pipeline Depth) = 1.15
Machine A is 1.15 times faster

Data Hazard on Registers

Time(clock cycles)
ALU
ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9
Mem
Reg Mem
Mem ALU Reg Re Reg Mem
Reg R1
Reg Reg
Mem ALU
Reg
Mem
Mem ALU Reg Reg
Reg
Reg Reg
Mem ALU
Reg
XOR R10,R11,R1
Mem
Mem
Reg
Hazards
Lecture 7 - 10

Registers can be made to read and store in the same cycle such that data is stored in the first half of the clock cycle, and that data can be read in the second half of the same clock cycle
Clcok Cycle
Store into Ri
Read from Ri
Register Ri
Hazards
Lecture 7 - 11

Time(clock cycles)
ALU
ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1
Mem
Reg Mem
Mem ALU Reg Reg Mem
Reg R1
Reg Reg
Mem ALU
Reg
Mem
Mem ALU Reg Reg
Reg
Reg Reg
Mem ALU
Reg
Mem
Mem
Reg
Needs to Stall 2 cycles

Three Generic Data Hazards

Instri followed by Instrj Read After Write (RAW) Instrj tries to read operand before Instri writes it
Instri Instrj LW R1, 0(R2) SUBR 4, R1, R5
Hazards
Lecture 7 - 13

InstrI followed by InstrJ Write After Read (WAR) Instrj tries to write operand before Instri reads it
Instri ADD R1, R2, R3 Instrj LW R2, 0(R5) Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, Reads are always in stage 2, and Writes are always in stage 5
Hazards
Lecture 7 - 14

InstrI followed by InstrJ Write After Write (WAW) Instrj tries to write operand before Instri writes it Leaves wrong result ( Instri not Instrj)
Instri Instrj LW LW R1, 0(R2) R1, 0(R3)
Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes
Hazards
Lecture 7 - 15
Forwarding to Avoid Data Hazards

Time(clock cycles)
ALU
ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1
Mem
Reg Mem
Mem ALU Reg Mem
Reg
Reg
Mem ALU
Reg
Mem
Mem ALU Reg
Reg
Reg
Mem ALU
Reg
Mem
Mem
Reg
Hazards
Lecture 7 - 16
HW Change for Forwarding

Zero? MUX
Hazards
D/A Buffer
A/M Buffer
M/W Buffer
ALU MUX
Data Memory
Lecture 7 - 17
Hazards
Lecture 7 - 18
Load Delay Due to Data Hazard

Time(clock cycles) ALU
LOAD R1,0(R2)
IM
Reg
DM ALU ALU
Reg
Load Delay =2cycles

Reg DM ALU Reg
SUB R4,R1,R6
IM
Reg IM
DM ALU ALU Reg
Reg
IM
DM ALU
Reg
AND R6,R1,R7
IM
Reg
DM ALU
Reg
OR R8,R1,R9
Hazards CS510 Computer Architectures
IM
Reg
DM
Lecture 7 - 19
Load Delay with Forwarding

Time(clock cycles)
LOAD R1,0(R2)
IM
Reg
DM
Reg
We need to add HW, called Pipeline Interlock
ALU Reg
Load Delay with Forwarding=1cycle

ALU ALU DM Reg
SUB R4,R1,R6
IM
ALU
IM
Reg
DM
Reg
AND R6,R1,R7
ALU
IM
Reg
DM ALU
Reg
OR R8,R1,R9
Hazards
IM
Reg
DM
Reg
Lecture 7 - 20
Software Scheduling to Avoid Load Hazards

Try to produce fast code for
a = b + c; d = e - f;
assuming a, b, c, d ,e, and f are in memory.

Slow code(with forwarding): Fast code: LW Rb,b LW Rb,b LW Rc,c LW Rc,c Stall RAW ADD Ra,Rb,Rc LW Re,e Stall RAW SW ADD Ra,Rb,Rc a,Ra LW Rf,f LW Re,e SW a,Ra LW Rf,f Stall RAW SUB SUB Rd,Re,Rf Rd,Re,Rf Stall RAW d,Rd SW RAW SW d,Rd
Hazards CS510 Computer Architectures
Stall
Lecture 7 - 21
Compiler Avoiding Load Stalls

scheduled unscheduled 54% 31% 42% 14% 65% 25% 20% 40% 60% 80%
gcc spice tex 0%
% loads stalling pipeline
Hazards
Lecture 7 - 22
Pipelined DLX Datapath

IF Stage ID Stage EX Stage Mem Stage
MUX
WB Stage
Add +4 PC Hazards MUX
Zero?
M/W Buffer
F/D Buffer
16
D/A Buffer
A/M Buffer
Instr. Memory
Reg File
ALU MUX SMD
Data Memory
LMD
MUX
Sign Ext
32
Branch Address Calculation Decide Condition
Branch Decision for target address Lecture 7 - 23
Control Hazard on Branches:
Three Stall Cycles

Time(clock cycles)
CC1 40 BEQ R1,R3, 36 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Hazards
Program execution order in instructions
ALU
IM
Reg
DM ALU ALU
Reg
Shouldt be executed when branch condition is true !
44 AND R12,R2, R5
IM IM
Reg Reg
DM DM ALU ALU
Reg Reg
Branch Target available Reg Reg
48 OR R13,R6, R2
IM
Reg
DM DM ALU ALU
52 ADD R14,R2, R2
IM IM
Reg Reg
DM ALU
Reg Reg
80 LD R4,R7, 100
IM
Reg
DM
Reg
Branch Delay = 3 cycles

CS510 Computer Architectures Lecture 7 - 24
Control Hazard on Branches:
Three Stall Cycles

We dont know yet the instruction being executed is a branch. Fetch the branch successor.
Branch instruction Branch successor IF ID IF EX ID MEM WB EX MEM
Now, target address is available.
Branch successor + 1 Branch successor + 2

Now, we know the instruction being executed is a branch. But stall until branch target address is known.
IF
ID IF
EX ID
3 Wasted clock cycles for the TAKEN branch
Hazards
Lecture 7 - 25
Branch Stall Impact

If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9 Half of the ideal speed Two part solution: Determine the branch is TAKEN or NOT TAKEN sooner, AND Compute TAKEN Branch Address(Branch Target) earlier DLX branch tests if register = 0 or 1
DLX Solution: Get New PC earlier - Move Zero test to ID stage - Additional ADDER to calculate New PC(taken PC) in ID stage - 1 clock cycle penalty for branch in contrast to 3 cycles
Pipelined DLX Datapath

IF Stage ID Stage EX Stage
To get target addr. earlier
Mem Stage
WB Stage
When a branch instruction is in Execute stage, Next Address is available here.
Zero?
Add +4 PC Hazards MUX
Instr. Memory
To get the Condition Earlier. Target Address available after ID.
MUX
Add
M/W Buffer
F/D Buffer
16
D/A Buffer
A/M Buffer
Reg File
ALU MUX SMD
Data Memory
LMD
MUX
Sign Ext 32
Lecture 7 - 27
Hazards
Lecture 7 - 28
Branch Behavior in Programs

Conditional branch frequencies integer average --- 14 to 16 % floating point --- 3 to 12 % Forward and backward taken branches forward taken --- 60 % backward taken --- 85 % the average of all conditional branches ---- 67 %
Hazards
Lecture 7 - 29
4 Branch Hazard Alternatives

Stall until branch direction is clear Predict branch NOT TAKEN Predict branch TAKEN Delayed branch
Hazards
Lecture 7 - 30
4 Branch Hazard Alternatives:
(1) STALL
Stall until branch direction is clear
Branch instruction
IF
ID
EX
MEM WB
Branch successor Branch successor + 1 Branch successor + 2
stall stall stall
IF
ID IF
EX ID IF
MEM EX ID
3 cycle penalty
Revised DLX pipeline(get the branch address at EX)

Branch instruction Branch successor Branch successor + 1 Branch successor + 2 IF ID EX stall IF MEM WB ID EX MEM WB IF ID EX MEM IF ID
1 cycle penalty(Branch Delay Slot)

(2) Predict Branch NOT TAKEN

Execute successor instructions in the sequence PC+4 is already calculated, so use it to get the next instruction Flush instructions in the pipeline if branch is actually TAKEN Advantage of late pipeline state update 47% of DLX branches are NOT TAKEN on the average
NOT TAKEN branch instruction i IF instruction i+1 instruction i+2 TAKEN branch instruction i instruction i+1 instruction T IF ID IF EX ID IF EX ID IF MEM EX ID MEM EX ID WB MEM EX
WB MEM
WB
No penalty 1 cycle penalty
ID IF
WB MEM WB EX MEM
WB
Flush this instruction in progress

(3) Predict Branch TAKEN

53% DLX branches TAKEN on average Branch target address available after ID in DLX DLX still incurs 1 cycle branch penalty for TAKEN branch Other machines: branch target known before outcome
TAKEN address not available at this time
NOT TAKEN instruction i Instruction T Instruction i+1 IF ID stall EX IF MEM IF WB ID EX MEM WB
2 cycle penalty in DLX(1 in other machines).
TAKEN address available

TAKEN branch instruction i IF ID Instruction T stall Instruction T+1 WB 1 cycle penalty in DLX(0 in other machines) Hazards EX IF MEM ID IF WB EX ID
MEM EX
WB MEM
Lecture 7 - 33
(4) Delayed Branch

Delayed Branch Delay branch to take place AFTER a successor instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken 1 slot delayed branch allows proper decision and branch target address in 5 stage DLX pipeline with control hazard improvement
Delayed Branch of length n
Hazards
Lecture 7 - 34
Delayed Branch
Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch TAKEN From fall through: only valuable when branch NOT TAKEN Canceling branches allow more slots to be filled
Compiler effectiveness for single delayed branch slot:

Fills about 60% of delayed branch slots About 80% of instructions executed in delayed branch slots are useful in computation About 50% (60% x 80%) of slots usefully filled
Hazards
Lecture 7 - 35
Delayed Branch
From before ADD R1, R2, R3 if R2=0 then Delay slot SUB R4, R5, R6 ADD R1, R2, R3 if R1=0 then Delay slot From target From fall through ADD R1, R2, R3 if R1=0 then Delay slot SUB R4, R5, R6
if R2=0 then ADD R1, R2, R3
ADD R1, R2, R3 if R1=0 then SUB R4, R5, R6
ADD R1, R2, R3 if R2=0 then SUB R4, R5, R6
- Always improve performance - Branch must not depend on rescheduled instructions
- Improve performance when TAKEN(loop) - Must be alright to execute rescheduled instructions if Not Taken - May need duplicate the instruction if it is the target of another branch instr.
- Improve performance when NOT TAKEN - Must be alright to execute instructions of Taken
Hazards
Lecture 7 - 36
Limitations on Delayed Branch

Difficulty in finding useful instructions to fill the delayed branch slots Solution - Squashing
Delayed branch associated with a branch prediction Instructions in the predicted path are executed in the delayed branch slot If the branch outcome is mispredicted, instructions in the delayed branch slot are squashed(discarded)
Hazards
Lecture 7 - 37
Canceling Branch
Used when the delayed branch scheduling, i.e., filling the delay slot cannot be done due to
Restrictions on scheduling instructions at the delay slots Limitations on the ability to predict whether it will TAKE or NOT TAKE at compile time
Instruction includes the direction that the branch was predicted

When the branch behaves as predicted, the instructions in the delay slot are executed When branch is incorrectly predicted, the instructions in the delay slot are turned into No-OPs
Canceling Branch allows to fill the delay slot even if the instruction to be filled in the delay slot does not meet the requirements
Hazards
Lecture 7 - 38
Evaluating Branch Alternatives

Pipeline speedup = Pipeline depth / CPI = Pipeline depth 1 + Branch frequency xBranch penalty
Conditional and Unconditional collectively 14% frequency, 65% of branch is TAKEN
Scheduling scheme
Branch penalty
CPI 1+0.14x3=1.42 1+0.14x1=1.14 1+0.14x0.65=1.09 1+0.14x0.5=1.07
speedup vs unpipelined 5/1.42=3.5 5/1.14=4.4 5/1.09=4.5 5/1.07=4.6
speedup vs stall 1.0 1.26 1.29 1.31

Lecture 7 - 39
Stall pipeline 3 Predict Taken 1 Predict Not Taken 1 Delayed branch 0.5
Hazards
Static(Compiler) Prediction of Taken/Untaken Branches

Code Motion LW SUB BEQZ Depend on LW, OR need to ADD stall L: ADD R1, 0(R2) R1, R1, R3 R1, L R4, R5, R6 R10,R4,R3 R7, R8, R9
If branch is almost always NOT TAKEN, TAKEN and R4 is not needed on the taken path, and R5 and R6 are not modified in the following instruction(s), this move can increase speed
If branch is almost always TAKEN, TAKEN and R7 is not needed, and R8 and R9 are not modified on the fall-through path, this move can increase speed
Hazards
Lecture 7 - 40
Static(Compiler) Prediction of Taken/Untaken Branches

Improves strategy for placing instructions in delay slot Two strategies Direction-based Prediction: TAKEN backward branch, NOT TAKEN forward branch Profile-based prediction: Record branch behaviors, predict branch based on the prior run(s)
Misprediction Rate
doduc gcc compress espresso hydro2d swm256 mdljsp2 tomcatv alvinn ora
70% 60% 50% 40% 30% 20% 10% 0% 14% 12% 10% 8% 6% 4% 2% 0%
Frequency of Misprediction
doduc
gcc
compress
espresso
hydro2d
Always taken
Hazards
Taken backwards Not Taken Forwards
Lecture 7 - 41
swm256
mdljsp2
tomcatv
alvinn
ora
Instructions per mispredicted branch
Evaluating Static Branch Prediction Strategies

Misprediction rate ignores frequency of branch Instructions between mispredicted branches is a better metric
100000 10000 1000 100 10 1
doduc
gcc
compress
espresso
hydro2d
Profile-based
Direction-based
Hazards
Lecture 7 - 42
swm256
mdljsp2
tomcatv
alvinn
ora
Pipelining Summary
Just overlap tasks, and easy if tasks are independent Speed Up <= Pipeline Depth; if ideal CPI is 1, then:
Speedup =
Pipeline Depth 1 + Pipeline stall CPI
Clock Cycle Unpipelined Clock Cycle Pipelined
Hazards limit performance on computers: Structural: need more HW resources Data: need forwarding, compiler scheduling Control: Dynamic Prediction, Delayed branch slot, Static(compiler) Prediction
Hazards
Lecture 7 - 43

Lec 07 Hazards

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 07 Hazards

Uploaded by

Copyright:

Available Formats

Lecture 7

CS510 Computer Architectures

CS510 Computer Architectures

Its Not That Easy to Achieve the Promised Performance

CS510 Computer Architectures

Structural Hazards /Memory

LOAD Mem Instruction Order Instr 1 Instr 2 Instr 3

Mem Mem ALU

Instr 4 Operation on Memory by 2 different instructions in the same clock cycle

CS510 Computer Architectures

Structural Hazards with Single-Port Memory

Mem LOAD Mem

Reg Mem Mem ALU

Instr 1 Instr 2 Stall 3 Instr Stall Stall Instr 3

Reg Mem ALU

3 cycles stall with 1-port memory

Avoiding Structural Hazard with Dual-Port Memory

LOAD Instruction Order Instr 1 Instr 2 Instr 3 Instr 4 Instr 5

No stall with 2-port memory

Speed Up Equation for Pipelining

Ideal CPI = CPIunpipelined

Clock Cycleunpipelined x CPIpipelined Clock Cyclepipelined /Pipeline depth(Number of pipeline stages)

Ideal CPI for pipelined machines is almost always 1

Speed Up Equation for Pipelining

Clock Cycleunpipelined Clock Cyclepipelined

Clock Cycleunpipelined Clock Cyclepipelined

CS510 Computer Architectures

Dual-Port vs Single-Port Memory

Machine A is 1.15 times faster

Data Hazard on Registers

ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9

Mem ALU Reg Re Reg Mem

Mem ALU Reg Reg

CS510 Computer Architectures

Data Hazard on Registers

CS510 Computer Architectures

Data Hazard on Registers

ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1

Mem ALU Reg Reg Mem

Mem ALU Reg Reg

Needs to Stall 2 cycles

Three Generic Data Hazards

CS510 Computer Architectures

Three Generic Data Hazards

CS510 Computer Architectures

Three Generic Data Hazards

CS510 Computer Architectures

Forwarding to Avoid Data Hazards

ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1

Mem ALU Reg Mem

Mem ALU Reg

CS510 Computer Architectures

HW Change for Forwarding

CS510 Computer Architectures

CS510 Computer Architectures

Load Delay Due to Data Hazard

Load Delay =2cycles

DM ALU ALU Reg

Load Delay with Forwarding

We need to add HW, called Pipeline Interlock

Load Delay with Forwarding=1cycle

Software Scheduling to Avoid Load Hazards

assuming a, b, c, d ,e, and f are in memory.