You are on page 1of 43

Lecture 7

Pipeline Hazards

Hazards

CS510 Computer Architectures

Lecture 7 - 1

Pipelining Lessons
6 PM 7 8 9

Time
20

30

40

40

40

40

T a s k O r d e r

B C D

Pipelining doesnt help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup

Hazards

CS510 Computer Architectures

Lecture 7 - 2

Its Not That Easy to Achieve the Promised Performance


Limits to pipelining: Hazards prevent the next instruction from executing during its designated clock cycle
Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches and other instructions that change the PC

Common solution is to stall the pipeline until the hazard is resolved, inserting one or more bubbles, i.e., idle clock cycles, in the pipeline

Hazards

CS510 Computer Architectures

Lecture 7 - 3

Structural Hazards /Memory


Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU

LOAD Mem Instruction Order Instr 1 Instr 2 Instr 3

Reg Mem

Mem Mem ALU

Reg

Reg

Mem ALU

Reg

Mem

Reg

Mem ALU

Reg

Mem Mem

Reg

Mem ALU

Reg

Instr 4 Operation on Memory by 2 different instructions in the same clock cycle


Hazards

Mem Mem

Reg

Mem

Reg

CS510 Computer Architectures

Lecture 7 - 4

Structural Hazards with Single-Port Memory


Time(clock cycles)
CC3 CC4 CC1 CC2 CC5 CC6 CC7 CC8 CC9

ALU

Mem LOAD Mem

Reg Mem

Mem ALU

Reg Mem Mem ALU

Hazards

Instruction Order

Instr 1 Instr 2 Stall 3 Instr Stall Stall Instr 3

Reg

Reg Mem ALU

Mem

Reg

Reg

Mem

Reg

Mem

Reg

ALU

3 cycles stall with 1-port memory


CS510 Computer Architectures

Mem Mem

Reg

Lecture 7 - 5

Avoiding Structural Hazard with Dual-Port Memory


Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU

LOAD Instruction Order Instr 1 Instr 2 Instr 3 Instr 4 Instr 5


Hazards

IM IM

Reg IM IM

DM DM ALU

Reg DM ALU

Reg

Reg DM DM ALU

IM

Reg

Reg DM ALU

IM IM

Reg IM IM

Reg DM DM ALU

Reg IM IM

Reg DM DM

No stall with 2-port memory


CS510 Computer Architectures

Reg

Lecture 7 - 6

Speed Up Equation for Pipelining


Speedup from pipelining Ave Instr Time unpipelined Ave Instr Time pipelined CPIunpipelined CPIpipelined CPIunpipelined x Clock Cycleunpipelined x Clock Cyclepipelined

Ideal CPI = CPIunpipelined

Clock Cycleunpipelined x CPIpipelined Clock Cyclepipelined /Pipeline depth(Number of pipeline stages)

Speedup = Ideal CPI x Pipeline depth x Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined

Ideal CPI for pipelined machines is almost always 1


Hazards CS510 Computer Architectures Lecture 7 - 7

Speed Up Equation for Pipelining


CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr = 1 + Pipeline stall clock cycles per instr Ideal CPI x Pipeline depth x Ideal CPI + Pipeline stall CPI Pipeline depth 1 + Pipeline stall CPI
x

Speedup

Clock Cycleunpipelined Clock Cyclepipelined

Speedup

Clock Cycleunpipelined Clock Cyclepipelined

Hazards

CS510 Computer Architectures

Lecture 7 - 8

Dual-Port vs Single-Port Memory


Machine A: 2-port memory(needs no stall for Load); same clock cycle as unpipelined machine Machine B: 1-ported memory(needs 3 cycles stall for Load); 1.05 times faster clock rate than the unpipelined machine Ideal CPI = 1 for both Loads are 40% of instructions executed

SpeedUpA = [Pipeline Depth/(1 + 0)] x (clockunpipe /clockpipe ) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 3) x (clockunpipe /(clockunpipe / 1.05) = (Pipeline Depth/1.2) x 1.05 = 0.87 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.87 x Pipeline Depth) = 1.15

Machine A is 1.15 times faster


Hazards CS510 Computer Architectures Lecture 7 - 9

Data Hazard on Registers


Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU

ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9

Mem

Reg Mem

Mem ALU Reg Re Reg Mem

Reg R1

Reg Reg

Mem ALU

Reg

Mem

Mem ALU Reg Reg

Reg

Reg Reg

Mem ALU

Reg

XOR R10,R11,R1

Mem

Mem

Reg

Hazards

CS510 Computer Architectures

Lecture 7 - 10

Data Hazard on Registers


Registers can be made to read and store in the same cycle such that data is stored in the first half of the clock cycle, and that data can be read in the second half of the same clock cycle
Clcok Cycle

Store into Ri

Read from Ri

Register Ri

Hazards

CS510 Computer Architectures

Lecture 7 - 11

Data Hazard on Registers


Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU

ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1

Mem

Reg Mem

Mem ALU Reg Reg Mem

Reg R1

Reg Reg

Mem ALU

Reg

Mem

Mem ALU Reg Reg

Reg

Reg Reg

Mem ALU

Reg

Mem

Mem

Reg

Needs to Stall 2 cycles


Hazards CS510 Computer Architectures Lecture 7 - 12

Three Generic Data Hazards


Instri followed by Instrj Read After Write (RAW) Instrj tries to read operand before Instri writes it
Instri Instrj LW R1, 0(R2) SUBR 4, R1, R5

Hazards

CS510 Computer Architectures

Lecture 7 - 13

Three Generic Data Hazards


InstrI followed by InstrJ Write After Read (WAR) Instrj tries to write operand before Instri reads it

Instri ADD R1, R2, R3 Instrj LW R2, 0(R5) Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, Reads are always in stage 2, and Writes are always in stage 5

Hazards

CS510 Computer Architectures

Lecture 7 - 14

Three Generic Data Hazards


InstrI followed by InstrJ Write After Write (WAW) Instrj tries to write operand before Instri writes it Leaves wrong result ( Instri not Instrj)
Instri Instrj LW LW R1, 0(R2) R1, 0(R3)

Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes

Hazards

CS510 Computer Architectures

Lecture 7 - 15

Forwarding to Avoid Data Hazards


Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU

ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1

Mem

Reg Mem

Mem ALU Reg Mem

Reg

Reg

Mem ALU

Reg

Mem

Mem ALU Reg

Reg

Reg

Mem ALU

Reg

Mem

Mem

Reg

Hazards

CS510 Computer Architectures

Lecture 7 - 16

HW Change for Forwarding


Zero? MUX

Hazards

D/A Buffer

A/M Buffer

M/W Buffer

ALU MUX

Data Memory

CS510 Computer Architectures

Lecture 7 - 17

Hazards

CS510 Computer Architectures

Lecture 7 - 18

Load Delay Due to Data Hazard


Time(clock cycles) ALU

LOAD R1,0(R2)

IM

Reg

DM ALU ALU

Reg

Load Delay =2cycles


Reg DM ALU Reg

SUB R4,R1,R6

IM

Reg IM

DM ALU ALU Reg

Reg

IM

DM ALU

Reg

AND R6,R1,R7

IM

Reg

DM ALU

Reg

OR R8,R1,R9
Hazards CS510 Computer Architectures

IM

Reg

DM
Lecture 7 - 19

Load Delay with Forwarding


Time(clock cycles)

LOAD R1,0(R2)

IM

Reg

DM

Reg

We need to add HW, called Pipeline Interlock

ALU Reg

Load Delay with Forwarding=1cycle


ALU ALU DM Reg

SUB R4,R1,R6

IM

ALU

IM

Reg

DM

Reg

AND R6,R1,R7

ALU

IM

Reg

DM ALU

Reg

OR R8,R1,R9
Hazards

IM
CS510 Computer Architectures

Reg

DM

Reg

Lecture 7 - 20

Software Scheduling to Avoid Load Hazards


Try to produce fast code for
a = b + c; d = e - f;

assuming a, b, c, d ,e, and f are in memory.


Slow code(with forwarding): Fast code: LW Rb,b LW Rb,b LW Rc,c LW Rc,c Stall RAW ADD Ra,Rb,Rc LW Re,e Stall RAW SW ADD Ra,Rb,Rc a,Ra LW Rf,f LW Re,e SW a,Ra LW Rf,f Stall RAW SUB SUB Rd,Re,Rf Rd,Re,Rf Stall RAW d,Rd SW RAW SW d,Rd
Hazards CS510 Computer Architectures

Stall
Lecture 7 - 21

Compiler Avoiding Load Stalls


scheduled unscheduled 54% 31% 42% 14% 65% 25% 20% 40% 60% 80%

gcc spice tex 0%

% loads stalling pipeline

Hazards

CS510 Computer Architectures

Lecture 7 - 22

Pipelined DLX Datapath


IF Stage ID Stage EX Stage Mem Stage
MUX

WB Stage

Add +4 PC Hazards MUX

Zero?

M/W Buffer

F/D Buffer
16

D/A Buffer

A/M Buffer

Instr. Memory

Reg File

ALU MUX SMD

Data Memory

LMD

MUX

Sign Ext

32

CS510 Computer Architectures

Branch Address Calculation Decide Condition

Branch Decision for target address Lecture 7 - 23

Control Hazard on Branches:

Three Stall Cycles


Time(clock cycles)
CC1 40 BEQ R1,R3, 36 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Hazards

Program execution order in instructions

ALU

IM

Reg

DM ALU ALU

Reg

Shouldt be executed when branch condition is true !

44 AND R12,R2, R5

IM IM

Reg Reg

DM DM ALU ALU

Reg Reg

Branch Target available Reg Reg

48 OR R13,R6, R2

IM

Reg

DM DM ALU ALU

52 ADD R14,R2, R2

IM IM

Reg Reg

DM ALU

Reg Reg

80 LD R4,R7, 100

IM

Reg

DM

Reg

Branch Delay = 3 cycles


CS510 Computer Architectures Lecture 7 - 24

Control Hazard on Branches:

Three Stall Cycles


We dont know yet the instruction being executed is a branch. Fetch the branch successor.
Branch instruction Branch successor IF ID IF EX ID MEM WB EX MEM

Now, target address is available.

Branch successor + 1 Branch successor + 2


Now, we know the instruction being executed is a branch. But stall until branch target address is known.

IF

ID IF

EX ID

3 Wasted clock cycles for the TAKEN branch

Hazards

CS510 Computer Architectures

Lecture 7 - 25

Branch Stall Impact


If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9 Half of the ideal speed Two part solution: Determine the branch is TAKEN or NOT TAKEN sooner, AND Compute TAKEN Branch Address(Branch Target) earlier DLX branch tests if register = 0 or 1

DLX Solution: Get New PC earlier - Move Zero test to ID stage - Additional ADDER to calculate New PC(taken PC) in ID stage - 1 clock cycle penalty for branch in contrast to 3 cycles
Hazards CS510 Computer Architectures Lecture 7 - 26

Pipelined DLX Datapath


IF Stage ID Stage EX Stage
To get target addr. earlier

Mem Stage

WB Stage
When a branch instruction is in Execute stage, Next Address is available here.

Zero?

Add +4 PC Hazards MUX

Instr. Memory

To get the Condition Earlier. Target Address available after ID.

MUX

Add

M/W Buffer

F/D Buffer
16

D/A Buffer

A/M Buffer

Reg File

ALU MUX SMD

Data Memory

LMD

MUX

Sign Ext 32

CS510 Computer Architectures

Lecture 7 - 27

Hazards

CS510 Computer Architectures

Lecture 7 - 28

Branch Behavior in Programs


Conditional branch frequencies integer average --- 14 to 16 % floating point --- 3 to 12 % Forward and backward taken branches forward taken --- 60 % backward taken --- 85 % the average of all conditional branches ---- 67 %

Hazards

CS510 Computer Architectures

Lecture 7 - 29

4 Branch Hazard Alternatives


Stall until branch direction is clear Predict branch NOT TAKEN Predict branch TAKEN Delayed branch

Hazards

CS510 Computer Architectures

Lecture 7 - 30

4 Branch Hazard Alternatives:

(1) STALL
Stall until branch direction is clear

Branch instruction

IF

ID

EX

MEM WB

Branch successor Branch successor + 1 Branch successor + 2

stall stall stall

IF

ID IF

EX ID IF

MEM EX ID

3 cycle penalty

Revised DLX pipeline(get the branch address at EX)


Branch instruction Branch successor Branch successor + 1 Branch successor + 2 IF ID EX stall IF MEM WB ID EX MEM WB IF ID EX MEM IF ID

1 cycle penalty(Branch Delay Slot)


Hazards CS510 Computer Architectures Lecture 7 - 31

4 Branch Hazard Alternatives:

(2) Predict Branch NOT TAKEN


Execute successor instructions in the sequence PC+4 is already calculated, so use it to get the next instruction Flush instructions in the pipeline if branch is actually TAKEN Advantage of late pipeline state update 47% of DLX branches are NOT TAKEN on the average
NOT TAKEN branch instruction i IF instruction i+1 instruction i+2 TAKEN branch instruction i instruction i+1 instruction T IF ID IF EX ID IF EX ID IF MEM EX ID MEM EX ID WB MEM EX

WB MEM

WB

No penalty 1 cycle penalty

ID IF

WB MEM WB EX MEM

WB

Flush this instruction in progress


Hazards CS510 Computer Architectures Lecture 7 - 32

4 Branch Hazard Alternatives:

(3) Predict Branch TAKEN


53% DLX branches TAKEN on average Branch target address available after ID in DLX DLX still incurs 1 cycle branch penalty for TAKEN branch Other machines: branch target known before outcome
TAKEN address not available at this time
NOT TAKEN instruction i Instruction T Instruction i+1 IF ID stall EX IF MEM IF WB ID EX MEM WB

2 cycle penalty in DLX(1 in other machines).

TAKEN address available


TAKEN branch instruction i IF ID Instruction T stall Instruction T+1 WB 1 cycle penalty in DLX(0 in other machines) Hazards EX IF MEM ID IF WB EX ID

MEM EX

WB MEM

CS510 Computer Architectures

Lecture 7 - 33

4 Branch Hazard Alternatives:

(4) Delayed Branch


Delayed Branch Delay branch to take place AFTER a successor instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken 1 slot delayed branch allows proper decision and branch target address in 5 stage DLX pipeline with control hazard improvement

Delayed Branch of length n

Hazards

CS510 Computer Architectures

Lecture 7 - 34

Delayed Branch
Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch TAKEN From fall through: only valuable when branch NOT TAKEN Canceling branches allow more slots to be filled

Compiler effectiveness for single delayed branch slot:


Fills about 60% of delayed branch slots About 80% of instructions executed in delayed branch slots are useful in computation About 50% (60% x 80%) of slots usefully filled

Hazards

CS510 Computer Architectures

Lecture 7 - 35

4 Branch Hazard Alternatives:

Delayed Branch
From before ADD R1, R2, R3 if R2=0 then Delay slot SUB R4, R5, R6 ADD R1, R2, R3 if R1=0 then Delay slot From target From fall through ADD R1, R2, R3 if R1=0 then Delay slot SUB R4, R5, R6

if R2=0 then ADD R1, R2, R3

ADD R1, R2, R3 if R1=0 then SUB R4, R5, R6

ADD R1, R2, R3 if R2=0 then SUB R4, R5, R6

- Always improve performance - Branch must not depend on rescheduled instructions

- Improve performance when TAKEN(loop) - Must be alright to execute rescheduled instructions if Not Taken - May need duplicate the instruction if it is the target of another branch instr.

- Improve performance when NOT TAKEN - Must be alright to execute instructions of Taken

Hazards

CS510 Computer Architectures

Lecture 7 - 36

Limitations on Delayed Branch


Difficulty in finding useful instructions to fill the delayed branch slots Solution - Squashing
Delayed branch associated with a branch prediction Instructions in the predicted path are executed in the delayed branch slot If the branch outcome is mispredicted, instructions in the delayed branch slot are squashed(discarded)

Hazards

CS510 Computer Architectures

Lecture 7 - 37

Canceling Branch
Used when the delayed branch scheduling, i.e., filling the delay slot cannot be done due to
Restrictions on scheduling instructions at the delay slots Limitations on the ability to predict whether it will TAKE or NOT TAKE at compile time

Instruction includes the direction that the branch was predicted


When the branch behaves as predicted, the instructions in the delay slot are executed When branch is incorrectly predicted, the instructions in the delay slot are turned into No-OPs

Canceling Branch allows to fill the delay slot even if the instruction to be filled in the delay slot does not meet the requirements

Hazards

CS510 Computer Architectures

Lecture 7 - 38

Evaluating Branch Alternatives


Pipeline speedup = Pipeline depth / CPI = Pipeline depth 1 + Branch frequency xBranch penalty

Conditional and Unconditional collectively 14% frequency, 65% of branch is TAKEN

Scheduling scheme

Branch penalty

CPI 1+0.14x3=1.42 1+0.14x1=1.14 1+0.14x0.65=1.09 1+0.14x0.5=1.07

speedup vs unpipelined 5/1.42=3.5 5/1.14=4.4 5/1.09=4.5 5/1.07=4.6

speedup vs stall 1.0 1.26 1.29 1.31


Lecture 7 - 39

Stall pipeline 3 Predict Taken 1 Predict Not Taken 1 Delayed branch 0.5
Hazards

CS510 Computer Architectures

Static(Compiler) Prediction of Taken/Untaken Branches


Code Motion LW SUB BEQZ Depend on LW, OR need to ADD stall L: ADD R1, 0(R2) R1, R1, R3 R1, L R4, R5, R6 R10,R4,R3 R7, R8, R9

If branch is almost always NOT TAKEN, TAKEN and R4 is not needed on the taken path, and R5 and R6 are not modified in the following instruction(s), this move can increase speed

If branch is almost always TAKEN, TAKEN and R7 is not needed, and R8 and R9 are not modified on the fall-through path, this move can increase speed

Hazards

CS510 Computer Architectures

Lecture 7 - 40

Static(Compiler) Prediction of Taken/Untaken Branches


Improves strategy for placing instructions in delay slot Two strategies Direction-based Prediction: TAKEN backward branch, NOT TAKEN forward branch Profile-based prediction: Record branch behaviors, predict branch based on the prior run(s)
Misprediction Rate
doduc gcc compress espresso hydro2d swm256 mdljsp2 tomcatv alvinn ora
70% 60% 50% 40% 30% 20% 10% 0% 14% 12% 10% 8% 6% 4% 2% 0%

Frequency of Misprediction

doduc

gcc

compress

espresso

hydro2d

Always taken

Hazards

CS510 Computer Architectures

Taken backwards Not Taken Forwards

Lecture 7 - 41

swm256

mdljsp2

tomcatv

alvinn

ora

Instructions per mispredicted branch

Evaluating Static Branch Prediction Strategies


Misprediction rate ignores frequency of branch Instructions between mispredicted branches is a better metric
100000 10000 1000 100 10 1

doduc

gcc

compress

espresso

hydro2d

Profile-based

Direction-based

Hazards

CS510 Computer Architectures

Lecture 7 - 42

swm256

mdljsp2

tomcatv

alvinn

ora

Pipelining Summary
Just overlap tasks, and easy if tasks are independent Speed Up <= Pipeline Depth; if ideal CPI is 1, then:

Speedup =

Pipeline Depth 1 + Pipeline stall CPI

Clock Cycle Unpipelined Clock Cycle Pipelined

Hazards limit performance on computers: Structural: need more HW resources Data: need forwarding, compiler scheduling Control: Dynamic Prediction, Delayed branch slot, Static(compiler) Prediction

Hazards

CS510 Computer Architectures

Lecture 7 - 43

You might also like