You are on page 1of 48

Embedded Systems Design

Pipelining and Instruction


Scheduling
Dr. Husain Parvez
Karachi Institute of Economics and Technology
husain.parvez@pafkiet.edu.pk
Fall 2016

Presentation Credit
Computer Organization and Design, David A.
Patterson, John L. Hennesy (Third Edition), Chapter 6

Pipelining Introduction

What is Pipelining?

An implementation technique in which


multiple instructions are overlapped in
execution, much like to an assembly line.

Why Pipelining?
To increase throughput.
To maximize hardware utilization.

Pipelining Laundry Example

Four persons A, B, C and D want to wash their


clothes.
Clothes need to be washed, dried, folded, and
placed.
Washer, Drier, Folder and Placer each take 30
minutes.
4

Pipelining Laundry Example

Pipelining approach takes 3.5 hours, because


every thing works in parallel.

Pipelining Laundry Example


Pipelining approach
must take be 4
times faster than the
non-pipelined
approach.
But the stand and
end of the pipeline is
completely full. That
is why pipelined
version is only 2.3
times faster than
non-pipelined
6

Pipelining in Microprocessors
Same pipelining principle applies to
microprocessors for the following five tasks

Instruction fetch (from memory)


Instruction decode (and read registers)
Execute instruction (or calculate address)
Memory (access operand from memory)
Write back (write results into registers)

In this lecture, MIPS has five pipeline


stages.
7

A simple RISC Pipeline

On each cycle, another instruction is fetched


which begins its 5-cycle execution.
If an instruction is started every cycle,
performance increases by 5 times (compared to
non-pipelined version).

Pipelining principle
Improve performance by increasing instruction throughput
P rog ram
e x ec utio n
T im e
o rd er
(in in stru ctio ns )

Ins tru ction


R eg
fe tch

lw $ 1, 1 0 0 ($0 )

lw $ 2, 2 0 0 ($0 )

A LU

D ata
ac cess

10

12

14

Instruction
R eg
fe tch

lw $ 3, 3 0 0 ($0 )

D ata
ac c ess

A LU

Single Cycle non-pipelined


execution

lw $1 , 1 0 0 ($ 0)

Ins truction
fetc h

lw $2 , 2 0 0 ($ 0)

2 ns

lw $3 , 3 0 0 ($ 0)

R eg
Instruc tion
fetc h

2 ns

A LU

R eg
Ins truc tion
fetc h

2 ns

D a ta
access
ALU

R eg

2 ns

Pipelined

10

R eg

A LU

D a ta
access

2 ns

2 ns

...
8 ns

14

12

R eg
D a ta
acces s

R eg
Ins tru ction
fe tch

8 ns

18

R eg

8 ns

P rog ra m
e x ec utio n
Tim e
o rd er
(in in struc tio n s)

16

R eg

2 ns

Single cycle
time is the
maximum
time required
by a phase.
9

Pipelining speedup?
Ideal speedup = number of stages
Do we achieve this?

A typical MIPS Pipelined datapath


0
M
u
x
1

IF/ID

ID/EX

EX/MEM

MEM/WB

Add
Add Add
result

PC

Address
Instruction
memory

I nstr ucti on

Shift
left 2
Read
register 1

Read
data 1

Read
register 2
Registers Read
Write
data 2
register
Write
data

0
M
u
x
1

Zero
ALU ALU
result

Address
Data
memory
Write
data

16

Sign
extend

32

Read
data

1
M
u
x
0

Pipelining (What makes it easy)


All instructions are the same length.
Easier of fetch instruction in first pipeline stage, and to decode
them in second stage.
Pipelining more challenging when instruction size vary from 1 to
17 bytes.

Just a few instruction formats


Source register field located in the same place in each
instruction.
Thus, second stage can determine the type of instruction
along with reading the source register.
If instruction formats were not symmetric, then split stage 2.

Memory operands appear only in loads and stores

Pipelining (What makes it hard)


What makes it hard?
Structural hazards : suppose we had only one memory
Control hazards: need to worry about branch
instructions
Data hazards: an instruction depends on a previous
instruction
Exceptions:

Hazards
Hazards are situations that prevent the next instruction in
the instruction stream from executing during its designated
clock cycle.

Hazard types:
Structural Hazards
Same resource is needed multiple times in the same cycle

Data Hazards
Data dependencies limit pipelining

Control Hazards
Next executed instruction may not be the next specified
instruction

Structural hazards
Examples:
Two accesses to a single ported memory
Two operations need the same function unit
at the same time
Two operations need the same function unit
in successive cycles, but the unit is not pipelined
Solutions:
stalling
add more hardware

Structural hazards
Non-pipelined units
Same non-pipelined FU

Instruction stream

time

IF ID OF EX
IF ID OF
IF ID
IF

FUs of these two instruction take


two execution cycles

WB
EX EX WB
OF
EX EX WB
ID
OF EX WB
IF
ID OF EX WB
IF ID OF EX WB
IF ID OF EX WB

Stall cycle
A pipeline stall delays all the remaining instructions

Note: this example pipeline differs from the 5-stage MIPS pipeline

Structural
hazards

Processor with one memory port will generate a conflict on memory


reference

Data hazards
Data dependencies:
RaW (read-after-write)
WaW (write-after-write)
WaR (write-after-read)

Hardware solution:
Forwarding / Bypassing
Detection logic
Stalling

Software solution: Scheduling

Data dependences
Three types: RaW, WaR and WaW
add r1, r2, 5
sub r4, r1, r3

; r1 := r2+5
; RaW of r1

add r1, r2, 5


sub r2, r4, 1

; WaR of r2

add r1, r2, 5


sub r1, r1, 1

; WaW of r1

st
ld

r1, 5(r2)
r5, 0(r4)

; M[r2+5] := r1 (st=store)
; r5 = M[r4 + 0] (ld=load)
; memory RaW if 5+r2 = 0+r4

RaW dependence Bypass circuitry


add r1, r2, 5
sub r4, r1, r3

;r1:= r2+5
;RaW of r1

Without bypass circuitry


time
add r1, r2, 5

IF

sub r4, r1, r3

ID OF
IF

EX WB
OF

ID

EX WB

With bypass circuitry


time
add r1, r2, 5

IF

ID OF

EX WB
Saves two cycles

sub r4, r1, r3

IF

ID OF

EX WB

Forwarding/By-pass circuitry

Forwarding path from output of EX stage of add to input of EX


stage of sub

Forwarding
Forwarding cannot prevent all pipeline stalls.
For example in case of a sub after load.

No instruction
issue in this
cycle

Code reordering to prevent stalls

; Load variable B
; Load variable E
; Add B and E
; Store result at memory for variable A
; Load variable F
; Add B and F
; Store result at memory for variable C

Code reordering to prevent stalls

Control hazards (also Branch hazard)


Control operations may change the sequential
flow of instructions

branch
jump
call (jump and link)
return
(exception/interrupt and rti / return from interrupt)

Branch example
Progra m

Time (in clock cycle s)

execu tion
order

CC 1

CC 2

IM

Reg

CC 3

CC 4

CC 5

DM

R eg

CC 6

CC 7

CC 8

CC 9

(in ins tructions )

40 beq $1, $3, 7

44 an d $1 2, $2 , $ 5

48 or $1 3, $6 , $2

52 ad d $1 4, $2 , $ 2

72 lw $4 , 50($ 7)

IM

R eg

IM

DM

R eg

IM

R eg

DM

R eg

IM

R eg

DM

R eg

Reg

DM

R eg

Branching (Solution-1)
Squash pipeline:
When we decide to branch, other instructions are
in the pipeline!
We are predicting branch not taken
need to add hardware for flushing instructions if we
are wrong

Branch with predict not taken

Clock cycles

Branch L

IF

Predict
not taken

L:

ID

EX

MEM WB

IF

ID

EX

MEM WB

IF

ID

EX

MEM WB

IF

ID

EX

IF

ID

MEM WB
EX

MEM WB

Branching (Solution-2)
Intelligent Predictor:
Some branches predicted as taken, some as
untaken.
Dynamic hardware predictors, dynamically
predicts if branch to be taken or not.
Can predict upto to 90% accuracy.
When guess is wrong, the pipelined must be emptied.

Longer pipelines can exacerbate the problem.

Branching (Solution-3)
Delayed branch instruction:
The delayed branch always executes the next
sequential instruction. (Branch taken after one
instruction)
Hidden from programmer, because assembler
can automatically handle it.

Extending MIPS architecture


to Handle Multi-Cycle
Operations
(Section A.5, Computer Architecture:
A Quantitative approach)

MIPS pipeline for multi-cycle operations


How MIPS pipeline can be extended to handle
multi-cycle operations?
Highly impractical to assume that MIPS multi-cycle
operations complete in 1 cycle.
Slower clock
Enormous amount of logic in

Multi-cycle operations may include


Floating Point and Integer multiplier
Floating Point Adder
Floating Point and Integer Divider

MIPS pipeline for multi-cycle operations


Multi-cycle operations can be incorporated in same pipeline.
The EX cycle may be repeated as many times as required
Number of repetitions can vary for different operations
There may be multiple multi-cycle functional units.
Fully pipelined
(Depth = 7)
Fully pipelined
(Depth = 4)

Non-pipelined

MIPS pipeline for multi-cycle operations


Pipeline stages of different Multi-cycle operations can be

What can be possible problems with multi-cycle operations?


1. Divide unit is not fully pipelined, structural hazard can occur.
Issuing instructions need to be stalled if previous instruction
has not yet completed.
2. Instructions have varying running times, number of register
writes required in a cycle can be larger than 1.
Possible solution: increase number of write ports, but that
write port will be used rarely.

MIPS pipeline for multi-cycle operations


Pipeline stages of different Multi-cycle operations can be

3. Because of longer latency of operations, stalls of RAW will be


more frequent.
4. Instruction complete in different order, thus causing problems
in exception handling.

MIPS pipeline for multi-cycle operations

5. WAW hazards are possible, because instructions dont reach


WB stage in order.
NOTE: WAR hazard is not possible because, register read
always occur in ID stage.
What can be the solution to resolve these problems?
A simpler solution: insert stalls to bring the WB in sequence.

Multi-cycle code sequence

Figure shows a typical multi-cycle code sequence (showing the


stalls arising from RAW hazard)
Longer pipelines substantially raises the number of stalls.
Note that each instruction is the above sequence is dependent on
the next instruction.
Each instruction proceeds as soon as data is available.
Pipeline has full support for bypassing and forwarding.
S.D must be stalled an extra cycle, so that its MEM operation
does not come in conflict with MEM of the ADD.D. Otherwise,
due to forwarding MEM of S.D. was also possible in cycle 16.

Multi-cycle Implementation
Interlocks implemented in hardware that detect when to stall
pipeline.
To avoid two writes in the same cycle:
Track the use write port in ID stage.
Stall an instruction before it is issued for execution.
If instruction in ID needs to use the write-port at the same time as
an already issued instruction, then instruction in ID is stalled for
one cycle.
To avoid possibility of RAW hazard:
Stall the instruction for RAW hazard
To avoid possibility of WAW hazard:
If there is a RAW, then WAW, then the stall of RAW will handle
Occurs in case of useless instructions, There will be no two
consecutive writes. (But in some rare cases WAW can arise)

Software solution to avoid


Pipeline stalls
Instruction Scheduling

Instruction Scheduling
Compiler or programmer schedules instructions (i.e. modifies the
sequence of code) to minimize the hardware stalls.
Without changing the meaning of code, compiler rearranges the
order of instructions to pipeline stalls.

Scheduling Constraints
The scheduled/optimized program must generate the same result
as the original program generates.
All the operations executed in the original program must be
executed in scheduled/optimized program.
No over-usage of resources. Assignment of resources in a cycle
must comply with the available resources

Scheduling Constraints
Data Dependence:
Compiler or programmer must schedule the code keeping in view
the following data dependences.
True dependence: write -> read (RAW hazard)
1. a =
2.
=a
Output dependence: write -> write (WAW hazard)
1. a =
2. a =
Anti dependence: read -> write (WAR hazard)
1. = a
2. a =

Pipelined MIPS Instructions


In MIPS, most instructions execute in 1 cycle.
Conditional branches require 2 cycles to complete
addiu
addiu
beq

$t1, $t1, 1
$t2, $t2, 1
$t2, $t3, label

The above code requires 4 cycles to complete


(1 for each add, 2 for the branch)

MIPS Branch Delay Slots


We can insert a nop to make the second branch cycle explicit

addiu
addiu
beq
nop

$t1, $t1, 1
$t2, $t2, 1
$t2, $t3, label

The nop executes in the branchs second cycle


This cycle is called as branch delay slot.

MIPS Branch Delay Slots (Cont.)


The code can be improved by scheduling something useful in the
delay slots
addiu
beq
addiu

$t2, $t2, 1
$t2, $t3, label
$t1, $t1, 1

This code is equivalent to the original code.


The final state of the machine is the same.
But this code is 25% faster (takes 3 cycles instead of 4 cycles)

MIPS Branch Delay Slots (Cont.)


Any instruction can go in the delay slot.
If we try
addiu $t1, $t1, 1
beq
$t2, $t3, label
addiu $t2, $t2, 1
This code is no longer correct
It uses the wrong value of $t2 in the comparison.

Pipelined multi-cycle MIPS


Instruction scheduling becomes more complex when multi-cycle
instructions are also supported by pipeline.
If an instruction has a delay of 4 cycles,
Four unrelated instructions can be scheduled in those slots.

You might also like