Embedded Systems Design: Pipelining and Instruction Scheduling

Embedded Systems Design
Pipelining and Instruction

Scheduling
Dr. Husain Parvez
Karachi Institute of Economics and Technology
husain.parvez@pafkiet.edu.pk
Fall 2016
Presentation Credit
Computer Organization and Design, David A.
Patterson, John L. Hennesy (Third Edition), Chapter 6
Pipelining Introduction
What is Pipelining?
An implementation technique in which

multiple instructions are overlapped in
execution, much like to an assembly line.
Why Pipelining?
To increase throughput.
To maximize hardware utilization.
Pipelining Laundry Example
Four persons A, B, C and D want to wash their

clothes.
Clothes need to be washed, dried, folded, and
placed.
Washer, Drier, Folder and Placer each take 30
minutes.
4
Pipelining approach takes 3.5 hours, because

every thing works in parallel.

Pipelining approach
must take be 4
times faster than the
non-pipelined
approach.
But the stand and
end of the pipeline is
completely full. That
is why pipelined
version is only 2.3
times faster than
non-pipelined
6
Pipelining in Microprocessors
Same pipelining principle applies to
microprocessors for the following five tasks
Instruction fetch (from memory)

Instruction decode (and read registers)
Execute instruction (or calculate address)
Memory (access operand from memory)
Write back (write results into registers)
In this lecture, MIPS has five pipeline

stages.
7
A simple RISC Pipeline
On each cycle, another instruction is fetched

which begins its 5-cycle execution.
If an instruction is started every cycle,
performance increases by 5 times (compared to
non-pipelined version).
Pipelining principle
Improve performance by increasing instruction throughput
P rog ram
e x ec utio n
T im e
o rd er
(in in stru ctio ns )
Ins tru ction

R eg
fe tch
lw $ 1, 1 0 0 ($0 )
lw $ 2, 2 0 0 ($0 )
A LU
D ata
ac cess
10
12
14
Instruction
R eg
fe tch
lw $ 3, 3 0 0 ($0 )
D ata
ac c ess
A LU
Single Cycle non-pipelined

execution
lw $1 , 1 0 0 ($ 0)
Ins truction
fetc h
lw $2 , 2 0 0 ($ 0)
2 ns
lw $3 , 3 0 0 ($ 0)
R eg
Instruc tion
fetc h
2 ns
A LU
R eg
Ins truc tion
fetc h
2 ns
D a ta
access
ALU
R eg
2 ns
Pipelined
10
R eg
A LU
D a ta
access
2 ns
2 ns
...
8 ns
14
12
R eg
D a ta
acces s
R eg
Ins tru ction
fe tch
8 ns
18
R eg
8 ns
P rog ra m
e x ec utio n
Tim e
o rd er
(in in struc tio n s)
16
R eg
2 ns
Single cycle
time is the
maximum
time required
by a phase.
9
Pipelining speedup?
Ideal speedup = number of stages
Do we achieve this?
A typical MIPS Pipelined datapath

0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add Add
result
PC
Address
Instruction
memory
I nstr ucti on
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
32
Read
data
1
M
u
x
0
Pipelining (What makes it easy)

All instructions are the same length.
Easier of fetch instruction in first pipeline stage, and to decode
them in second stage.
Pipelining more challenging when instruction size vary from 1 to
17 bytes.
Just a few instruction formats

Source register field located in the same place in each
instruction.
Thus, second stage can determine the type of instruction
along with reading the source register.
If instruction formats were not symmetric, then split stage 2.
Memory operands appear only in loads and stores
Pipelining (What makes it hard)

What makes it hard?
Structural hazards : suppose we had only one memory
Control hazards: need to worry about branch
instructions
Data hazards: an instruction depends on a previous
instruction
Exceptions:
Hazards
Hazards are situations that prevent the next instruction in
the instruction stream from executing during its designated
clock cycle.
Hazard types:
Structural Hazards
Same resource is needed multiple times in the same cycle
Data Hazards
Data dependencies limit pipelining
Control Hazards
Next executed instruction may not be the next specified
instruction
Structural hazards
Examples:
Two accesses to a single ported memory
Two operations need the same function unit
at the same time
Two operations need the same function unit
in successive cycles, but the unit is not pipelined
Solutions:
stalling
add more hardware
Structural hazards
Non-pipelined units
Same non-pipelined FU
Instruction stream
time
IF ID OF EX
IF ID OF
IF ID
IF
FUs of these two instruction take

two execution cycles
WB
EX EX WB
OF
EX EX WB
ID
OF EX WB
IF
ID OF EX WB
IF ID OF EX WB
IF ID OF EX WB
Stall cycle
A pipeline stall delays all the remaining instructions
Note: this example pipeline differs from the 5-stage MIPS pipeline
Structural
hazards
Processor with one memory port will generate a conflict on memory

reference
Data hazards
Data dependencies:
RaW (read-after-write)
WaW (write-after-write)
WaR (write-after-read)
Hardware solution:
Forwarding / Bypassing
Detection logic
Stalling
Software solution: Scheduling
Data dependences
Three types: RaW, WaR and WaW
add r1, r2, 5
sub r4, r1, r3
; r1 := r2+5
; RaW of r1
add r1, r2, 5

sub r2, r4, 1
; WaR of r2
add r1, r2, 5

sub r1, r1, 1
; WaW of r1
st
ld
r1, 5(r2)
r5, 0(r4)
; M[r2+5] := r1 (st=store)
; r5 = M[r4 + 0] (ld=load)
; memory RaW if 5+r2 = 0+r4
RaW dependence Bypass circuitry

add r1, r2, 5
sub r4, r1, r3
;r1:= r2+5
;RaW of r1
Without bypass circuitry

time
add r1, r2, 5
IF
sub r4, r1, r3
ID OF
IF
EX WB
OF
ID
EX WB
With bypass circuitry

time
add r1, r2, 5
IF
ID OF
EX WB
Saves two cycles
sub r4, r1, r3
IF
ID OF
EX WB
Forwarding/By-pass circuitry
Forwarding path from output of EX stage of add to input of EX

stage of sub
Forwarding
Forwarding cannot prevent all pipeline stalls.
For example in case of a sub after load.
No instruction
issue in this
cycle
Code reordering to prevent stalls
; Load variable B
; Load variable E
; Add B and E
; Store result at memory for variable A
; Load variable F
; Add B and F
; Store result at memory for variable C
Code reordering to prevent stalls
Control hazards (also Branch hazard)

Control operations may change the sequential
flow of instructions
branch
jump
call (jump and link)
return
(exception/interrupt and rti / return from interrupt)
Branch example
Progra m
Time (in clock cycle s)
execu tion
order
CC 1
CC 2
IM
Reg
CC 3
CC 4
CC 5
DM
R eg
CC 6
CC 7
CC 8
CC 9
(in ins tructions )
40 beq $1, $3, 7
44 an d $1 2, $2 , $ 5
48 or $1 3, $6 , $2
52 ad d $1 4, $2 , $ 2
72 lw $4 , 50($ 7)
IM
R eg
IM
DM
R eg
IM
R eg
DM
R eg
IM
R eg
DM
R eg
Reg
DM
R eg
Branching (Solution-1)
Squash pipeline:
When we decide to branch, other instructions are
in the pipeline!
We are predicting branch not taken
need to add hardware for flushing instructions if we
are wrong
Branch with predict not taken
Clock cycles
Branch L
IF
Predict
not taken
L:
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
IF
ID
MEM WB
EX
MEM WB
Intelligent Predictor:
Some branches predicted as taken, some as
untaken.
Dynamic hardware predictors, dynamically
predicts if branch to be taken or not.
Can predict upto to 90% accuracy.
When guess is wrong, the pipelined must be emptied.
Longer pipelines can exacerbate the problem.
Delayed branch instruction:
The delayed branch always executes the next
sequential instruction. (Branch taken after one
instruction)
Hidden from programmer, because assembler
can automatically handle it.
Extending MIPS architecture

to Handle Multi-Cycle
Operations
(Section A.5, Computer Architecture:
A Quantitative approach)
MIPS pipeline for multi-cycle operations

How MIPS pipeline can be extended to handle
multi-cycle operations?
Highly impractical to assume that MIPS multi-cycle
operations complete in 1 cycle.
Slower clock
Enormous amount of logic in
Multi-cycle operations may include

Floating Point and Integer multiplier
Floating Point Adder
Floating Point and Integer Divider

Multi-cycle operations can be incorporated in same pipeline.
The EX cycle may be repeated as many times as required
Number of repetitions can vary for different operations
There may be multiple multi-cycle functional units.
Fully pipelined
(Depth = 7)
Fully pipelined
(Depth = 4)
Non-pipelined

Pipeline stages of different Multi-cycle operations can be
What can be possible problems with multi-cycle operations?

1. Divide unit is not fully pipelined, structural hazard can occur.
Issuing instructions need to be stalled if previous instruction
has not yet completed.
2. Instructions have varying running times, number of register
writes required in a cycle can be larger than 1.
Possible solution: increase number of write ports, but that
write port will be used rarely.

Pipeline stages of different Multi-cycle operations can be
3. Because of longer latency of operations, stalls of RAW will be

more frequent.
4. Instruction complete in different order, thus causing problems
in exception handling.
5. WAW hazards are possible, because instructions dont reach

WB stage in order.
NOTE: WAR hazard is not possible because, register read
always occur in ID stage.
What can be the solution to resolve these problems?
A simpler solution: insert stalls to bring the WB in sequence.
Multi-cycle code sequence
Figure shows a typical multi-cycle code sequence (showing the

stalls arising from RAW hazard)
Longer pipelines substantially raises the number of stalls.
Note that each instruction is the above sequence is dependent on
the next instruction.
Each instruction proceeds as soon as data is available.
Pipeline has full support for bypassing and forwarding.
S.D must be stalled an extra cycle, so that its MEM operation
does not come in conflict with MEM of the ADD.D. Otherwise,
due to forwarding MEM of S.D. was also possible in cycle 16.
Multi-cycle Implementation
Interlocks implemented in hardware that detect when to stall
pipeline.
To avoid two writes in the same cycle:
Track the use write port in ID stage.
Stall an instruction before it is issued for execution.
If instruction in ID needs to use the write-port at the same time as
an already issued instruction, then instruction in ID is stalled for
one cycle.
To avoid possibility of RAW hazard:
Stall the instruction for RAW hazard
To avoid possibility of WAW hazard:
If there is a RAW, then WAW, then the stall of RAW will handle
Occurs in case of useless instructions, There will be no two
consecutive writes. (But in some rare cases WAW can arise)
Software solution to avoid

Pipeline stalls
Instruction Scheduling
Instruction Scheduling
Compiler or programmer schedules instructions (i.e. modifies the
sequence of code) to minimize the hardware stalls.
Without changing the meaning of code, compiler rearranges the
order of instructions to pipeline stalls.
Scheduling Constraints
The scheduled/optimized program must generate the same result
as the original program generates.
All the operations executed in the original program must be
executed in scheduled/optimized program.
No over-usage of resources. Assignment of resources in a cycle
must comply with the available resources
Scheduling Constraints
Data Dependence:
Compiler or programmer must schedule the code keeping in view
the following data dependences.
True dependence: write -> read (RAW hazard)
1. a =
2.
=a
Output dependence: write -> write (WAW hazard)
1. a =
2. a =
Anti dependence: read -> write (WAR hazard)
1. = a
2. a =
Pipelined MIPS Instructions

In MIPS, most instructions execute in 1 cycle.
Conditional branches require 2 cycles to complete
addiu
addiu
beq
$t1, $t1, 1
$t2, $t2, 1
$t2, $t3, label
The above code requires 4 cycles to complete

(1 for each add, 2 for the branch)
MIPS Branch Delay Slots

We can insert a nop to make the second branch cycle explicit
addiu
addiu
beq
nop
$t1, $t1, 1
$t2, $t2, 1
$t2, $t3, label
The nop executes in the branchs second cycle

This cycle is called as branch delay slot.
MIPS Branch Delay Slots (Cont.)

The code can be improved by scheduling something useful in the
delay slots
addiu
beq
addiu
$t2, $t2, 1
$t2, $t3, label
$t1, $t1, 1
This code is equivalent to the original code.

The final state of the machine is the same.
But this code is 25% faster (takes 3 cycles instead of 4 cycles)
MIPS Branch Delay Slots (Cont.)

Any instruction can go in the delay slot.
If we try
addiu $t1, $t1, 1
beq
$t2, $t3, label
addiu $t2, $t2, 1
This code is no longer correct
It uses the wrong value of $t2 in the comparison.
Pipelined multi-cycle MIPS

Instruction scheduling becomes more complex when multi-cycle
instructions are also supported by pipeline.
If an instruction has a delay of 4 cycles,
Four unrelated instructions can be scheduled in those slots.

Embedded Systems Design: Pipelining and Instruction Scheduling

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Embedded Systems Design: Pipelining and Instruction Scheduling

Uploaded by

Copyright:

Available Formats

Embedded Systems Design

Pipelining and Instruction

An implementation technique in which

Pipelining Laundry Example

Four persons A, B, C and D want to wash their

Pipelining Laundry Example

Pipelining approach takes 3.5 hours, because

Pipelining Laundry Example

Instruction fetch (from memory)

In this lecture, MIPS has five pipeline

A simple RISC Pipeline

On each cycle, another instruction is fetched

Ins tru ction

Single Cycle non-pipelined

A typical MIPS Pipelined datapath

Pipelining (What makes it easy)

Just a few instruction formats

Memory operands appear only in loads and stores

Pipelining (What makes it hard)

FUs of these two instruction take

Processor with one memory port will generate a conflict on memory

Software solution: Scheduling

add r1, r2, 5

add r1, r2, 5

RaW dependence Bypass circuitry

Without bypass circuitry

sub r4, r1, r3

With bypass circuitry

sub r4, r1, r3

Forwarding path from output of EX stage of add to input of EX

Code reordering to prevent stalls

Code reordering to prevent stalls

Control hazards (also Branch hazard)

Time (in clock cycle s)

(in ins tructions )

40 beq $1, $3, 7

Branch with predict not taken

Longer pipelines can exacerbate the problem.

Extending MIPS architecture

MIPS pipeline for multi-cycle operations

Multi-cycle operations may include

MIPS pipeline for multi-cycle operations

MIPS pipeline for multi-cycle operations

What can be possible problems with multi-cycle operations?

MIPS pipeline for multi-cycle operations

3. Because of longer latency of operations, stalls of RAW will be

MIPS pipeline for multi-cycle operations

5. WAW hazards are possible, because instructions dont reach

Multi-cycle code sequence

Figure shows a typical multi-cycle code sequence (showing the

Software solution to avoid

Pipelined MIPS Instructions

The above code requires 4 cycles to complete

MIPS Branch Delay Slots

The nop executes in the branchs second cycle

MIPS Branch Delay Slots (Cont.)

This code is equivalent to the original code.

MIPS Branch Delay Slots (Cont.)

Pipelined multi-cycle MIPS

You might also like