Pipe Lining

Communication Networks Institute
Prof. Dr.-Ing. C. Wietfeld
Working slides for students of the course only!

Disclosure to 3rd parties is strictly prohibited!
Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda
Slide 1

Computer Systems
Prof. Dr.-Ing. Christian Wietfeld
Dipl.-Ing. Dipl.-Kfm. Ralf Burda
Slide 2

Pipelined Processors
The drawback of the simple processor architecture is that
each instruction requires several cycles to execute.
Improvement: Pipelined execution of instructions
Instruction execution is split into multiple phases

Different phases of multiple instructions are executed in parallel
A phase is called a pipeline stage
All pipeline stages together form a pipeline
Pipelines bring the same idea to processors than Henry Ford

has brought to the automobile industry
Slide 3

No Pipeline
Slide 4

Pipelining
Slide 5

Super Pipelining
Slide 6

A Simple Pipeline
Instruction n:
Fetch and
decode
Instruction n:
Execution
Instruction n+1:
Fetch and
decode
Instruction n+1:
Execution
Instruction n+2:
Fetch and
decode
Time
28.10.2015
Slide 7

Pipelines
Pipeline stages are connected by clocked pipeline registers
(also called latches)
Each pipeline stages logic delay is at most one clock period
In the optimal case, each instruction requires k clock cycles to
pass a pipeline with k stages
If a new instruction enters the pipeline in each cycle, k
instructions are handled in parallel inside the pipeline and
also one instruction leaves the pipeline at the end (in the ideal
case)
Slide 8

General Pipeline Architecture

Clock
Takt signal
Register
Register
Register
Register
Eingabe
Input
Output
Ausgabe
Slide 9

Definitions
The latency is the time that an instruction requires to pass all
(relevant) pipeline stages. A pipeline with k stages shows a
latency of k clock cycles in the ideal case.
The throughput of a pipeline specifies the number of
instructions that can leave the pipeline in a single cycle. This
value represents the (theoretical) performance of a pipeline
Slide 10

Speedup
We assume n instructions and k steps that are required to
execute one instruction
A processor without pipeline requires n*k clock cycles
A processor with pipeline requires k+n-1 clock cycles
We assume an ideal pipeline with a latency of k and a throughput of 1
Speedup: S = (n*k) / (k+n-1)

In case of an infinite number of instructions, the speedup is
equivalent to the number of pipeline stages (S = k).
Slide 11

Basic Pipeline
Instruction fetch,
Instruction decode,
Operand fetch from the register file
(the memory where all registers are located)
Instruction execution inside the ALU (Arithmetic Logic Unit)
Write back of the result to the register file
Sometimes, instruction decode and operand fetch are combined in
one single pipeline stage.
Load/Store instructions require an address calculation and at least one
(additional) memory access stage.
Slide 12

The DLX Pipeline

Master
Clock
Cycle
IF
5-Deep
ID
EX MEM WB
IF
-- Instruction Fetch
ID
-- Instruction Decode/Register Fetch
EX
-- Execute/Address Calculation
MEM -- Memory Access

IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
WB
-- Write Back
EX MEM WB
Current CPU Cycle

Slide 13

Slide 14

Instruction Fetch
32 1
Instruction
Register
PC
IF/ID
Registers
Add
I-cache
4
32
Instruction fetch (IF)
MUX
PC
Slide 15

Instruction Decode & Operand Fetch
32
32
Result
Register
Selector
5
Register File
5
Register Addressing
PC
Instruction
Register
Immediate
Register
32
Sign
Extended
16
ID/EX
Registers
32
Instruction decode/
register fetch (ID)
ALU Input
Register 2
Registers Write Value
ALU Input
Register 1
PC
I F /I D
R eg i sters
Slide 16

Execute
True/False
ALU Output
Register
Register
True/False
1
Store Value
Register
ALU
Zero ?
MUX
MUX
32
PC
ALU Input
Register 1
ALU Input
Register 2
Immediate
Register
EX/MEM
Registers
Execution/effective
address calculation (EX)
Conditional
ID/EX
Registers
Slide 17

Write
back (WB)
Memory Access and Write Back

MUX
Jump/Branch Target Address
Load/Store
Address
MEM/WB
Registers
True/False
Conditional
Register
ALU Output
Register
Store Value
Register
Memory access/branch
completion (MEM)
D-cache
ALU Result
Register
ALU Result Value
Load Memory
Data Register
EX/MEM
Registers
Slide 18

Problems of Processor Pipelines

Pipeline conflicts:
Resource conflicts:
Occur if two pipeline stages require the same resource at the same
time
Data conflicts:
An operand is currently not available at the required position
Control conflicts:
Appear at control flow instructions
Resolving any kind of pipeline conflict reduces throughput of

the pipeline
Slide 19

Resource Conflict
Instruction
LOAD
i+1
i+2
i+3
i+4
Cycle Number
1 2 3 4 5 6 7
IF ID EX ME WB
IF ID EX ME WB
9 10
Access Conflict
IF ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB
Given: microprocessor with shared

memory for the instructions and data.
Problem: Access conflict during cycle 4.
Slide 20

Solution of Resource Conflicts

Instruction
LOAD
i+1
i+2
i+3
i+4
Cycle Number
1 2 3 4 5 6 7
IF ID EX ME WB
IF ID EX ME WB
9 10
IF ID EX ME WB
O IF ID EX ME WB
Bubble
IF ID EX ME WB
Slide 21

Data Dependencies
Assume two instructions I1 and I2:
A true dependence st exists if I1 generates a result that is
required by I2
An anti-dependence sa exists if I1 reads a register that is
overwritten by I2
An output dependence so exists if both instructions write to
the same destination
Anti and output dependencies are called false

dependencies.
Slide 22

Example Dependency Graph

S1
Formal structure:
IND: OPERATION, DEST, OP1, OP2
S1:ADD R1,R2,2;
S2:ADD R4,R1,R3;
S3:MULT R3,R5,3;
S4:MULT R3,R6,3;
R1 = R2+2
R4 = R1+R3
R3 = R5*3
R3 = R6*3
True Dependence
S2
Anti D ependence
S3
Output Dependence
S4
Slide 23

Data Conflicts
Data conflicts can occur if two instructions with data
dependencies are located close to each other.
Close depends on the pipeline structure and the actual
instructions.
Three kinds of data conflicts can occur:
Read after write (RAW), caused by a true dependence
Write after read (WAR), caused by an anti dependence
Write after write (WAW), caused by an output dependence
Slide 24

Data Conflicts
1
ADD
R1 R2 R3
SUB
R4 R1 R5
AND
R6 R4 R1
OR
R7 R1 R6
XOR
R8 R1 R4
IF ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB
Dependencies from R1:

New value is read before it was written.
No conflict, if the registers are written at the
beginning of the WB-phase and read at its end
No conflict.
Slide 25

Data Conflicts
1
ADD
R1 R2 R3
SUB
R4 R1 R5
AND
R6 R4 R1
OR
R7 R1 R6
XOR
R8 R1 R4
IF ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB

No conflict, if the registers are written at the
beginning of the WB-phase and read at its end
Slide 26

Data Conflicts
1
ADD
R1 R2 R3
SUB
R4 R1 R5
AND
R6 R4 R1
OR
R7 R1 R6
XOR
R8 R1 R4
IF ID EX ME WB
IF ID EX ME WB
IF ID EX ME W
WB
B
IF ID EX ME WB
IF ID EX ME WB

Slide 27

Solutions for Data Conflicts

Software solutions:
Insertion of NOPs (no operation instructions)
Instruction reordering
Hardware solution:
Three stages
shift-register
with parallel
output
Data forwarding
MUX
ALU
MUX
Control
logic
Slide 28

New Problem: Load Instructions

1
LW
R1 4(R2)
ADD
R4 R1 R3
AND
R5 R6 R7
OR
R7 R6 R8
IF ID EX ME WB
IF ID EX ME WB
Dependence
IF ID EX ME WB
IF ID EX ME WB
With Load instructions data forwarding is not possible

1 pipeline stall and data forwarding is necessary
Slide 29

Load Instructions
1
LW
R1 4(R2)
ADD
R4 R1 R3
AND
R5 R6 R7 Bubble IF
O ID EX ME WB
R7 R6 R8
O IF ID EX ME WB
OR
IF ID EX ME WB
IF ID
O EX ME WB
With Load instructions data forwarding is not possible

1 pipeline bubble and data forwarding is necessary
Slide 30

Instruction Reordering
ADD R1
R1 R2 R3
SUB R4 R1 R5
AND R6 R7 R8
OR R9 R10 R11
XOR R12 R13 R14
RAW conflict between ADD and SUB.

Can be removed by reordering the instructions
Slide 31

Instruction Reordering
ADD R1 R2 R3
SUB R4 R1 R5
AND R6 R7 R8
OR R9 R10 R11
XOR R12 R13 R14
RAW conflict between ADD and SUB.

Can be removed by reordering the instructions
Conditions:
- must not introduce any new dependence

- must not change the program
Slide 32

Control Dependencies
Branching
Instruction i+1
IF ID EX ME WB
IF ID EX ME WB
Instruction i+2
Instruction i+3
IF ID EX ME WB
IF ID EX ME WB
During the decoding (ID phase) it will be determined,

whether the instruction is a branch or not.
If so, the following instructions depend on the outcome

of the branch.
Slide 33

Control Dependencies
Branching
Instruction i+1
Instruction i+2
IF ID EX ME WB
IF
IF
Bubbles
Instruction i+3
O IF ID EX ME WB
O IF ID EX ME WB
O IF ID EX ME WB
During the decoding (ID phase) it will be determined,

whether the instruction is a branch or not.
If so, the following instructions depend on the outcome

of the branch.
The pipeline must be filled with bubbles, until the
result of the branch is clear.
Slide 34

Possible Improvement: Branch Delay Slot

AND
ADD
BEQ
NOP
R4
R1
R1
R2
R2
R5
R3
R3
offset
Branch delay slot: the instruction following the

branch will be executed in any case.
By reordering, it is possible to make an effective
use of the branch delay slot.
Slide 35

Branch Delay Slot

AND
ADD
BEQ
R4
R1
R1
R2
R2
R5
R3
R3
offset
Branch delay slot: the instruction following the

branch will be executed in any case.
By reordering, it is possible to make an effective
use of the branch delay slot.
The AND instruction can be moved into the slot, since
it does not depend on the branch result.
Slide 36

Variation of the von-Neumann-Idea:

The Harvard-Architecture
Separate memories and interconnects for code and data
Bottleneck because of the common von-Neumann interconnect and
memory is omitted
Especially used by Digital Signal Processors (DSPs)
Nearly all modern processors comprise separated first level caches:
The instruction cache supports read accesses only and is connected directly to
the fetch logic of the processor
The data cache supports both, read and write accesses and is connected to
the data path of the processor
In general, both caches are connected to a common second level cache or
directly to the common memory
Slide 37

Pipe Lining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipe Lining

Uploaded by

Copyright:

Available Formats

Communication Networks Institute

Prof. Dr.-Ing. C. Wietfeld

Working slides for students of the course only!

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Dipl.-Ing. Dipl.-Kfm. Ralf Burda

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Instruction execution is split into multiple phases

Pipelines bring the same idea to processors than Henry Ford

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Communication Networks Institute

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

General Pipeline Architecture

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Speedup: S = (n*k) / (k+n-1)

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Communication Networks Institute

The DLX Pipeline

-- Instruction Decode/Register Fetch

MEM -- Memory Access

Current CPU Cycle

Communication Networks Institute

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Instruction fetch (IF)

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Instruction Decode & Operand Fetch

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Registers Write Value

Communication Networks Institute

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Memory Access and Write Back

Jump/Branch Target Address

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

ALU Result Value

Communication Networks Institute

Problems of Processor Pipelines

Resolving any kind of pipeline conflict reduces throughput of

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Given: microprocessor with shared

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Solution of Resource Conflicts

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Communication Networks Institute

Anti and output dependencies are called false

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda