You are on page 1of 37

Communication Networks Institute

Prof. Dr.-Ing. C. Wietfeld

Working slides for students of the course only!


Disclosure to 3rd parties is strictly prohibited!

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 1

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Computer Systems
Communication Networks Institute
Prof. Dr.-Ing. Christian Wietfeld

Dipl.-Ing. Dipl.-Kfm. Ralf Burda

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 2

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Pipelined Processors
The drawback of the simple processor architecture is that
each instruction requires several cycles to execute.
Improvement: Pipelined execution of instructions

Instruction execution is split into multiple phases


Different phases of multiple instructions are executed in parallel
A phase is called a pipeline stage
All pipeline stages together form a pipeline

Pipelines bring the same idea to processors than Henry Ford


has brought to the automobile industry

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 3

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

No Pipeline

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 4

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Pipelining

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 5

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Super Pipelining

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 6

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

A Simple Pipeline
Instruction n:
Fetch and
decode

Instruction n:
Execution
Instruction n+1:
Fetch and
decode

Instruction n+1:
Execution
Instruction n+2:
Fetch and
decode

Time
28.10.2015
Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 7

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Pipelines
Pipeline stages are connected by clocked pipeline registers
(also called latches)
Each pipeline stages logic delay is at most one clock period
In the optimal case, each instruction requires k clock cycles to
pass a pipeline with k stages
If a new instruction enters the pipeline in each cycle, k
instructions are handled in parallel inside the pipeline and
also one instruction leaves the pipeline at the end (in the ideal
case)

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 8

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

General Pipeline Architecture


Clock
Takt signal

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Register

Register

Register

Register

Eingabe
Input

Output
Ausgabe

Slide 9

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Definitions
The latency is the time that an instruction requires to pass all
(relevant) pipeline stages. A pipeline with k stages shows a
latency of k clock cycles in the ideal case.
The throughput of a pipeline specifies the number of
instructions that can leave the pipeline in a single cycle. This
value represents the (theoretical) performance of a pipeline

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 10

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Speedup
We assume n instructions and k steps that are required to
execute one instruction
A processor without pipeline requires n*k clock cycles
A processor with pipeline requires k+n-1 clock cycles
We assume an ideal pipeline with a latency of k and a throughput of 1

Speedup: S = (n*k) / (k+n-1)


In case of an infinite number of instructions, the speedup is
equivalent to the number of pipeline stages (S = k).

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 11

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Basic Pipeline
Instruction fetch,
Instruction decode,
Operand fetch from the register file
(the memory where all registers are located)
Instruction execution inside the ALU (Arithmetic Logic Unit)
Write back of the result to the register file
Sometimes, instruction decode and operand fetch are combined in
one single pipeline stage.
Load/Store instructions require an address calculation and at least one
(additional) memory access stage.
Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 12

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

The DLX Pipeline


Master
Clock
Cycle

IF

5-Deep

ID

EX MEM WB

IF

-- Instruction Fetch

ID

-- Instruction Decode/Register Fetch

EX

-- Execute/Address Calculation

MEM -- Memory Access


IF

ID

EX MEM WB

IF

ID

EX MEM WB

IF

ID

EX MEM WB

IF

ID

WB

-- Write Back

EX MEM WB

Current CPU Cycle


Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 13

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 14

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Instruction Fetch
32 1

Instruction
Register

PC

IF/ID
Registers

Add

I-cache
4

32

Instruction fetch (IF)

MUX

PC

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 15

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Instruction Decode & Operand Fetch

32

32
Result
Register
Selector
5

Register File
5

Register Addressing
PC

Instruction
Register

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Immediate
Register
32
Sign
Extended

16

ID/EX
Registers

32

Instruction decode/
register fetch (ID)

ALU Input
Register 2

Registers Write Value

ALU Input
Register 1

PC

I F /I D
R eg i sters

Slide 16

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Execute
True/False
ALU Output
Register

Register
True/False
1

Store Value
Register

ALU

Zero ?

MUX

MUX

32

PC

ALU Input
Register 1

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

ALU Input
Register 2

Immediate
Register

EX/MEM
Registers

Execution/effective
address calculation (EX)

Conditional

ID/EX
Registers

Slide 17

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Write
back (WB)

Memory Access and Write Back


MUX

Jump/Branch Target Address

Load/Store
Address

MEM/WB
Registers

True/False
Conditional
Register

ALU Output
Register

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Store Value
Register

Memory access/branch
completion (MEM)

D-cache

ALU Result
Register

ALU Result Value

Load Memory
Data Register

EX/MEM
Registers

Slide 18

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Problems of Processor Pipelines


Pipeline conflicts:
Resource conflicts:
Occur if two pipeline stages require the same resource at the same
time
Data conflicts:
An operand is currently not available at the required position
Control conflicts:
Appear at control flow instructions

Resolving any kind of pipeline conflict reduces throughput of


the pipeline

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 19

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Resource Conflict
Instruction
LOAD

i+1
i+2
i+3
i+4

Cycle Number

1 2 3 4 5 6 7
IF ID EX ME WB
IF ID EX ME WB

9 10

Access Conflict

IF ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB

Given: microprocessor with shared


memory for the instructions and data.
Problem: Access conflict during cycle 4.

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 20

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Solution of Resource Conflicts


Instruction
LOAD

i+1
i+2
i+3
i+4

Cycle Number
1 2 3 4 5 6 7
IF ID EX ME WB
IF ID EX ME WB

9 10

IF ID EX ME WB
O IF ID EX ME WB
Bubble
IF ID EX ME WB

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 21

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Data Dependencies
Assume two instructions I1 and I2:
A true dependence st exists if I1 generates a result that is
required by I2
An anti-dependence sa exists if I1 reads a register that is
overwritten by I2
An output dependence so exists if both instructions write to
the same destination

Anti and output dependencies are called false


dependencies.

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 22

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Example Dependency Graph


S1

Formal structure:
IND: OPERATION, DEST, OP1, OP2

S1:ADD R1,R2,2;
S2:ADD R4,R1,R3;
S3:MULT R3,R5,3;
S4:MULT R3,R6,3;

R1 = R2+2
R4 = R1+R3
R3 = R5*3
R3 = R6*3

True Dependence

S2

Anti D ependence

S3

Output Dependence

S4
Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 23

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Data Conflicts
Data conflicts can occur if two instructions with data
dependencies are located close to each other.
Close depends on the pipeline structure and the actual
instructions.
Three kinds of data conflicts can occur:
Read after write (RAW), caused by a true dependence
Write after read (WAR), caused by an anti dependence
Write after write (WAW), caused by an output dependence

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 24

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Data Conflicts
1
ADD

R1 R2 R3

SUB

R4 R1 R5

AND

R6 R4 R1

OR

R7 R1 R6

XOR

R8 R1 R4

IF ID EX ME WB
IF ID EX ME WB

IF ID EX ME WB
IF ID EX ME WB

IF ID EX ME WB

Dependencies from R1:


New value is read before it was written.
No conflict, if the registers are written at the
beginning of the WB-phase and read at its end
No conflict.
Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 25

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Data Conflicts
1
ADD

R1 R2 R3

SUB

R4 R1 R5

AND

R6 R4 R1

OR

R7 R1 R6

XOR

R8 R1 R4

IF ID EX ME WB
IF ID EX ME WB

IF ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB

Dependencies from R4:


New value is read before it was written.
No conflict, if the registers are written at the
beginning of the WB-phase and read at its end
Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 26

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Data Conflicts
1
ADD

R1 R2 R3

SUB

R4 R1 R5

AND

R6 R4 R1

OR

R7 R1 R6

XOR

R8 R1 R4

IF ID EX ME WB
IF ID EX ME WB

IF ID EX ME W
WB
B
IF ID EX ME WB
IF ID EX ME WB

Dependencies from R6:


New value is read before it was written.

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 27

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Solutions for Data Conflicts


Software solutions:
Insertion of NOPs (no operation instructions)
Instruction reordering

Hardware solution:

Three stages
shift-register
with parallel
output

Data forwarding
MUX

ALU
MUX
Control
logic

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 28

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

New Problem: Load Instructions


1
LW

R1 4(R2)

ADD

R4 R1 R3

AND

R5 R6 R7

OR

R7 R6 R8

IF ID EX ME WB
IF ID EX ME WB

Dependence

IF ID EX ME WB
IF ID EX ME WB

With Load instructions data forwarding is not possible


1 pipeline stall and data forwarding is necessary

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 29

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Load Instructions
1

LW

R1 4(R2)

ADD

R4 R1 R3

AND

R5 R6 R7 Bubble IF

O ID EX ME WB

R7 R6 R8

O IF ID EX ME WB

OR

IF ID EX ME WB
IF ID

O EX ME WB

With Load instructions data forwarding is not possible


1 pipeline bubble and data forwarding is necessary

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 30

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Instruction Reordering
ADD R1
R1 R2 R3
SUB R4 R1 R5
AND R6 R7 R8
OR R9 R10 R11
XOR R12 R13 R14

RAW conflict between ADD and SUB.


Can be removed by reordering the instructions

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 31

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Instruction Reordering
ADD R1 R2 R3
SUB R4 R1 R5
AND R6 R7 R8
OR R9 R10 R11
XOR R12 R13 R14

RAW conflict between ADD and SUB.


Can be removed by reordering the instructions
Conditions:

- must not introduce any new dependence


- must not change the program

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 32

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Control Dependencies
Branching
Instruction i+1

IF ID EX ME WB
IF ID EX ME WB

Instruction i+2
Instruction i+3

IF ID EX ME WB
IF ID EX ME WB

During the decoding (ID phase) it will be determined,


whether the instruction is a branch or not.

If so, the following instructions depend on the outcome


of the branch.

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 33

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Control Dependencies
Branching
Instruction i+1

Instruction i+2

IF ID EX ME WB
IF
IF
Bubbles

Instruction i+3

O IF ID EX ME WB

O IF ID EX ME WB

O IF ID EX ME WB

During the decoding (ID phase) it will be determined,


whether the instruction is a branch or not.

If so, the following instructions depend on the outcome


of the branch.
The pipeline must be filled with bubbles, until the
result of the branch is clear.
Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 34

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Possible Improvement: Branch Delay Slot


AND
ADD
BEQ
NOP

R4
R1
R1

R2
R2
R5

R3
R3
offset

Branch delay slot: the instruction following the


branch will be executed in any case.
By reordering, it is possible to make an effective
use of the branch delay slot.

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 35

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Branch Delay Slot


AND
ADD
BEQ

R4
R1
R1

R2
R2
R5

R3
R3
offset

Branch delay slot: the instruction following the


branch will be executed in any case.
By reordering, it is possible to make an effective
use of the branch delay slot.
The AND instruction can be moved into the slot, since
it does not depend on the branch result.
Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 36

Communication Networks Institute


Prof. Dr.-Ing. C. Wietfeld

Variation of the von-Neumann-Idea:


The Harvard-Architecture
Separate memories and interconnects for code and data
Bottleneck because of the common von-Neumann interconnect and
memory is omitted
Especially used by Digital Signal Processors (DSPs)
Nearly all modern processors comprise separated first level caches:
The instruction cache supports read accesses only and is connected directly to
the fetch logic of the processor
The data cache supports both, read and write accesses and is connected to
the data path of the processor
In general, both caches are connected to a common second level cache or
directly to the common memory

Computer Systems | Unit 4 Pipelining| Winterterm 2015 | Dipl.-Ing. Ralf Burda

Slide 37

You might also like