ch.9 Pipeline MoDIFIED

PIPELINING AND VECTOR PROCESSING
Parallel Processing
Pipelining & Hazzards
RISC Pipeline
Vector Processor
Array Processor
1
Parallel Processing
PARALLEL PROCESSING
Execution of Concurrent Events in the computing process to

achieve faster Computational Speed
Levels of Parallel Processing
- Job or Program level
- Task or Procedure level
- Inter-Instruction level
- Intra-Instruction level
2
Basics of Parallel Computing
cont
Consider another scenario:
You have get a color image printed on a
stack of papers.
Would you rather:
For each sheet print red, then green, and
blue and then take up the next paper? or
As soon as you complete printing red on a
paper advance it to blue, in the mean
while take a new paper for printing red?
3
Parallel Processing
PARALLEL COMPUTERS
Architectural Classifications
Flynn's classification [1966]
Based on the multiplicity of Instruction Streams and
Data Streams
Instruction Stream
Sequence of Instructions read from memory
Data Stream
Operations performed on the data in the processor
Number of Data Streams

Single Multiple
Number of Single SISD SIMD

Instruction
Streams Multiple MISD MIMD
4
Flynns Classification
SISD (Single Instruction Single Data):
Uniprocessors.
MISD (Multiple Instruction Single
Data):
No practical examples exist
SIMD (Single Instruction Multiple
Data):
Specialized processors
MIMD (Multiple Instruction Multiple
Data):
General purpose, commercially important
5
SISD
IS Memory
Processing DS
Control unit Module
Unit
6
Parallel Processing
SISD COMPUTER SYSTEMS
Characteristics
- Standard von Neumann machine

- Instructions and data are stored in memory
- One operation at a time
Limitations
Von Neumann bottleneck
Maximum speed of the system is limited by the

Memory Bandwidth (bits/sec or bytes/sec)
- Limitation on Memory Bandwidth

- Memory is shared by CPU and I/O
7
SIMD
Processing DS Memory
Unit 1 Module
IS DS
Control Processing Memory
unit Unit 2 Module
DS
Processing Memory
Unit n Module
IS 8
Parallel Processing
TYPES OF SIMD COMPUTERS
Array Processors
- The control unit broadcasts instructions to all PEs, and all

active PEs execute the same instructions
- ILLIAC IV, GF-11, Connection Machine, DAP, MPP
Systolic Arrays
- Regular arrangement of a large number of very simple

processors constructed on VLSI circuits
- CMU Warp, Purdue CHiP
9
MIMD
Control IS Processing DS Memory
unit Unit 1 Module
Control IS DS Memory
Processing
unit Unit 2 Module
DS
Control IS Processing Memory
unit Unit n Module
10
Classification for MIMD
Computers
Shared Memory:
Processors communicate through a shared
memory.
Typically processors connected to each
other and to the shared memory through a
bus.
Distributed Memory:
Processors do not share any physical
memory.
Processors connected to each other
through a network. 11
Shared Memory
Shared memory located at a
centralized location:
May consist of several interleaved
modules -- same distance (access
time) from any processor.
Also called Uniform Memory Access
(UMA) model.
12
Distributed Memory
Memory is distributed to each processor:
Improves scalability.
Non-Uniform Memory Access (NUMA)
(a) Message passing architectures No
processor can directly access another
processors memory.
(b) Distributed Shared Memory (DSM)
Memory is distributed, but the address
space is shared.
13
UMA vs. NUMA Computers
P1 P2 Pn P1 P2 Pn
Cache Cache Cache Cache Cache Cache
Bus
Main Main Main
Memory Memory Memory
Main
Memory
Network
(a) UMA Model (b) NUMA Model

14
Parallel Processing DEMANDS
Concurrent Execution
An efficient form of information processing
which emphasizes the exploitation of concurrent
events in computing process.
Parallel events may occur in multiple resources
during the same interval of time.
Simultaneous events may occur at the same time.
Pipelined events may occur in overlapped time
spans.
15
Parallel Computer Structure
Emphasize on parallel processing
Basic architectural features of parallel
computers are :
Pipelined computers Which performs overlapped
computation to exploit temporal parallelism.
Array processor uses multiple synchronized
arithmetic logic units to achieve spatial parallelism.
Multiprocessor system achieve asynchronous
parallelism through a set of interactive processors
with shared resources.
16
Pipelined Computer
IF ID EX MEM Segments
Multiple pipeline cycles
A pipeline cycle can be set equal to the
delay of the slowest stage.
Operations of all stages is synchronized
under a common clock.
Interface latches are used between the
segments to hold the intermediate result.
17
Scalar Pipeline
Scalar data SP1
SP2
Scalar
..
.
.
Reg. .
.
.
K SPn
.
Scalar fetch
M
E I I K
IS OF
M F D
Vector Pipeline
O Vector fetch
K
R VP1
Y Vector .
VP2
Instruction Preprocessing Reg. .

.
VPn
Vector data
Functional structure of pipeline computer with 18

scalar & vector capabilities
Array processor
Synchronous parallel computer with multiple
arithmetic logic units [Same function at same
time] .
By replication of ALU system can achieve
spatial parallelism.
An appropriate data routing mechanism must
be establish among the Pes.
Scalar & control type instructions are directly
executed in control unit.
Each PE consist of one ALU with register &
local memory.
19
Array processor
Vector instructions are broadcast to PEs for
distributed execution over different component
operands fetched directly from the local
memory.
IF & Decode is done by the control unit.
Array processors are designed with associative
memory called associative processors.
Array processors are much more difficult to
program than pipelined machines.
20
I/O
Scalar Processing
Control Unit
Control P: Processor
Processor M: Memory
Data Bus Control

Memory
PE1 PE 2
PE n
P Control
P P
M M . .... M
Array
Processing
Inter-PE connection network

[Data routing]
Functional structure of SIMD array processor with
concurrent scalar processing in control unit 21
Multiprocessor Systems
Multiprocessor system leads to improving
throughput, reliability, &flexibility.
The enter system must be control by a
single integrated OS providing interaction
between processor & their programs at
various level.
Each processor has its own local memory &
i/o devices.
Inter-processor communication can be
done through shared memories or through
an interrupt network.
22
Multiprocessor Systems
There are three different types of inter-
processor communications are :
Time shared common bus
Crossbar switch network.
Multiport memories.
Centralized computing system, in which all
H/W & S/W resource are housed in the
same computing center with negligible
communication delays among subsystems.
Loosely coupled & tightly coupled.
23
I/O channels
. . . . . .
Memory
module 1
Input-output Interprocessor-memory
Memory Interconnection Connection network
module 2 Network [Bus , crossbar or multiport]
.
.
. LM1
. . . .
Memory P 1
module n
Inter- P 2 LM2
Shared Memory
processor
interrupt
network
P n LM n
Functional design of an MIMD multiprocessor system 24

Traditional Pipeline Concept
Laundry Example
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold A B C D
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes
25
Traditional Pipeline
Concept
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
A Sequential laundry takes 6
hours for 4 loads
B If they learned pipelining,

how long would laundry
take?
C
26
Traditional Pipeline
Concept
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 40 40 40 20
s
k A
Pipelined laundry takes
3.5 hours for 4 loads
O B
r
d C
e
r D
27
A Pipeline Example
Consider two alternate ways in which
an engineering college can work:
Approach 1. Admit a batch of students
and next batch admitted only after
already admitted batch completes (i.e.
admit once every 4 years).
Approach 2. Admit students every year.
In the second approach:
Average number of students graduating per
year increases four times.
28
Pipelining
First Year Second Year Third Year Fourth Year

29
Pipelined Execution
Time
IFetch Dcd Exec Mem WB

Program Flow
30
Advantages of Pipelining
An n-stage pipeline:
Can improve performance upto n times.
Not much investment in hardware:
No replication of hardware resources
necessary.
The principle deployed is to keep the
units as busy as possible.
Transparent to the programmers:
Easy to use
31
Basic Pipelining Terminologies
Pipeline cycle (or Processor cycle):
The time required to move an instruction
one step further in the pipeline.
Not to be confused with clock cycle.
Synchronous pipeline:
Pipeline cycle is constant (clock-driven).
Asynchronous pipeline:
Time for moving from stage to stage varies
Handshaking communication between stages
32
Synchronous Pipeline
- Transfers between stages are simultaneous.
- One task or operation enters the pipeline
per cycle.
L L L L L
Input Output
S1 S2 Sk
Clock
m d
33
Asynchronous Pipeline
- Transfers performed when individual stages are
ready.
- Handshaking protocol between processors.
Input Output
Ready S1 Ready S2 Ready Sk Ready

Ack Ack Ack Ack
- Different amounts of delay may be experienced at

different stages.
- Can display variable throughput rate.
34
A Few Pipeline Concepts
Si Si+1
m d
Pipeline cycle :
Latch delay : d
= max {m } + d
Pipeline frequency : f
f = 1 /
35
Example on Clock period
Suppose the time delays of the 4 stages are
1=60ns,2 = 50ns, 3 = 90ns, 4 = 80ns & the
interface latch has a delay of l = 10ns. What is
the value of pipeline cycle time ?
Hence the cycle time of this pipeline can be
granted to be like :- = 90 + 10 =100ns
Clock frequency of the pipeline (f) =1/100 =10 Mhz
If it is non-pipeline then = 60+50+90+80 = 280ns
= max {m } + d
36
Pipelining
PIPELINE SPEEDUP
n: Number of tasks to be performed
Conventional Machine (Non-Pipelined)

tn: Clock cycle
t1: Time required to complete the n tasks
t1 = n * tn
Pipelined Machine (k stages)

tp: Clock cycle (time to complete each suboperation)
tk: Time required to complete the n tasks
tk = (k + n - 1) * tp
Speedup
Sk: Speedup
Sk = n*tn / (k + n - 1)*tp
tn
lim Sk = ( = k, if tn = k * tp )
n tp
37
PIPELINE AND MULTIPLE FUNCTION Pipelining
UNITS
Example
- 4-stage pipeline
- subopertion in each stage; tp = 20nS
- 100 tasks to be executed
- 1 task in non-pipelined system; 20*4 = 80nS
Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS
Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS
Speedup
Sk = 8000 / 2060 = 3.88
4-Stage Pipeline is basically identical to the system
with 4 identical function units
Ii I i+1 I i+2 I i+3
Multiple Functional Units P1 P2 P3 P4
38
Drags on Pipeline Performance
Things are actually not so easy,

due to the following factors:
Difficult to balance the stages
Pipeline overheads: latch delays
Hazards
39
Exercise
Consider an unpipelined processor:
Takes 4 cycles for ALU and other operations
5 cycles for memory operations.
Assume the relative frequencies:
ALU and other=60%,
memory operations=40%
Cycle time =1ns
Compute speedup due to pipelining:
Ignore effects of branching.
Assume pipeline overhead = 0.2ns
40
Solution
Average instruction execution time for
large number of instructions:
unpipelined= 1ns * (60%*4+ 40%*5) =4.4ns
Pipelined=1 + 0.2 = 1.2ns
Speedup=4.4/1.2=3.7 times
41
Drags on Pipeline Performance
Factors that can degrade pipeline
performance:
Unbalanced stages
Pipeline overheads
Clock skew
Hazards
Hazards cause the worst drag on

the performance of a pipeline.
42
Pipeline Hazards
What is a pipeline hazard?
A situation that prevents an instruction
from executing during its designated
clock cycles.
There are 3 classes of hazards:
Structural Hazards
Data Hazards
Control Hazards
43
Pipeline Hazards
Hazards can result in incorrect
operations:
Structural hazards: Two instructions
requiring the same hardware unit at same
time.
Data hazards: Instruction depends on result
of a prior instruction that is still in pipeline
Data dependency
Control hazards: Caused by delay in decisions
about changes in control flow (branches and
jumps).
Control dependency
44
Structural Hazards
Arise from resource conflicts among
instructions executing concurrently:
Same resource is required by two (or more)
concurrently executing instructions at the same
time.
Easy way to avoid structural hazards:
Duplicate resources (sometimes not practical)
Memory interleaving ( lower & higher order )
Examples of Resolution of Structural Hazard:
An ALU to perform an arithmetic operation and an
adder to increment PC.
Separate data cache and instruction cache 45
accessed simultaneously in the same cycle.

Structural Hazard: Example
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
46
An Example of a Structural
Hazard
Load Mem Reg DM Reg
ALU
Instruction 1 Mem Reg DM Reg
ALU
ALU
ALU
ALU
Time Would there be a hazard here? 47
How is it Resolved?
Load Mem Reg DM Reg
ALU
ALU
ALU
Stall Bubble Bubble Bubble Bubble Bubble
ALU
Time A Pipeline can be stalled by
inserting a bubble or NOP 48
Data Hazards
Occur when an instruction under execution
depends on:
Data from an instruction ahead in pipeline.
Example: A=B+C; D=A+E;
A=B+C; IF ID EXE MEM WB

D=A+E; IF ID EXE MEM WB
Dependent instruction uses old data:

Results in wrong computations
49
Types of Data Hazards
Data hazards are of three types:
Read After Write (RAW)
Write After Read (WAR)
Write After Write (WAW)
With an in-order execution machine:
WAW, WAR hazards can not occur.
Assume instruction i is issued before
j.
50
Read after Write (RAW) Hazards
Hazard between two instructions I & J
may occur when j attempts to read some
data object that has been modified by I.
instruction j tries to read its operand before
instruction i writes it.
j would incorrectly receive an old or incorrect
value. i: ADD R1, R2, R3
Example: j: SUB R4, R1, R6
j i
Instruction j is a Instruction i is a
read instruction write instruction
issued after i issued before j
51
Read after Write (RAW)
Hazards
Instn . I RAW
D (I) R (I)
Write
Instn J
R (J)
D (J) Read
R (I) D (J) for RAW

52
RAW Dependency: More
Examples
Example program (a):
i1: load r1, addr;
i2: add r2, r1,r1;
Program (b):
i1: mul r1, r4, r5;
i2: add r2, r1, r1;
Both cases, i2 does not get operand until i1
has completed writing the result
In (a) this is due to load-use dependency
In (b) this is due to define-use dependency
53
Write after Read (WAR) Hazards
Hazard may occur when j attempts to
modify (write) some data object that is
going to read by I.
Instruction J tries to write its operand at
destination before instruction I read it.
I would incorrectly receive a new or incorrect
value. i: ADD R1, R2, R3
Example: j: SUB R2, R4, R6
j i
write instruction read instruction
54
Write after Read (WAR)
Hazards
Instn . J
D (J) Write
R (J) WAR
D (I) Instn . I
R (I)
Read
D (I) R (J) for WAR

55
Write After Write (WAW) Hazards
WAW hazard:
Both I & J wants to modify a same data object.
instruction j tries to write an operand before
instruction i writes it.
Writes are performed in wrong order.
(How can this happen???)
Example: i: DIV F1, F2, F3
j: SUB F1, F4, F6
j i
write instruction write instruction
56
Write After Write (WAW)
Hazards
Instn . I WAW
D (I) Write
R (I)
Instn . J
D (J)
R (J) Write
R (I) R (J) for WAW

57
WAR and WAW Dependency:
More Examples
Example program (a):
i1: mul r1, r2, r3;
i2: add r2, r4, r5;
Example program (b):
i1: mul r1, r2, r3;
i2: add r1, r4, r5;
Both cases have dependence between i1 and i2
in (a) r2 must be read before it is written into
in (b) r1 must be written by i2 after it has been
written into by i1
58
Inter-Instruction Dependences
Data dependence
r3 r1 op r2 Read-after-Write
r5 r3 op r4 (RAW)
Anti-dependence
r3 r1 op r2 Write-after-Read
r1 r4 op r5 (WAR) False
Dependency
Output dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW)
r3 r6 op r7
Control dependence
59
Data Dependencies : Summary
Data dependencies
in straight-line code
RAW WAR WAW

Read After Write Write After Read Write After Write
dependency dependency dependency
( Flow dependency ) ( Anti dependency ) ( Output dependency )
Load-Use Define-Use
dependency dependency
True dependency False dependency

Cannot be overcome Can be eliminated by register
renaming 60
Solutions to Data Hazard
Operand forwarding
By S/W (NOP)
Reordering the instruction
61
Recollect Data Hazards
What causes them?
Pipelining changes the order of read/write
accesses to operands.
Order differs from that of an unpipelined
machine.
Example:
ADD R1, R2, R3 For MIPS, ADD writes the
SUB R4, R1, R5 register in WB but SUB
needs it in ID.
This is a data hazard
62
Illustration of a Data Hazard
ADD R1, R2, R3 Mem Reg DM Reg
ALU
SUB R4, R1, R5 Mem Reg DM Reg
ALU
AND R6, R1, R7 Mem Reg DM
ALU
OR R8, R1, R9 Mem Reg
ALU
XOR R10, R1, R11 Mem Reg
Time
ADD instruction causes a hazard in next 3 instructions
because register not written until after those 3 read it.
63
Forwarding
Simplest solution to data hazard:
forwarding
Result of the ADD instruction not really
needed:
until after ADD actually produces it.
Can we move the result from EX/MEM
register to the beginning of ALU (where
SUB needs it)?
Yes!
64
Forwarding
cont
Generally speaking:
Forwarding occurs when a result is
passed directly to the functional
unit that requires it.
Resultgoes from output of one
pipeline stage to input of another.
65
Forwarding Technique
Latch Latch
EXECUTE WRITE
ALU RESULT
Forwarding
Path 66
When Can We Forward?
ADD R1, R2, R3 Mem Reg DM Reg SUB gets info.
ALU
from EX/MEM
pipe register
SUB R4, R1, R5 Mem Reg DM Reg
ALU
AND R6, R1, R7 Mem Reg DM
AND gets info.
ALU
from MEM/WB
pipe register
OR R8, R1, R9 Mem Reg
ALU
OR gets info. by
XOR R10, R1, R11 Mem Reg forwarding from
register file
If line goes forward you can do forwarding.

Time If its drawn backward, its physically impossible.
67
Handling data hazard by S/W
Compiler introduce NOP in between two
instructions
NOP = a piece of code which keeps a gap
between two instruction
Detection of the dependency is left entirely
on the S/W
Advantage :- We find the easy technique
called as instruction reordering.
68
Instruction Reordering
ADD R1 , R2 , R3
SUB R4 , R1 , R5 Before
XOR R8 , R6 , R7
AND R9 , R10 , R11
ADD R1 , R2 , R3
XOR R8 , R6 , R7 After
AND R9 , R10 , R11
SUB R4 , R1 , R5
69
Control Hazards
Result from branch and other instructions that
change the flow of a program (i.e. change PC).
Example:
1: If(cond){
2: s1}
3: s2
Statement in line 2 is control dependent on

statement at line 1.
Until condition evaluation completes:
It is not known whether s1 or s2 will execute next.
70
Can You Identify Control
Dependencies?
1: if(cond1){
2: s1;
3: if(cond2){
4: s2;}
5: }
71
Solutions to Branch Hazards
Three simplest methods of dealing
with branches:
Flush Pipeline:
Redo the instructions following a branch,
once an instruction is detected to be
branch during the ID stage.
Branch Not Taken:
A slightly higher performance scheme is to
assume every branch to be not taken.
Another scheme is delayed branch.
72
Four Simple Branch Hazard
Solutions
#1: Stall
until branch direction is clear flushing pipe
#2: Predict Branch Not Taken
Execute successor instructions in sequence
as if there is no branch
undo instructions in pipeline if branch
actually taken
47% branches not taken on average
73
Solutions cont
#3: Predict Branch Taken

53% branches taken on average.
But branch target address not available
after IF in MIPS
MIPS still incurs 1 cycle branch penalty
even with predict taken
Other machines: branch target known
before branch outcome computed,
significant benefits can accrue
74
Solutions cont
#4: Delayed Branch

Insert unrelated successor in the branch
delay slot
branch instruction
sequential successor1
sequential successor2
........ Branch delay of length n
sequential successorn
branch target if taken
1 slot delay required in 5 stage pipeline
75
Delayed Branch
Simple idea: Put an instruction that would be executed
anyway right after a branch.
Branch IF ID EX MEM WB
Delayed slot instruction IF delay

ID EX slotWB
MEM
IF ID EX MEM WB
Branch target OR successor
Question: What instruction do we put in the delay slot?

Answer: one that can safely be executed no matter
what the branch does.
The compiler decides this.
76

ch.9 Pipeline MoDIFIED

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ch.9 Pipeline MoDIFIED

Uploaded by

Copyright:

Available Formats

PIPELINING AND VECTOR PROCESSING

Pipelining & Hazzards

Execution of Concurrent Events in the computing process to

Levels of Parallel Processing

- Job or Program level

- Task or Procedure level

Number of Data Streams

Number of Single SISD SIMD

SISD COMPUTER SYSTEMS

- Standard von Neumann machine

Von Neumann bottleneck

Maximum speed of the system is limited by the

- Limitation on Memory Bandwidth

- The control unit broadcasts instructions to all PEs, and all

- Regular arrangement of a large number of very simple

(a) UMA Model (b) NUMA Model

Instruction Preprocessing Reg. .

Functional structure of pipeline computer with 18

Data Bus Control

Inter-PE connection network

Functional design of an MIMD multiprocessor system 24

Dryer takes 40 minutes

Folder takes 20 minutes

B If they learned pipelining,

First Year Second Year Third Year Fourth Year

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

Ready S1 Ready S2 Ready Sk Ready

- Different amounts of delay may be experienced at

If it is non-pipeline then = 60+50+90+80 = 280ns

Conventional Machine (Non-Pipelined)

Pipelined Machine (k stages)

Multiple Functional Units P1 P2 P3 P4

Things are actually not so easy,

Hazards cause the worst drag on

accessed simultaneously in the same cycle.

Instruction 3 Mem Reg DM Reg

A=B+C; IF ID EXE MEM WB

Dependent instruction uses old data:

R (I) D (J) for RAW

D (I) R (J) for WAR

R (I) R (J) for WAW

RAW WAR WAW

True dependency False dependency

This is a data hazard

ADD R1, R2, R3 Mem Reg DM Reg

If line goes forward you can do forwarding.

Statement in line 2 is control dependent on

#3: Predict Branch Taken

#4: Delayed Branch

Delayed slot instruction IF delay

Question: What instruction do we put in the delay slot?

You might also like