Professional Documents
Culture Documents
Parallel Processing
RISC Pipeline
Vector Processor
Array Processor
1
Parallel Processing
PARALLEL PROCESSING
- Inter-Instruction level
- Intra-Instruction level
2
Basics of Parallel Computing
cont
Consider another scenario:
You have get a color image printed on a
stack of papers.
Would you rather:
For each sheet print red, then green, and
blue and then take up the next paper? or
As soon as you complete printing red on a
paper advance it to blue, in the mean
while take a new paper for printing red?
3
Parallel Processing
PARALLEL COMPUTERS
Architectural Classifications
Flynn's classification [1966]
Based on the multiplicity of Instruction Streams and
Data Streams
Instruction Stream
Sequence of Instructions read from memory
Data Stream
Operations performed on the data in the processor
IS Memory
Processing DS
Control unit Module
Unit
6
Parallel Processing
Characteristics
Limitations
7
SIMD
Processing DS Memory
Unit 1 Module
IS DS
Control Processing Memory
unit Unit 2 Module
DS
Processing Memory
Unit n Module
IS 8
Parallel Processing
TYPES OF SIMD COMPUTERS
Array Processors
Systolic Arrays
9
MIMD
Control IS Processing DS Memory
unit Unit 1 Module
Control IS DS Memory
Processing
unit Unit 2 Module
DS
Control IS Processing Memory
unit Unit n Module
10
Classification for MIMD
Computers
Shared Memory:
Processors communicate through a shared
memory.
Typically processors connected to each
other and to the shared memory through a
bus.
Distributed Memory:
Processors do not share any physical
memory.
Processors connected to each other
through a network. 11
Shared Memory
Shared memory located at a
centralized location:
May consist of several interleaved
modules -- same distance (access
time) from any processor.
Also called Uniform Memory Access
(UMA) model.
12
Distributed Memory
Memory is distributed to each processor:
Improves scalability.
Non-Uniform Memory Access (NUMA)
(a) Message passing architectures No
processor can directly access another
processors memory.
(b) Distributed Shared Memory (DSM)
Memory is distributed, but the address
space is shared.
13
UMA vs. NUMA Computers
P1 P2 Pn P1 P2 Pn
Cache Cache Cache Cache Cache Cache
Bus
Main Main Main
Memory Memory Memory
Main
Memory
Network
15
Parallel Computer Structure
Emphasize on parallel processing
Basic architectural features of parallel
computers are :
Pipelined computers Which performs overlapped
computation to exploit temporal parallelism.
Array processor uses multiple synchronized
arithmetic logic units to achieve spatial parallelism.
Multiprocessor system achieve asynchronous
parallelism through a set of interactive processors
with shared resources.
16
Pipelined Computer
IF ID EX MEM Segments
Multiple pipeline cycles
A pipeline cycle can be set equal to the
delay of the slowest stage.
Operations of all stages is synchronized
under a common clock.
Interface latches are used between the
segments to hold the intermediate result.
17
Scalar Pipeline
Scalar data SP1
SP2
Scalar
..
.
.
Reg. .
.
.
K SPn
.
Scalar fetch
M
E I I K
IS OF
M F D
Vector Pipeline
O Vector fetch
K
R VP1
Y Vector .
VP2
Vector data
20
I/O
Scalar Processing
Control Unit
Control P: Processor
Processor M: Memory
PE1 PE 2
PE n
P Control
P P
M M . .... M
Array
Processing
23
I/O channels
. . . . . .
Memory
module 1
Input-output Interprocessor-memory
Memory Interconnection Connection network
module 2 Network [Bus , crossbar or multiport]
.
.
. LM1
. . . .
Memory P 1
module n
Inter- P 2 LM2
Shared Memory
processor
interrupt
network
P n LM n
Laundry Example
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold A B C D
Washer takes 30 minutes
25
Traditional Pipeline
Concept
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
A Sequential laundry takes 6
hours for 4 loads
26
Traditional Pipeline
Concept
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 40 40 40 20
s
k A
Pipelined laundry takes
3.5 hours for 4 loads
O B
r
d C
e
r D
27
A Pipeline Example
Consider two alternate ways in which
an engineering college can work:
Approach 1. Admit a batch of students
and next batch admitted only after
already admitted batch completes (i.e.
admit once every 4 years).
Approach 2. Admit students every year.
In the second approach:
Average number of students graduating per
year increases four times.
28
Pipelining
29
Pipelined Execution
Time
30
Advantages of Pipelining
An n-stage pipeline:
Can improve performance upto n times.
Not much investment in hardware:
No replication of hardware resources
necessary.
The principle deployed is to keep the
units as busy as possible.
Transparent to the programmers:
Easy to use
31
Basic Pipelining Terminologies
Pipeline cycle (or Processor cycle):
The time required to move an instruction
one step further in the pipeline.
Not to be confused with clock cycle.
Synchronous pipeline:
Pipeline cycle is constant (clock-driven).
Asynchronous pipeline:
Time for moving from stage to stage varies
Handshaking communication between stages
32
Synchronous Pipeline
- Transfers between stages are simultaneous.
- One task or operation enters the pipeline
per cycle.
L L L L L
Input Output
S1 S2 Sk
Clock
m d
33
Asynchronous Pipeline
- Transfers performed when individual stages are
ready.
- Handshaking protocol between processors.
Input Output
34
A Few Pipeline Concepts
Si Si+1
m d
Pipeline cycle :
Latch delay : d
= max {m } + d
Pipeline frequency : f
f = 1 /
35
Example on Clock period
Suppose the time delays of the 4 stages are
1=60ns,2 = 50ns, 3 = 90ns, 4 = 80ns & the
interface latch has a delay of l = 10ns. What is
the value of pipeline cycle time ?
Hence the cycle time of this pipeline can be
granted to be like :- = 90 + 10 =100ns
Clock frequency of the pipeline (f) =1/100 =10 Mhz
= max {m } + d
36
Pipelining
PIPELINE SPEEDUP
n: Number of tasks to be performed
Speedup
Sk: Speedup
Sk = n*tn / (k + n - 1)*tp
tn
lim Sk = ( = k, if tn = k * tp )
n tp
37
PIPELINE AND MULTIPLE FUNCTION Pipelining
UNITS
Example
- 4-stage pipeline
- subopertion in each stage; tp = 20nS
- 100 tasks to be executed
- 1 task in non-pipelined system; 20*4 = 80nS
Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS
Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS
Speedup
Sk = 8000 / 2060 = 3.88
4-Stage Pipeline is basically identical to the system
with 4 identical function units
Ii I i+1 I i+2 I i+3
38
Drags on Pipeline Performance
39
Exercise
Consider an unpipelined processor:
Takes 4 cycles for ALU and other operations
5 cycles for memory operations.
Assume the relative frequencies:
ALU and other=60%,
memory operations=40%
Cycle time =1ns
Compute speedup due to pipelining:
Ignore effects of branching.
Assume pipeline overhead = 0.2ns
40
Solution
Average instruction execution time for
large number of instructions:
unpipelined= 1ns * (60%*4+ 40%*5) =4.4ns
Pipelined=1 + 0.2 = 1.2ns
Speedup=4.4/1.2=3.7 times
41
Drags on Pipeline Performance
Factors that can degrade pipeline
performance:
Unbalanced stages
Pipeline overheads
Clock skew
Hazards
43
Pipeline Hazards
Hazards can result in incorrect
operations:
Structural hazards: Two instructions
requiring the same hardware unit at same
time.
Data hazards: Instruction depends on result
of a prior instruction that is still in pipeline
Data dependency
Control hazards: Caused by delay in decisions
about changes in control flow (branches and
jumps).
Control dependency
44
Structural Hazards
Arise from resource conflicts among
instructions executing concurrently:
Same resource is required by two (or more)
concurrently executing instructions at the same
time.
Easy way to avoid structural hazards:
Duplicate resources (sometimes not practical)
Memory interleaving ( lower & higher order )
Examples of Resolution of Structural Hazard:
An ALU to perform an arithmetic operation and an
adder to increment PC.
Separate data cache and instruction cache 45
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
46
An Example of a Structural
Hazard
Load Mem Reg DM Reg
ALU
Instruction 1 Mem Reg DM Reg
ALU
Instruction 2 Mem Reg DM Reg
ALU
Instruction 3 Mem Reg DM Reg
ALU
Instruction 4 Mem Reg DM Reg
ALU
Time Would there be a hazard here? 47
How is it Resolved?
Load Mem Reg DM Reg
ALU
Instruction 1 Mem Reg DM Reg
ALU
Instruction 2 Mem Reg DM Reg
ALU
Stall Bubble Bubble Bubble Bubble Bubble
ALU
Time A Pipeline can be stalled by
inserting a bubble or NOP 48
Data Hazards
Occur when an instruction under execution
depends on:
Data from an instruction ahead in pipeline.
Example: A=B+C; D=A+E;
49
Types of Data Hazards
Data hazards are of three types:
Read After Write (RAW)
Write After Read (WAR)
Write After Write (WAW)
With an in-order execution machine:
WAW, WAR hazards can not occur.
Assume instruction i is issued before
j.
50
Read after Write (RAW) Hazards
Hazard between two instructions I & J
may occur when j attempts to read some
data object that has been modified by I.
instruction j tries to read its operand before
instruction i writes it.
j would incorrectly receive an old or incorrect
value. i: ADD R1, R2, R3
Example: j: SUB R4, R1, R6
j i
Instruction j is a Instruction i is a
read instruction write instruction
issued after i issued before j
51
Read after Write (RAW)
Hazards
Instn . I RAW
D (I) R (I)
Write
Instn J
R (J)
D (J) Read
j i
Instruction j is a Instruction i is a
write instruction read instruction
issued after i issued before j
54
Write after Read (WAR)
Hazards
Instn . J
D (J) Write
R (J) WAR
D (I) Instn . I
R (I)
Read
56
Write After Write (WAW)
Hazards
Instn . I WAW
D (I) Write
R (I)
Instn . J
D (J)
R (J) Write
Control dependence
59
Data Dependencies : Summary
Data dependencies
in straight-line code
Load-Use Define-Use
dependency dependency
61
Recollect Data Hazards
What causes them?
Pipelining changes the order of read/write
accesses to operands.
Order differs from that of an unpipelined
machine.
Example:
ADD R1, R2, R3 For MIPS, ADD writes the
SUB R4, R1, R5 register in WB but SUB
needs it in ID.
62
Illustration of a Data Hazard
ALU
SUB R4, R1, R5 Mem Reg DM Reg
ALU
AND R6, R1, R7 Mem Reg DM
ALU
OR R8, R1, R9 Mem Reg
ALU
XOR R10, R1, R11 Mem Reg
Time
ADD instruction causes a hazard in next 3 instructions
because register not written until after those 3 read it.
63
Forwarding
Simplest solution to data hazard:
forwarding
Result of the ADD instruction not really
needed:
until after ADD actually produces it.
Can we move the result from EX/MEM
register to the beginning of ALU (where
SUB needs it)?
Yes!
64
Forwarding
cont
Generally speaking:
Forwarding occurs when a result is
passed directly to the functional
unit that requires it.
Resultgoes from output of one
pipeline stage to input of another.
65
Forwarding Technique
Latch Latch
EXECUTE WRITE
ALU RESULT
Forwarding
Path 66
When Can We Forward?
ADD R1, R2, R3 Mem Reg DM Reg SUB gets info.
ALU
from EX/MEM
pipe register
SUB R4, R1, R5 Mem Reg DM Reg
ALU
AND R6, R1, R7 Mem Reg DM
AND gets info.
ALU
from MEM/WB
pipe register
OR R8, R1, R9 Mem Reg
ALU
OR gets info. by
XOR R10, R1, R11 Mem Reg forwarding from
register file
68
Instruction Reordering
ADD R1 , R2 , R3
SUB R4 , R1 , R5 Before
XOR R8 , R6 , R7
AND R9 , R10 , R11
ADD R1 , R2 , R3
XOR R8 , R6 , R7 After
AND R9 , R10 , R11
SUB R4 , R1 , R5
69
Control Hazards
Result from branch and other instructions that
change the flow of a program (i.e. change PC).
Example:
1: If(cond){
2: s1}
3: s2
70
Can You Identify Control
Dependencies?
1: if(cond1){
2: s1;
3: if(cond2){
4: s2;}
5: }
71
Solutions to Branch Hazards
Three simplest methods of dealing
with branches:
Flush Pipeline:
Redo the instructions following a branch,
once an instruction is detected to be
branch during the ID stage.
Branch Not Taken:
A slightly higher performance scheme is to
assume every branch to be not taken.
Another scheme is delayed branch.
72
Four Simple Branch Hazard
Solutions
#1: Stall
until branch direction is clear flushing pipe
#2: Predict Branch Not Taken
Execute successor instructions in sequence
as if there is no branch
undo instructions in pipeline if branch
actually taken
47% branches not taken on average
73
Four Simple Branch Hazard
Solutions cont
75
Delayed Branch
Simple idea: Put an instruction that would be executed
anyway right after a branch.
Branch IF ID EX MEM WB
IF ID EX MEM WB
Branch target OR successor