Parallel Distributed Techniques

Parallel distributed
computing techniques
GVHD:
Phm Trn V
Sinh vin:
L Trng Tn
Mai Vn Ninh
Phng Quang Chnh
Nguyn c Cnh
ng Trung Tn
Contents
Motivation of Parallel Computing Techniques
Parallel Computing Techniques
Message-passing computing
Pipelined Computations
Embarrassingly Parallel Computations
Partitioning and Divide-and-Conquer Strategies
Synchronous Computations
Load Balancing and Termination Detection
www.cse.hcmut.edu.vn
Contents
Demand for Computational Speed

Continual demand for greater computational
speed from a computer system than is
currently possible
Areas requiring great computational speed
include numerical modeling and simulation of
scientic and engineering problems.
Computations must be completed within a
reasonable time period.
Contents
Message-Passing Computing
Basics
of
Message-Passing
Programming
using
user-level
message passing libraries
Two primary mechanisms needed:
A method of creating separate
processes for execution on different
computers
A method of sending and receiving
messages
Static process creation:
Source
file
Basic MPI way
Compile to suit
processor
executables
Source
file
Processor
0
Source
file
Processor
n-1
Dynamic process creation:
Processor 1
PVM way
.
spawn()
.
.
.
.
.
time
Star
t
of pr executio
oces
n
s2
Processor 2
.
.
.
.
.
.
.
8
Method of sending and receiving messages?
Contents
10
Pipelined Computation
Problem divided into a series of
tasks that have to be completed
one after the other (the basis of
sequential programming).
Each task executed by a separate
process or processor.
11
Where pipelining can be used to good effect
1-If more than one instance of the
complete problem is to be executed
2-If a series of data items must be
processed, each requiring multiple
operations
3-If information to start the next
process can be passed forward before
the process has completed all its internal
operations
12
Execution time = m + p - 1 cycles for a p-stage

pipeline and m instances
13
14
15
16
17
18
Contents
19
Ideal Parallel Computation

A computation that can obviously
be devided into a number of
completely independent parts
Each of which can be executed by a
separate processor
Each process can do its tasks without any

interaction with other process
20

Practical embarrassingly parallel
computation with static process
creation and master slave
approach
21

Practical
embarrassingly
parallel
computation with dynamic process
creation
and
master
slave
approach
22
Embarrassingly parallel examples

Geometrical Transformations of Images
Mandelbrot set
Monte Carlo Method
23

Performing on the coordinates of each pixel to move
the position of the pixel without affecting its value
The transformation on each pixel is totally
independent from other pixels
Some geometrical operations
Shifting
Scaling
Rotation
Clipping
24

Partitioning into regions for individual processes
Process
Process
80
640
640
Map
Map
80
480
Square region for each process
480
Row region for each process

10
25
Mandelbrot Set
Set of points in a complex plane that are quasi-stable
when computed by iterating the function
where is the (k + 1)th iteration of the complex number

z = a + bi and c is a complex number giving position of
point in the complex plane. The initial value for z is
zero.
Iterations continued until magnitude of z is greater than
2 or number of iterations reaches arbitrary limit.
Magnitude of z is the length of the vector given by
26
Mandelbrot Set
27
Mandelbrot Set
28
Mandelbrot Set
c.real = real_min + x * (real_max - real_min)/disp_width
c.imag = imag_min + y * (imag_max - imag_min)/disp_height
Static Task Assignment

Simply divide the region into fixed
number of parts, each computed by a
separate processor
Not very successful because different
regions require different numbers of
iterations and time
Dynamic Task Assignment
Have processor request regions after
computing previouos regions
29
Mandelbrot Set
Dynamic Task Assignment
Have processor request regions
computing previouos regions
after
30
Monte Carlo Method

Another embarrassingly parallel computation
Monte Carlo methods use of random
selections
Example To calculate
Circle formed within a square, with unit
radius so that square has side 2x2. Ratio
of the area of the circle to the square
given by
31
Monte Carlo Method

One quadrant of the construction can be
described by integral
Random pairs of numbers, (xr,yr) generated,
each between 0 and 1. Counted as in circle if
; that is,
32
Monte Carlo Method

Alternative method to compute integral
Use random values of x to compute f(x) and
sum values of f(x)
where xr are randomly generated values of x

between x1 and x2
Monte Carlo method very useful if the
function cannot be integrated numerically
(maybe having a large number of variables)
33
Monte Carlo Method

Example computing the integral
Sequential code
Routine randv(x1, x2) returns a pseudorandom number between x1 and x2
34
Monte Carlo Method

Parallel Monte Carlo integration
Master
Partial
sum
Request
Slaves
Random
number
Random-number
process
35
Contents
36
37
Partitioning simply divides the problem into parts.
It is the basic of all parallel programming.
Partitioning can be applied to the program data (data partitioning or domain decomposition) and the functions of a program (functional
decomposition).
It is much less mommon to find concurrent functions in a problem, but data partitioning is a main strategy for parallel programming.
38
A sequence of numbers, x0
x0 x(n/p)-1
,, xn-1 , are to be added

xn/p x2(n/p)-1
x(p-1)n/p xn-1
+
Partial sums
n: number of items
p: number of processors
+
Sum
Partitioning a sequence of numbers into parts and adding them

39
Characterized by dividing problem into

subproblems
of
same
form
as
larger
problem. Further divisions into still smaller

sub-problems, usually done by recursion.
Recursive divide and conquer amenable to
parallelization because separate processes
can be used for divided parts. Also usually
data is naturally localized.
40
A sequential recursive definition for adding

alist of numbers is
int add(int *s) // add list of numbers, s
{
if(number(s) <= 2) return (n1 + n2);
else {
Divide (s, s1, s2); // divide s into two part, s1, s2
part_sum1 = add(s1);// recursive calls to add sub lists
part_sum2 = add(s2);
return (part_sum1 + part_sum2);
}
}
41
Initial problem
Divide
problem
Final
task
Tree construction
42
Original list
Initial problem
P0
P0
P0
P0
P1
x0
P2
P2
Divide
problem
P4
P4
P3
P4
P6
P5
P6
P7
Final
task
xn-1
43
Many possibilities.
Operations on sequences of number such as
simply adding them together
Several sorting algorithms can often be partitioned
or constructed in a recursive fashion
Numerical integration
N-body problem
44
One bucket assigned to hold numbers that fall within each region.
Numbers in each bucket sorted using a sequential sorting
algorithm.
n: number of items
m: number of buckets
Sequental sorting time complexity: O(nlog(n/m).

Works well if the original numbers uniformly distributed across a
known interval, say 0 to a - 1.
45
Simple approach
Assign one processor for each bucket.
46
Partition sequence into m regions, one region for

each processor.
Each processor maintains p small buckets and
separates the numbers in its region into its own
small buckets.
Small buckets then emptied into p nal buckets
for sorting, whichrequires each processor to send
one small bucket to each of the other processors
(bucket i to processor i).
47
Introduces new message-passing operation - all-to-all broadcast.
48
Sends data from each process to every other

process
49
all-to-all
routine actually transfers rows of an array to columns:

Tranposes a matrix.
50
Contents
51
Synchronous
Barrier
Barrier Implementation
Centralized Counter implementation
Tree Barrier Implementation
Butterfly Barrier
Synchronized Computations
Fully synchronous
Data Parallel Computations
Synchronous Iteration(Synchronous Parallelism)
Locally synchronous
Heat Distribution Problem
Sequential Code
Parallel Code
52
Barrier
A basic mechanism for synchronizing

processes - inserted at the point in each
process where it must wait.
All processes can continue from this point
when all the processes have reached it
Processes reaching barrier at different times
53
Barrier Image
54
Barrier Implementation
( linear barrier)
Tree Barrier Implementation.
Butterfly Barrier
Local Synchronization
Deadlock
55

Have two phase
Arrival phase (trapping)
Departure phase(release)
A process enters arrival phase and does not

leave this phase until all processes have
arrived in this phase
Then processes move to departure phase
and are released
56
Example code
Master:
for (i = 0; i < n; i++)/*count slaves as they reach
barrier*/
recv(Pany);
for (i = 0; i < n; i++)/* release slaves */
send(Pi);
Slave processes:
send(Pmaster);
recv(Pmaster);
57

Suppose 8 processes, P0, P1, P2, P3, P4, P5, P6, P7:
First stage:
P1 sends message to P0; (when P1 reaches its barrier)

Second stage:
P2 sends message to P0; (P2 & P3 reached their barrier)
P6 sends message to P4; (P6 & P7 reached their barrier)
Second stage:
P4 sends message to P0; (P4, P5, P6, & P7 reached barrier)
P0 terminates arrival phase;( when P0 reaches barrier &
received message from P4)
58

Release with a reverse tree construction.
Tree barrier
59
Butterfly Barrier
This would be used if data were exchanged

between the processes
60
Local Synchronization
Suppose a process Pi needs to be synchronized and
to exchange data with process Pi-1 and process Pi+1
Not a perfect three-process barrier because

process Pi-1 will only synchronize with Pi and
continue as soon as Pi allows. Similarly,process
Pi+1 only synchronizes with Pi.
61
Synchronized Computations
Fully synchronous
In fully synchronous, all processes involved in the computation
must be synchronized.

Synchronous Iteration(Synchronous Parallelism)
Locally synchronous
In locally synchronous, processes only need to synchronize
with a set of logically nearby processes, not all processes
involved in the computation

Sequential Code
Parallel Code
62

Same operation performed on different
data elements simultaneously (SIMD)
Data parallel programming is very
convenient for two reasons
The first is its ease of programming
(essentially only one program)
The second is that it can scale easily to
larger problems sizes
63
Synchronous Iteration
Each iteration composed of several processes
that start together at beginning of iteration.
Next iteration cannot begin until all processes
have finished previous iteration Using forall :
for (j = 0; j < n; j++) /*for each synch. iteration */
forall (i = 0; i < N; i++) { /*N procs each using*/
body(i);
/* specific value of i */
}
64
Solving a General System of Linear Equations by Iteration
Suppose the equations are of a general form with n
equations and n unknowns where the unknowns are x0,
x1, x2, xn-1 (0 <= i < n).
an-1,0x0 + an-1,1x1 + an-1,2x2 + an-1,n-1xn-1 = bn-1
.
.
.
.
a2,0x0 + a2,1x1 + a2,2x2 + a2,n-1xn-1 = b2
a1,0x0 + a1,1x1 + a1,2x2 + a1,n-1xn-1 = b1
a0,0x0 + a0,1x1 + a0,2x2 + a0,n-1xn-1 = b0
where the unknowns are x0, x1, x2, xn-1 (0<= i < n).
65
By rearranging the ith equation:
ai,0x0 + ai,1x1 + ai,2x2 + ai,n-1xn-1 = bi
to
xi = (1/ai,i)[bi-(ai,0x0+ai,1x1+ai,2x2ai,i-1xi-1+ai
,i+1xi+1+ai,n-1xn-1)]
Or
66
An area has known temperatures along each of

its edges. Find thetemperature distribution
within. Divide area into fine mesh of points, hi,j.
Temperature at an inside point taken to be
average of temperatures of four neighboring
points..
Temperature of each point by iterating the
equation
(0 < i < n, 0 < j < n)

67
68
Sequential Code
Using a fixed number of iterations
for (iteration = 0; iteration < limit; iteration++) {
for (i = 1; i < n; i++)
for (j = 1; j < n; j++)
g[i][j] = 0.25*(h[i-1][j]+h[i+1][j]+h[i][j-1]
+h[i][j+1]);
for (i = 1; i < n; i++)/* update points */
for (j = 1; j < n; j++)
h[i][j] = g[i][j];
69
Parallel Code
With fixed number of iterations, Pi,j (except for the
boundary points):
for (iteration = 0; iteration < limit; iteration++) {
g = 0.25 * (w + x + y + z);
send(&g, Pi-1,j); /* non-blocking sends */
send(&g, Pi+1,j);
Local
send(&g, Pi,j-1);
send(&g, Pi,j+1);
Barrier
recv(&w, Pi-1,j); /* synchronous receives */
recv(&x, Pi+1,j);
recv(&y, Pi,j-1);
recv(&z, Pi,j+1);
}
70
Contents
71
Load Balancing &

Termination Detection
Load Balancing & Termination Detection
Content
Load Balancing
Used to distribute
computations fairly
across processors in
order to obtain the
highest possible
execution speed
Termination Detection
Detecting when a
computation has been
completed. More difficult
when the computation is
distributed.
73
Load Balancing
74
Load Balancing & Termination Detection
Load Balancing
Static Load Balancing

Load Baclancing can be
attemped statically
before the execution of
any process.
Dynamic Load Balancing

Load Balancing can be
attemped dynamically
during the execution of
the process.
75
Round robin algorithm passes out tasks in

sequential order of processes coming back to the first
when all processes have been given a task
Randomized algorithms selects processes at
random to
take tasks
Recursive bisection recursively divides the problem
into
subproblems of equal computational effort while
minimizing message passing
Simulated annealing an optimization technique
Genetic algorithm another optimization technique,
described
76
Several fundamental flaws with static load

balancing even if a mathematical solution
exists:
Very difficult to estimate accurately the
execution times of various parts of a program
without actually executing the parts.
Communication delays that vary under
different circumstances
Some problems have an indeterminate
number of steps to reach their solution.
77
Dynamic Load Balancing
78
Centralized dynamic load balancing
Tasks handed out from a centralized location.

Master-slave structure
Master process(or) holds the collection of
tasks to be performed.
Tasks are sent to the slave processes. When a
slave process completes one task, it requests
another task from the master process.
(Terms used : work pool, replicated worker,
processor farm.)
79
Centralized dynamic load balancing
80
Termination
Computation terminates when:

The task queue is empty and
Every process has made a request for
another task without any new tasks being
generated
Not sufficient to terminate when task queue
empty if one or more processes are still
running if a running process may provide new
tasks for task queue.
81
Decentralized dynamic load balancing
82
Fully Distributed Work Pool

Processes to execute
tasks from each other
Task
could
be
transferred by:
- Receiver-initiated
- Sender-initiated
83
Process Selection
Algorithms for selecting a process:
Round robin algorithm process
Pi requests tasks from process
Px,where x is given by a counter
that is incremented after each
request,
using
modulo
n
arithmetic
(n
processes),
excluding x = i.
Random polling algorithm
process Pi requests tasks from
process Px, where x is a number
that is selected randomly
between 0 and n- 1 (excluding i).
84
Distributed Termination Detection Algorithms
Termination Conditions
Application-specific local termination conditions exist
throughout the collection of processes, at time t.
There are no messages in transit between processes at
time t.
Second condition necessary because a message in
transit might restart a terminated process. More difficult
to recognize. The time that it takes for messages to
travel between processes will not be known in advance.
85
Using Acknowledgment Messages

Each process in one of two
states:
Inactive - without any task
to perform
Active
Process that sent task to
make it enter the active
state becomes its parent.
86
Using Acknowledgment Messages

When process receives a task, it immediately sends an
acknowledgment message, except if the process it
receives the taskfrom is its parent process. Only sends
an acknowledgment message to its parent when it is
ready to become inactive, i.e. when:
Its local termination condition exists (all tasks are
completed, and It has transmitted all its
acknowledgments for tasks it has received, and It has
received all its acknowledgments for tasks it has sent
out.
A process must become inactive before its parent
process. When first process becomes idle, the
computation can terminate
87
Load balancing/termination detection

Example
EX: Finding the shortest distance between two
points on a graph.
88
References:
Parallel Programming: Techniques and
Applications Using Networked Workstations
and Parallel Computers, Barry Wilkinson
and MiChael Allen, Second Edition, Prentice
Hall, 2005.

Parallel Distributed Techniques

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Distributed Techniques

Uploaded by

Copyright:

Available Formats

Parallel distributed

Motivation of Parallel Computing Techniques

Demand for Computational Speed

Basic MPI way

Method of sending and receiving messages?

Execution time = m + p - 1 cycles for a p-stage

Ideal Parallel Computation

Each process can do its tasks without any

Ideal Parallel Computation

Ideal Parallel Computation

Embarrassingly parallel examples

Geometrical Transformations of Images

Geometrical Transformations of Images

Square region for each process

Row region for each process

where is the (k + 1)th iteration of the complex number

Static Task Assignment

Monte Carlo Method

Monte Carlo Method

Monte Carlo Method

where xr are randomly generated values of x

Monte Carlo Method

Routine randv(x1, x2) returns a pseudorandom number between x1 and x2

Monte Carlo Method

Partitioning simply divides the problem into parts.

It is the basic of all parallel programming.

,, xn-1 , are to be added

Partitioning a sequence of numbers into parts and adding them

Characterized by dividing problem into

problem. Further divisions into still smaller

A sequential recursive definition for adding

Sequental sorting time complexity: O(nlog(n/m).

Partition sequence into m regions, one region for

Introduces new message-passing operation - all-to-all broadcast.

Sends data from each process to every other

routine actually transfers rows of an array to columns:

A basic mechanism for synchronizing

Centralized Counter implementation

A process enters arrival phase and does not

Tree Barrier Implementation

P1 sends message to P0; (when P1 reaches its barrier)

Tree Barrier Implementation

This would be used if data were exchanged

Not a perfect three-process barrier because

Data Parallel Computations

Heat Distribution Problem

Data Parallel Computations

Heat Distribution Problem

An area has known temperatures along each of

(0 < i < n, 0 < j < n)

Heat Distribution Problem

Load Balancing &

Load Balancing & Termination Detection

Load Balancing & Termination Detection

Static Load Balancing

Dynamic Load Balancing

Static Load Balancing

Round robin algorithm passes out tasks in

Static Load Balancing

Several fundamental flaws with static load

Dynamic Load Balancing

Centralized dynamic load balancing

Tasks handed out from a centralized location.

Centralized dynamic load balancing