You are on page 1of 91

Parallel distributed

computing techniques
GVHD:
Phm Trn V

Sinh vin:
L Trng Tn
Mai Vn Ninh
Phng Quang Chnh
Nguyn c Cnh
ng Trung Tn

Contents
Motivation of Parallel Computing Techniques
Parallel Computing Techniques
Message-passing computing
Pipelined Computations
Embarrassingly Parallel Computations
Partitioning and Divide-and-Conquer Strategies
Synchronous Computations
Load Balancing and Termination Detection
www.cse.hcmut.edu.vn

Contents
Motivation of Parallel Computing Techniques
Parallel Computing Techniques
Message-passing computing
Pipelined Computations
Embarrassingly Parallel Computations
Partitioning and Divide-and-Conquer Strategies
Synchronous Computations
Load Balancing and Termination Detection
www.cse.hcmut.edu.vn

Motivation of Parallel Computing Techniques

Demand for Computational Speed


Continual demand for greater computational
speed from a computer system than is
currently possible
Areas requiring great computational speed
include numerical modeling and simulation of
scientic and engineering problems.
Computations must be completed within a
reasonable time period.
www.cse.hcmut.edu.vn

Contents
Motivation of Parallel Computing Techniques
Parallel Computing Techniques
Message-passing computing
Pipelined Computations
Embarrassingly Parallel Computations
Partitioning and Divide-and-Conquer Strategies
Synchronous Computations
Load Balancing and Termination Detection
www.cse.hcmut.edu.vn

Message-Passing Computing
Basics
of
Message-Passing
Programming
using
user-level
message passing libraries
Two primary mechanisms needed:
A method of creating separate
processes for execution on different
computers
A method of sending and receiving
messages

www.cse.hcmut.edu.vn

Message-Passing Computing
Static process creation:
Source
file

Basic MPI way

Compile to suit
processor
executables

www.cse.hcmut.edu.vn

Source
file

Processor
0

Source
file

Processor
n-1

Message-Passing Computing
Dynamic process creation:
Processor 1
PVM way

.
spawn()
.
.
.
.
.

time

www.cse.hcmut.edu.vn

Star
t
of pr executio
oces
n
s2

Processor 2

.
.
.
.
.
.
.
8

Message-Passing Computing

Method of sending and receiving messages?

www.cse.hcmut.edu.vn

Contents
Motivation of Parallel Computing Techniques
Parallel Computing Techniques
Message-passing computing
Pipelined Computations
Embarrassingly Parallel Computations
Partitioning and Divide-and-Conquer Strategies
Synchronous Computations
Load Balancing and Termination Detection
www.cse.hcmut.edu.vn

10

Pipelined Computation
Problem divided into a series of
tasks that have to be completed
one after the other (the basis of
sequential programming).
Each task executed by a separate
process or processor.

www.cse.hcmut.edu.vn

11

Pipelined Computation
Where pipelining can be used to good effect
1-If more than one instance of the
complete problem is to be executed
2-If a series of data items must be
processed, each requiring multiple
operations
3-If information to start the next
process can be passed forward before
the process has completed all its internal
operations
www.cse.hcmut.edu.vn

12

Pipelined Computation

Execution time = m + p - 1 cycles for a p-stage


pipeline and m instances
www.cse.hcmut.edu.vn

13

Pipelined Computation

www.cse.hcmut.edu.vn

14

Pipelined Computation

www.cse.hcmut.edu.vn

15

Pipelined Computation

www.cse.hcmut.edu.vn

16

Pipelined Computations

www.cse.hcmut.edu.vn

17

Pipelined Computation

www.cse.hcmut.edu.vn

18

Contents
Motivation of Parallel Computing Techniques
Parallel Computing Techniques
Message-passing computing
Pipelined Computations
Embarrassingly Parallel Computations
Partitioning and Divide-and-Conquer Strategies
Synchronous Computations
Load Balancing and Termination Detection
www.cse.hcmut.edu.vn

19

Ideal Parallel Computation


A computation that can obviously
be devided into a number of
completely independent parts
Each of which can be executed by a
separate processor

Each process can do its tasks without any


interaction with other process
www.cse.hcmut.edu.vn

20

Ideal Parallel Computation


Practical embarrassingly parallel
computation with static process
creation and master slave
approach

www.cse.hcmut.edu.vn

21

Ideal Parallel Computation


Practical
embarrassingly
parallel
computation with dynamic process
creation
and
master

slave
approach

www.cse.hcmut.edu.vn

22

Embarrassingly parallel examples


Geometrical Transformations of Images
Mandelbrot set
Monte Carlo Method

www.cse.hcmut.edu.vn

23

Geometrical Transformations of Images


Performing on the coordinates of each pixel to move
the position of the pixel without affecting its value
The transformation on each pixel is totally
independent from other pixels
Some geometrical operations
Shifting
Scaling
Rotation
Clipping

www.cse.hcmut.edu.vn

24

Geometrical Transformations of Images


Partitioning into regions for individual processes
Process

Process
80

640

640
Map

Map

80

480

Square region for each process

480

Row region for each process


10

www.cse.hcmut.edu.vn

25

Mandelbrot Set
Set of points in a complex plane that are quasi-stable
when computed by iterating the function

where is the (k + 1)th iteration of the complex number


z = a + bi and c is a complex number giving position of
point in the complex plane. The initial value for z is
zero.
Iterations continued until magnitude of z is greater than
2 or number of iterations reaches arbitrary limit.
Magnitude of z is the length of the vector given by

www.cse.hcmut.edu.vn

26

Mandelbrot Set

www.cse.hcmut.edu.vn

27

Mandelbrot Set

www.cse.hcmut.edu.vn

28

Mandelbrot Set
c.real = real_min + x * (real_max - real_min)/disp_width
c.imag = imag_min + y * (imag_max - imag_min)/disp_height

Static Task Assignment


Simply divide the region into fixed
number of parts, each computed by a
separate processor
Not very successful because different
regions require different numbers of
iterations and time
Dynamic Task Assignment
Have processor request regions after
computing previouos regions
www.cse.hcmut.edu.vn

29

Mandelbrot Set
Dynamic Task Assignment
Have processor request regions
computing previouos regions

www.cse.hcmut.edu.vn

after

30

Monte Carlo Method


Another embarrassingly parallel computation
Monte Carlo methods use of random
selections
Example To calculate
Circle formed within a square, with unit
radius so that square has side 2x2. Ratio
of the area of the circle to the square
given by

www.cse.hcmut.edu.vn

31

Monte Carlo Method


One quadrant of the construction can be
described by integral
Random pairs of numbers, (xr,yr) generated,
each between 0 and 1. Counted as in circle if
; that is,

www.cse.hcmut.edu.vn

32

Monte Carlo Method


Alternative method to compute integral
Use random values of x to compute f(x) and
sum values of f(x)

where xr are randomly generated values of x


between x1 and x2
Monte Carlo method very useful if the
function cannot be integrated numerically
(maybe having a large number of variables)
www.cse.hcmut.edu.vn

33

Monte Carlo Method


Example computing the integral

Sequential code

Routine randv(x1, x2) returns a pseudorandom number between x1 and x2

www.cse.hcmut.edu.vn

34

Monte Carlo Method


Parallel Monte Carlo integration
Master
Partial
sum
Request
Slaves
Random
number
Random-number
process

www.cse.hcmut.edu.vn

35

Contents
Motivation of Parallel Computing Techniques
Parallel Computing Techniques
Message-passing computing
Pipelined Computations
Embarrassingly Parallel Computations
Partitioning and Divide-and-Conquer Strategies
Synchronous Computations
Load Balancing and Termination Detection
www.cse.hcmut.edu.vn

36

www.cse.hcmut.edu.vn

37

Partitioning simply divides the problem into parts.

It is the basic of all parallel programming.

Partitioning can be applied to the program data (data partitioning or domain decomposition) and the functions of a program (functional
decomposition).

It is much less mommon to find concurrent functions in a problem, but data partitioning is a main strategy for parallel programming.

www.cse.hcmut.edu.vn

38

A sequence of numbers, x0

x0 x(n/p)-1

,, xn-1 , are to be added


xn/p x2(n/p)-1
x(p-1)n/p xn-1

+
Partial sums

n: number of items
p: number of processors

+
Sum

Partitioning a sequence of numbers into parts and adding them


www.cse.hcmut.edu.vn

39

Characterized by dividing problem into


subproblems

of

same

form

as

larger

problem. Further divisions into still smaller


sub-problems, usually done by recursion.
Recursive divide and conquer amenable to
parallelization because separate processes
can be used for divided parts. Also usually
data is naturally localized.
www.cse.hcmut.edu.vn

40

A sequential recursive definition for adding


alist of numbers is
int add(int *s) // add list of numbers, s
{
if(number(s) <= 2) return (n1 + n2);
else {
Divide (s, s1, s2); // divide s into two part, s1, s2
part_sum1 = add(s1);// recursive calls to add sub lists
part_sum2 = add(s2);
return (part_sum1 + part_sum2);
}
}

www.cse.hcmut.edu.vn

41

Initial problem

Divide
problem

Final
task
Tree construction
www.cse.hcmut.edu.vn

42

Original list

Initial problem
P0

P0

P0

P0

P1

x0
www.cse.hcmut.edu.vn

P2

P2

Divide
problem

P4

P4

P3

P4

P6

P5

P6

P7

Final
task
xn-1
43

Many possibilities.
Operations on sequences of number such as
simply adding them together
Several sorting algorithms can often be partitioned
or constructed in a recursive fashion
Numerical integration
N-body problem

www.cse.hcmut.edu.vn

44

One bucket assigned to hold numbers that fall within each region.
Numbers in each bucket sorted using a sequential sorting
algorithm.

n: number of items
m: number of buckets

Sequental sorting time complexity: O(nlog(n/m).


Works well if the original numbers uniformly distributed across a
known interval, say 0 to a - 1.

www.cse.hcmut.edu.vn

45

Simple approach
Assign one processor for each bucket.

www.cse.hcmut.edu.vn

46

Partition sequence into m regions, one region for


each processor.
Each processor maintains p small buckets and
separates the numbers in its region into its own
small buckets.
Small buckets then emptied into p nal buckets
for sorting, whichrequires each processor to send
one small bucket to each of the other processors
(bucket i to processor i).
www.cse.hcmut.edu.vn

47

Introduces new message-passing operation - all-to-all broadcast.

www.cse.hcmut.edu.vn

48

Sends data from each process to every other


process

www.cse.hcmut.edu.vn

49

all-to-all

routine actually transfers rows of an array to columns:


Tranposes a matrix.

www.cse.hcmut.edu.vn

50

Contents
Motivation of Parallel Computing Techniques
Parallel Computing Techniques
Message-passing computing
Pipelined Computations
Embarrassingly Parallel Computations
Partitioning and Divide-and-Conquer Strategies
Synchronous Computations
Load Balancing and Termination Detection
www.cse.hcmut.edu.vn

51

Synchronous Computations

Synchronous
Barrier
Barrier Implementation
Centralized Counter implementation
Tree Barrier Implementation
Butterfly Barrier

Synchronized Computations
Fully synchronous
Data Parallel Computations
Synchronous Iteration(Synchronous Parallelism)

Locally synchronous
Heat Distribution Problem
Sequential Code
Parallel Code
www.cse.hcmut.edu.vn

52

Barrier

A basic mechanism for synchronizing


processes - inserted at the point in each
process where it must wait.
All processes can continue from this point
when all the processes have reached it
Processes reaching barrier at different times

www.cse.hcmut.edu.vn

53

Barrier Image

www.cse.hcmut.edu.vn

54

Barrier Implementation
Centralized Counter implementation
( linear barrier)
Tree Barrier Implementation.
Butterfly Barrier
Local Synchronization
Deadlock

www.cse.hcmut.edu.vn

55

Centralized Counter implementation


Have two phase
Arrival phase (trapping)
Departure phase(release)

A process enters arrival phase and does not


leave this phase until all processes have
arrived in this phase
Then processes move to departure phase
and are released

www.cse.hcmut.edu.vn

56

Example code
Master:
for (i = 0; i < n; i++)/*count slaves as they reach
barrier*/
recv(Pany);
for (i = 0; i < n; i++)/* release slaves */
send(Pi);
Slave processes:
send(Pmaster);
recv(Pmaster);

www.cse.hcmut.edu.vn

57

Tree Barrier Implementation


Suppose 8 processes, P0, P1, P2, P3, P4, P5, P6, P7:
First stage:

P1 sends message to P0; (when P1 reaches its barrier)


P3 sends message to P2; (when P3 reaches its barrier)
P5 sends message to P4; (when P5 reaches its barrier)
P7 sends message to P6; (when P7 reaches its barrier)

Second stage:
P2 sends message to P0; (P2 & P3 reached their barrier)
P6 sends message to P4; (P6 & P7 reached their barrier)

Second stage:
P4 sends message to P0; (P4, P5, P6, & P7 reached barrier)
P0 terminates arrival phase;( when P0 reaches barrier &
received message from P4)
www.cse.hcmut.edu.vn

58

Tree Barrier Implementation


Release with a reverse tree construction.

Tree barrier
www.cse.hcmut.edu.vn

59

Butterfly Barrier

This would be used if data were exchanged


between the processes
www.cse.hcmut.edu.vn

60

Local Synchronization
Suppose a process Pi needs to be synchronized and
to exchange data with process Pi-1 and process Pi+1

Not a perfect three-process barrier because


process Pi-1 will only synchronize with Pi and
continue as soon as Pi allows. Similarly,process
Pi+1 only synchronizes with Pi.
www.cse.hcmut.edu.vn

61

Synchronized Computations
Fully synchronous
In fully synchronous, all processes involved in the computation
must be synchronized.

Data Parallel Computations


Synchronous Iteration(Synchronous Parallelism)

Locally synchronous
In locally synchronous, processes only need to synchronize
with a set of logically nearby processes, not all processes
involved in the computation

Heat Distribution Problem


Sequential Code
Parallel Code
www.cse.hcmut.edu.vn

62

Data Parallel Computations


Same operation performed on different
data elements simultaneously (SIMD)
Data parallel programming is very
convenient for two reasons
The first is its ease of programming
(essentially only one program)
The second is that it can scale easily to
larger problems sizes

www.cse.hcmut.edu.vn

63

Synchronous Iteration
Each iteration composed of several processes
that start together at beginning of iteration.
Next iteration cannot begin until all processes
have finished previous iteration Using forall :
for (j = 0; j < n; j++) /*for each synch. iteration */
forall (i = 0; i < N; i++) { /*N procs each using*/
body(i);
/* specific value of i */
}

www.cse.hcmut.edu.vn

64

Synchronous Iteration
Solving a General System of Linear Equations by Iteration
Suppose the equations are of a general form with n
equations and n unknowns where the unknowns are x0,
x1, x2, xn-1 (0 <= i < n).
an-1,0x0 + an-1,1x1 + an-1,2x2 + an-1,n-1xn-1 = bn-1
.
.
.
.
a2,0x0 + a2,1x1 + a2,2x2 + a2,n-1xn-1 = b2
a1,0x0 + a1,1x1 + a1,2x2 + a1,n-1xn-1 = b1
a0,0x0 + a0,1x1 + a0,2x2 + a0,n-1xn-1 = b0
where the unknowns are x0, x1, x2, xn-1 (0<= i < n).
www.cse.hcmut.edu.vn

65

Synchronous Iteration
By rearranging the ith equation:
ai,0x0 + ai,1x1 + ai,2x2 + ai,n-1xn-1 = bi
to
xi = (1/ai,i)[bi-(ai,0x0+ai,1x1+ai,2x2ai,i-1xi-1+ai
,i+1xi+1+ai,n-1xn-1)]
Or

www.cse.hcmut.edu.vn

66

Heat Distribution Problem

An area has known temperatures along each of


its edges. Find thetemperature distribution
within. Divide area into fine mesh of points, hi,j.
Temperature at an inside point taken to be
average of temperatures of four neighboring
points..
Temperature of each point by iterating the
equation

(0 < i < n, 0 < j < n)


www.cse.hcmut.edu.vn

67

Heat Distribution Problem

www.cse.hcmut.edu.vn

68

Sequential Code
Using a fixed number of iterations
for (iteration = 0; iteration < limit; iteration++) {
for (i = 1; i < n; i++)
for (j = 1; j < n; j++)
g[i][j] = 0.25*(h[i-1][j]+h[i+1][j]+h[i][j-1]
+h[i][j+1]);
for (i = 1; i < n; i++)/* update points */
for (j = 1; j < n; j++)
h[i][j] = g[i][j];

www.cse.hcmut.edu.vn

69

Parallel Code
With fixed number of iterations, Pi,j (except for the
boundary points):
for (iteration = 0; iteration < limit; iteration++) {
g = 0.25 * (w + x + y + z);
send(&g, Pi-1,j); /* non-blocking sends */
send(&g, Pi+1,j);
Local
send(&g, Pi,j-1);
send(&g, Pi,j+1);
Barrier
recv(&w, Pi-1,j); /* synchronous receives */
recv(&x, Pi+1,j);
recv(&y, Pi,j-1);
recv(&z, Pi,j+1);
}
www.cse.hcmut.edu.vn

70

Contents
Motivation of Parallel Computing Techniques
Parallel Computing Techniques
Message-passing computing
Pipelined Computations
Embarrassingly Parallel Computations
Partitioning and Divide-and-Conquer Strategies
Synchronous Computations
Load Balancing and Termination Detection
www.cse.hcmut.edu.vn

71

Load Balancing &


Termination Detection

Load Balancing & Termination Detection

Content

Load Balancing
Used to distribute
computations fairly
across processors in
order to obtain the
highest possible
execution speed

www.cse.hcmut.edu.vn

Termination Detection
Detecting when a
computation has been
completed. More difficult
when the computation is
distributed.

73

Load Balancing

www.cse.hcmut.edu.vn

74

Load Balancing & Termination Detection

Load Balancing

Static Load Balancing


Load Baclancing can be
attemped statically
before the execution of
any process.

www.cse.hcmut.edu.vn

Dynamic Load Balancing


Load Balancing can be
attemped dynamically
during the execution of
the process.

75

Static Load Balancing

Round robin algorithm passes out tasks in


sequential order of processes coming back to the first
when all processes have been given a task
Randomized algorithms selects processes at
random to
take tasks
Recursive bisection recursively divides the problem
into
subproblems of equal computational effort while
minimizing message passing
Simulated annealing an optimization technique
Genetic algorithm another optimization technique,
described
www.cse.hcmut.edu.vn

76

Static Load Balancing

Several fundamental flaws with static load


balancing even if a mathematical solution
exists:
Very difficult to estimate accurately the
execution times of various parts of a program
without actually executing the parts.
Communication delays that vary under
different circumstances
Some problems have an indeterminate
number of steps to reach their solution.

www.cse.hcmut.edu.vn

77

Dynamic Load Balancing

www.cse.hcmut.edu.vn

78

Centralized dynamic load balancing

Tasks handed out from a centralized location.


Master-slave structure
Master process(or) holds the collection of
tasks to be performed.
Tasks are sent to the slave processes. When a
slave process completes one task, it requests
another task from the master process.
(Terms used : work pool, replicated worker,
processor farm.)

www.cse.hcmut.edu.vn

79

Centralized dynamic load balancing

www.cse.hcmut.edu.vn

80

Termination

Computation terminates when:


The task queue is empty and
Every process has made a request for
another task without any new tasks being
generated
Not sufficient to terminate when task queue
empty if one or more processes are still
running if a running process may provide new
tasks for task queue.

www.cse.hcmut.edu.vn

81

Decentralized dynamic load balancing

www.cse.hcmut.edu.vn

82

Fully Distributed Work Pool


Processes to execute
tasks from each other
Task
could
be
transferred by:
- Receiver-initiated
- Sender-initiated

www.cse.hcmut.edu.vn

83

Process Selection
Algorithms for selecting a process:
Round robin algorithm process
Pi requests tasks from process
Px,where x is given by a counter
that is incremented after each
request,
using
modulo
n
arithmetic
(n
processes),
excluding x = i.
Random polling algorithm
process Pi requests tasks from
process Px, where x is a number
that is selected randomly
between 0 and n- 1 (excluding i).

www.cse.hcmut.edu.vn

84

Distributed Termination Detection Algorithms

Termination Conditions
Application-specific local termination conditions exist
throughout the collection of processes, at time t.
There are no messages in transit between processes at
time t.
Second condition necessary because a message in
transit might restart a terminated process. More difficult
to recognize. The time that it takes for messages to
travel between processes will not be known in advance.

www.cse.hcmut.edu.vn

85

Using Acknowledgment Messages


Each process in one of two
states:
Inactive - without any task
to perform
Active
Process that sent task to
make it enter the active
state becomes its parent.

www.cse.hcmut.edu.vn

86

Using Acknowledgment Messages


When process receives a task, it immediately sends an
acknowledgment message, except if the process it
receives the taskfrom is its parent process. Only sends
an acknowledgment message to its parent when it is
ready to become inactive, i.e. when:
Its local termination condition exists (all tasks are
completed, and It has transmitted all its
acknowledgments for tasks it has received, and It has
received all its acknowledgments for tasks it has sent
out.
A process must become inactive before its parent
process. When first process becomes idle, the
computation can terminate
www.cse.hcmut.edu.vn

87

Load balancing/termination detection


Example
EX: Finding the shortest distance between two
points on a graph.

www.cse.hcmut.edu.vn

88

References:
Parallel Programming: Techniques and
Applications Using Networked Workstations
and Parallel Computers, Barry Wilkinson
and MiChael Allen, Second Edition, Prentice
Hall, 2005.

You might also like