You are on page 1of 40

OpenMP

Parallel Processing
OpenMP
A specification for parallelizing programs in a
shared memory environment.
Provides a set of pragmas, runtime routines,
and environment variables that programmers
can use to specify shared-memory
parallelism in Fortran, C, and C++ programs.
All you need to do is insert appropriate
pragmas in the source program, and then
compile the program with a compiler
supporting OpenMP and with the appropriate
compiler option.
OpenMP Pragmas
The OpenMP specification defines a
set of pragmas.
A pragma is compiler directives on
how to process the block of code that
follows the pragma.
The most basic pragma is
the#pragma omp parallelto
denote aparallel region.
Fork Join Model
OpenMP pragmas use an OpenMP-aware compiler to
generate an executable that will run in parallel using
multiple threads.
OpenMP uses the fork-join model of parallel execution.
An OpenMP program begins as a single thread of
execution, called the initial thread. When a thread
encounters a parallel construct, it creates a new team
of threads composed of itself and zero or more
additional threads, and becomes the master of the
new team.
All members of the new team (including the master)
execute the code inside the parallel construct.
There is an implicit barrier at the end of the parallel
construct. Only the master thread continues execution
of user code beyond the end of the parallel construct.
Fork Join Model
OpenMP
Most OpenMP* constructs apply to a
structured block.
Structured block: a block of one or
more statements with one point of
entry at the top and one point of exit
at the bottom.
What is a Thread
Thread is independent sequence of
execution of program code
Block of code with one entry and one exit
OpenMP threads are mapped onto
physical cores
Possible to map more than 1 thread on a
core
In practice best to have one-to-one
mapping
Write a program that prints
hello world.

void main()
{
int ID = 0;
printf( hello(%d) , ID);
printf( world(%d) \n, ID);
}
Write a multithreaded program
that prints hello world.
#include omp.h
void main()
{

#pragma omp parallel

{
int ID = 0;
printf( hello(%d) , ID);
printf( world(%d) \n, ID);
}

}
Multithreaded program that
prints hello world with ID
#include omp.h
void main()
{
#pragma omp parallel
{
int ID = omp_get_thread_num();
printf( hello(%d) , ID);
printf( world(%d) \n, ID);
}
}
Some Useful OpenMP
Functions
Environmental variable
OMP_NUM_THREADS Runtime function
omp_set_num_threads(n)
Runtime function omp_get_num_threads()
Returns number of threads in parallel region
Returns 1 if called outside parallel region
Runtime function omp_get_thread_num()
Returns id of thread in team
Value between [0,n-1] // where n = #threads
Master thread always has id 0
OpenMP Memory Model
OpenMP is a multi-threading, shared
address model.
Threads communicate by sharing
variables.
Unintended sharing of data causes
race conditions
when the programs outcome changes
as the threads are scheduled differently.
OpenMP Memory Model
Shared and Private Variable
SHARED ( list )
All variables in list will be considered shared.
Every openmp thread has access to all these
variables
PRIVATE ( list )
Every openmp thread will have it's own
private copy of variables in list
No other openmp thread has access to this
private copy
Example: #pragma omp parallel
private(a,b,c)
Synchronization Problem:
False Sharing
False Sharing
Non-shared data in the same cache line
so each update invalidates the cache line
in essence sloshing independent
data back and forth between threads.
Cache Line: Data is transferred
between memory andcachein blocks
of fixed size, calledcache lines.
Modify Pi Program
False Sharing
In multiprocessor system each processor
has a local cache.
The memory system must guarantee
cache coherence.
False sharing occurs when threads on
different processors modify variables
that reside on the same cache line.
This invalidates the cache line and forces
an update, which hurts performance.
False Sharing
Synchronization
Critical:
Synchronization: Atomic
Atomic provides mutual exclusion but
only applies to the update of a memory
location
the update of X in the following example
#pragma omp parallel
{
double tmp, B;
B = DOIT();
tmp = big_ugly(B);
#pragma omp atomic
X += tmp; }
Synchronization: Barrier
Synchronization: nowait
Loop Directive: For
Splits the for-loop so that each thread
handles a different portion of the loop
#pragma omp for
for(int n=0; n<10; ++n)
{
printf(" %d", n);
}
printf(".\n");
Sample Output
0567182349
Loop Directive: For
The previous code is equivalent to

int this_thread = omp_get_thread_num();


int num_threads = omp_get_num_threads();
int my_start = (this_thread ) * 10 / num_threa
int my_end = (this_thread+1) * 10 / num_thread
for(int n=my_start; n<my_end; ++n)
printf(" %d", n);
Scheduling Clauses for
Loops
schedule(static [,chunk])
Deal-out blocks of iterations of size
chunk to each thread.
Scheduling done at compile-time
Default schedule technique

#pragma omp for schedule(static)


for(int n=0; n<10; ++n)
printf(" %d", n);
printf(".\n");
Scheduling Clauses for
Loops
schedule(dynamic[,chunk])
Each thread grabs chunk iterations off
a queue until all iterations have been
handled.
Most work is done at run time.
#pragma omp for
schedule(dynamic)
for(int n=0; n<10; ++n)
printf(" %d", n); #pragma omp for
printf(".\n"); schedule(dynamic, 3)
for(int n=0; n<10; ++n)
printf(" %d", n);
printf(".\n");
Scheduling Clauses for
Loops
schedule(guided[,chunk])
Threads dynamically grab blocks of
iterations. The size of the block starts
large and shrinks down to size chunk
as the calculation proceeds.
Working with loop: Some
Tricks
int i, j, A[MAX];
j = 5;
for (i=0;i< MAX; i++) int i, A[MAX];
{ #pragma omp
j +=2; parallel for
A[i] = big(j); for (i=0;i< MAX; i+
} +)
{
int j = 5 +
2*(i+1);
A[i] = big(j);
}
Working with loop: Some
Tricks
Use collapse to program Nested Loops
#pragma omp parallel for collapse(2)
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) {
.....
}
}
Will form a single loop of length NxM
and then parallelize that.
Working with loop: Some
Tricks
We are combining values into a
single accumulation variable (ave)
there is a true dependence between
loop iterations that cant be trivially
removed.
double ave=0.0,
A[MAX]; int i;
for (i=0;i< MAX;
i++)
{
ave + = A[i];
}
Reduction
OpenMP reduction clause:
reduction (op : list)
Inside a parallel or a work-sharing
construct:
A local copy of each list variable is made and
initialized depending on the op (e.g. 0 for
+).
Updates occur on the local copy.
Local copies are reduced into a single value
and combined with the original global value.
Reduction

double ave=0.0, A[MAX]; int i;


#pragma omp parallel for
reduction (+:ave)
for (i=0;i< MAX; i++) {
ave + = A[i];
}
ave = ave/MAX;
Reduction Operators and
Initial Values
Pi Program with Reduction
Operator
#include <omp.h>
static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel
{
double x;
#pragma omp for reduction(+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
}
pi = step * sum;
}
Synchronization: Barrier
Each thread waits until all threads
arrive
#pragma omp parallel shared (A, B, C) private
{
id=omp_get_thread_num();
A[id] = big_calc1(id);
#pragma omp barrier
#pragma omp for
for(i=0;i<N;i++){C[i]=big_calc3(i,A);}
#pragma omp for nowait
for(i=0;i<N;i++){ B[i]=big_calc2(C, i); }
A[id] = big_calc4(id);
}
Synchronization: Master
Construct
The master construct denotes a
structured block that is only
executed by the master thread
The other threads just skip it (no
synchronization is implied)
#pragma omp paralle
{
do_many_things();
#pragma omp master
{ exchange_boundari
#pragma omp barrier
do_many_other_thing
}
Synchronization: Single
Construct
The single construct denotes a block of
code that is executed by only one thread
(not necessarily the master thread).
A barrier is implied at the end of the single
block (can remove the barrier with a
nowait#pragma
clause).
omp parallel
{
do_many_things();
#pragma omp single
{ exchange_boundaries(); }
do_many_other_things();
}
Synchronization: Sections
Construct
The Sections worksharing construct
gives a different structured block to
each thread
Sometimes it is handy to indicate that
"this and this can run in parallel".
There is a barrier at the end of the
omp sections. Use the nowait
clause to turn off the barrier
Synchronization: Sections
Construct
#pragma omp
parallel
{
#pragma omp
sections
{
#pragma omp
section
X_calculation();
#pragma omp
section
y_calculation();
Synchronization: Sections
Construct
#pragma omp parallel // starts a new team
{ Work0(); // this function would be run by a
#pragma omp sections // divides the team into sec
{ // everything herein is run only once.
{ Work1(); }
#pragma omp section
{ Work2();
Work3(); }
#pragma omp section
{ Work4(); }
}
Work5(); // this function would be run by all th
}
Synchronization: Locks
The OpenMP runtime library provides
a lock type, omp_

You might also like