You are on page 1of 5

OpenMP Course: Examples

Author: Michael Bane Version: 1.1 Date: Nov 2013

Introduction
Please refer to the Getting Started slides http://wiki.rcs.manchester.ac.uk/community/OpenMP/course which outline how to obtain code skeletons (for questions 2 onwards) and how to compile and run your codes.

Question 1 - Simple Replication


Along the lines of the ubiquitous "hello world" program, your task is to write an OpenMP program that will output a statement identifying each thread and the total number of threads. There is no skeleton code for this so open your favourite editor (emacs, gedit) and create a file from scratch. Write this program, use an environment variable to set the number of the threads and run this short program a few times for the same number of threads. What do you notice? Ensure your code can be compiled and run in serial. Advanced Users: Amend your code so that it sets the number of threads in the code itself. Which takes precedence: environment variables or clauses in your code?

Question 1b Simple Parallelisation


In the ~/training/ROPENMP/exercises directory there is a serial loops.f90/loops.c code. Compile and run the serial code to ensure you know the expected results. Then edit the code to parallel as many loops as you can, recompile and run do you get the correct results?

Question 2 - Scheduling
The code lu_serial.f90/c performs the matrix multiplication A = L U where L is a lower matrix and U an upper one; that is
L(i,j) = 0 U(i,j) = 0 for i > j for i < j

The main part of the code is given by (the C example is similar):

do i=1,n do j=1,n do k=1,i ! only up to i a(i,j) = a(i,j) + l(i,k)*u(k,j) end do end do end do

Assuming n=10, and using two threads, determine how much work will each processor perform for each of the following scheduling methods:

SCHEDULE (STATIC) SCHEDULE (STATIC,1)

Add OpenMP directives to the code to parallelise the above loop (across the i loop) and use OMP timers to determine how long it takes for each of the above schedules on (say) 1, 2, 4 and 8 threads. Try other SCHEDULE directive options and consider how the efficiency depends on the scheduling involved. Try setting the SCHEDULE to "runtime" and then running the code for different versions of the appropriate environment variable. What do you notice? NB you may wish to increase the value of the variable n to magnify any effects.

Question 3a: Prime Numbers


The skeleton code, primes_serial.c/f90, contains comments to help you create a parallel code that will give each OpenMP thread a fixed number of integers to search for prime numbers. Having found their prime numbers, each thread should update 2 shared arrays (one 1D containing the number of primes for each thread, and a 2D containing the primes themselves for each thread). Still within the single PARALLEL region, output the primes found on each thread. Run these a few times, on varying numbers of threads. What do you see (in terms of output)? we construct a single vector with the prime numbers in order, before terminating the parallel region and outputting the prime numbers. FORTRAN: You also need the chkprime.f90 subroutine, for example: ifort -O0 -openmp primes_open.f90 chkprime.f90 o primes.exe

Question 3b:
Take your solution from Question 3a and implement the output using a CRITICAL directive or an ORDERED clause+directive to output the primes in ascending order.

Question 4a: Integration Example


In this example we will integrate a function using the trapezoidal rule. An outline serial code has been provided as trap_serial.f90/c. You should add a PARALLEL region in which 1. each thread determines its thread number 2. the DO/for loop is performed across all threads, with the local summation being written safely to part of the mySum vector 3. upon completion of the local sums, compute on each thread the global sum (comprising components from the mySum vector) This example is a little contrived but the main exercise is to understand (a) the data scoping clauses; (b) how work can be spread across the available threads. Once your code compiles, run it on various numbers of threads and determine the speedup and efficiency. How do you know if you have the right answer? Advanced Users: What solution can you find (other than using REDUCTION) for any problems you spot you may need to consider cache lines

Question 4b:
For this exercise we wish to use the REDUCTION technique to form directly the global sum, rather than adding partial sums together ourselves. First, copy your solution to question 4a to a new file. Using this new file, amend mySum to be a scalar, amend the PARALLEL construct accordingly and output the value of the integral outside the PARALLEL region. Once your code compiles, run it on various numbers of threads and determine the speedup and efficiency. How do these compare to Q4a?

Question 5 Where to parallelise


The object of this exercise is to determine whether it's safe to parallelise every DO loop that you see. Consider the code loopy.f (not .f90, sorry!) or loopy.c Follow these steps (note that we are NOT interested, at this stage, in parallelising the call to the initial subroutine): 1. determine the correct result by compiling serially 2. parallelise every loop weve put a skeleton OMP compound directive in to help you 3. run the program on 1, 2, 4 and 8 threads in batch and compare the results.

4. now rewrite your code to give the correct results irrespective of the number of threads used but to run as fast as you can

Question 6 - Granularity
Let's look at where to put OpenMP directives for matrix-matrix multiplication. Copy the file matrix_mult.f90 or matmult.c to your own file and parallelise the A=B*C loop (don't bother with parallelising the calls to initial and zero) at the outer or middle or inner loop levels (try each but just one at a time, perhaps ending up with 3 different files). Now, choose a suitable number of threads and submit the compiled code(s) to the batch system. Which loops do you think will results best from the parallelization, and why? Do your results bear this out?

Question 7 - Granularity
Stage 1
The aim of this example is to highlight the effects of fusing loops and replicating some computations to improve performance. Looking at performance.f90/c we see that there are two main loops separated by a scalar operation. Firstly, put the relevant PARALLEL DO/for constructions around each of the DO/for loops. Save this as perf_1.f90/c

Stage 2
Now write another version (perf_2.f90/c) but with both the loops and the scalar operation in a single PARALLEL construct using the appropriate construct to ensure that the computation of alpha is only performed on one thread.

Stage 3
Finally, note that we can actually allow all threads to compute their own values of alpha and use this in the following DO/for loop. Write another version (perf_3.f90/c) of this code so that all threads compute their own value of alpha. What of the above methods do you think will be the fastest? Time them all on 2, 4 and 8 threads and ensure you've got the correct results. What do you find?

Question 8 - Run time libraries

Add some OpenMP directives to the code start_dynamic.f90/c code so that it reads in values of threads to use and then sets OMP_NUM_THREADS to this value. Run the program with the environmental variable OMP_DYNAMIC set to TRUE. Now set the environmental variable OMP_DYNAMIC to FALSE and repeat the above step - what do you notice? Now add a run time library call which will ensure the requested number of threads are used.

Question 9: Determine pi
Take the code given in pi.f90/c and convert it to a simple OpenMP program and then run it on a varying number of processors. What kind of speed up and efficiency do you see when using 4 and 8 processors?

You might also like