Professional Documents
Culture Documents
Introduction
Please refer to the Getting Started slides http://wiki.rcs.manchester.ac.uk/community/OpenMP/course which outline how to obtain code skeletons (for questions 2 onwards) and how to compile and run your codes.
Question 2 - Scheduling
The code lu_serial.f90/c performs the matrix multiplication A = L U where L is a lower matrix and U an upper one; that is
L(i,j) = 0 U(i,j) = 0 for i > j for i < j
do i=1,n do j=1,n do k=1,i ! only up to i a(i,j) = a(i,j) + l(i,k)*u(k,j) end do end do end do
Assuming n=10, and using two threads, determine how much work will each processor perform for each of the following scheduling methods:
Add OpenMP directives to the code to parallelise the above loop (across the i loop) and use OMP timers to determine how long it takes for each of the above schedules on (say) 1, 2, 4 and 8 threads. Try other SCHEDULE directive options and consider how the efficiency depends on the scheduling involved. Try setting the SCHEDULE to "runtime" and then running the code for different versions of the appropriate environment variable. What do you notice? NB you may wish to increase the value of the variable n to magnify any effects.
Question 3b:
Take your solution from Question 3a and implement the output using a CRITICAL directive or an ORDERED clause+directive to output the primes in ascending order.
Question 4b:
For this exercise we wish to use the REDUCTION technique to form directly the global sum, rather than adding partial sums together ourselves. First, copy your solution to question 4a to a new file. Using this new file, amend mySum to be a scalar, amend the PARALLEL construct accordingly and output the value of the integral outside the PARALLEL region. Once your code compiles, run it on various numbers of threads and determine the speedup and efficiency. How do these compare to Q4a?
4. now rewrite your code to give the correct results irrespective of the number of threads used but to run as fast as you can
Question 6 - Granularity
Let's look at where to put OpenMP directives for matrix-matrix multiplication. Copy the file matrix_mult.f90 or matmult.c to your own file and parallelise the A=B*C loop (don't bother with parallelising the calls to initial and zero) at the outer or middle or inner loop levels (try each but just one at a time, perhaps ending up with 3 different files). Now, choose a suitable number of threads and submit the compiled code(s) to the batch system. Which loops do you think will results best from the parallelization, and why? Do your results bear this out?
Question 7 - Granularity
Stage 1
The aim of this example is to highlight the effects of fusing loops and replicating some computations to improve performance. Looking at performance.f90/c we see that there are two main loops separated by a scalar operation. Firstly, put the relevant PARALLEL DO/for constructions around each of the DO/for loops. Save this as perf_1.f90/c
Stage 2
Now write another version (perf_2.f90/c) but with both the loops and the scalar operation in a single PARALLEL construct using the appropriate construct to ensure that the computation of alpha is only performed on one thread.
Stage 3
Finally, note that we can actually allow all threads to compute their own values of alpha and use this in the following DO/for loop. Write another version (perf_3.f90/c) of this code so that all threads compute their own value of alpha. What of the above methods do you think will be the fastest? Time them all on 2, 4 and 8 threads and ensure you've got the correct results. What do you find?
Add some OpenMP directives to the code start_dynamic.f90/c code so that it reads in values of threads to use and then sets OMP_NUM_THREADS to this value. Run the program with the environmental variable OMP_DYNAMIC set to TRUE. Now set the environmental variable OMP_DYNAMIC to FALSE and repeat the above step - what do you notice? Now add a run time library call which will ensure the requested number of threads are used.
Question 9: Determine pi
Take the code given in pi.f90/c and convert it to a simple OpenMP program and then run it on a varying number of processors. What kind of speed up and efficiency do you see when using 4 and 8 processors?