You are on page 1of 3

Advanced Parallel Computing for Scientific

Applications
Autumn Term 2010
Prof. I. F. Sbalzarini Prof. P. Arbenz
ETH Zentrum, CAB G34 ETH Zentrum, CAB H89
CH-8092 Zürich CH-8092 Zürich

Exercise 3
Release: 12. Oct 2010
Due: 26 Oct. 2010

1 Practice in C/C++
The following two assignments illustrate the effects of caching in matrix operations. C
uses row major memory layout for storing matrices and high dimensional arrays. Hence
row-wise access of elements is more cache efficient than column-wise access.

Question 1: Matrix multiplication


The file matrixMult.cpp contains a program to find the execution time for matrix multi-
plication.

C =A·B

Each matrix is stored as a 1-D array. The multiplication is performed in the method void
Multiply(...). To calculate each element in C, the elements of A are accessed row-wise
and that of B are accessed column-wise resulting in several cache misses especially for
large matrix sizes. A better cache usage can be achieved if the matrix B is transposed
and the matrix multiplication operation is modified accordingly to give the same result as
before. Your task is to implement the methods void InPlaceTranspose(...) and void
MultiplyEfficient(...).
Compile your code using the default GNU compiler: g++ -o mult matrixMult.cpp
Do you observe better performance in the case of large matrices?

Question 2: Matrix norm


You have to calculate the 1-norm and infinity norm of a matrix A of size m × n given by
m
X
||A||1 = max |aij |
1≤j≤n
i=1

n
X
||A||∞ = max |aij |
1≤i≤m
j=1

1
Implement the above equations in the appropriate methods double Norm 1() and double
Norm Inf() in the file matrixNorm.cpp. Count the floating point operations in the calcu-
lation and compute Mflop/s for different matrix sizes n.
Compile your code using : g++ -o norm matrixNorm.cpp
Which of the above norms is calculated faster and why?

In each of the above examples, time measurement is done using the method double walltime(...)
which is implemented in the file walltime.h
Please submit the jobs to the batch queue as follows:

bsub -o <op_file> ./<executable>

2 Introduction to OpenMP
1. OpenMP is an application programming interface that provides a parallel program-
ming model for shared memory and distributed shared memory multiprocessors.

2. OpenMP is based on the Fork/Join Execution Model : An OpenMP program starts


as a single thread (master) and additional threads are created when the master hits
a parallel region.

3. There is a standard include file omp.h for C/C++ OpenMP programs.

4. The number of threads is fixed a priori by the programmer using environment


variable OMP NUM THREADS

5. omp get num threads() and omp get thread num() can be used to get the number
of threads created and the local number assigned to each thread.

6. The directive #pragma omp parallel in the program marks the beginning of parallel
section.

7. The keywords used for distributing work among threads are for, sections, critical
etc

Question 3: First OpenMP program


Using the above information, write a simple program that will create n = 2, 4, 6 threads and
each thread will display one of the following messages along with its own thread number.

This is Advanced Parallel Computing tutorial.


This is the first OpenMP program.
This program uses n threads
Hello World

2
Compile the program using the GNU compiler as follows:

gcc -lgomp -fopenmp -o omp1 omp1.c

Question 4: Work sharing among OpenMP threads


The file vectorAdd.cpp contains the code for serial and parallel execution of SAXPY oper-
ation along with time measurement.
a) Compile the program and execute it using say 4 threads. Why is there no speedup?
Modify the code in order to achieve appreciable speedup.
b) Write a code to calculate dot product ā.b̄ in parallel.

You may submit the jobs to the batch queue as follows:

bsub -n N -o <op_file> ./<executable>


{N is the number of processors}

Do not forget to set the environment variable OMP NUM THREADS before execution.

You might also like