You are on page 1of 14

Parallel Computing I

Assignment-1
1
1
2

xxxxxx 2016. 00:17


c 2016 by Annual Reviews.
Copyright
All rights reserved

Khedar Yogesh, 2 Jujare Kartik


4612809 CSE, TU Braunschweig, Braunschweig, ; email: y.khedar@tu-bs.de
4611716 CSE, TU Braunschweig, Braunschweig, ; email: k.jujare@tu-bs.de

Keywords
trap, task2, task3, critical, reduction
Abstract
Assignment in contribution towards completion of Parallel Computing-I
Summer Semester 2016

Contents
1. Task 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1. Task 1.a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2. Task 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1. Task 2.a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3. Task 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1. Task 3a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4. Task 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1. Task 4a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2. Task 4b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2
2
2
2
4
4
5
6
7

1. Task 1
1.1. Task 1.a
The four factors which influence the runtime of the program are as follows:
1. Algorithm Efficiency (High level of Influence) : The Algorithm is the major
contributor towards runtime of a program. For instance the use of unnecessary for
loops will increase the computing time for things which not be used. Similarly, the
every Sort and search method has its own computation time.
2. Processor Clock Speed (Normal to High level of Influence) : Clock speeds
are the number of instructions a processor can execute per unit time. The faster the
better. Its effect is limited by size of the program.
3. Parallel or Serial programming (High level of Influence) : If a program is run
on multiple processors, it can greatly reduce the run time of the program. Although
it is limited by communcation between processors which is very slow as compared to
the clock speeds of the processors.
4. Data Communication Speed & Program size (High level of Influence) :
Communication between the processor-memory duo greatly limits the proper utilisation of Processor clocks-speed. This is becaus only a small amount of execution
instructions can be stored on the Processor chip. Data from the CPU chip can be
retrieved in the smallest possible time. As the size of the program grows, its not possible to store the instructions on the chip itself and data must be stored at a different
location, e.g. RAM or even slower Hard Drives.

2. Task 2
2.1. Task 2.a
The plots were made for the following values. Values being varied as follows:
n = 1000, 10000, 100000, 1000000, 10000000
s = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
The Algorithm for the code is as follows:
1. Create Array of n and s.
2. For every element n[i] run the following code:
2

Khedar, Jujare et al.

(a) Create & open a specific .txt file where the results will be stored for the given
n.
(b) Dedicate memory for the array of x with space for n[i] elements.
(c) Generate using the random function all the elements and fill array x.
(d) For each element s[j] calculate sum every s[j]th element of x and record this sum
along with the time taken(dT) for this loop to the opened file.
(e) Close the file.
2.1.1. Description of the Data Files. There are 5 files that are created which hold
these values. These files have three columns. The first column contains the s values.
The second has the time required to sum the values of the Array depending on the
value of s. The third column lists the sum.The format of the file was made for it to
be used as an input file to GNUPLOT. And all these five files are plotted on the same graph

2.1.2. Description of the plot. The plots as described in fig. Figure 1, Figure 2, Figure
3 show the trend of the amount of time taken when the value of s increases. The increase
in the value of s artificially simulates the increase in the number of processors.
As the value of s increases (for a given n) lesser time is needed to reach the final sum value
of the elements of x.

Figure 1
Figure shows the time for increasing s of n = 1000

www.annualreviews.org Short title

Figure 2
Figure shows the time for increasing s of n = 1000, n = 10000, n = 100000

Figure 3
Figure shows the time for increasing s of n = 1000, n = 10000, n = 100000, n = 1000000, n =
10000000

3. Task 3
3.1. Task 3a

The requirement of the task was to find out the round trip latency. This round trip latency
is the time taken for a message to be passed and returned.

Khedar, Jujare et al.

1. The given function was modified to include a while loop for changing the size of the
message.
2. The message size was varied form 1000 to 100000 with an increment of 10000.
3. The same file was run on two configurations. One configuration required 1 node and
2 processors on the same node while the second configuration required 2 nodes and
1 processor per node. The requirement of the task was to find out and understand
whether the communication between processors takes time or not. If yes, how much
time.
4. The result from this task was that when a program is run on two processors which
lie on different nodes takes more time than the two processors located on the same
node.
3.1.1. Description of Graph. Figure 4 shows the general increase in the time requirement
for both configurations. This time requirement as seen from the graph is linear for linear
increase in the message size.
1. For the case of 1 node: 2 ppn we find that the time taken for communication is far
lesser than the time taken for 2 nodes: 1 ppn.
2. The quantification of this property is done in the following way. Ratio of Message size
to time taken. The higher the better. We provide a table of this ratio for different
message sizes for both configuration.

Figure 4
Size of Message vs Scaled Time

4. Task 4
This task was run on the cluster.
www.annualreviews.org Short title

Table 1

Message passing time ratio


Node 1: ppn 2
625362.187500
718012.750000
741130.375000
710359.125000
701946.937500
700405.687500
696832.625000
707692.500000
710843.937500
708926.312500

Ratio =

Node 2: ppn 1
110144.539062
115041.085938
114601.304688
114535.679688
114337.656250
114283.406250
114331.765625
114218.578125
114281.023438
114131.609375

M essageSize
T ime

Message Size = 10000:100000 (in 10000 interval)

4.1. Task 4a
The given code trap.c was modified to implement Parallelism using Critical section and
Reduction. Main features of the modified code is as follows:

1. Trap function: Return type is changed to long double to return integral sum. Pointer
to time of execution is passed as an additional argument to get the time in main
function.
2. The variations to the trap function is written in the same file. New critical and
reduction trap functions are defined and are called in the main inside the while loop
which is nested inside a for loop. The for loop is to change the number of threads.
3. The number of threads are varied from 1 to 4 and the output for each thread for all
array sizes is printed into its own file, namely text file: file 1.txt in the same directory
where the executable resides.
4. The value of a and b are provided in the main function as, a = 10 and b = 100.
5. The output file is printed as follows. The number on the file name indicates the thread
number. The first column is the size of the array, 2nd column: Time for Trap function
in serial, 3rd Column:Time for Trap function with reduction, 4th column:Time for
Trap function with critical Section, 5th Column: Integral for Serial, 6th: Integral for
reduction, 7th : Integral for critical Section.
6. These Clauses are required to disallow conflicts. The variable integral is a shared
variable when the code runs in parallel. Therefore, this results in a race condition
that returns a value which is not the true value. This is avoided by restricting the use
of the variable to one thread at a time as in the critical section or by using reduction
which collects the contribution to the variable by different threads.This avoids conflict
and therefore returns answer which is similar to the answer provided by the serial
code.
6

Khedar, Jujare et al.

4.2. Task 4b
For every run the Serial code is going to take exactly the same time because it always
runs on a single thread. We can therefore compare how the increase of threads affects the
reduction function compared to serial code time. Similarly we can compare the critical
section code to the serial code in the same way.
The figures 5, 6, 7, 8 describe the time spent when the thread value is set to a value of 1.
1. In Figure 5 we notice that the Critical function spends a lot of overhead on ensuring
the single use of variable integral which is unnecessary.
2. Both reduction and the serial code take the same time which is explained by the fact
that there is just one thread that is running.
The figures 9, 10, 11, 12 describe the time spent when the thread value is set to a value of 2.
1. In Figure 9 we notice that the Critical function becomes a bottleneck.
2. From Figure 11 we see that for number of sub-interval when 1000 the serial code runs
faster than the reduction function. But as the number of Sub-intervals increase the
efficiency of reduction code increases.
3. We see that the serial is best for jobs which have very less number of sub-intervals
4. Reduction code has improved efficiency over its own self from previous running on 1
thread. And we expect it to be increasing as the number of threads increases.
The figures 13, 14, 15 describe the time spent when the thread value is set to a value of 3.
1. In Figure 13 we notice that the Critical function performance deteriorates as the
number of threads increases. Probably due to the communication overhead and the
because of the bottleneck that is created which allows only one thread to access the
variable.
2. Reduction code has improved efficiency over its own self from previous running on 1
and 2 threads. And we expect it to be increasing as the number of threads increases.
The figures 16, 17, 18 describe the time spent when the thread value is set to a value of 4.
1. In Figure 16 we notice that the Critical function performance deteriorates as the
number of threads increases. Probably due to the communication overhead and the
because of the bottleneck that is created which allows only one thread to access the
variable.
2. Reduction code has improved efficiency over its own self from previous running on 1, 2,
3 threads. And we expect it to be increasing as the number of threads increases

www.annualreviews.org Short title

Figure 5
Figure shows Size of Array to the Original Time for all three variations

Figure 6
Figure shows Size of Array to the Original Time for serial and reduction functions only

Khedar, Jujare et al.

Figure 7
Figure shows Size of Array to the Scaled Time

Figure 8
Figure shows Size of Array to the Scaled Time

www.annualreviews.org Short title

Figure 9
Figure shows Size of Array to the Original Time for serial and reduction functions only

Figure 10
Figure shows Size of Array to the Original Time for serial and reduction functions only

10

Khedar, Jujare et al.

Figure 11
Figure shows Size of Array to the Scaled Time

Figure 12
Figure shows Size of Array to the Scaled Time.This graph is provided only to resolve the close
points. We ask the reader to ignore the value for 10
7. The time taken for this operation does not
scale properly for the scaling factor that is applied

www.annualreviews.org Short title

11

Figure 13
Figure shows Size of Array to the Original Time for all three variations

Figure 14
Figure shows Size of Array to the Original Time for serial and reduction functions only

12

Khedar, Jujare et al.

Figure 15
Figure shows Size of Array to the Scaled Time

Figure 16
Figure shows Size of Array to the Original Time for all three variations

www.annualreviews.org Short title

13

Figure 17
Figure shows Size of Array to the Original Time for serial and reduction functions only

Figure 18
Figure shows Size of Array to the Scaled Time

14

Khedar, Jujare et al.

You might also like