Professional Documents
Culture Documents
M. Hernandez milhernandezcag@unal.edu.co
Engineering Faculty, Universidad Nacional de Colombia, Bogota, Colombia
D. A. Caceres daacaceressa@unal.edu.co
Engineering Faculty, Universidad Nacional de Colombia, Bogota, Colombia
Abstract
Parallelizing compilers have main focused on loop parallelization. This paper provides a way
to generate task parallelization based on syntax analysis. We present the call, flow and depen-
dency graph construction, information needed to proceed with the parallelization phase, and then
our proposed algorithm to find independent sections that can run as tasks in separated processes.
Then we provide an algorithm to perform loop parallelization of reduction operations. As results,
we have parallelized some competitive programming tasks, to show how parallelization can be
extended to a world that has been purely sequential. At the end, we compare the features imple-
mented in our software with other well-known parallelization tools.
Keywords
Automatic parallelization, task parallelization, flow graph, dependency graph.
1 Introduction
Parallel programming is the dominant paradigm of computer architectures; thus, programmers
must stop writing serial code and start learning and writing parallel code. In terms of execution
time, a program written in parallel can be just a bit slower but up to 10 times faster than a serial
one, this depends on the processor type and the number of cores that it has.
Auto Parallelizer is a source to source translator which main goal is to show the efficiency
of the parallel paradigm by inserting OpenMP pragmas into serial codes using syntactic analysis
with the help of ANTLR and the visitor pattern, so that the user can compare the execution time
of the serial code and the auto generated parallel one. This tool main goal is to bring closer the
parallel paradigm to programmers since the auto-parallelization is a tough process that requires
exhaustive analysis and is prone to errors, that is why is always going to be better to write parallel
code since the beginning than to parallelize it in more advanced stages of the code.
This paper provides a brief introduction to the parallel paradigm, shows different auto-
parallelization tools, provides a general vision of Auto Parallelizer, how does it work and shows a
comparison of execution times between serial codes written by competitive programmers and the
auto generated parallel code given by Auto Parallelizer.
2 Concepts
In order the understand the further sections, we will sketch out some important concepts related
to parallelization.
acc[0] = 0;
for (int i = 1; i <= n; ++i)
acc[i] = acc[i - 1] + i;
2.1 Dependencies
It makes reference to sections of code that must be executed before other ones. This is the main
concept of auto parallelization since the code parallelization depends on whether or not there is a
dependency.
Data dependency: It is when the variables must be read and written in the same order since
it is not the same to read and then write and to write and then read.
2.4 Agglomeration
If there are n different independent sections of code, creating n threads and assigning 1 section to
each of them is not the most efficient thing to do since creating and deleting threads costs time,
switch context between threads also costs time; thus, there must be a balance between the number
of threads and the amount of operations that are assigned to them.
2.5 Scheduling
It is the assignment of operations that each thread is going to perform, there are two types:
Static: All the iterations make the same amount of operations. Ideal for threads that will take
realatively the same amount of time. Figure 5 is an example of this; if it the loop is parallelized
in a fixed number of threads, each of them will execute a similar number of operations and
they will finish near each other.
Dynamic: Each time a thread finishes its operations, new work is assigned to it. Ideal for
iterations that takes more than others since waiting for threads to join can be expensive in
terms of execution time. Figure 6 is an example of this.
2.6 OpenMP
To achieve parallel computing it is necessary to use some parallel programming language or an
API that allows making parallel execution. And as mentioned, our parallelization process is per-
formed by insertin OpenMP directives.
The OpenMP API supports multi-platform shared-memory parallel programming in C/C++
and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible in-
terface for developing parallel applications on platforms from the desktop to the supercomputer.
[1]
We have taken two clauses from OpenMP to perform the parallelization:
Sections clause: It begins with the pragma omp parallel sections, and then each section specified
by the prama omp section inside the sections part is going to be exetuded by a different thread.
All the threads wait until the others had finished and then the program continues with what
is after the sections pragma.
Reduction clause: It begins with the pragma omp parallel reduction (op:var), where op is the
operator that is going to be used (currently the supported operators are addition, subtraction,
multiplication, bitwise or, bitwise and, bitwise xor, min and max) and var is the variable that
will be reduced.
A private copy initialized with the identity of the operator (identity op x) = x) is created for
each thread, at the end the specified operator is applied between all copies and the final result
is written into the original variable.
and C by inserting OpenMP pragma and using CUDA. It aims at generating parallel code with
almost no manual help required.
SUIF Compiler (Stanford University Intermediate Format): Is a framework for research on op-
timizing and parallelizing compilers, it has been developed as a platform on which research
on compiler techniques is done for high-performance machines.It has been successfully used
to perform research on various concepts including loop transformation, array data depen-
dence analysis, software pre-fetching. scalar optimizations and instruction scheduling. The
SUID system consists of a parallelizer that automatically searches for parallel loops and gen-
erate a corresponding parallel code [2]
CAPO (Computer-Aided Parallelizer and Optimizer): Is a tool that automatically inserts com-
piler directives to enable parallel processing on a shared memory parallel machines. CAPO is
developed at NASA Ames Research Center.
Gupta, Mukhopadhyay, Sinha: Although this tool has its years and it does not have a name
yet, is a very complete one, this tool uses the IBM Toronto compiler and creates a framework
of compile time analysis that uses symbolic analysis in array intervals for the parallelization
of recursive functions. This tool also provides a speculative analysis in run time when the
compile time analysis is not enough.
YUCCA (User Code Conversion Application): It is an automatic serial to parallel code con-
version tool introduced by KPIT Technologies Ltd. Pune. The tool accepts input as a C source
code file. The output generated by the tool is a transformed code. This code is a multithreaded
parallel code which has pthread function and OpenMP constructs. This tool does both task
parallelization and loop level parallelization.
Is a parallelizing compiler for C. Cetus can run on any system that supports Oracle and Java
runtime environment. It provides an internal C parser and is written in Java. Cetus uses
the basic techniques used for parallelization and currently it implements reduction variable
recognition, privatization, and induction variable substitution. The most recent version of
Cetus includes a GUI and also a client server model. Cetus compiles and run the sequential
input and generates an output which is the C code with OpenMP constructs. It also shows
the charts of speedup and efficiency.
4 Code analysis
Auto Parallelizer consists of a Java parser written with the help of ANTLR, the Abstract Syntax
Tree (AST) is then traversed a numerous amount of times to gain the information required for the
parallelization. The internal program representation (IR) is implemented in the form of a Javaclass
hierarchy. A high level representation provides a syntactic view of the source program to the pass
writer, making it easy to understand, access and transform the input program. The specification
of each class in the hierarchy is as follow:
AutoParallelizer: This class receives from the user the source code and the different flags in
which the tool is going to be used and creates an instance of the Translator class.
Translator: This class has an instance of the class Program and is in charge of running the
translation algorithm, which is stated in Figure 7.
Program: This class has the list of functions of the program, the call graph of this functions
and the parallel code that is going to be given to the user at the end of the execution.
Function: Each instance of this class is one function written in the source code given by the
user. Its attributes are an ordered list of blocks of code (flow graph), a set of alive and dead
variables used in this function, a reference of the head of the function in the AST, a depen-
dency graph and the list of islands that it possesses.
Block: Is the most atomic of all classes and each instance of it represents a section of code of a
function that is written in the source code given by the user. It attributes are an ordered list of
instructions (lines of code) and a set of alive and dead variables used in this section of code.
int main() {
add(d[i - 1], d[i - 1]);
sub(d[i], d[c[i] - 1]);
}
1. While the instruction (line of code) is not a control structure (while, do while, if-else, for) add
it to the current block.
4.5 Islands
After having the DDG completely built, we proceed to find a set of islands of dependecies. An
island is a group of blocks connected in the data dependency graph, if there is an edge between
block u and block v in the DDG then u and v must be in the same island. Each block is assigned
to one and only one island.
#include <iostream>
int main () {
int n = 10, sum1 = 0, sum2 = 0;
for (int i = 0; i < n; ++i)
sum1 += i;
for (int i = 0; i < n; ++i)
sum2 += 2 * i;
cout << sum1 + sum2 << \n;
}
#include <iostream>
using namespace std;
int main () {
int n = 10, sum1 = 0, sum2 = 0;
#pragma omp parallel sections
{
#pragma omp section
{
#pragma omp parallel for reduction(+:sum2)
for (int i = 0; i < n; ++i)
sum2 += 2 * i;
}
#pragma omp section
{
#pragma omp parallel for reduction(+:sum1)
for (int i = 0; i < n; ++i)
sum1 += i;
}
}
cout << sum1 + sum2 << \n;
}
Table 1: Non-parallelizeble codes. T1: Time before be executed in Auto Parallelizer. T2. Time after
be executed in Auto Parallelizer.
All of the algorithms tested in Table 1 do not have significative sections of code that can be
parallellized. Auto Parallelizer was successful in finding independent sections that only refer to
initialization of scalar variables, which do not affect in the execution time. As all of them refer to
graph algorithms, the set of inputs were generated randomly and then each code was tested in
several times; thirty times both, serial and parallelized versions.
Now, to test codes that have the possibility of being parallelized. First, we tested some of
them with classical examples such as independent loops, independent calls to functions and pro-
cedures, reduction variables of minimum and maximum and some other. Then, we proceeded to
test our software with more complex and real problems, and as we stated at the beginning we
wanted to show how parallelization could be included into a world that has been purely sequen-
tial: competitive programming contests.
Competitive programming contest problems have been sequential because most of the prob-
lems do not have the alternative to solve them in a parallel way. It was difficult to find problems
that met the requirements of possibly have independent tasks (with a lot of computations), but
fortunately, we found some of them. Here we show two of them.
The first problem was extracted from the Red de Programacion Competitiva, a Colombian
organization that creates programming contests around Latin America. The solution for this prob-
lem involved the implementation of Fast Fourier Transform (FFT) which code requires a lot of
independent tasks [4].
The second problem was extracted from Codeforces, a Russian platform that hosts program-
ming contest for the entire world. In this problem, we had to implement a Dynamic Programming
(DP) approach. But, dynamic programming depends always on subproblems and previous states;
but in this problem, it was necessary to compute four solutions, each of them with an independent
dynamic programming approach. Hence, this is another problem to parallelize [5].
For this two problems, random data was generated. And as well as the algorithms mentioned
above they were tested several times. Thirty times each. The results are shown in Table 2.
The times in Table 2 for the parallelized version are much better than the serial version. And
both codes got the same output.
Table 2: Parallelizeble codes. T1: Time before be executed in Auto Parallelizer. T2. Time after be
executed in Auto Parallelizer.
5.2 Comparison
After implemented the task parallaelization and reduction variable recognition for loops paral-
lelization. We show a comparison with other tools that generate parallel code. Figure 12.
6 Conclusions
Even the most advanced parallelizing compilers are not able to detect all the cases in which the
parallelization is possible. In this paper we have presented a simple implementation of automatic
parallelization using only syntactic analysis even though all the others tools take advantage of
semantic analysis. This work is part of our ongoing efforts towards bringing young programmers
to learn about the parallel and concurrent paradigm.
Competitive programming problems are oriented to sequential tasks. Even that there prob-
lems that can be modeled in order to increase the performance of that problem. Companies like
Google are already doing competitions that involve parallel computing. This shows that the par-
allel computing paradigm is growing and being more popular and useful than before.
[2] Khan N. Keshatiwar S. Botre S. Barve A., Khandelwal S. Serial to parallel code converter tools:
A review. International Journal of Research in Advent Technology, 2016.
[3] Lee J. Aurangzeb Lin H. Dave C. Eigenmann R. Midkiff S. Bae H., Mustafa D. The cetus
source-to-source compiler infrastructure: Overview and evaluation. International Journal of
Parallel Programming, 2013.
[7] Pawar P. Vaidya V. Rajguru C. Athavale A., Ranadive P. Automatic sequential to parallel code
conversion. GSTF Journal on Computing, 2014.
[8] Sinha N. Gupta M., Mukhopadhyay S. Automatic parallelization of recursive procedures.
International Journal of Parallel Programming, 2000.