You are on page 1of 12

Auto Parallelizer: A source-to-source translator

M. Hernandez milhernandezcag@unal.edu.co
Engineering Faculty, Universidad Nacional de Colombia, Bogota, Colombia
D. A. Caceres daacaceressa@unal.edu.co
Engineering Faculty, Universidad Nacional de Colombia, Bogota, Colombia

Abstract
Parallelizing compilers have main focused on loop parallelization. This paper provides a way
to generate task parallelization based on syntax analysis. We present the call, flow and depen-
dency graph construction, information needed to proceed with the parallelization phase, and then
our proposed algorithm to find independent sections that can run as tasks in separated processes.
Then we provide an algorithm to perform loop parallelization of reduction operations. As results,
we have parallelized some competitive programming tasks, to show how parallelization can be
extended to a world that has been purely sequential. At the end, we compare the features imple-
mented in our software with other well-known parallelization tools.

Keywords
Automatic parallelization, task parallelization, flow graph, dependency graph.

1 Introduction
Parallel programming is the dominant paradigm of computer architectures; thus, programmers
must stop writing serial code and start learning and writing parallel code. In terms of execution
time, a program written in parallel can be just a bit slower but up to 10 times faster than a serial
one, this depends on the processor type and the number of cores that it has.
Auto Parallelizer is a source to source translator which main goal is to show the efficiency
of the parallel paradigm by inserting OpenMP pragmas into serial codes using syntactic analysis
with the help of ANTLR and the visitor pattern, so that the user can compare the execution time
of the serial code and the auto generated parallel one. This tool main goal is to bring closer the
parallel paradigm to programmers since the auto-parallelization is a tough process that requires
exhaustive analysis and is prone to errors, that is why is always going to be better to write parallel
code since the beginning than to parallelize it in more advanced stages of the code.
This paper provides a brief introduction to the parallel paradigm, shows different auto-
parallelization tools, provides a general vision of Auto Parallelizer, how does it work and shows a
comparison of execution times between serial codes written by competitive programmers and the
auto generated parallel code given by Auto Parallelizer.

2 Concepts
In order the understand the further sections, we will sketch out some important concepts related
to parallelization.
acc[0] = 0;
for (int i = 1; i <= n; ++i)
acc[i] = acc[i - 1] + i;

Figure 1: Loop with data dependency between iterations

for (i = 0; i < n; ++i)


for (j = 0; j < q; ++j)
for (k = 0; k < m; ++k)
c[i][j] += a[i][k] * b[k][j];

Figure 2: Loop without any dependency between iterations

2.1 Dependencies
It makes reference to sections of code that must be executed before other ones. This is the main
concept of auto parallelization since the code parallelization depends on whether or not there is a
dependency.

Control dependency: It is when a section of code determines if another is or not going to be


executed, this is the case of conditionals such as if - else statements.

Data dependency: It is when the variables must be read and written in the same order since
it is not the same to read and then write and to write and then read.

2.2 Loop parallelization


A loop can be parallelized if its iterations are completely independent (dependency free).
In Figure 1 there is a data dependency between the i-th element of acc and the (i-1)-th; therefore
parallelization is impossible to make. On the other hand, in Figure 2 the elements in the matrix
c can be calculated in any order since there are no dependencies, it means that the loops can be
parallelized.

2.3 Task parallelization


There are sections of code that can be executed in any order without changing the result.
In Figure 3 the matrix D must be initialized in the first loop before it can be used in the second
loop; hence, the order cannot be changed and the parallelization is not possible. Moreover, in
Figure 4 the array A and the matrix B are being initialized without any dependency; thus, we can
change the order of execution and even better, run both parts in parallel.

2.4 Agglomeration
If there are n different independent sections of code, creating n threads and assigning 1 section to
each of them is not the most efficient thing to do since creating and deleting threads costs time,
switch context between threads also costs time; thus, there must be a balance between the number
of threads and the amount of operations that are assigned to them.

2 Advanced topics in Programming Languages


b

for (int i = 0; i < n; ++i)


for (int j = 0; j < n; ++j)
if (G[i][j] != 0) D[i][j] = 1;
else D[i][j] = INF;

for (int k = 0; k < n; ++k)


for (int i = 0; i < n; ++i)
for (int j = 0; j < n; ++j)
D[i][j] = min(D[i][j], D[i][k] + D[k][j]);

Figure 3: Floyd-Warshall Algorithm

for (int i = 0; i < n; ++i)


A[i] = i;

for (int i = 0; i < n; ++i)


for (int j = 0; j < n; ++j)
B[i][j] = i * j;

Figure 4: Task parallelization code example

2.5 Scheduling
It is the assignment of operations that each thread is going to perform, there are two types:

Static: All the iterations make the same amount of operations. Ideal for threads that will take
realatively the same amount of time. Figure 5 is an example of this; if it the loop is parallelized
in a fixed number of threads, each of them will execute a similar number of operations and
they will finish near each other.

Dynamic: Each time a thread finishes its operations, new work is assigned to it. Ideal for
iterations that takes more than others since waiting for threads to join can be expensive in
terms of execution time. Figure 6 is an example of this.

for (int i = 0; i < n; ++i) {


A[i] = i + i;
B[i] = 2 * i;
C[i] = i * i;
D[i] = i / 2;
}

Figure 5: Static scheduling code example

Advanced topics in Programming Languages 3


for (int i = 0; i < n; ++i) {
if (i % 2 != 0)
A[i] = i * i + sqrt(i);
else {
A[i] = 0;
for (int j = 0; j < i; ++j)
A[i] += B[i];
A[i] /= i;
}
}

Figure 6: Dynamic scheduling code example

2.6 OpenMP
To achieve parallel computing it is necessary to use some parallel programming language or an
API that allows making parallel execution. And as mentioned, our parallelization process is per-
formed by insertin OpenMP directives.
The OpenMP API supports multi-platform shared-memory parallel programming in C/C++
and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible in-
terface for developing parallel applications on platforms from the desktop to the supercomputer.
[1]
We have taken two clauses from OpenMP to perform the parallelization:
Sections clause: It begins with the pragma omp parallel sections, and then each section specified
by the prama omp section inside the sections part is going to be exetuded by a different thread.
All the threads wait until the others had finished and then the program continues with what
is after the sections pragma.
Reduction clause: It begins with the pragma omp parallel reduction (op:var), where op is the
operator that is going to be used (currently the supported operators are addition, subtraction,
multiplication, bitwise or, bitwise and, bitwise xor, min and max) and var is the variable that
will be reduced.
A private copy initialized with the identity of the operator (identity op x) = x) is created for
each thread, at the end the specified operator is applied between all copies and the final result
is written into the original variable.

3 State of the art


In this section we will introduce some of the current tools of auto parallelization with a short
review for each one of them:
Intel C++ Compiler: Although is a very complete optimizator it does not translate source to
source, it automatically generates a multithreaded form of the applications where almost all
the computations are executed in simple loops. It supports MPI and OpenMP.
Par4All: This is an open source project developed by HPC project. Par4all is a script for users
who are not inserted in parallel programming but wish to produce parallel code in FORTRAN

4 Advanced topics in Programming Languages


b

and C by inserting OpenMP pragma and using CUDA. It aims at generating parallel code with
almost no manual help required.

SUIF Compiler (Stanford University Intermediate Format): Is a framework for research on op-
timizing and parallelizing compilers, it has been developed as a platform on which research
on compiler techniques is done for high-performance machines.It has been successfully used
to perform research on various concepts including loop transformation, array data depen-
dence analysis, software pre-fetching. scalar optimizations and instruction scheduling. The
SUID system consists of a parallelizer that automatically searches for parallel loops and gen-
erate a corresponding parallel code [2]

CAPO (Computer-Aided Parallelizer and Optimizer): Is a tool that automatically inserts com-
piler directives to enable parallel processing on a shared memory parallel machines. CAPO is
developed at NASA Ames Research Center.

Gupta, Mukhopadhyay, Sinha: Although this tool has its years and it does not have a name
yet, is a very complete one, this tool uses the IBM Toronto compiler and creates a framework
of compile time analysis that uses symbolic analysis in array intervals for the parallelization
of recursive functions. This tool also provides a speculative analysis in run time when the
compile time analysis is not enough.

YUCCA (User Code Conversion Application): It is an automatic serial to parallel code con-
version tool introduced by KPIT Technologies Ltd. Pune. The tool accepts input as a C source
code file. The output generated by the tool is a transformed code. This code is a multithreaded
parallel code which has pthread function and OpenMP constructs. This tool does both task
parallelization and loop level parallelization.

Is a parallelizing compiler for C. Cetus can run on any system that supports Oracle and Java
runtime environment. It provides an internal C parser and is written in Java. Cetus uses
the basic techniques used for parallelization and currently it implements reduction variable
recognition, privatization, and induction variable substitution. The most recent version of
Cetus includes a GUI and also a client server model. Cetus compiles and run the sequential
input and generates an output which is the C code with OpenMP constructs. It also shows
the charts of speedup and efficiency.

4 Code analysis
Auto Parallelizer consists of a Java parser written with the help of ANTLR, the Abstract Syntax
Tree (AST) is then traversed a numerous amount of times to gain the information required for the
parallelization. The internal program representation (IR) is implemented in the form of a Javaclass
hierarchy. A high level representation provides a syntactic view of the source program to the pass
writer, making it easy to understand, access and transform the input program. The specification
of each class in the hierarchy is as follow:

AutoParallelizer: This class receives from the user the source code and the different flags in
which the tool is going to be used and creates an instance of the Translator class.

Translator: This class has an instance of the class Program and is in charge of running the
translation algorithm, which is stated in Figure 7.

Advanced topics in Programming Languages 5


function A UTO PARALLELIZER(sourceCode)
def inedF unctions = visitF unctions()
callGraph = buildCallGraph(def inedF unctions)
callGraph = deadCodeElimination(callGraph)
for each function f in callGraph do
f.buildF lowGraph()
end for
for each function f in callGraph do
f.buildDependencyGraph()
end for
for each function f in callGraph do
f.f indIslands()
end for
for each function f in callGraph do
openM P SourceCode+ = f.parallelize()
end for
return openMPSourceCode
end function

Figure 7: Translation algorithm

Program: This class has the list of functions of the program, the call graph of this functions
and the parallel code that is going to be given to the user at the end of the execution.

Function: Each instance of this class is one function written in the source code given by the
user. Its attributes are an ordered list of blocks of code (flow graph), a set of alive and dead
variables used in this function, a reference of the head of the function in the AST, a depen-
dency graph and the list of islands that it possesses.

Block: Is the most atomic of all classes and each instance of it represents a section of code of a
function that is written in the source code given by the user. It attributes are an ordered list of
instructions (lines of code) and a set of alive and dead variables used in this section of code.

4.1 Function call graph


The first thing to do is to find the call graph of all the functions defined by the user. The function
call graph is a directed graph in which every function in the source code is represented as a node
and there is a directed edge between node u and node v if in the body of function u there is a call
to function v.

4.2 Dead code elimination


When a C++ code is executed the corresponding main function is called, that is why if we traverse
the function call graph starting from the main node we will reach every function that might be
called at some point of the execution of the program, therefore if a function (node) is never visited
by this traversal is safe to assume that this function is never going to be used and can be eliminated
from the source code.

6 Advanced topics in Programming Languages


b

int add(int a, int b);


int sub(int a, int b);
int mult(int a, int b);
double mod_mul(ll a, ll b, ll mod);

double mod_pow(ll b, ll e, ll mod) {


mod_mul(r, b, mod);
}

int main() {
add(d[i - 1], d[i - 1]);
sub(d[i], d[c[i] - 1]);
}

Figure 8: Function call graph and dead code elimination example

4.3 Control flow graph


For every function that is in the code a control flow graph is built, which is a linear chain graph in
which every node is an instance of block and represents the order in which the sections of code of
a functions will be executed in the serial code (source code given by the user). The algorithm to
partition the function in blocks is the following:

1. While the instruction (line of code) is not a control structure (while, do while, if-else, for) add
it to the current block.

2. Add the block to the control flow graph.

3. If there are more instructions repeat from step 1.

4.4 Data dependency graph (DDG)


The next step is to build a new graph that will store the information about ddata depenencies
throughout every funcion. The DDG is an undirected graph in which the same blocks from the
control flow graphs are used as nodes and there is an edge between node u and node v if block v
is further down in the control flow graph than block u and if one of the following two conditions
are true:

An alive variable of v is first killed in u.

A dead variable of v is alive in u.

4.5 Islands
After having the DDG completely built, we proceed to find a set of islands of dependecies. An
island is a group of blocks connected in the data dependency graph, if there is an edge between
block u and block v in the DDG then u and v must be in the same island. Each block is assigned
to one and only one island.

Advanced topics in Programming Languages 7


int main () {
1 a = 10;
2 b = a + 15;
3 function(a);
4 c = 50;
5 d = 6;
6 function(a);
7 for (int i = 0; i < c; ++i)
8 d += i;
9 function(a);
10 function(a, b + 10);
11 int e = 18;
12 a = 32;
13 b = e + 15;
}

Figure 9: Control flow graph and DDG example

4.6 Parallelization algorithm


Then, after all data has been found the software is ready to execute the parallelization itself. Which
consists in two parts:
Task parallelization: This software is mainly focused in task parallelization. To achieve that,
we define an algorithm which divides the blocks into sections, assigning consecutive blocks u
in the control flow graph to a section that does not have a block of the same island as u. This
guarantees that every subsection (block) in a section is executed in parallel with other blocks
that are in other island. Each pair of blocks in a section is dependency-free.
Reduction [3]: For each block that consists of a for instruction checks two restrictions:
The loop has one or many expressions of the form rv = rv + expr
rv is not used in any other part of the loop.

8 Advanced topics in Programming Languages


b

#include <iostream>

using namespace std;

int main () {
int n = 10, sum1 = 0, sum2 = 0;
for (int i = 0; i < n; ++i)
sum1 += i;
for (int i = 0; i < n; ++i)
sum2 += 2 * i;
cout << sum1 + sum2 << \n;
}

Figure 10: Code example before being parallelized

#include <iostream>
using namespace std;
int main () {
int n = 10, sum1 = 0, sum2 = 0;
#pragma omp parallel sections
{
#pragma omp section
{
#pragma omp parallel for reduction(+:sum2)
for (int i = 0; i < n; ++i)
sum2 += 2 * i;
}
#pragma omp section
{
#pragma omp parallel for reduction(+:sum1)
for (int i = 0; i < n; ++i)
sum1 += i;
}
}
cout << sum1 + sum2 << \n;
}

Figure 11: Code from Figure 10 after being parallelized

Advanced topics in Programming Languages 9


5 Results
5.1 Execution times of parallelized codes
To test the software, several examples were chosen to be parallelized. Since simple examples as
the shown in Figures 10 and 11 to more complex ones that were extracted from a competitive
programming judges. These codes were chosen to test both, parallelizable situations and non-
parallelizable situations.
Some of the popular algorithms that were chosen to test the non-parallelizable situations and
are listed in the Table 1.

Algorithm T1 T2 Same output


Floyd Warshall 4.2s 4.4s Yes
Dijkstra 2.7s 2.7s Yes
DFS Graph traversal 0.5s 0.55s Yes
BFS Graph traversal 0.5s 0.5s Yes

Table 1: Non-parallelizeble codes. T1: Time before be executed in Auto Parallelizer. T2. Time after
be executed in Auto Parallelizer.

All of the algorithms tested in Table 1 do not have significative sections of code that can be
parallellized. Auto Parallelizer was successful in finding independent sections that only refer to
initialization of scalar variables, which do not affect in the execution time. As all of them refer to
graph algorithms, the set of inputs were generated randomly and then each code was tested in
several times; thirty times both, serial and parallelized versions.
Now, to test codes that have the possibility of being parallelized. First, we tested some of
them with classical examples such as independent loops, independent calls to functions and pro-
cedures, reduction variables of minimum and maximum and some other. Then, we proceeded to
test our software with more complex and real problems, and as we stated at the beginning we
wanted to show how parallelization could be included into a world that has been purely sequen-
tial: competitive programming contests.
Competitive programming contest problems have been sequential because most of the prob-
lems do not have the alternative to solve them in a parallel way. It was difficult to find problems
that met the requirements of possibly have independent tasks (with a lot of computations), but
fortunately, we found some of them. Here we show two of them.
The first problem was extracted from the Red de Programacion Competitiva, a Colombian
organization that creates programming contests around Latin America. The solution for this prob-
lem involved the implementation of Fast Fourier Transform (FFT) which code requires a lot of
independent tasks [4].
The second problem was extracted from Codeforces, a Russian platform that hosts program-
ming contest for the entire world. In this problem, we had to implement a Dynamic Programming
(DP) approach. But, dynamic programming depends always on subproblems and previous states;
but in this problem, it was necessary to compute four solutions, each of them with an independent
dynamic programming approach. Hence, this is another problem to parallelize [5].
For this two problems, random data was generated. And as well as the algorithms mentioned
above they were tested several times. Thirty times each. The results are shown in Table 2.
The times in Table 2 for the parallelized version are much better than the serial version. And
both codes got the same output.

10 Advanced topics in Programming Languages


b

Algorithm T1 T2 Same output


FFT 23.34s 9.89s Yes
DP 0.42s 0.26s Yes

Table 2: Parallelizeble codes. T1: Time before be executed in Auto Parallelizer. T2. Time after be
executed in Auto Parallelizer.

Figure 12: Feature comparison with existing parallelization tools

5.2 Comparison
After implemented the task parallaelization and reduction variable recognition for loops paral-
lelization. We show a comparison with other tools that generate parallel code. Figure 12.

6 Conclusions
Even the most advanced parallelizing compilers are not able to detect all the cases in which the
parallelization is possible. In this paper we have presented a simple implementation of automatic
parallelization using only syntactic analysis even though all the others tools take advantage of
semantic analysis. This work is part of our ongoing efforts towards bringing young programmers
to learn about the parallel and concurrent paradigm.
Competitive programming problems are oriented to sequential tasks. Even that there prob-
lems that can be modeled in order to increase the performance of that problem. Companies like
Google are already doing competitions that involve parallel computing. This shows that the par-
allel computing paradigm is growing and being more popular and useful than before.

Advanced topics in Programming Languages 11


References
[1] The openmp api specification for parallel programming.

[2] Khan N. Keshatiwar S. Botre S. Barve A., Khandelwal S. Serial to parallel code converter tools:
A review. International Journal of Research in Advent Technology, 2016.
[3] Lee J. Aurangzeb Lin H. Dave C. Eigenmann R. Midkiff S. Bae H., Mustafa D. The cetus
source-to-source compiler infrastructure: Overview and evaluation. International Journal of
Parallel Programming, 2013.

[4] Red de programacion competitiva.


[5] Codeforces.
[6] Rauchwerger L. Rus S., Pennings M. Sensitivity analysis for automatic parallelization on multi-
cores. 2007.

[7] Pawar P. Vaidya V. Rajguru C. Athavale A., Ranadive P. Automatic sequential to parallel code
conversion. GSTF Journal on Computing, 2014.
[8] Sinha N. Gupta M., Mukhopadhyay S. Automatic parallelization of recursive procedures.
International Journal of Parallel Programming, 2000.

[9] T. Han J. Ahn. An analytical method for parallelization of recursive functions.

12 Advanced topics in Programming Languages

You might also like