Parallel Programming Using Hierarchically Tiles Arrays

Parallel Programming Using Hierarchically Tiles Arrays
Gheorghe Almasi
IBM Thomas J. Watson Research Center Yorktown Heights, NY, USA
Basilio B. Fraguela
Dept. de Electr nica e Sistemas o Universidade da Coru a, Spain n
Jia Guo
Dept. of Computer Science Univ. of Illinois at Urbana-Champaign, USA
Jos Moreira e
IBM Thomas J. Watson Research Center Yorktown Heights, NY, USA
David Padua
Dept. of Computer Science Univ. of Illinois at Urbana-Champaign, USA
Abstract A natural way to express parallel computations is to use objects that encapsulate parallel operations. Programs built using such objects can be highly readable and easy to maintain. In this paper, we show our rst experiences with a class of objects that encapsulate parallelism, the Hierarchically Tiled Arrays (HTAs). We have implemented HTAs and their methods as a MATLAB toolbox that overloads conventional operators and array functions in such a way that HTA operations appear to the programmer as natural language extensions. HTAs allow the construction of single-threaded parallel MATLAB programs where a master distributes tasks to be executed in a collection of servers where the components (tiles) of the HTAs to be manipulated reside. The tiled and recursive nature of the HTAs also facilitates the development of algorithms with a high degree of locality, a desirable feature given the growing processor-memory speed gap.
1 Introduction
Parallel programs are difcult to develop and maintain. This is particularly true in the case of distributed memory machines, where every piece of data that needs to be accessed by two or more processors must be communicated by means of messages in the program, and where the user must make sure that every machine is working with the latest version of the data. Parallel execution also makes debugging and tuning difcult tasks. The language and compiler community has come up with several approaches to help programmers deal with these issues.
This work has been supported in part by the Ministry of Science and Technology of Spain under contract TIC20013694-C02-02, by the Xunta de Galicia under contract PGIDIT03-TIC10502PR, and by the Defense Advanced Research Project Agency under contract NBCH30390004. This work is not necessarily representative of the positions or policies of the Army or Government.
The rst approach to ease the burden on the programmer when developing distributed memory programs was based on standard message passing libraries like MPI [7] or PVM [6] which improve the portability of the applications. Still, data distribution and synchronization must be completely managed by the programmer. Also, the SPMD programming model of the codes that use these libraries, gives place to unstructured codes in which communication may take place between widely separated sections of code and in which a given communication statement could interact with different statements during the execution of the program. Programming languages like Co-Array FORTRAN [9] and UPC [4] improve the readability of the programs by replacing explicit communications with array assignments, but they still have all the drawbacks of the SPMD approach. Another strategy to improve the programmability of the distributed memory environments consists of allowing a single thread of execution and letting the compiler take care of all the problems related with the distribution of the data and the parallel tasks. This is for example the approach of the High Performance Fortran[8]. Unfortunately, compiler technology does not seem to have reached a level in which compilers generate competitive code for this kind of approach. In this paper we explore the possibility of extending a single-threaded object-oriented programming language with a new class, called Hierarchically Tiled Array or HTA [2], that encapsulates the parallelism in the code. HTA operators overload the standard operators of the language. HTA storage, as well as operations on HTAs, are distributed among a collection of servers. The HTA class provides a exible indexing scheme for its tiles that allows data movement among the servers to be expressed by means of simple array assignments. As a result, HTA based programs look as if they had a single thread of execution, but are actually executed on a number of processors. This improves the readability and ease of development and maintenance of HTA code. Furthermore, the compiler support required by our parallel programming approach is minimal, since the implementation of the class itself takes care of the parallelization. The tiled nature of the matrices stored in HTAs allows us to use them to express locality. This is a very interesting property, given the growing gap between the speed of the memories and that of the processors. We have found in MATLABTM an ideal platform for the implementation and testing of HTAs, since it provides a high-level programming language with object-oriented features that is easy to extend using the toolbox approach. The rest of this paper is structured as follows. HTA syntax and semantics are described in the next Section. Section 3 provides several code examples. Section 4 describes the implementation of the HTA language extension as a MATLABTM toolbox. Section 5 evaluates the performance of this implementation and compares its ease of use with that of a traditional MPI+C/Fortran approach. The last section is devoted to our conclusions and future work.
2 Hierarchically Tiled Arrays: Syntax and Semantics

We dene a tiled array as an array partitioned into tiles in such a way that adjacent tiles have the same size along the dimension of adjacency. A hierarchically tiled array (HTA) is a tiled array where each tile is either an unpartitioned array or an HTA. Note that our denition does not require all tiles to have the same size: both HTAs in Fig. 1 are legal. An additional property of hierarchically tiled arrays is homogeneity. This property holds when
Top level
Bottom level
(a) HTA A
(b) HTA B
(a) HTA C
(b) HTA D
Figure 1: Two tiled arrays A (a) and B (b).
Figure 2: (a) Non-homogeneous and (b) homogeneous HTAs C and D, respectively.
all tiles of an HTA agree in the number and size of the partitions along any dimension with their corresponding neighboring tiles. Fig. 2(a) depicts a non-homogeneous HTA with mismatches in the sizes and the numbers of subdivisions of its rst level tiles (the ones separated by thick dashed lines). Fig. 2(b) shows a homogeneous HTA.
2.1 Dereferencing the contents of an HTA

There are two basic ways to address the contents of an HTA: hierarchical or at. Hierarchical addressing identies a subset of the HTA at the top level of the hierarchy, then identies subsets of those elements at lower levels of the hierarchy, and so on down to possibly the lowest level. Flat addressing disregards the hierarchical nature of the HTA and addresses elements by their absolute indices, as in a normal array. A attened HTA is the one with all the hierarchy information removed. Flattening can be applied to any level of the hierarchy of an HTA, thus a series of intermediate possibilities lie between hierarchical and at addressing. As an example, consider HTA C shown in Fig. 2(a). The element in its fth row and sixth column can be referenced using at addressing as C(5,6). It can also be referenced using hierarchical addressing as C{2,1}{1,2}(1,3), where curly brackets are used to index tiles and parentheses are used to index the elements in the bottommost tile. Yet another possibility to reference the same element is when attening is applied only at the second level of the tile hierarchy: C{2,1}(1,6). In any kind of indexing, a range of element position or tiles may be chosen in each dimension using triplets of the form begin:step:end, where the step is optional. If no step is provided, a step one is assumed. Also, the : notation can be used in any index to refer to the whole range of possible values for that index. This way, for example, C(:,1:3) refers to the rst three columns of elements of HTA C using attening; and C{2,:}(1:2:6,1:3) chooses the rst three elements of the odd rows of the two lower top-level tiles of C.
2.2 Binary Operations on HTAs

HTAs behave a lot like arrays, supporting a number of unary and binary operations. As with arrays, not all combinations of arguments and operators are valid.
The simplest case of argument compatibility is that of two HTAs with compatible topologies. In the case of the + operator this means identical topologies; for the * operator it means that the second dimension of the rst argument has to have the same structure, down to elements, as the rst dimension of the second argument. The only other rule we use for argument compatibility is that of scalar (or HTA) expansion or replication. For example, an operation between an HTA and a scalar is made legal by expanding the scalar to an HTA of compatible structure and with elements replicated from the scalar. By the same token, a binary operation between an HTA and an array is legal if there is a tiling level in the HTA where the (possibly attened) tiles below are compatible with the array. In this case, the array is expanded into an HTA by replication, and becomes compatible with the HTA. In a binary operation between two HTAs, one of the arguments can also be expanded to match the structure of the other. This may or may not be possible, depending on the arguments structures.
2.3 HTA Assignments

The semantics for assignments to HTAs are similar to those for binary operators. When a scalar is assigned to a range of positions within an HTA, the scalar gets replicated in all of them. When an array is assigned to a range of tiles of an HTA, the array is replicated in all of the tiles if the HTA resulting from the assignment is legal. Finally, an HTA can be assigned to another HTA (or a range of tiles of it) if the copy of the correspondingly selected tiles from the right-hand side (RHS) HTA to those selected in the left-hand side (LHS) one is legal. When the right HTA has a single tile, it is checked whether its replication in each one of the tiles selected in the left HTA would yield a legal HTA, and if this the case, a replication takes place.
2.4 Execution model

Codes using HTAs have a single thread of execution that runs in a machine we call the client or master. The client is connected to a distributed memory machine with an array of processors, called servers, onto which the tiles of the HTAs can be mapped as explained below. The client broadcasts commands to be executed by the servers in parallel. These commands involve HTA tiles the servers own or can get from their peers. This way, the parallelism will be encapsulated in our programs in those statements where a computation affects several tiles in one or more HTAs.
2.5 Construction of HTAs

The simplest way to build an HTA is by providing a source array and a series of delimiters in each dimension where the array should be cut into tiles. For example, the HTAs A and B from Fig. 1 could be created as follows, using a source matrix MX: A = hta(MX, {1:2:10,1:3:12}); B = hta(MX, {[1,2,6,8,9], [1,3,8,12]}); The curly brackets are used to build a list of vectors with one partition vector for each dimension
a = hta(MX,{dist}, [P 1]); b = hta(P, 1, [P 1]); b{:} = V; r = a * b;
for c a b end
i = = =
= 1:n c + a * b; a{:,[2:n, 1]}; b{[2:n, 1],:};
Figure 3: Sparse matrix vector product.
Figure 4: Main loop in Cannons algorithm.
of the source array. Each partition vector contains points where hyperplanes will cut the input matrix in the corresponding dimension to form tiles. This form of the constructor can also be used to generate homogeneous HTAs with several levels of tiling, like D in Fig. 2(b), by applying tiling in a bottom-up fashion. D could be created with any of the two following statements, which yield exactly the same result: D = hta(MX, {[1,2,6,8,9],[1,3,8,12]}, {[1,2,4],[1,3]}); D = hta(B, {[1,2,4],[1,3]}); The value 1 at the beginning of every partition vector is included here for clarity, but not required. It is also possible to build empty HTAs whose tiles are later lled in. For example, the following piece of code would replicate the matrix MX in the four empty tiles of HTA F: F = hta(2, 2); F{:,:} = MX; The examples discussed above generate non-distributed HTAs, which are located only in the processor executing the code. To distribute the top level tiles of an HTA on a mesh of processors, add as last argument in the constructor a vector describing the dimensions of the mesh. The distribution of the tiles is always block cyclic in the current implementation. So, for example, F would have been distributed on a 2 2 mesh of processors just by writing F = hta(2, 2, [2, 2]);
3 Parallel Programming using HTAs

In this section we illustrate the use of HTAs with three simple code examples. Our rst example, sparse matrix-vector product (Fig. 3), illustrates the expressivity and simplicity of our parallel programming approach. We begin by taking a sparse matrix MX and distributing its contents into an HTA a by calling an HTA constructor. The P servers handling the HTA are organized into a single column. We rely on the dist argument to achieve load balance by distributing the original MX array in a fashion that will generate a uniform computational load across the servers. Next we create an empty HTA b with the same number of tiles and the same mapping as a. This way, the tiles of a and b with the same index will be colocated in the same server. Then, in the third statement, the vector V, which is the second argument of the sparse matrix-vector multiplication, is broadcasted from the client to each server containing a tile of b.
while dif > epsilon v{2:n,:}(1,:) = v{1:n-1,:}(d+1,:); v{1:n-1,:}(d+2,:) = v{2:n,:}(2,:); v{:,2:n}(:,1) = v{:,1:n-1}(:,d+1); v{:,1:n-1}(:,d+2) = v{:,2:n}(:,2); u{:,:}(2:d+1,2:d+1) = K * (v{:,:}(2:d+1,1:d) + v{:,:}(1:d,2:d+1) + ... v{:,:}(2:d+1,3:d+2) + v{:,:}(3:d+2,2:d+1)); maxdifhta = max(abs(v - u)); dif = max(max(maxdifhta(:,:))); v = u; end
Figure 5: Parallel Jacobi relaxation The product of the HTAs a and b in the last line of code has the effect multiplying the corresponding tiles of both HTAs. That is, a sparse matrix-vector multiply takes place at each server containing corresponding tiles of a and b. The result is a vector r, distributed across the servers with the same mapping as a and b. The HTA resulting from the operation can be attened back into a vector by using the r(:) notation. The code completely hides the fact that MX is sparse because MATLAB TM provides the very same syntax for dense and sparse computations, a feature our HTA class has inherited. While the previous example only required communication between the client and each individual server, Cannons matrix multiplication algorithm [3], our second example, also requires communication between the servers. The algorithm has O(n) time complexity and uses O(n 2 ) processors (servers). In our implementation of the algorithm, the operands, denoted a and b respectively, are HTAs tiled in two dimensions. The HTAs are mapped onto a mesh of n n processors. In each iteration of the algorithms main loop, shown in Fig. 4, each server rst executes a matrix multiplication of the tiles of a and b that currently reside on that server. The result of the multiplication is accumulated in a (local) tile of the result HTA, c. Then, the tiles of both a and b are circular-shifted along one dimension. The tiles of b are shifted circularly north, in such a way that the top processor in each column transfers its current tile of b to the bottom processor. Similarly, the tiles of a are shifed west (to the left), sending the left-most processor in each row its tile of a to the corresponding right-most processor of its row. After n iterations, each server holds the correct value for its associated tile in the HTA c=a*b. Referencing arbitrary elements of HTAs results in complex communication patterns. The blocked Jacobi relaxation code in Fig. 5 requires the four neighbors of a given element to compute its new value. Each block is represented by a tile of the HTA v. In addition the tiles also contain extra rows and columns for use as buffers for the border regions when exchanging information with the neighbors. Border exchange is executed in the rst four statements of the main loop. Thus the computation step uses only local data. In this case the attened version of the HTA v does not quite represent the desired end result,
User
MATLAB
MATLAB
Figure 6: HTA implementation in MATLAB TM because of the existence of the border exchange regions. However, the desired matrix can be obtained by rst removing the border regions and applying the attening operator afterward: (v{:,:}(2:d+1,2:d+1))(:,:).
4 HTA Class Implementation

HTAs can be added to almost any object-based or object-oriented language. We chose MATLAB TM as the host environment for our implementation for a number of reasons. First, it is a linear algebra language with a large base of users who write scientic codes. HTAs allow these users to harness the power of a cluster of workstations instead of a single machine. Another reason is that it is polymorphic, which allows HTAs to substitute regular arrays almost without changing the rest of the code, thereby adding parallelism painlessly. Finally, MATLAB TM is designed to be extensible. Third party can developers provide so-called toolboxes of functions for specialized purposes. MATLABTM also provides a native method interface called MEX, which allows functions to be implemented in languages like C and Fortran. In the MATLABTM environment, HTAs are implemented as a new type of object and accessible through the constructors described in Section 2. The bulk of the HTA implementation is actually written in C and interfaces with MATLAB TM through MEX. Communication between the client and the servers in implemented using the MPI [7] library. The architecture of our MATLABTM implementation is shown in Fig. 6. MATLAB TM is used both in the client, where the code is executed following a single thread, and in the servers, where it is used as computational engine for the distributed operations on the HTAs. All communication is done through MPI; the lower layers of the HTA toolbox take care of the communication requirements, while the higher layers implement the syntax expected by MATLAB TM users. Our implementation supports both dense and sparse matrices with double precision data, which can be real or complex. Any number of levels of tiling is allowed in the HTAs, although every tile must have the same number of levels of decomposition. Also, the current implementation requires the HTAs to be either full (every tile has some content) or empty (every tile is empty).
pq t 2x vu x pq x2x DyBxiBvu hwgh t rq Eq2s E@ pq hg if
e E%dcb% a VT US VT US
` YX %% W
GA C A9 H" !FEDB@8 Q &I R@P' GA C A9 H" !FEDB@8 65 3 0 7421)(' & # "%$ ""!
for K=1:m for I=1:m for J=1:m c{I,J}=c{I,J} + a{I,K} * b{K,J]}; end end end
for K=1:m t1 = repmat(a{:,K}, 1, m); t2 = repmat(b{K,:}, m, 1); c = c + t1 * t2; end
Figure 7: Matrix multiplication
Figure 8: SUMMA algorithm using HTA
5 Evaluation
Our toolbox has been implemented and tested in an IBM SP system that consists of two nodes of 8 Power3 processors running at 375 MHz and sharing 8 GB of memory each. Two congurations have been used in the experiments: one with 4 processors used as servers, and another one with 9 servers. Each node contributes half of the servers. There is an additional processor that executes the main thread of the program, thus acting as the client or master of the system. Although the main focus of our work is the achievement of parallelism, it is interesting to remember that the HTAs can be used to express locality too. Experiments in our base system proved that for large matrices (order > 2000), the product of matrices completely stored in the client as HTAs proved to be about 8% faster than the native MATLAB TM matrix product routine.
5.1 Benchmarks
In this subsection we present the six benchmarks we used to evaluate the HTA toolbox. smv implements the sparse matrix vector product shown in Fig. 3; cannon is a version of Cannons matrix multiply algorithm [3] (the main loop is depicted in Fig. 4); jacobi is the Jacobi relaxation code shown in Fig. 5; summa represents the well known SUMMA[5] matrix multiplication algorithm; lu is an implementation of the LU factorization algorithm; nally, mg is one of the NAS benchmarks [1]. Benchmarks smv, cannon and jacobi have already been described in Sect. 3; we now briey describe the other three benchmarks. As for summa, consider the matrix multiplication using HTAs shown in Fig. 7. The inner two loops are parallel with respect to the tiles of c. However, each tile of a and b is used in the calculation of m tiles of c. Thus, for a parallel algorithm a and b need to be replicated to become local in the servers that hold the corresponding tiles of c. This replication is done by an HTA-aware overloaded version of the MATLABTM repmat function, which takes as arguments an HTA, and parameters specifying the number of copies to make in each dimension. Since the HTAs are distributed across processors, repmat implementation involves communication operations. Fig. 8 shows the SUMMA algorithm in HTA representation. A blocked LU decomposition algorithm is implemented in lu. In A=LU, A is tiled into KK blocks. We use the HTA notation A{I, J} to represent the block in A where its position is {I, J}. In each iteration I from 1 to K, A{I, I} is factorized into L{I,I} and U{I,I}. Then row I and column I are updated using L{I,I} and U{I,I}. In the end, the remaining matrix A{I+1:K, I+1:K} is modied using the updated row I and column I.
4.5 HTA MPI+C/Fortran 4 3.5 3
9 HTA MPI+C/Fortran 8 7 6 Speedup 5 4 3 2 1 0
Speedup
2.5 2 1.5 1 0.5 0
smv
cannon
jacobi summa Benchmarks
lu
mg
smv
cannon
jacobi summa Benchmarks
lu
mg
(a) 4 servers
(b) 9 servers
Figure 9: Speedup for HTA and MPI+C/Fortran on 4 servers (a) and 9 servers (b) The NAS benchmark mg [1] solves Poissons equation, 2 u = v, in 3D using a multigrid V-cycle. The algorithm carries out computation at a series of levels and each level of them denes a grid at a successively coarser resolution. The code for the parallel version of mg using HTAs is a simple extension of the serial version in MATLAB TM. In the MATLABTM code each grid level is dened as a 3D array, while in the parallel version they are dened as 3D HTAs equally distributed along each dimension. Since the grid is distributed among processors, communication of neighboring grid points is performed at appropriate places using simple assignment operations, as in jacobi.
5.2 Performance evaluation

We wrote four versions of each benchmark: two serial programs, one in MATLAB TM and another one in C/Fortran, and two parallel versions, one implemented using HTAs and the other one using MPI+C/Fortran. In the MPI implementation the array tiles were mapped to the mesh of processors in the same way as when using HTAs. Block matrix multiplication and factorization in C and Fortran are done by calling the IBM ESSL library. Figs. 9(a) and 9(b) show the speedup obtained from both HTA and MPI+C/Fortran on 4 and 9 servers, respectively. The baseline for the HTA speedup is the serial MATLAB TM program, while the MPI+C/Fortran speedup is calculated with respect to the corresponding serial C or Fortran program. Except in the case of mg, all benchmarks are tested on matrices with size 4800 4800 and on 4 and 9 servers. Benchmark mg is tested on a three dimensional matrix with the size 256 256 256 on 4 and 8 severs as this benchmark requires the processors to be in a power of two. The input sparse matrix of smv is of density 20%. From Fig. 9(a), we observe that in the benchmarks cannon, summa, and mg, the speedups of the HTA system are very close to those of the MPI+C/Fortran programs. In the case of cannon, the speedup of the HTA version is even higher than that of the MPI program. HTAs also perform quite well in mg. On 4 servers, the speedup is 17.8% higher than that of the MPI program. We also compared the execution times of the parallel MATLAB TM program based on HTA with that
of the parallel MPI program. The execution time of all the parallel MATLAB TM implementations are at most a factor of two slower than the MPI counterparts except for mg where the parallel MPI version is more than 30 times faster. This shows that the overhead introduced by MATLAB TM is, except for mg, not very high and therefore the speedups obtained are mainly the result of parallelizing useful computation and not mainly due to parallelization of the overhead. There are two main reasons for the good performance of the HTA versions of cannon and summa. First, using the HTA representation, the main bodies of cannon and summa contain only 3 lines, as shown in Figs. 4 and 8, respectively, resulting in very little MATLAB TM interpretation overhead. Second, the kernels of both cannon and summa consist of two calls to the circshift or repmat functions and a matrix multiplication in HTA form. In cannon, the circular shifts of the tiles of the matrices were implemented using an overloaded version of the MATLAB TM circshift function. This function shifts the contents of an array an arbitrary number of positions in any wished dimension(s). This implementation is more efcient than the one shown in Fig. 4. The HTA repmat and circshift methods used respectively in summa and cannon, have been implemented very efciently in the toolbox. Therefore, they are comparable in performance to the similar functionality in the corresponding MPI code. As for the matrix multiplication, it is hihgly parallel and involves little overhead, since it is local in each server. Benchmark mg consists of a series of local computations with communications interleaved between them. Since the problem size is large (256 256 256 grids), the cost of the local computations greatly dominates that of the communication, and hence the speedup is signicant. The HTA speedups are smaller than those of MPI+C/Fortran for some benchmarks. In lu, HTAs perform similar to MPI on 4 servers; on 9 servers, performance drops. The speedup of our HTA versions of smv and jacobi are clearly lower than that on MPI both for 4 and 9 servers. These lower speedups are due to extra overhead built into our implementation. One such cause of extra overhead is the time taken to broadcast HTA commands to the servers: MATLAB TM is an interpreted language, and all HTA operations need to be broadcasted from the client to the servers before they get executed. If the HTA operation handles relatively low amounts of raw data, broadcast overhead dominates execution time. By comparison, in an MPI program this broadcast is unnecessary because program execution is governed locally. Another source of overhead is caused by the limitations of the MATLAB TM extension interface. MATLABTM user-dened functions cannot deal with arguments passed in by reference; they must be passed by value. This way, our current implementation, written in MATLAB TM, must make a full copy of the LHS of any indexed HTA assignment operation. Copying an HTA with every assignment constitutes a potentially crippling overhead. This effect is visible in the speedups of lu and jacobi. MATLAB TM itself does not suffer from this overhead, as the indexed assignment operation is a built-in function to which arguments are passed by reference. This way, the current sources of overheads are mostly due to implementation issues and not inherent limitations of the HTA approach. Instruction copy overhead can be mitigated by sending pre-compiled snippets of code to the servers for execution (although this implies the existence of a compiler at the client). By re-implementing the indexed assignment method in C instead of MATLABTM we can mitigate the overhead caused by excessive copying of HTAs.
Benchmark MATLAB HTA C/FORTRAN MPI
smv 6 9 100 176
cannon 1 18 14 189
jacobi 31 41 48 364
summa 1 24 14 261
lu 1 24 33 342
mg 1202 1500 1542 2443
Table 1: Lines of code in 4 implementations of benchmarks
5.3 Ease of programming

Table 1 compares the lines of code used by each benchmark for all 4 implementations. Unsurprisingly, MATLABTM and HTA based programs take fewer lines of code than C, Fortran and MPI programs. In the case of smv, cannon, jacobi, summa and lu the difference is of one order of magnitude. For mg, the HTA program is about 60% of the MPI program. We also compare the implementation of HTA and MPI with their own serial versions: MATLAB TM and C/Fortran. The code sizes of smv, cannon, summa and lu increase substantially when going from serial to parallel, because one-line invocations of e.g. matrix multiply are replaced by whole parallel algorithms. In jacobi and mg the same algorithm is implemented in both the serial and parallel version, so the code size doesnt change signicantly. It is worth noting that the time we spent implementing the HTA versions of the programs was much shorter than the time implementing the MPI+C/Fortran versions. HTA programs are much easier to write than MPI programs. A functional HTA program can be obtained trivially by changing array initializations into HTA initializations. Writing an MPI based algorithm requires much more work in planning and thinking, in order to avoid deadlock situations while achieving correctness.
6 Conclusions
We present a novel approach to write parallel programs in object-oriented languages using a class called Hierarchically Tiled Arrays (HTAs). The objects of this class are arrays divided into tiles which may be distributed on a mesh of processors. HTAs allow the expression of parallel computation and data movement by means of indexed assignment and computation operators that overload those of the host language. HTA code is executed in a master-slave environment where commands from the client executing HTA code are transmitted to the servers that actually hold the data. HTAs improve the ability of the programmer to reason about a parallel program, particularly when compared to code written using the SPMD programming model. HTA tiling can also be used to express memory locality in linear algebra routines. We have implemented our new data type as a MATLAB TM toolbox. We have written a number of benchmarks in the MATLABTM + HTA environment. The benchmarks are easy to read, understand and maintain, as the examples shown through our paper illustrate. While the performance of HTA codes is competitive with that of traditional SPMD codes using MPI for many benchmarks, there are situations when HTAs suffer from overhead problems and fall behind in performance. The two main reasons for the additional overhead suffered by the
HTA codes are related to details of our current implementation, not to the HTA approach itself. First, the current implementation combines the interpreted execution of MATLAB TM with the need to broadcast each command to a remote server. This could be mitigated in the future by more intelligent ahead-of-time broadcasting of commands or by the deployment of a compiler. The second cause of overhead is the need to use intermediate buffering and to replicate pieces of data. This is not because of algorithmic requirements, but due to the need to operate inside the MATLABTM environment. This effect can be improved by more careful implementation. In summary, we consider the HTA toolbox to be a powerful tool for the prototyping and design of parallel algorithms and we plan to make it publicly available soon.
References
[1] Nas Parallel Benchmarks. Website. http://www.nas.nasa.gov/Software/NPB/. [2] G. Almasi, L. De Rose, B.B. Fraguela, J. Moreira, and D. Padua. Programming for locality and parallelism with hierarchically tiled arrays. In L. Rauchwerger, editor, Proc. of the 16th International Workshop on Languages and Compilers for Parallel Computing, LCPC 2003, to be published in Lecture Notes in Computer Science, vol. 2958, College Station, Texas, Oct 2003. Springer-Verlag. [3] L.E. Cannon. A cellular computer to implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969. [4] W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to upc and language specication. Technical Report CCS-TR-99-157, IDA Center for Computing Sciences, 1999. [5] R. A. Van De Geijn and J. Watts. SUMMA: scalable universal matrix multiplication algorithm. 9(4):255274, April 1997. [6] Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidyalingam S. Sunderam. PVM: Parallel Virtual Machine: A Users Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, MA, USA, 1994. [7] W. Gropp, E. Lusk, and A. Skjellum. Using MPI (2nd ed.): Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1999. [8] C. Koelbel and P. Mehrotra. An overview of high performance fortran. SIGPLAN Fortran Forum, 11(4):916, 1992. [9] R. W. Numrich and J. Reid. Co-array fortran for parallel programming. SIGPLAN Fortran Forum, 17(2):131, 1998.

Parallel Programming Using Hierarchically Tiles Arrays

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Programming Using Hierarchically Tiles Arrays

Uploaded by

Copyright:

Available Formats

Parallel Programming Using Hierarchically Tiles Arrays

2 Hierarchically Tiled Arrays: Syntax and Semantics

Figure 1: Two tiled arrays A (a) and B (b).

Figure 2: (a) Non-homogeneous and (b) homogeneous HTAs C and D, respectively.

2.1 Dereferencing the contents of an HTA

2.2 Binary Operations on HTAs

2.3 HTA Assignments

2.4 Execution model

2.5 Construction of HTAs

a = hta(MX,{dist}, [P 1]); b = hta(P, 1, [P 1]); b{:} = V; r = a * b;

= 1:n c + a * b; a{:,[2:n, 1]}; b{[2:n, 1],:};

Figure 3: Sparse matrix vector product.

Figure 4: Main loop in Cannons algorithm.

3 Parallel Programming using HTAs

4 HTA Class Implementation

pq t 2x vu x pq x2x DyBxiBvu hwgh t rq Eq2s E@ pq hg if

e E%dcb% a VT US  VT US

for K=1:m t1 = repmat(a{:,K}, 1, m); t2 = repmat(b{K,:}, m, 1); c = c + t1 * t2; end

Figure 7: Matrix multiplication

Figure 8: SUMMA algorithm using HTA

4.5 HTA MPI+C/Fortran 4 3.5 3

9 HTA MPI+C/Fortran 8 7 6 Speedup 5 4 3 2 1 0

2.5 2 1.5 1 0.5 0

jacobi summa Benchmarks

jacobi summa Benchmarks

5.2 Performance evaluation

Benchmark MATLAB HTA C/FORTRAN MPI

smv 6 9 100 176

mg 1202 1500 1542 2443

Table 1: Lines of code in 4 implementations of benchmarks

5.3 Ease of programming

You might also like

pq t 2x vu x pq x2x DyBxiBvu hwgh t rq Eq2s E@ pq hg if

e E%dcb% a VT US VT US