Professional Documents
Culture Documents
Gheorghe Almasi
IBM Thomas J. Watson Research Center Yorktown Heights, NY, USA
Basilio B. Fraguela
Dept. de Electr nica e Sistemas o Universidade da Coru a, Spain n
Jia Guo
Dept. of Computer Science Univ. of Illinois at Urbana-Champaign, USA
Jos Moreira e
IBM Thomas J. Watson Research Center Yorktown Heights, NY, USA
David Padua
Dept. of Computer Science Univ. of Illinois at Urbana-Champaign, USA
Abstract A natural way to express parallel computations is to use objects that encapsulate parallel operations. Programs built using such objects can be highly readable and easy to maintain. In this paper, we show our rst experiences with a class of objects that encapsulate parallelism, the Hierarchically Tiled Arrays (HTAs). We have implemented HTAs and their methods as a MATLAB toolbox that overloads conventional operators and array functions in such a way that HTA operations appear to the programmer as natural language extensions. HTAs allow the construction of single-threaded parallel MATLAB programs where a master distributes tasks to be executed in a collection of servers where the components (tiles) of the HTAs to be manipulated reside. The tiled and recursive nature of the HTAs also facilitates the development of algorithms with a high degree of locality, a desirable feature given the growing processor-memory speed gap.
1 Introduction
Parallel programs are difcult to develop and maintain. This is particularly true in the case of distributed memory machines, where every piece of data that needs to be accessed by two or more processors must be communicated by means of messages in the program, and where the user must make sure that every machine is working with the latest version of the data. Parallel execution also makes debugging and tuning difcult tasks. The language and compiler community has come up with several approaches to help programmers deal with these issues.
This work has been supported in part by the Ministry of Science and Technology of Spain under contract TIC20013694-C02-02, by the Xunta de Galicia under contract PGIDIT03-TIC10502PR, and by the Defense Advanced Research Project Agency under contract NBCH30390004. This work is not necessarily representative of the positions or policies of the Army or Government.
The rst approach to ease the burden on the programmer when developing distributed memory programs was based on standard message passing libraries like MPI [7] or PVM [6] which improve the portability of the applications. Still, data distribution and synchronization must be completely managed by the programmer. Also, the SPMD programming model of the codes that use these libraries, gives place to unstructured codes in which communication may take place between widely separated sections of code and in which a given communication statement could interact with different statements during the execution of the program. Programming languages like Co-Array FORTRAN [9] and UPC [4] improve the readability of the programs by replacing explicit communications with array assignments, but they still have all the drawbacks of the SPMD approach. Another strategy to improve the programmability of the distributed memory environments consists of allowing a single thread of execution and letting the compiler take care of all the problems related with the distribution of the data and the parallel tasks. This is for example the approach of the High Performance Fortran[8]. Unfortunately, compiler technology does not seem to have reached a level in which compilers generate competitive code for this kind of approach. In this paper we explore the possibility of extending a single-threaded object-oriented programming language with a new class, called Hierarchically Tiled Array or HTA [2], that encapsulates the parallelism in the code. HTA operators overload the standard operators of the language. HTA storage, as well as operations on HTAs, are distributed among a collection of servers. The HTA class provides a exible indexing scheme for its tiles that allows data movement among the servers to be expressed by means of simple array assignments. As a result, HTA based programs look as if they had a single thread of execution, but are actually executed on a number of processors. This improves the readability and ease of development and maintenance of HTA code. Furthermore, the compiler support required by our parallel programming approach is minimal, since the implementation of the class itself takes care of the parallelization. The tiled nature of the matrices stored in HTAs allows us to use them to express locality. This is a very interesting property, given the growing gap between the speed of the memories and that of the processors. We have found in MATLABTM an ideal platform for the implementation and testing of HTAs, since it provides a high-level programming language with object-oriented features that is easy to extend using the toolbox approach. The rest of this paper is structured as follows. HTA syntax and semantics are described in the next Section. Section 3 provides several code examples. Section 4 describes the implementation of the HTA language extension as a MATLABTM toolbox. Section 5 evaluates the performance of this implementation and compares its ease of use with that of a traditional MPI+C/Fortran approach. The last section is devoted to our conclusions and future work.
Top level
Bottom level
(a) HTA A
(b) HTA B
(a) HTA C
(b) HTA D
all tiles of an HTA agree in the number and size of the partitions along any dimension with their corresponding neighboring tiles. Fig. 2(a) depicts a non-homogeneous HTA with mismatches in the sizes and the numbers of subdivisions of its rst level tiles (the ones separated by thick dashed lines). Fig. 2(b) shows a homogeneous HTA.
The simplest case of argument compatibility is that of two HTAs with compatible topologies. In the case of the + operator this means identical topologies; for the * operator it means that the second dimension of the rst argument has to have the same structure, down to elements, as the rst dimension of the second argument. The only other rule we use for argument compatibility is that of scalar (or HTA) expansion or replication. For example, an operation between an HTA and a scalar is made legal by expanding the scalar to an HTA of compatible structure and with elements replicated from the scalar. By the same token, a binary operation between an HTA and an array is legal if there is a tiling level in the HTA where the (possibly attened) tiles below are compatible with the array. In this case, the array is expanded into an HTA by replication, and becomes compatible with the HTA. In a binary operation between two HTAs, one of the arguments can also be expanded to match the structure of the other. This may or may not be possible, depending on the arguments structures.
for c a b end
i = = =
of the source array. Each partition vector contains points where hyperplanes will cut the input matrix in the corresponding dimension to form tiles. This form of the constructor can also be used to generate homogeneous HTAs with several levels of tiling, like D in Fig. 2(b), by applying tiling in a bottom-up fashion. D could be created with any of the two following statements, which yield exactly the same result: D = hta(MX, {[1,2,6,8,9],[1,3,8,12]}, {[1,2,4],[1,3]}); D = hta(B, {[1,2,4],[1,3]}); The value 1 at the beginning of every partition vector is included here for clarity, but not required. It is also possible to build empty HTAs whose tiles are later lled in. For example, the following piece of code would replicate the matrix MX in the four empty tiles of HTA F: F = hta(2, 2); F{:,:} = MX; The examples discussed above generate non-distributed HTAs, which are located only in the processor executing the code. To distribute the top level tiles of an HTA on a mesh of processors, add as last argument in the constructor a vector describing the dimensions of the mesh. The distribution of the tiles is always block cyclic in the current implementation. So, for example, F would have been distributed on a 2 2 mesh of processors just by writing F = hta(2, 2, [2, 2]);
while dif > epsilon v{2:n,:}(1,:) = v{1:n-1,:}(d+1,:); v{1:n-1,:}(d+2,:) = v{2:n,:}(2,:); v{:,2:n}(:,1) = v{:,1:n-1}(:,d+1); v{:,1:n-1}(:,d+2) = v{:,2:n}(:,2); u{:,:}(2:d+1,2:d+1) = K * (v{:,:}(2:d+1,1:d) + v{:,:}(1:d,2:d+1) + ... v{:,:}(2:d+1,3:d+2) + v{:,:}(3:d+2,2:d+1)); maxdifhta = max(abs(v - u)); dif = max(max(maxdifhta(:,:))); v = u; end
Figure 5: Parallel Jacobi relaxation The product of the HTAs a and b in the last line of code has the effect multiplying the corresponding tiles of both HTAs. That is, a sparse matrix-vector multiply takes place at each server containing corresponding tiles of a and b. The result is a vector r, distributed across the servers with the same mapping as a and b. The HTA resulting from the operation can be attened back into a vector by using the r(:) notation. The code completely hides the fact that MX is sparse because MATLAB TM provides the very same syntax for dense and sparse computations, a feature our HTA class has inherited. While the previous example only required communication between the client and each individual server, Cannons matrix multiplication algorithm [3], our second example, also requires communication between the servers. The algorithm has O(n) time complexity and uses O(n 2 ) processors (servers). In our implementation of the algorithm, the operands, denoted a and b respectively, are HTAs tiled in two dimensions. The HTAs are mapped onto a mesh of n n processors. In each iteration of the algorithms main loop, shown in Fig. 4, each server rst executes a matrix multiplication of the tiles of a and b that currently reside on that server. The result of the multiplication is accumulated in a (local) tile of the result HTA, c. Then, the tiles of both a and b are circular-shifted along one dimension. The tiles of b are shifted circularly north, in such a way that the top processor in each column transfers its current tile of b to the bottom processor. Similarly, the tiles of a are shifed west (to the left), sending the left-most processor in each row its tile of a to the corresponding right-most processor of its row. After n iterations, each server holds the correct value for its associated tile in the HTA c=a*b. Referencing arbitrary elements of HTAs results in complex communication patterns. The blocked Jacobi relaxation code in Fig. 5 requires the four neighbors of a given element to compute its new value. Each block is represented by a tile of the HTA v. In addition the tiles also contain extra rows and columns for use as buffers for the border regions when exchanging information with the neighbors. Border exchange is executed in the rst four statements of the main loop. Thus the computation step uses only local data. In this case the attened version of the HTA v does not quite represent the desired end result,
User
MATLAB
MATLAB
Figure 6: HTA implementation in MATLAB TM because of the existence of the border exchange regions. However, the desired matrix can be obtained by rst removing the border regions and applying the attening operator afterward: (v{:,:}(2:d+1,2:d+1))(:,:).
` YX %% W
GA C A9 H" !FEDB@8 Q &I R@P' GA C A9 H" !FEDB@8 65 3 0 7421)(' & # "%$ ""!
for K=1:m for I=1:m for J=1:m c{I,J}=c{I,J} + a{I,K} * b{K,J]}; end end end
5 Evaluation
Our toolbox has been implemented and tested in an IBM SP system that consists of two nodes of 8 Power3 processors running at 375 MHz and sharing 8 GB of memory each. Two congurations have been used in the experiments: one with 4 processors used as servers, and another one with 9 servers. Each node contributes half of the servers. There is an additional processor that executes the main thread of the program, thus acting as the client or master of the system. Although the main focus of our work is the achievement of parallelism, it is interesting to remember that the HTAs can be used to express locality too. Experiments in our base system proved that for large matrices (order > 2000), the product of matrices completely stored in the client as HTAs proved to be about 8% faster than the native MATLAB TM matrix product routine.
5.1 Benchmarks
In this subsection we present the six benchmarks we used to evaluate the HTA toolbox. smv implements the sparse matrix vector product shown in Fig. 3; cannon is a version of Cannons matrix multiply algorithm [3] (the main loop is depicted in Fig. 4); jacobi is the Jacobi relaxation code shown in Fig. 5; summa represents the well known SUMMA[5] matrix multiplication algorithm; lu is an implementation of the LU factorization algorithm; nally, mg is one of the NAS benchmarks [1]. Benchmarks smv, cannon and jacobi have already been described in Sect. 3; we now briey describe the other three benchmarks. As for summa, consider the matrix multiplication using HTAs shown in Fig. 7. The inner two loops are parallel with respect to the tiles of c. However, each tile of a and b is used in the calculation of m tiles of c. Thus, for a parallel algorithm a and b need to be replicated to become local in the servers that hold the corresponding tiles of c. This replication is done by an HTA-aware overloaded version of the MATLABTM repmat function, which takes as arguments an HTA, and parameters specifying the number of copies to make in each dimension. Since the HTAs are distributed across processors, repmat implementation involves communication operations. Fig. 8 shows the SUMMA algorithm in HTA representation. A blocked LU decomposition algorithm is implemented in lu. In A=LU, A is tiled into KK blocks. We use the HTA notation A{I, J} to represent the block in A where its position is {I, J}. In each iteration I from 1 to K, A{I, I} is factorized into L{I,I} and U{I,I}. Then row I and column I are updated using L{I,I} and U{I,I}. In the end, the remaining matrix A{I+1:K, I+1:K} is modied using the updated row I and column I.
Speedup
smv
cannon
lu
mg
smv
cannon
lu
mg
(a) 4 servers
(b) 9 servers
Figure 9: Speedup for HTA and MPI+C/Fortran on 4 servers (a) and 9 servers (b) The NAS benchmark mg [1] solves Poissons equation, 2 u = v, in 3D using a multigrid V-cycle. The algorithm carries out computation at a series of levels and each level of them denes a grid at a successively coarser resolution. The code for the parallel version of mg using HTAs is a simple extension of the serial version in MATLAB TM. In the MATLABTM code each grid level is dened as a 3D array, while in the parallel version they are dened as 3D HTAs equally distributed along each dimension. Since the grid is distributed among processors, communication of neighboring grid points is performed at appropriate places using simple assignment operations, as in jacobi.
of the parallel MPI program. The execution time of all the parallel MATLAB TM implementations are at most a factor of two slower than the MPI counterparts except for mg where the parallel MPI version is more than 30 times faster. This shows that the overhead introduced by MATLAB TM is, except for mg, not very high and therefore the speedups obtained are mainly the result of parallelizing useful computation and not mainly due to parallelization of the overhead. There are two main reasons for the good performance of the HTA versions of cannon and summa. First, using the HTA representation, the main bodies of cannon and summa contain only 3 lines, as shown in Figs. 4 and 8, respectively, resulting in very little MATLAB TM interpretation overhead. Second, the kernels of both cannon and summa consist of two calls to the circshift or repmat functions and a matrix multiplication in HTA form. In cannon, the circular shifts of the tiles of the matrices were implemented using an overloaded version of the MATLAB TM circshift function. This function shifts the contents of an array an arbitrary number of positions in any wished dimension(s). This implementation is more efcient than the one shown in Fig. 4. The HTA repmat and circshift methods used respectively in summa and cannon, have been implemented very efciently in the toolbox. Therefore, they are comparable in performance to the similar functionality in the corresponding MPI code. As for the matrix multiplication, it is hihgly parallel and involves little overhead, since it is local in each server. Benchmark mg consists of a series of local computations with communications interleaved between them. Since the problem size is large (256 256 256 grids), the cost of the local computations greatly dominates that of the communication, and hence the speedup is signicant. The HTA speedups are smaller than those of MPI+C/Fortran for some benchmarks. In lu, HTAs perform similar to MPI on 4 servers; on 9 servers, performance drops. The speedup of our HTA versions of smv and jacobi are clearly lower than that on MPI both for 4 and 9 servers. These lower speedups are due to extra overhead built into our implementation. One such cause of extra overhead is the time taken to broadcast HTA commands to the servers: MATLAB TM is an interpreted language, and all HTA operations need to be broadcasted from the client to the servers before they get executed. If the HTA operation handles relatively low amounts of raw data, broadcast overhead dominates execution time. By comparison, in an MPI program this broadcast is unnecessary because program execution is governed locally. Another source of overhead is caused by the limitations of the MATLAB TM extension interface. MATLABTM user-dened functions cannot deal with arguments passed in by reference; they must be passed by value. This way, our current implementation, written in MATLAB TM, must make a full copy of the LHS of any indexed HTA assignment operation. Copying an HTA with every assignment constitutes a potentially crippling overhead. This effect is visible in the speedups of lu and jacobi. MATLAB TM itself does not suffer from this overhead, as the indexed assignment operation is a built-in function to which arguments are passed by reference. This way, the current sources of overheads are mostly due to implementation issues and not inherent limitations of the HTA approach. Instruction copy overhead can be mitigated by sending pre-compiled snippets of code to the servers for execution (although this implies the existence of a compiler at the client). By re-implementing the indexed assignment method in C instead of MATLABTM we can mitigate the overhead caused by excessive copying of HTAs.
cannon 1 18 14 189
jacobi 31 41 48 364
summa 1 24 14 261
lu 1 24 33 342
6 Conclusions
We present a novel approach to write parallel programs in object-oriented languages using a class called Hierarchically Tiled Arrays (HTAs). The objects of this class are arrays divided into tiles which may be distributed on a mesh of processors. HTAs allow the expression of parallel computation and data movement by means of indexed assignment and computation operators that overload those of the host language. HTA code is executed in a master-slave environment where commands from the client executing HTA code are transmitted to the servers that actually hold the data. HTAs improve the ability of the programmer to reason about a parallel program, particularly when compared to code written using the SPMD programming model. HTA tiling can also be used to express memory locality in linear algebra routines. We have implemented our new data type as a MATLAB TM toolbox. We have written a number of benchmarks in the MATLABTM + HTA environment. The benchmarks are easy to read, understand and maintain, as the examples shown through our paper illustrate. While the performance of HTA codes is competitive with that of traditional SPMD codes using MPI for many benchmarks, there are situations when HTAs suffer from overhead problems and fall behind in performance. The two main reasons for the additional overhead suffered by the
HTA codes are related to details of our current implementation, not to the HTA approach itself. First, the current implementation combines the interpreted execution of MATLAB TM with the need to broadcast each command to a remote server. This could be mitigated in the future by more intelligent ahead-of-time broadcasting of commands or by the deployment of a compiler. The second cause of overhead is the need to use intermediate buffering and to replicate pieces of data. This is not because of algorithmic requirements, but due to the need to operate inside the MATLABTM environment. This effect can be improved by more careful implementation. In summary, we consider the HTA toolbox to be a powerful tool for the prototyping and design of parallel algorithms and we plan to make it publicly available soon.
References
[1] Nas Parallel Benchmarks. Website. http://www.nas.nasa.gov/Software/NPB/. [2] G. Almasi, L. De Rose, B.B. Fraguela, J. Moreira, and D. Padua. Programming for locality and parallelism with hierarchically tiled arrays. In L. Rauchwerger, editor, Proc. of the 16th International Workshop on Languages and Compilers for Parallel Computing, LCPC 2003, to be published in Lecture Notes in Computer Science, vol. 2958, College Station, Texas, Oct 2003. Springer-Verlag. [3] L.E. Cannon. A cellular computer to implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969. [4] W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to upc and language specication. Technical Report CCS-TR-99-157, IDA Center for Computing Sciences, 1999. [5] R. A. Van De Geijn and J. Watts. SUMMA: scalable universal matrix multiplication algorithm. 9(4):255274, April 1997. [6] Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidyalingam S. Sunderam. PVM: Parallel Virtual Machine: A Users Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, MA, USA, 1994. [7] W. Gropp, E. Lusk, and A. Skjellum. Using MPI (2nd ed.): Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1999. [8] C. Koelbel and P. Mehrotra. An overview of high performance fortran. SIGPLAN Fortran Forum, 11(4):916, 1992. [9] R. W. Numrich and J. Reid. Co-array fortran for parallel programming. SIGPLAN Fortran Forum, 17(2):131, 1998.