You are on page 1of 9

2010 2010 Ninth Second Ninth International International International Workshop Workshop Workshop on Parallel on onHigh Parallel and

Performance and Distributed Distributed Computational Methods Methods in Verification/Second in Systems Verification/2010 Biology and International Second Parallel International and Workshop Distributed Workshop on Methods High 2010 Second International Workshop on High Performance Computational on High Performance Performance Computational of Computational verifiCation Systems Systems Biology Biology Systems Biology

Implementation of Smith-Waterman algorithm in OpenCL for GPUs


Dzmitry Razmyslovich , Guillermo Marcus , Markus Gipp , Marc Zapatka and Andreas Szillus Institute for Computer Engineering (ZITI), University of Heidelberg, Mannheim, Germany Email: see http://www.ziti.uni-heidelberg.de German Cancer Research Center, Heidelberg, Germany Email: m.zapatka, a.szillus@dkfz-heidelberg.de

AbstractIn this paper we present an implementation of the Smith-Waterman algorithm. The implementation is done in OpenCL and targets high-end GPUs. This implementation is capable of computing similarity indexes between reference and query sequences. The implementation is designed for the sequence alignment paths calculation. In addition, it is capable of handling very long reference sequences (in the order of millions of nucleotides), a requirement for the target application in cancer research. Performance compares favorably against CPU, being on the order of 9 - 130 times faster; 3 times faster than the CUDA-enabled CUDASW++v2.0 for medium sequences or larger. Additionally, it is on par with Farrars performance, but with less constraints in sequence length. Keywords-OpenCL, GPU, CUDA, Smith-Waterman, Bioinformatics

and putting excessively high requirements for the target computation system. In this paper we represent the accelerated implementation of the Smith-Waterman algorithm which uses the latest technologies for heterogeneous high-performance parallel systems such as GPUs, FPGAs and etc. This implementation is written using a modern OpenCL standard, which provides the interface independence from the type of the target system. The code is optimized for running on high-end CUDA-enabled NVIDIA GPUs. Currently, there are a number of GPU accelerated implementations of the Smith-Waterman algorithm. The following implementations can be seen as a related work for our implementation: Cheng Lings implementation [5], CUDASW++v2.0 [6], Farrars implementation [7], Manavskis implementation [8]. However none of these implementations is focused on computing sequence alignment paths. Neither of them, except for Cheng Lings, processes long reference sequences. At the same time, in accordance with the biological task described, it is necessary to implement the Smith-Waterman algorithm so that this implementation does not put limits on the reference sequence length and is able to provide sequence alignment paths calculation. These two challenges form the key characteristics of our implementation. To provide the possibility for revealing the efciency of our implementation, CUDASW++v2.0 and Farrars implementations have been chosen, being the most popular and widely used. The rest of the paper is divided into 5 sections. A brief description of the Smith-Waterman algorithm is given in the rst section. The second section highlights the main pros and cons of the NVIDIA OpenCL standard implementation for GPUs. The third section consists of 6 subsections, each of them presenting a technique or a method we have used to improve the performance of the implementation. In the fourth section, the results of benchmarking and comparison are given. Finally, the fth section concludes the paper with an outlook to the most important advantages of the OpenCL implementation presented.
39 47 48

I NTRODUCTION There are currently a lot of biological questions being investigated using the second-generation sequencing technology. This technology is characterized by short lengths of the read sequences (35-100 nucleotides). One possible application of the second-generation sequencing technology is cancer genomics [1]. All cancers are results of changes occurred in the DNA sequence of the genomes of cancer cells [2]. These changes (aberrations) can be described as nucleotide substitutions, short insertions and deletions, rearrangements and copynumber changes [3]. The Smith-Waterman algorithm is one of the best solutions for the identication of the aberrations specied, because this algorithm is quite sensitive to identify most complex aberrations unrecognizable with alternative faster algorithms [4]. Our approach aims to provide a solution for the alignment of the short reads from second-generation sequencing technology along the long genome sequence, which would be acceptable according to the time characteristics. The main problem of the Smith-Waterman algorithm usage for the described task is the O(n m) time complexity, where n is the length of a short read (a query sequence) and m is the length of a long genome sequence (a reference sequence). Moreover, the algorithm requires a lot of memory (of the order of 16 GB), additionally decreasing the performance
978-0-7695-4265-2/10 $26.00 2010 IEEE DOI 10.1109/PDMC-HiBi.2010.16 10.1109/HiBi.2010.20

I. T HE S MITH -WATERMAN ALGORITHM The Smith-Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein (elements) sequences [9]. The idea of alignment lies in lling the n m matrix H , the similarity matrix, where n is the number of elements in a query sequence and m is the number of elements in a reference sequence. The values of the matrix are computed using dynamic programming according to formula 1. Each value H [i, j ] is the measure of similarity of two subsequences: a query sequence up to the i-th element and a reference sequence up to the j-th element. H [i, 0] = 0, 0 i n, H [0, j ] = 0, 0 j m, 0 H [i 1, j ] + IF H [i, j ] = max , H [i, j 1] + RF H [i 1, j 1] + S (i, j ) 1 i n, 1 j m.

Figure 1.

A step of the path constructing traceback procedure.

(1)

The H [i, j ] value is a similarity score. The insertion fee (IF value) is a penalty for extending a reference sequence with an element from a query sequence, while the removing fee (RF value) is a penalty for withdrawing an element from a reference sequence. The value S (i, j ) is calculated using formula 2, where M F is the mismatching elements fee and M S is the matching elements score. S (i, j ) = M F, Query [i] = Ref erence[j ], M S, Query [i] = Ref erence[j ] (2)

Figure 2.

An example of an alignment by Smith-Waterman algorithm.

An OpenCL program consists of two parts:

The denition for all fees and scores specied above differs. In this implementation: IF and RF are specied separately; M F and M S can be specied equal for all values or unequal using a substitution matrix (like BLOSUM[10] or PAM[11] matrices) To obtain the actual alignment of two subsequences, the sequence alignment path in the matrix should be found. It is done using a traceback procedure [9] which starts from the similarity value of the sequences until reaching the upper-left corner of the matrix. On each step of the traceback procedure a new point in the matrix is to be chosen as the maximum of 3 neighbors to the current point (see gure 1). In gure 2 is shown the example of an alignment for the following parameters: Ref erence =ACACACTA, Query =AGCACACA, M S = 2, M F = IF = RF = 1. II. O PEN CL MODEL DESCRIPTION OpenCL is an open standard for general purpose parallel programming across different heterogeneous processing platforms: CPUs, GPUs and others [12].

the host code, designed to prepare data, load it to GPU memory, schedule a kernel execution, postprocess the kernel execution results; the kernel code, which is executed on a GPU.

The kernel code is written in a variety of the C language and consists of at least one kernel function. A kernel func-

Figure 3.

A grid of thread blocks.

49 48 40

Figure 4.

The blocks scheduling model.

tion is executed concurrently by each thread (an OpenCL workitem) of a block (an OpenCL workgroup). A set of the blocks constructs a grid of blocks, which presents the whole execution model (see gure 3). For the GPU platform used, each block is only executed by one streaming multiprocessor (SM), while the grid of blocks is executed by a scheduler on an array of the SMs, sequentially occupying the vacant multiprocessors (see gure 4) [13]. So, the kernel code should be designed to provide the independence of any blocks execution order. Meanwhile, individual threads in a block are executed by groups of 32 threads called warps. A warp processes one common instruction for all the 32 threads at a time. If the execution sequence has a divergent branch, the warp executes both of the paths serially. So, a number of threads for each path appears blocked. As a result, every divergent branch makes the execution time higher. Such an execution model provides synchronization and communication mechanisms. Threads of one block can synchronize using barriers and communicate using shared memory, while threads from different blocks can not. Also, there are some other layers of memory available for usage: registers, non-cached local memory, caches for constant and texture memory, and non-cached global memory (see gure 5). Each layer has a certain size and performance. A more detailed description of the memory layers and their usage can be found in [14]. III. I MPLEMENTATION IN THE GPU The previously described biological problem can be solved using the Smith-Waterman algorithm implemented in OpenCL as a series of successive steps. Each step contributes to an improvement of the performance. A. Parallelization granularity The basic task lies in processing a lot of short query sequences and one long reference sequence. Since the original

Figure 5.

The memory model.

task is computation of the paths, the similarity matrix has to be stored in order to be processed. However, the datasize requirements put limits to the possibility of storing the matrix. For instance, in case of the reference sequence of 28-million-nucleotide length and the query sequence of 150-nucleotide length, 16 GB of memory will be used. Therefore, it is necessary to use online computation of the paths, because the modern graphics cards have up to 5 GB of memory. In this case, it means calculating the paths for the already calculated part of the matrix and truncating the matrix concurrently with computation of the new piece of the matrix - see subsection B. Choosing a nucleotide from a query sequence as a parallelization grain makes online computation possible. B. Long reference sequences processing The main heuristic used to reduce the memory usage for the calculation is shown in gure 6. In this gure the already calculated part of the matrix is the similarity matrix for a reference sequence R1 and a query sequence. The optimal path for this matrix is marked with line P1. The previous matrix with the attached dashed piece stands for the similarity matrix for a new reference sequence R2. A new optimal path P2 for the renewed matrix crosses P1. It means that it is possible to dene another optimal path P3 for the renewed matrix. P3 is assembled by merging the part of P2 from the end of the renewed matrix to the junction J and the part of P1 from J to the top row of the old matrix. Now it is easy to see, that the dashed part of the original matrix can be truncated just before the calculation of the new optimal path. Then, processing a long reference sequence can be divided

50 49 41

D. The calculation shape Every iteration is a full execution of the kernel function. At any given iteration, a block of h h values is calculated, where h is the number of workitems in a workgroup. It is seen from formula 1 that the value H [i, j ] depends on 3 other values in the matrix. Each of these values depends by itself on 3 others, creating a dependency chain for all the elements in the matrix. As well as for the elements the same dependency chain can be constructed for the blocks of the matrix. If the blocks, computed at any given step in the chain are numbered, a wavefront appears, as shown in gure 7.

Figure 6.

The heuristic model.

into 2 parts:

calculation of a new piece of the matrix, calculation of a new optimal path together with truncating the current matrix.

The former part takes more time than the latter. So, if these 2 parts will be processed with different devices, the whole calculation time will be equal to the calculation time of a new piece of matrix. And since it is generally accepted to pass the more time-consuming part of an algorithm to a GPU, the calculation of a new piece of a matrix is processed with this device, while the calculation of a new optimal path and truncating the current matrix is given to a CPU. This solution makes computation of both parts of a long reference sequence processing concurrent. C. Multi-query processing Since the size of any possible query sequence is not big, normally a query ts one workgroup, processing of one query at a time is inefcient. As it was mentioned in the 2nd section, one workgroup is processed by only one GPU multiprocessor. It means that it is possible to process several queries in one cycle and it will take the same time. In order to avoid exceeding the resources available per kernel, the following techniques have been introduced. First of all, several short query sequences are concatenated into one big query sequence. Then, a zero-value delimiter is placed between the original queries. The delimiter provides an opportunity to process different queries separately. It can be done by multiplying the H [i, j ] value from formula 1 by the value of the sign function of the current query character. H [i, j ] = H [i, j ] sign(q [i]) (3)

Figure 7.

The wavefront calculation model.

Using the wavefront calculation model makes it important to choose the right shape of the block. In case of the usually chosen rectangular shape, the calculation process limits the number of working threads of each step, because of the data dependencies between these threads. Extending the basic rectangular block to a parallelogram in the way it is shown in gure 8 reduces the number of branches in a kernel code providing calculation by diagonals. This eliminates the data dependencies letting all the threads to be simultaneously sprang into action.

Figure 8.

The block calculating model.

Placing the zero-value delimiter results in the zero function values, which are the top row values for the similarity matrix of the next query sequence. So formula 3 keeps up the right values in the matrices for each query sequence. Concatenating several short query sequences into a big one makes multi-query processing possible, which provides a better occupancy of a GPU.

However, the calculation process is still inefcient, because the number of values calculated in the extended block is twice as many as in the original rectangular block, while the signicant values are only kept in the original rectangular area (the lled rectangle in gure 8). Moreover, the presence of useless values in the shaded area brings a branch into the kernel function causing further performance decrease. To omit the branch, it is necessary to avoid the presence of useless values, while preserving at the same time the diagonal calculation. So, if several diagonals are put together in the number of workitems in a workgroup, a new diamond shape will appear, which does correspond to these requirements. In this case, no additional computation

51 50 42

is needed and the kernel function comes easier and faster, at the expense of the additional complexity in the calculation process (see gure 9):the kernel function was divided into 2 functions the preprocessing kernel function and main kernel function. The preprocessing kernel function calculates the initial (k +1)2 blocks (dashed in gure 9), where k is the number of OpenCL workgroups used on a GPU. The main kernel function has no additional branches and is used for monotonous computing of the rest of the matrix. The usage of the diamond shape provides an additional speed-up.

window is possible, due to the independent functionality of the GPU DMA controller and the GPU multiprocessors. Since path calculating and matrix block calculating tasks are processed with different devices, these tasks can also be handled concurrently. To enable the possibility to overlap data transferring and kernel execution, a ring buffer is allocated in device memory. The ring buffer consists of a minimum of 3 windows, 2 of which are used for calculating matrix values and 1 contains the ready-to-transfer piece of the matrix (see gure 10).

Figure 9.

The modied calculating model.

E. The concurrent transfer and execution The whole calculating process is divided into 8 parts (subprocesses): the host initialization; the transfer of the input data to device memory; kernel execution scheduling; the precalculation kernel execution; the main kernel execution; the transfer of a matrix block to host memory; path calculation; results print-out. The most time-consuming subprocesses are the main kernel execution, the transfer of a matrix block to host memory and path calculating, because these processes are heavy and repeated several times (see table I). The main kernel execution is repeated by the number of iterations dened as the number of blocks tted in the similarity matrix for a workgroup size query sequence. The transfer process and path calculating are repeated by the number of windows dened as the number of iterations over a window size.
Table I T HE GPU USAGE STATISTICS ACCORDING TO THE O PEN CL PROFILER . Method Initial transfer Precalculate kernel Main kernel Blocks transfer #Calls 6 23 19279 201 GPU ms 101.7 6.8 9240 3248.9 %GPU time 0.8 0.05 73.34 25.78

Figure 10.

The ring buffer schema.

The efciency of the ring buffer usage is represented in gures 11 and 12. In these gures the whole computation process with and without the ring buffer usage is shown. The only difference between these 2 processes is the order of the execution of subprocesses, while the execution time of each subprocess itself is the same. So, it is easy to see that the whole computation process without the ring buffer usage needs an additional time gap for transferring data from GPU memory to host memory the way it is shown in gure 11. The ring buffer usage makes the time gap overlapped with the main kernel execution (see gure 12) improving the GPU utilization and reducing the calculation time by approximately 25%, according to the proling data shown in table I. F. Smith-Waterman without the path calculation In most cases, a vast amount of sequencer data must be ltered at the beginning this task only requires similarity values. So, to provide the possibility of faster computation of solely similarity values as well as to compare the OpenCL implementation with other ones, a omitting path calculation version has been created. The main difference of the no path calculating version is that the matrix storage is unnecessary, since only the last column of the matrix is used for retrieving the results. Moreover, if queries t the workgroup size, it is efcient

Because of data dependencies between these 3 subprocesses, it is impossible to execute them simultaneously for the same window. But calculating a window of the matrix together with transferring and processing the previous

52 51 43

Figure 11.

The calculation time diagram without transfer and execution overlapping.

Figure 12.

The calculation time diagram with transfer and execution overlapping.

to pass only an integer number of queries to a workgroup in order to omit the data dependency between different workgroups. Taking all these features into account, it is possible to complete the computation in 3 steps instead of 8 subprocesses described in subsection E (see gure 13):

initialization - performs some calculations to make the wavefront technique usable; calculating - calculates the whole matrix excluding heading and ending; nishing - calculates the ending with saving the results.

Figure 13.

The no path calculating model.

and the synchronization, providing an additional speed-up. The effect of each step on the implementation perfor-

This calculating model gives the possibility to eliminate the expenses for the data transferring, the kernel starting

53 52 44

Figure 14.

The effect of the implementation steps on overall performance.

mance is summarized in gure 14. The steps letters correspond to the subsections letters listed above. It should be pointed out that each step brings either an additional functional characteristic or a speedup or both. IV. R ESULTS AND D ISCUSSIONS The sequence of chromosome 21 (circa 28 million nucleotides in length) from the NCBI [15] Build 36 of the human reference assembly is used as a test space for benchmarking, analyzing and comparing the OpenCL implementation. A set of 36-nucleotide-long reads of equal length from an Illumina genome analyzer was used as query sequences. To measure the performance the computation time is used. The computation time includes: the kernel execution scheduling time, the kernel execution time, the device-to-host data transferring time (for the version with path calculating), the paths calculation time (for the version with path calculating), the device-to-host results transferring time (for the version without path calculating). The time for the program initialization and reading input les as well as the OpenCL library initialization, loading the input data to GPU memory and printing the results was not included into the computation time, because these factors do not inuence the comparison characteristics of the implementation. All benchmarking and comparison tests were carried out on the following test platform: the NVIDIA GeForce GTX 260 GPU with 1.75GB of RAM, 30 multiprocessors and 216 cores installed on the PC with the Intel i7-920 CPU and 6GB of RAM running Linux OS with the installed NVIDIA GPU Computing SDK 3.0.

A. Implementation benchmarking Benchmarking tests show the time expenses according to the reference sequence length. The number of query sequences processed at a time is xed. For the version with path calculating it is 40 queries at a time, while for the version without path calculating 600 queries. These numbers are experimental ndings based on the given test platform and can vary depending on the test platform. The parameters used in the implementation were chosen according to the following requirements:

the GPU is to be loaded as much as possible, all multiprocessors should be busy; the window size should be neither too small to prevent the CPU from processing the previous window nor too large to render the data transferring too long and the ring buffer too big; a number of queries too big can either rise the possibility to run out of memory (in the case of path calculating) or reduce the performance (in the case of no path calculating).

Figure 15.

The benchmark graph.

54 53 45

Figure 16. The comparison graph for the OpenCL implementations and the CPU implementation.

Figure 17.

The comparison graph for the 40-query database le.

The time graph for both of the implementation versions is shown in gure 15. B. Comparison The OpenCL implementations with and without path calculation have been compared with CUDASW++v2.0 [6], Farrars implementation [7] and our CPU implementation. The comparison with the CPU implementation is shown in gure 16. According to the tests results, the path OpenCL implementation accelerates the performance about 9x and the no path OpenCL implementation - about 130x. The main advantage of the OpenCL implementations is the ability to process the long reference sequences in comparison with the other non-CPU implementations. Farrars implementation is capable to process 65536-nucleotide-long reference sequences and CUDASW++v2.0 up to 262000nucleotide-long ones. While on the same test platform the OpenCL implementation is able to treat the reference sequences up to 28 million nucleotides in length. To compare the computation time the test space has been modied: the reference sequences lengths have been limited according to the maximum mutual capability of the implementations (65536 nucleotides); to show the computation time with regard to both parameters used in the implementations (the reference sequence length and the number of queries) the comparison has been divided into the groups of tests according to the number of queries in a query database le. The databases with 40, 200, 600 queries were used. In gure 17 the comparison graph for the 40-query database le is shown. This graph is the base for the following more complex comparison cases as of more lled database les, the computation time for the path calculating version is dened as the sum of times to compute each 40query part of the le (see subsection A). The path calculating version was also tested on the 200-query (gure 18) database le.

Figure 18.

The comparison graph for the 200-query database le.

In case of the 600-query database le the computation time for the path calculating OpenCL version is much higher than for all the no path calculating implementations. So, the presence of the graph depicting the path calculating version performance in gure 19 would have impeded highlighting the performance differences between all the no path calculating implementations. The no path OpenCL implementation is competitive to Farrars implementation and is 3x as fast as CUDASW++v2.0 for the 600-query database le (gure 19).

Figure 19.

The comparison graph for the 600-query database le.

55 54 46

V. C ONCLUSION In this paper the implementation of the Smith-Waterman algorithm using the modern OpenCL standard targeted highend CUDA-enabled GPUs was presented. This implementation is intended to use for alignment of the short reads received using second-generation sequencing technology along a genome sequence. It was shown by testing, that the key advantages of this OpenCL implementation in comparison with the other toprated implementations are: the implementation is able to process efciently the long reference sequences (up to 28 million in the tests); the alignment paths can be calculated effectively, which is the key feature of this implementation; the computation performance of the implementation is competitive to Farrars implementation and 3x as fast as CUDASW++v2.0 implementation for the 600-query database le; the acceleration in comparison with our CPU implementation is 9x for the path calculating version and 130x for the no path calculating version; the implementation is written in the modern OpenCL standard, which provides the possibility to use it on different parallel systems on condition of a proper tuning of the implementation. This new implementation provides the high efciency needed for current biological tasks as well as future challenges posed by ever increasing sequences. An additional exibility from the new OpenCL language and the choice of the paths calculation gives this implementation a unique advantage, that can be exploited by a wide range of biological applications. ACKNOWLEDGMENT We would like to thank Prof. Dr. Reinhard M anner and DAAD (German Academic Exchange Service)[16] for providing the scholarship for D. Razmyslovich. R EFERENCES
[1] K. Robison, Application of second-generation sequencing to cancer genomics. Briengs in bioinformatics, pp. bbq013+, April 2010. [Online]. Available: http://dx.doi.org/10.1093/bib/bbq013 [2] M. R. Stratton, P. J. Campbell, and P. A. Futreal, The cancer genome, Nature, vol. 458, no. 7239, pp. 719724, April 2009. [Online]. Available: http://dx.doi.org/10.1038/nature07943 [3] International network of cancer genome projects, Nature, vol. 464, no. 7291, pp. 993998, April 2010. [Online]. Available: http://dx.doi.org/10.1038/nature08987 [4] H. Li and N. Homer, A survey of sequence alignment algorithms for next-generation sequencing, Brief Bioinform, pp. bbq015+, May 2010. [Online]. Available: http://dx.doi.org/10.1093/bib/bbq015

[5] C. Ling, K. Benkrid, and T. Hamada, A parameterisable and scalable smith-waterman algorithm implementation on cudacompatible gpus, Application Specic Processors, 2009. SASP 09. IEEE 7th Symposium on, pp. 94100, jul. 2009. [6] Y. Liu, B. Schmidt, and D. L. Maskell, Cudasw++2.0: enhanced smith-waterman protein database search on cuda-enabled gpus based on simt and virtualized simd abstractions. BMC Res Notes, vol. 3, p. 93, 2010. [Online]. Available: http://www.biomedsearch.com/nih/CUDASW20enhanced-Smith-Waterman-protein/20370891.html [7] M. Farrar, Striped smithwaterman speeds database searches six times over other simd implementations, Bioinformatics, vol. 23, no. 2, pp. 156161, 2007. [8] S. Manavski and G. Valle, Cuda compatible gpu cards as efcient hardware accelerators for smithwaterman sequence alignment, BMC Bioinformatics, vol. 9, no. Suppl 2, p. S10, 2008. [Online]. Available: http://www.biomedcentral.com/1471-2105/9/S2/S10 [9] T. F. Smith and M. S. Waterman, Identication of common molecular subsequences, Journal of Molecular Biology, vol. 147, pp. 195197, 1981. [10] S. Henikoff and J. G. Henikoff, Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America, vol. 89, no. 22, pp. 10 91510 919, November 1992. [Online]. Available: http://dx.doi.org/10.1073/pnas.89.22.10915 [11] M. O. Dayhoff and R. M. Schwartz, Chapter 22: A model of evolutionary change in proteins, in in Atlas of Protein Sequence and Structure, 1978. [12] Khronos Group, http://www.khronos.org 2008. [Online]. Available:

[13] NVIDIA, NVIDIA OpenCL Programming Guide for the CUDA Architecture, Version 2.3, 2010. [14] , NVIDIA OpenCL Best Practices Guide, Version 2.3, 2009. [15] National Center for Biotechnology Information (NCBI). [Online]. Available: http://www.ncbi.nlm.nih.gov [16] DAAD - German Academic Exchange Service. [Online]. Available: http://www.daad.de

56 55 47

You might also like