You are on page 1of 14

Accelerating Smith-Waterman Local Sequence Alignment on GPU Cluster

Thuy T. Nguyen Duc H. Nguyen Phong H. Pham De artment o! "n!ormation Systems Hanoi Uni#ersity o! Science an$ Technology Hanoi% &ietnam 'mail( )thuynt% $ucnh% hong h*+it-hut.e$u.#n Ngoc ,. Ta Tan N. Duong Hung D. Le High Per!ormance Com uting Center Hanoi Uni#ersity o! Science an$ Technology Hanoi% &ietnam 'mail( )ngoctm-..% $n.nhattan% hungl$/0*+gmail.com

AbstractWith a high accuracy, the Smith-Waterman local sequence alignment algorithm requires a very large amount of memory and computation, making implementations on common computing systems become less practical. In this paper, we present sw !"#luster $ an implementation of the Smith- Waterman algorithm on a cluster equipped with %&I'I( !" graphics cards )called a !" cluster* . +ur test was performed on a cluster of two nodes, one node is equipped with a dual graphics card %&I'I( e,orce -. /01, a -esla #2343 card, and the remaining node is equipped with / dual graphics cards %&I'I( e,orce -. /01. 5esults show that the performance has increased significantly compared with the previous best implementations such as SW!S6 or #"'(SW77. -he performance of sw !"#luster has increased along with the lengths of query sequences, from 68.6/9 #"!S to :4.834 #"!S. -hese results demonstrate the great computing power of graphics cards and their high applicability in solving bioinformatics problems. Keywords: sequence alignment; smith-waterman; cuda; gpu cluster.

".

"NT12DUCT"2N

"n a !e3 recent $eca$es% $ata o! DNA or Protein sequences ha#e 4een !oun$ more !requently% 4ioin!ormatics has ai$ attention to $e#elo com uter rograms !or analysing these sequences 3ith $i!!erent a roaches% then e5tract use!ul in!ormation. Sequence alignment is one o! the ty ical sequence $ata analysis ro4lems in 4ioin!ormatics. "t is an alignment an$ com arison roce$ure 4et3een t3o or more sequences to !in$ out the most homogeneous characteristic o! these sequences. 6igure 7 is an e5am le o! the result o! a t3o sequences alignment 3here some ga s are inserte$ into the !irst sequence to achie#e the 4iggest region o! similarity 4et3een them. tctgcctctgccatcat---caaccc | ||| ||||| ||||| |||||| tgtgcatctgcaatcatgggcaaccc
6igure 7. '5am le o! an alignment !or t3o sequences

The sequence alignment ro4lem is 3i$ely a lie$ to $isco#er use!ul in!ormation o! !unctions% structures an$ e#olution !rom 4iological sequences. Sequences 3hich ha#e many similarities may ha#e the same !unctions% structures or the same origin o! e#olution. This ro4lem is also a !oun$ation !or many other 4ioin!ormatics ro4lems such as( rotein structure re$iction% gene annotation. There are t3o 8in$s o! sequence alignment( glo4al alignment an$ local alignment. Glo4al alignment requires aligning an$ com aring on the 3hole sequence. "n this ty e% in ut sequences are o!ten similar an$ ha#e a ro5imately equal lengths. Local alignment !ocuses on calculating similarities o! local arts. The local alignment algorithm usually tries to align sequences to o4tain long similar arts. Local alignment is suita4le !or sequences 3ith $i!!erent lengths or ha#ing reser#e gene ortions. This alignment has a more signi!icant 4iological meaning than glo4al alignment since usually% not all o! the elements o! the alignment ta8e art in !in$ing 4iological characteristics o! the gi#en o49ect. Currently% there are many researches o! resol#ing the sequence alignment ro4lem% mostly !ocus on three main 4ranches( metho$s using oint matri5es% $ynamic rogramming metho$s an$ the :LAST metho$. 6or the glo4al sequence alignment ro4lem% the most $e#elo e$ algorithm is the Nee$lemanWunsch ;7<. This is a glo4al sequence alignment metho$ 4ase$ on $ynamic rogramming% to calculate oints !or the alignment rocess% using the su4stitution matri5 PA,-.= or :L2SU,0- !or rotein sequences. This metho$ ensures% !rom the mathematics si$e o! #ie3 that !in$ing an o timal ans3er 3ith a s eci!ic mechanism is ossi4le% 4ut it has a large amount o! calculations. 2n the other han$% the :LAST algorithm ;-< allo3s searching su4sequences >a sequences $ata4ase? 3hich are similar to the gi#en sequence >the query sequence?. :LAST uses the heuristic a roach so the s ee$ is remar8a4ly !ast 3hen er!orming !or gene 4an8s. This has ma$e :LAST the most o ular tool in 4ioin!ormatics. Although the s ee$ is lo3er than that o! the :LAST algorithm% 4ut 3ith a higher accuracy% the Smith @ Waterman algorithm is consi$ere$ as one o! the most o ular algorithms o! sol#ing the local

sequence alignment ro4lem. Since the e5ecution o! the Smith @ Waterman algorithm requires a large amount o! calculation an$ storage memory% $ue to huge 4iological $ata together 3ith the $ynamic rogramming algorithm% the im lementation rocess is unacce ta4le !or common com uting systems. Another ten$ency is 4eing $e#elo e$ !or the ne5t generation com uters( multi-core structures% 3hich hel sol#ing a lot o! ro4lems requiringlarge com uting o3er% inclu$ing 4ioin!ormatics. "n this a er% 3e er!orm arallelism !or the Smith Waterman algorithm on a multi-core cluster equi e$ 3ith N&"D"AAs gra hics car$s >calle$ a GPU cluster?. 1esults o! this e5 eriment ha#e sho3n that the s ee$ o! im lementing
S4

Where matScore is the scoring matri5% g is the enalty !or an o ening ga % e is the enalty !or lengthen ga s. The ma5imum oint o! the local alignment is the ma5imum #alue in the matri5 H. 6igure B $escri4es the rocess o! calculating the matri5 H. "n this !igure% 3e can see that each cell in the matri5 is calculate$ 4ase$ on #alues o! three other cells. "! 3e num4er su4-$iagonals o! the matri5 H% th then each cell on the su4-$iagonal i $e en$s on cells o! th the su4-$iagonals >i-1) % th >i-2) . There!ore% cells on the same su4-$iagonal $o not $e en$ on each other an$ they can 4e calculate$ in arallel. i-i-7 i

the algorithm has increase$ signi!icantly com are$ to e5ecutions on other common com uting en#ironments. This has ro#e$ the e5tremely high com uting o3er o! gra hics
Sa

car$s an$ their a "".

lica4ility in 4ioin!ormatics.

TH' S,"TH-WAT'1,AN ALG21"TH, AND 1'LAT'D W21CS

A.

The Smith-Waterman Algorithm "n the sequence alignment ro4lem% each alignment ans3er is gra$e$ accor$ing to the amount o! similarity 4et3een the t3o gi#en sequences. There are a !e3 gra$ing metho$s 4ut the linear gra$ing metho$ is the most o ular( each air o! matche$ characters is counte$ - oints% = oint is the gra$e gi#en to each mismatche$ air% an$ -7 oint !or airs 3ith at least one ga . The sum o! oints o! all the airs is the oint o! this sequence alignment ans3er. Ans3ers 3hich gi#e high oints are goo$ ones. The o timal ans3er is the one that has the highest oint. The oint o! this o timal ans3er is calle$ the similarity o! the t3o gi#en sequences. "n the Smith @ Waterman algorithm% the alignment rocess is e5ecute$ 4y aligning e#ery airs o! characters in the t3o sequences. The oint !or each air $e en$s on the !ollo3ings( t3o characters are a match% t3o characters are a mismatch an$ oints !or a$$ing or remo#ing ga s >or enalties?. The result o! local alignment is that 3e can !in$ out segments ha#ing the highest similarity 4et3een t3o sequences. The algorithm is 4ase$ on the $ynamic rogramming metho$ to calculate the oint o! the alignment rocess. The Smith @ Waterman algorithm is $e#elo e$ to i$enti!y the o timal local alignment ans3er o! t3o 4iological sequences 4y gra$ing the similarity using the $ynamic rogramming metho$. Su ose that t3o sequences Sa an$ Sb ha#e the !ollo3ing lengths( la an$ lb% the Smith @ Waterman algorithm calculates the match oints o! t3o sequences% using the matri5 H(i,j) o! t3o su4sequences Sai% Sbj >sequences en$ at oints i,j o! Sa% Sb ; 0<i<la; 0<j<lb?. H(i,j) is calculate$ 4y the !ollo3ing recursi#e !ormula(
E(i,j) F(i,j) H(i,j) matSco H(i,0) = max{E(i,j-1) g, H(i,j-1) g - e} = max{F(i-1,j) g, H(i-1,j) g - e} = max{0,E(i,j),F(i,j),H(i-1,j-1) + e!Sa(i),Sb(j)"} = H(0,j) = E(i,0) = F(0,j) = 0; 0<i<la, 0<j<lb 6igure -. Calculation o! the matri5 H

6igure B.

"llustration o! calculating one cell on the su4-$iagonal ith

B.

Related Works "t can 4e o4ser#e$ that the Smith @ Waterman algorithm requires three matri5es o! the siDe mx# 3here m% # are lengths o! t3o sequences nee$e$ to 4e com are$. The $ynamic rogramming algorithm also requires all #alues o! these three matri5es to 4e com letely !ille$. With such require$ amount o! calculation an$ memory storage% the algorithm 4ecomes less ractical in common cases 3hen 3e nee$ to align a query sequence an$ a large $ata4ase o! correlati#e sequences 3ith high lengths. To achie#e high results% most im lementations o! this algorithm use arallel rocessing com uter architectures. "n ;E<% ;.<% authors ha#e arallele$ the Smith @ Waterman algorithm on general ur ose rocessors accor$ing to the S",D architecture. Their result has sho3n that the im lementation s ee$ has increase$ 4y 0 times. ,ana#s8i SA an$ &alle G% in ;0<% ha#e arallele$ the algorithm on a system equi e$ 3ith - gra hics car$s Ge6orce //== GTF o! N&"D"A an$ o4taine$ the result o! B.. GCUPS >the num4er o! 4illions o! cells u $ate$ er secon$?. "n com arison 3ith the re#ious 4est im lementations% on one single GPU an$ on the S",D architecture% the im lementation s ee$ has increase$ a4out !rom - to B= times. This im lementation has $e!initely mar8e$ the im ortance o! the GPU multi-core architecture in 4ioin!ormatics ro4lems. "nheriting this metho$ o! 6arrar ;.<

an$ !ully e5 loiting the com uting o3er o! rocessing cores% in ;G<% authors ha#e asserte$ that SWPSB is the !astest #ectoriDe$ installation o! the Smith @ Waterman algorithm on CellH:' an$ 5/0HSS' architectures% 3ith a com uter system using Iua$ core Pentium% it can reach 7..G GCUPS. "n that im lementation% the algorithm 3as installe$ 4ase$ on the multi-core architecture an$ it 3as arallele$ 4y the multi- threa$ metho$. "n some situations% the alignment s ee$ o! SWPSB is calculate$ to 4e the same as that o! :LAST algorithm% the !astest heuristic algorithm at the moment. As mentione$ a4o#e% another $irection 3hich is intensi#ely consi$ere$ in the installation o! Smith @ Waterman algorithm is the a lication o! the great arallelism com uting o3er o! GPU in sol#ing ro4lems 3hich require a huge amount o! calculation. The a er ;/< resents a4out CUDASWJJ% another installation o! the Smith @ Waterman algorithm on gra hics car$s o! N&"D"A. The #ersion running on one single GPU reaches the s ee$ o! a4out 7= GCUPS 3ith N&"D"A Ge6orce GTF -/= gra hics car$s an$ the one running on multi-GPUs can reach 70 GCUPS. These results ha#e sho3n a much greater er!ormance com are$ to SWPSB or N&:"- :LAST an$ $emonstrate$ a high a lica4ility o! the GPU multi-core architecture in sol#ing the sequence alignment ro4lem. C. Our A roach With the e5 onential $e#elo ment o! 4iological sequences $ata4ase% the necessity o! high er!ormance com uting metho$s is consi$ere$ to sol#e the 4ioin!ormatics ro4lems% es ecially the sequence alignment ro4lem. 1ecent results using GPU to im lement the Smith @ Waterman algorithm ha#e sho3n an outstan$ing er!ormance com are$ to other metho$s. Ho3e#er% that e5ecution can only 4e installe$ on one single GPU or on one com uter equi e$ 3ith multi-GPUs% an$ there is not any installation e5ecute$ on a cluster% 3here no$es equi e$ 3ith multi-GPUs. "n this a er% 3e im lemente$ the Smith@Waterman algorithm on a cluster inclu$ing multiGPUs - GPUCluster. These results 3hich ha#e 4een com are$ to the re#ious e!!ecti#e im lementations sho3 a remar8a4le im ro#ement o! er!ormance an$ $emonstrate the great com uting o3er% a high a lica4ility o! GPUCluster in 4ioin!ormatics ro4lems. """. A. 2&'1&"'W 26 GPU CLUST'1

common CPUs has not reache$ / cores yet% the num4er o! cores in single GPU has reache$ -E= an$ also romises to continue to increase to .== cores in -=7=. As a enalty !or the com uting o3er% GPUs lose the !le5i4ility o! rocessing cores. Currently% all rocessing cores on one single GPU can only e5ecute a single iece o! co$e at a time% so GPU is only suita4le !or $ata arallel ro4lems% in 3hich the same rogram co$e 3ill 4e e5ecute$ in arallel !or se#eral $i!!erent $ata sets. 6ortunately% most ro4lems that require large com uting o3er can 4e con#erte$ to a ty e o! $ata arallelism. :esi$e the e!!ort o! im ro#ing GPU com uting o3er% GPU manu!acturers are also intereste$ in ro#i$ing 4etter a lication $e#elo ment en#ironments !or common $e#elo ers to easily rogram on GPUs. N&"D"A CUDA ;G< is a goo$ e5am le o! such e!!ort. With CUDA% rogrammers can e5 loit GPU com uting o3er !or not only gra hics rocessing a lications 4ut also general- ur ose a lications. This technology is one o! im ortant !actors !or the o ening o! the recent GPGPU >General-Pur ose com utation on Gra hics Processing Units? era. The !ollo3ings are some 8ey !eatures o! the rogramming language su orte$ 4y CUDA >calle$ CUDA language? CUDA language is an e5tension o! C language% so !amiliar to most $e#elo ers. CUDA co$e is $i#i$e$ into t3o arts( one e5ecute$ on CPU an$ the other e5ecute$ on GPU. The art e5ecute$ on GPU% also 8no3n as arallel 8ernel% 3hen calle$% can 4e e5ecute$ in arallel on thousan$s o! e5ecution threa$s. 'ach threa$ has a unique i$enti!ier use$ to $etermine its tas8. CUDA allo3s rogrammers to $e!ine an ar4itrary num4er o! arallel threa$s% 4ut to a#oi$ the $e en$ence on har$3are% threa$s are $i#i$e$ into 4loc8s 3ith the num4er not e5cee$ing G0/>GT-== generation?. This allo3s a rogrammer to $esign his arallel rogram e!!ecti#ely 3ithout caring a4out the har$3are ca a4ility. ,emory is hierarchically organiDe$ !or e!!ecti#e usage ,ain memory( the memory area !or CPU co$e. 2nly this co$e can access an$ mo$i!y in!ormation here. Glo4al memory( the memory area that all GPU threa$s can access to it. Programmers can mo#e $ata !rom main memory to glo4al memory 4y using !unctions !rom a CUDA 4asic li4rary. This memory is o!ten use$ to store in uts an$ out uts !or arallel threa$s on GPUs. Share$ memory( the memory area that only threa$s in one 4loc8 can access. This memory is integrate$ on-chi N so the s ee$ o! accessing $ata on it is much higher than on glo4al memory. This memory is o!ten use$ to store tem orary share$ $ata among threa$s in a 4loc8 to s ee$ u the rocess o! memory usage.

Cuda and ! g u "n recent years% the com uting o3er o! GPU gra hics rocessors has increase$ signi!icantly com are$ to CPU. Until Kune -==/% N&"D"ALs GPU GT-== generation has reache$ the threshol$ o! MBB G6L2PS% more than 7= times o#er $ual-core rocessor the "ntel Feon B.- GHD at the same time. 6igure sho3s a massi#e increase in com uting o3er o! the N&"D"A gra hics rocessors com are$ to "ntel rocessors. This su eriority in er!ormance $oes not im ly the su eriority in technology. GPU an$ CPU are $e#elo e$ in t3o $i!!erent $irections( 3hile CPU technology s ee$s u a single tas8% GPU technology tries to increase the num4er o! tas8s that can 4e er!orme$ in arallel. Thus% 3hile the num4er o! cores in

- Local memory( the memory area allocate$ to local #aria4les o! each threa$ an$ one GPU threa$ can not access to those !rom others. With the a4ility to er!orm $ata arallelism on such a lot o! threa$s% GPU is an a ro riate choice to im lement the Smith-Waterman algorithm% 3here each threa$ can calculate one cell on su4-$iagonals o! the matri5 H. B. ! ucluster The $e#elo ment o! com uting o3er o! gra hics car$s% along 3ith the #ery quic8 increase o! the siDe o! in ut $ata !or ro4lems requiring high er!ormance com uting has le$ to requests o! com4ining the com uting o3er o! multi-GPUs on $i!!erent no$es. A GPU cluster system hel s sol#ing those ro4lems. A GPU cluster is $e!ine$ as a cluster o! com uters 3here each no$e is equi e$ 3ith one or more GPUs. :y !ully e5 loiting the com uting o3er o! GPUs through a lication $e#elo ment en#ironments% 3e can er!orm #ery !ast calculations on a GPU cluster. :asically% such a system inclu$es har$3are com onents such as CPU% GPU an$ to connect no$es 3e nee$ to a$$ a net3or8 connection such as Giga4it 'thernet. Necessary so!t3are installe$ on a GPU no$e inclu$es( o erating system% GPU $ri#er in each no$e an$ arallel rogramming inter!aces such as ,P". "n recent $eca$es% there ha#e 4een a lot o! GPU cluster systems $e loye$ as installation o! Gra hStream ;7=<% 4ut they are only #irtual systems. A num4er o! ro9ects $e loye$ GPU com uting no$es inclu$e( GPU Cluster ODIO 70= no$es at the LANL- ;77<% an$ OIPO 70 no$es at NCSA ;7-<% 4oth 4ase$ on N&"D"ALs Iuar$roPle5 technology. These installations mainly ro#i$e e5 erimental results in the ro$uction !iel$ using high er!ormance. To harness the great com uting ca a4ility o! the GPU Cluster% 3e use$ the i$ea o! lunging the CUDA !rame3or8 in a message assing inter!ace en#ironment @ MPI. 'ach o! no$es insi$e the cluster has its o3n tas8 o! sen$ing $ata an$ intensi#e arallel tas8s to GPU% ma8ing CPU !ree !or er!orming net3or8 communication 4et3een no$es. To com ile an$ run ,P" rograms% it is not $i!!icult i! using N&"D"ALs com iler - n"cc to com ile the entire mi5e$ CUDA an$ ,P" co$e together. Here% 3e mi5e$ ,P" co$e into CUDA source !iles% since CUDA is an e5tension o! C language an$ the com iler n"cc 3ra s the com iler m icc. 6or e5am le% to com ile source co$es containing 4oth CUDA an$ ,P" co$e% 3e use n#cc to com ile% an$ inclu$e ,P" li4raries an$ hea$ers(
#$cc m%i&exam%le'c( ) *H+,E-o%e#-m%i-lib lm%i *H+,E-o%e# m%i-i#cl(.e /

"&. A.

",PL','NT"NG TH' S,"TH-WAT'1,AN ALG21"TH, 2N GPU CLUST'1

Strateg# With the $escri tion o! the SW algorithm a4o#e% along 3ith analysis o! !actors 3hich can 4e arallele$ in the rocess o! calculating the matri5% 3e em loy t3o a roaches on GPU cluster 3ith t3o $ata arallel le#els% ro ose$ 4y Pongchao Liu% Douglas L ,as8ell an$ :ertil Schmi$t in ;/<. "n the !irst le#el% the algorithm can er!orm alignment 3ith $i!!erent in ut sequences in arallel. "n secon$ one% the calculation o! the matri5 H can 4e simultaneously e5ecute$ !or #alues o! cells on su4-$iagonals. Su ose that alignment o! t3o sequences is a single tas8% the !irst le#el can 4e consi$ere$ as inter-task arallelism% the secon$ one is consi$ere$ as intra-task arallelism >s litting a tas8 into se#eral su4-tas8s?. Inter-task parallelism; each tas8 is assigne$ to an e5ecution threa$. "n one 4loc8 o! threa$s% tas8s are simultaneously er!orme$ 4y $i!!erent threa$s. Intra-task parallelism; each tas8 is assigne$ to a 4loc8. 'ach th threa$ insi$e a 4loc8 3ill calculate one cell on the i su4- $iagonal 4ase$ on #alues o! cells on t3o su4-$iagonals th th $i-1) an$ $i-2) . A!ter !inishing the calculation% #alues o! su4- $iagonal #ectors are s3a e$ to calculate #alues o! the ne5t $iagonal. "nter-tas8 arallelism requires much more memory% 4ut it gi#es a 4etter er!ormance% so it is suita4le !or the alignment o! sequences 3ith short lengths. "n contrast% intra-tas8 arallelism $oes not require much memory an$ it has a lo3er er!ormance% it is suita4le !or longer sequences. To se arate these t3o metho$s% 3e use a threshol$ #alue o! the sequence length to $eci$e 3hich metho$ is use$. B. %aralleling the Algorithm on ! ucluster :ase$ on the metho$ o! aralleliDation 3ith t3o le#els as mentione$ a4o#e% 3e $istri4ute $ata on GPU cluster system !or im lementing on each GPU o! each no$e. The cluster system uses a share$ $ata $irectory 3hich is synchroniDe$ 4et3een no$es. This $irectory stores 4iological sequences $ata. Su ose that the GPU cluster system inclu$es n&client no$es% each no$e is equi e$ 3ith n&de"ice GPUs. 6irst% the $ata sequence $ata4ase is $i#i$e$ into n&client arts 4ase$ on the siDe o! the $ata. To a#oi$ con!licts 3hen accessing !iles containing sequence $ata% 3e use a $ata loc8er db&lock an$ a $ata-status-recor$ db&stat !or each access o! no$es. The $ata loc8er an$ the $ata-statusrecor$ are unique an$ trans!erre$ 4et3een no$es. When a no$e !in$s out that $ata is not loc8e$% it 3ill rea$ the $ata-status-recor$ to $etermine the re#ious osition to continue loa$ing $ata into memory. When rea$ing is com lete% the no$e 3ill u $ate the $ata-status-recor$ an$ unloc8 the $ata loc8er.
)#itiali0e ,1), get t2e #(mbe o3 #o.e4; .b&bloc5&4i0e = total67Si0e - #&clie#t; 82ile (.ata 4till i4 loc5e.){ 8ait; } 9 eate .ata loc5e ; :ea. eco . o3 .ata 4tat(4 to .ete mi#e .ata

The ne5t section 3ill resent our im lementation o! the Smith-Waterman algorithm on GPUCluster system an$ some e5 erimental results.

%o4itio#; tem%&4i0e = 0; 82ile(tem%&4i0e < .b&bloc5&4i0e ;;.b)4<otEm%t=){ >et (Se?, Se?<ame, Se?/e#) i# 6ataFile; tem%&4i0e += Se?/e#; } Sa$e t2e .ata 4tat(4 eco .; @#loc5 .ata; 6igure E. Pseu$o co$e o! $istri4uting $ata into no$es

&.

'FP'1",'NTAL 1'SULTS

:e!ore $ata is $i#i$e$ into GPUs memory% it is arrange$ in ascen$ing or$er o! sequence lengths. This arrangement aims at t3o ur oses. The !irst one is to se arate $ata into t3o 4loc8s 3hich are han$le$ in t3o 3ays >as $escri4e$ a4o#e?. This rocess is controlle$ 4y a runtime arameter threshold >i! the sequence length Q threshold% it is assigne$ to the !irst 4loc8% other3ise it 4elongs to the secon$ 4loc8?. These 4loc8s continue 4eing $i#i$e$ into n&de"ice arts corres on$ing to GPUs% thus each GPU 3ill only ha#e to han$le similar 4loc8s o! $ata in 4oth 3ays. The secon$ ur ose is that threa$s o! the same 4loc8 3ill align sequences o! a ro5imately equal lengths% so their runtime is a ro5imately similarN this increases the er!ormance o! the im lementation. The rogram consists o! t3o kernels er!orming t3o arallel algorithms 3ith t3o le#els on the GPU cluster >one 8ernel corres on$ing to inter-tas8 arallelism an$ the other corres on$ing to intra-tas8 arallelism?. A!ter loa$ing sequence $ata into memory o! each GPU% the rogram 3ill call 8ernels to align sequences. Then results are sa#e$ in the memory o! no$es an$ each no$e 3ill 3rite the results into tem orary !iles. At no$e =% the rogram 3ill collect results er!orme$ on all no$es. :elo3 is seu$o co$e !or the im lementation o! the 8ernels on the GPU Cluster(
-AAAAAAAAAAAAAAAAAAAAABSS@,1C)+<SAAAAAAAAAAAAAAAAAAA t2 ea.4D t2e #(mbe o3 t2 ea.4 exec(te. i# %a allel; #S,D t2e #(mbe o3 m(lti-% oce44o 4 %e >1@; )#te Se?<oD t2e #(mbe o3 4e?(e#ce4 E2o4e le#gt24 a e le44 t2a# t2e $al(e threshold %e >1@ )#t aSe?<oD t2e #(mbe o3 4e?(e#ce4 E2o4e le#gt24 a e mo e t2a# t2e $al(e threshold %e >1@ AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA#&batc2 = )#te Se?<o-t2 ea.4; E2ile (#&batc2 F 0){ # = mi#(#&batc2, #S,); .imG g i.4 (#, 1, 1); .imG bloc54 (t2 ea.4, 1, 1); i#te &5e #el<<g i.4, bloc54, 1FF (i#te Se?4, i#te Se?<o, .& e4(lt); #&batc2 -= #; } i3 (i#t aSe?<o F 0) { maxSe?+#e1a44 = HIJ; #&batc2 = i#t aSe?<o; E2ile (#&batc2 F 0){ # = mi#(#&batc2, maxSe?+#e1a44); .imG g i.4 (#, 1, 1); .imG bloc54 (t2 ea.4, 1, 1); i#t a&5e #el<<g i.4, bloc54, 1FF (i#t aSe?4, i#t aSe?<o, .& e4(lt); #&batc2 = #;} } t a#43e CoHo4t:e4(lt(.& e4(lt); 6igure .. Pseu$o co$e o! im lementing the 8ernels on GPU Cluster

The im lementation o! the Smith-Waterman algorithm on the GPU cluster has 4een $e loye$ an$ teste$ at High Per!ormance Com uting Center% Hanoi Uni#ersity o! Science an$ Technology. The test en#ironment is a GPU cluster inclu$ing t3o no$es% one no$e is equi e$ 3ith a $ual gra hics car$ N&"D"A Ge6orce GTF -M.% a Tesla C7=0= car$% an$ the remaining no$e is equi e$ 3ith - $ual gra hics car$s N&"D"A Ge6orce GTF -M.. To remo#e the $e en$ency on the query sequences an$ the $ata4ases use$ !or the $i!!erent test% cell u dates er second >CUPS? is a commonly use$ er!ormance measure in 4ioin!ormatics. CUPS resents the num4er o! cells o! the matri5 H calculate$ er secon$% inclu$ing the calculation o! interme$iate #alues o! the matri5 '% 6?. The !ormula >7? calculates CUPS #alues o! one sequence alignment ans3er( c(%4 = ?/e# A .b/e#-t >7? Where '(en is the length o! a query sequence% db(en is the length o! a su49ect sequenceN t is the runtime o! the rogram. The #alue t inclu$es the time o! loa$ing $ata !rom main memory to $e#ice memory% the time o! calculation on GPUs an$ the time o! trans!erring results to CPU. "n our test% 3e use$ a set o! query sequences o! lengths 3hich are !rom 7== to .===% the 4iological sequences $ata4ase )ni%rot release -=7=R=. - A r -=% -=7= 3hich inclu$es .70%=/= sequences an$ 7/7%0G0%.=. amino aci$s. With this $ata4ase an$ the #alue threshold is B=G-% there are u to .7.%EG- sequences 3hich are aligne$ 4y intra-tas8 arallelism an$ 0=/ others are aligne$ 4y inter-tas8 arallelism. '5 erimentation o! the s3GPUCluster is teste$ on t3o no$es No$e= an$ No$e7% using multi-GPUs >three $ual car$s GTF-M. @ 0 GPUs% one car$ Tesla C7=0= @ 7 GPU?. With our GPU cluster system% the ma5imum er!ormance is achie#e$ 3hen the 4loc8 siDe threads S -.0 an$ the gri$ siDe blocks S B= >the num4er o! streaming multi rocessors o! GPU?. The er!ormance o! the s3GPUCluster increases accor$ing to lengths o! query sequences% !rom the minimum #alue BG.B-/ GCUPS to the ma5imum #alue E0.G=0 GCUPS. This result is $escri4e$ in the ta4le 7. We ha#e com are$ the results o! the s3GPUCluster to other solutions im lementing the SW algorithm such as( cu$aSWJJ or s3 sB. cu$aSWJJ 3as teste$ on one GTF-M. GPU. "ts result sho3s that the minimum er!ormance is /.B/G GCUPS an$ the ma5imum is M.-B- GCUPS. "n com arison to the cu$aSWJJ on one single GPU% the s ee$ o! im lementing the s3GPUCluster is a4out E.E to . times !aster than the cu$aSWJJ. Another com arison o! er!ormance is er!orme$ 3ith the s3 sB im lementation. The s3 sB 3as teste$ on 5/0Hsselat!orm inclu$ing one no$e equi e$ 3ith a rocessor Core Iua$ I/E== -.00 GhD >E cores?% /G: 1A, 3ith one threa$ or !our threa$s. The er!ormance o! the s3GPUCluster is a4out 7B./ to --./ times !aster than the s3 sB 5/0Hsse--singlecore% an$ it is a ro5imately B.E to 77.E times !aster than s3 sB 5/0Hsse- multi-cores% as sho3n in !igure 0.

TA:L' ". 1'SULTS 26 TH' ",PL','NTAT"2N 26 TH' S,"TH-WAT'1,AN ALG21"TH, 2N GPU CLUST'1 <uery !3//6/ !32222 !2:0:/ !386/8 !/1831 !/2288 !/8901 !38814 !3:881 !20304 !3#4>9 !39120 !66:13 <0"?%2 =ength 7EE 7/M --BG. ..B G-M 7=== 7.== -==. -.=E B.0E E.E/ .7EG .EG/ -ime)s* =.GGMBMB =.MEE70E 7.=G/G/B 7.G-7BG/ -..B--0= B.--E7/. E.BBGB7= 0.E.0BM/ /..GM00E 7=.0/7E-0 7..7GEM7/ 7M.BE//G7 -7./MEMB/ -B.-ME.E0 #"!S BG.B-/ BM.-=E=.GM. E-.ME. EB.E7M EE.M=G E../EM E0.=0. E0.B.. E0..B= E0.07E0.0.B E0.0M= E0.G=0

im lementation re#iously installe$ on GPU or on multi-core architectures such as s3 sB or cu$aSWJJ. 2ur results sho3 a high a lica4ility o! GPUs to s ee$ u the im lementation o! algorithms in 4ioin!ormatics% i! 3e 3ell e5 loit characteristics o! com uting har$3are. The outstan$ing er!ormance also sho3s that the er!ormance o! GPUs increases much !aster than the er!ormance o! multi-core CPUs. 1'6'1'NC'S
;7< ;-< ;B< ;E< 333-.cs.uh.e$uHTDhenDhaoH1e#ie3Halignment.htm htt (HH4last.nc4i.nlm.nih.go#H:last.cgi. htt (HHen.3i8i e$ia.orgH3i8iHSmithWatermanRalgorithm. 1ognes T% See4erg '( USi5-!ol$ s ee$-u o! Smith-Waterman sequence $ata4ase searches using arallel rocessing on common micro rocessorsV . Bioin*ormatics -=== % 70>/?;0MM-G=0 ;.< 6arrar ,( UStri e$ Smith-Waterman s ee$s $ata4ase searches si5 times o#er other S",D im lementationsV . Bioin*ormatics -==G % -B>-?;7.0707 ;0< ,ana#s8i SA% &alle G( UCUDA com ati4le GPU car$s as e!!icient har$3are accelerators !or Smith-Waterman sequence alignmentV ;G< SDal8o3s8i A% Le$erger4er C% Crahen4uhl P an$ DessimoD C( USWPSB @ !ast multi-threa$e$ #ectoriDe$ Smith-Waterman !or ":, CellH:.'. an$ W/0HSS'-V . B+C Research ,otes -==/% 7(7=G. ;/< Pongchao Liu% Douglas L ,as8ell an$ :ertil Schmi$t( UCUDASWJJ( o timiDing Smith-Waterman sequence $ata4ase searches !or CUDAena4le$ gra hics rocessing unitsV. B+C Research ,otes -==M% -;GB. ;M< N&"D"A.htt (HH333.n#i$ia.comHo49ectHcu$aRhomeRne3.html ;7=< >-==M? Gra hStream% "nc. 3e4site. ;2nline<. A#aila4le( htt (HH333.gra hstream.comH. ;77< D. GX$$e8e% 1. StrDo$8a% K. ,oh$-Puso!% P. ,cCormic8% S. :ui9ssen% ,. Gra9e3s8i% an$ S. Ture8a% U'5 loring 3ea8 scala4ility !or 6', calculations on a GPU-enhance$ cluster%V Parallel Com uting% #ol. BB% . 0/.-0MM% No# -==G. ;7-< ,. Sho3erman% K. 'nos% A. Pant% &. Cin$raten8o% C. Ste!!en% 1. Pennington% W. H3u% UIP( A Heterogeneous ,ulti-Accelerator Cluster%V in Proc. 7=th LC" "nternational Con!erence on High- Per!ormance Clustere$ Com uting% -==M. ;2nline<. A#aila4le(htt (HH333.ncsa.illinois.e$uHT8in$rH a ersHlci=MR a er. $!.

&".

C2NCLUS"2N

"n this a er% 3e resent the s3GPUCluster @ an im lementation o! the Smith-Waterman sequence alignment algorithm on a GPU cluster system consisting o! t3o no$es equi e$ 3ith multi-GPUs >B $ual car$s GTF-M. 0GPUs an$ one car$ Tesla C7=0= - 7GPU?. With the test in ut 3hich is 4iological sequences $ata4ase )ni%rot #ersion -=7=R=. A r -=% together 3ith the o timal con!iguration set% the er!ormance o! the s3GPUCluster increases 3ith the length o! query sequences !rom the minimum #alue o! BG%B-/ GCUPS to the ma5imum #alue o! E0%G=0 GCUPS. The s3GPUCluster gi#es a signi!icantly 4etter er!ormance than the

50 40 30 20 #0 0 $ u e r% & e ! " t'


swGP UC lust er swps3- x86/ sse2 mult i- co re cudaSW si!"leGP U swps3- x86/ sse2si!"le- co re

6igure 0.

Com arison o! er!ormance o! s3GPUCluster 3ith cu$aSWJJ an$ s3s B-5/0Hsse-.

TYm tZt-&[i \] ch^nh 5_c cao% thu`t to_n sZ 5a trbnh tc cudc 4ed Smith-Waterman \fi hgi m]t $ung lhing 4] nh[ #j shd t^nh to_n rkt l[n % ljm cho #ilc trimn 8hai trnn hl thong m_y t^nh theng thhpng trq

nnn ^t thcc ta. Trong 4ji 4_o njy% chrng tei trbnh 4jy s3GPUCluster - casi \tdt thu`t to_n SmithWaterman trnn cluster >m]t nhYm? \hic trang 4u car$ \v hwa N&"D"A GPU >\hic gwi lj m]t cxm GPU?. Thy nghilm cza chrng tei \hic thcc hiln trnn m]t cluster gesm hai nrt% m]t nrt \hic trang 4u m]t car$ \v hwa 8{ N&"D"A Ge6orce GTF -M.% medt car$ Tesla C7=0=% #j nrt cfn l|i \hic trang 4u - car$ \v hwa 8{ N&"D"A Ge6orce GTF -M.. Cat qu} cho thky hilu sukt \~ ttng \_ng 8m so #[i #ilc trimn 8hai tot nhkt trh[c \y nhh SWPSB hoc CUDASW. &ilc thcc hiln s3GPUCluster \~ ttng cng #[i \] $ji cza chui truy #kn% t BG%B-/ \an E0%G=0 GCUPS GCUPS. Nhng 8at qu} njy chng minh sc m|nh \iln to_n l[n cza car$ \v hwa #j ng $xng cao cza chung trong #ilc gi}i quyat #kn \ sinh hwc. 7. Gii thindu Trong m]t #ji th` 8 gn \y% trbnh tc $ lilu cua DNA hoc rotein \~ \hic tbm thky thhpng 5uynn hn% tin sinh hwc \~ \hdc chr trwng h_t trimn c_c chhng trbnh m_y t^nh \m hn t^ch nhng trbnh tc njy #[i cac hhng h_ tia c`n 8h_c nhau% sau \Y gi}i n{n theng tin hu ^ch. Trbnh tc sZ 5a lj m]t trong nhhng trbnh tc hn t^ch $ lilu \imn hbnh trong tin sinh hwc. NY lj m]t sc linn 8at #j so s_nh gia hai hay nhiu trbnh tc \m tbm ra c_c \c t^nh \vng nhkt nhkt cza nhng trbnh tc njy. Hbnh 7 lj m]t #^ $x # 8at qu} cza m]t trisnh thd st 5n hai chui mj m]t so nhng 8ho}ng trong \hic chn #jo c_c chuei trbnh tc \u tinn \m \|t \hic c_c #usng gieng nhau l[n nhkt gia chrng. &kn \ lj trisnh thd st 5n _ $xng redng rai cho #indc 8h_m h_ theng tin hu ^ch # chc ntng% cku trrc #j sc tian hYa t c_c trbnh tc sinh hwc. Trbnh tc cY nhiu \imm thng \vng cY thm cY chc ntng thng tc% c_c cku trrc hay nguvn goc giong nhau cza sc tian hYa. &kn \ njy cng lj m]t nn t}ng cho nhiu #kn \ tin sinh hwc 8h_c nhh( $c \o_n cku trrc rotein% chr th^ch gen. CY hai lo|i trisnh thd st 5n ( st 5n toasn cudc #as st 5n cudc 4ed . st 5n toasn cudc \fi hgi sZ 5a #j so s_nh # trbnh tc medt cach tojn 4]. Trong lo|i njy% trbnh tc \u #jo thhpng thng tc #j cY \] $ji 5k 5 4ng nhau. st 5n cudc 4ed t` trung #jo t^nh thng \vng cza c_c 4] h`n cudc 4ed. C_c thu`t to_n linn 8at cudc 4ed thhpng co gZng sZ 5a trbnh tc \m cY \hic hn gieng nhau $ji nht. Linn 8at cudc 4ed h hi #[i trbnh tc co \] $ji 8h_c nhau hoc hn $c tr gen. Theng thhsng% Sc linn 8at njy cY m]t ngha quan trwng trong sinh hwc hn sc linn 8at tojn cu% 8heng h}i tkt c} c_c yau to cza sc st 5n tham gia trong #ilc tbm 8iam cac \c \imm sinh hwc cza \oi thing \hdc \ha ra. Hiln nay% cY nhiu nghinn cu gi}i quyat c_c #kn \ linn 8at chui% chz yau t` trung #jo 4a ngjnh ch^nh( hhng h_ sy $xng ma tr`n \imm% hhng h_ l` trbnh \]ng #j hhng thc :LAST. oi #[i c_c #kn \ #ns trisnh thd st 5n tojn cudc% c_c thu`t to_n \hdc h_t trimn nhkt lj Nee$leman-Wunsch ;7<. y lj m]t hhng ha st 5n tojn cudc $ca #aso l` trbnh \]ng% \m t^nh \imm cho qu_ trbnh linn 8at% 4ng c_ch sy $xng thay tha ma tr`n PA,-.= hoc :L2SU,0- cho c_c trbnh tc rotein. Phhng h_ njy \}m 4}o% 5et #ns hhng $indn to_n hwc cza quan \imm cho rng #ilc tbm 8iam m]t cu tr} lpi toi hu #[i m]t c cha cx thm lj cY thm% nhhng nY cY m]t so lhing l[n c_c t^nh to_n. ,t 8h_c% c_c thu`t to_n :LAST ;-< cho h{ tbm 8iam trisnh thd con >su4sequences? >m]t c sq $ lilu trbnh tc? gieng nhh trbnh tc \ha ra >chui truy #kn?. :LAST sy $xng hhng h_ heuristic #is #`y toc \] \tdc 4indt nhanh 8hi thcc hiln cho c_c ngn hjng gen. iu njy \~ ljm :LAST tr thasnh ceng cx h 4ian nhkt trong tin sinh hwc. ,c $ toc \] thk hn cza thu`t to_n :LAST% nhhng #[i m]t \] ch^nh 5_c cao hn% Smith - Waterman thu`t to_n \hic 5em lj m]t trong nhng thu`t to_n h 4ian nhkt cza #ilc gi}i quyat c_c #n \ns #ns trisnh thd st 5n cudc 4ed . &indc casi \tdt thudt toan Smith - Waterman \fi hgi m]t lhing l[n c_c t^nh to_n #j 4] nh[ lhu tr% $o $ lilu sinh hodc 8hng lv cng #[i c_c thu`t to_n l` trbnh \]ng% qu_ trbnh thcc hiln lj 8heng thm chk nh`n cho hl thong m_y t^nh theng thhpng. ,]t 5u hh[ng \ang \hic h_t trimn cho c_c thn hnd m_y t^nh tia theo( cku trrc \a li% gir gi}i quyat rkt nhiu #kn \ \fi hgi sc m|nh t^nh to_n l[n% 4ao gvm tin sinh hwc. Trong 4ji 4_o njy% chrng tei thcc hiln song song cho c_c thu`t to_n Smith Waterman trnn m]t cxm \a li \hic trang 4u car$ \v hwa cza N&"D"A >\hic gwi lj m]t cxm GPU?. Cat qu} cza th^ nghilm njy \~ ch ra rng toc \] thcc hiln c_c thu`t to_n \~ ttng \_ng 8m so #[i thcc hiln trnn mei trhpng m_y t^nh h 4ian 8h_c. iu njy \~ chng tg sc m|nh t^nh to_n ccc 8 cao cza car$ \v hwa #j ng $xng cza chung trong tin sinh hwc.
-. Thu`t to_n S,"TH-WAT'1,AN & CNG T1NH L"N IUAN

C_c thu`t to_n Smith-Waterman Trong c_c #kn \ #ns trisnh thd st 5n % mi cu tr} lpi st 5n \hic chia theo so lhing gieng nhau gia hai trbnh tc nhkt \unh. CY m]t #ji hhng h_ st 5n nhhng hhng h_ st 5n tuyan t^nh lj h 4ian nhkt( mi c nhn #`t h hi \hic t^nh - \imm% = \imm lj ck cho mi c 8heng hus hd % #j -7 \imm cho c #[i ^t nhkt m]t 8ho}ng c_ch. Tng \imm cza tkt c} c_c c lj \imm cza cu tr} lpi #ns trisnh thd st 5n . Cu tr} lpi mj cho \imm cao lj tot. Cu tr} lpi toi hu lj m]t trong \Y cY \imm cao nhkt. imm cza cu tr} lpi toi hu \hic gwi lj sc giong nhau cza hai chui \~ cho. Trong thu`t to_n Smith - Waterman% qu_ trbnh linn 8at \hic thcc hiln 4ng c_ch sZ 5a mwi c nhn #`t trong hai chui. C_c \imm cho mi c hx thu]c #jo( hai \tdc \inm hus hd % hai \tdc \inm 8heng hus hd #j cho thnm hoc lo|i 4g nhng 8ho}ng trong >hoc hbnh h|t?. Cat qu} cza sc linn 8at cudc 4ed lj chrng ta cY thm tbm ra c_c hn \o|n cY sc giong nhau nhkt gia hai chui. Thu`t to_n nasy $ca #jo hhng h_ l` trbnh \]ng \m t^nh to_n \imm cza qu_ trbnh st 5n . Thudt toan Smith - Waterman \hic h_t trimn \m 5_c \unh cu tr} lpi st 5n cudc 4ed toi hu cza hai chui sinh hwc 4ng c_ch hn lo|i sc giong nhau sy $xng hhng h_ l` trbnh \]ng. Gi} sy hai chui Sa #j S4 cY \] $ji sau \y( la #j l4% thudt toan Smith - Waterman t^nh to_n \imm h hi cza hai chui% sy $xng ma tr`n H >i% 9? cza hai chuei con Gfn% S:K >chui 8at thrc t|i \imm i % 9 cza Sa% S4 N = Qi QlaN = Q8 Ql4?. H >i% 9? \hic t^nh theo ceng thc \l quy sau \y(
E(i,j) F(i,j) H(i,j) matSco H(i,0) = max{E(i,j-1) g, H(i,j-1) g - e} = max{F(i-1,j) g, H(i-1,j) g - e} = max{0,E(i,j),F(i,j),H(i-1,j-1) + e!Sa(i),Sb(j)"} = H(0,j) = E(i,0) = F(0,j) = 0; 0<i<la, 0<j<lb 6igure -. Calculation o! the matri5 H

Trhpng hi matScore lj ma tr`n \imm% g lj hbnh h|t >loadi 4o? cho #indc m 8ho}ng c_ch% e lj hbnh h|t cho 8{o $ji 8ho}ng trong. imm toi \a cza sc st 5n cudc 4ed lj gi_ tru toi \a cza ma tr`n H. Hbnh B me t} qu_ trbnh t^nh to_n ma tr`n H. Trong hbnh njy% chrng ta cY thm thky rng mi ta 4jo trong ma tr`n \hic t^nh $ca trnn gi_ tru cza 4a ta 4jo 8h_c. Nau so \hpng ch{o hud cza ma tr`n H% sau \Y mi ta 4jo trnn c_c th i \hpng ch{o hud hx thu]c #jo ta 4jo cza \hpng ch{o hx ln th >i-7?% ln th >i--?. Do \Y% c_c ta 4jo trnn cng m]t \hpng ch{o hud 8heng hx thu]c #jo nhau #j chrng cY thm \hic t^nh to_n song song.

T_c hm linn quan NY cY thm \hic quan s_t thky rng Smith - Waterman thu`t to_n \fi hgi 4a ma tr`n 8^ch thh[c ,FN #i m% n lj \] $ji cza hai chui cn thiat \m so s_nh. C_c thu`t to_n l` trbnh \]ng cng ynu cu tkt c} c_c gi_ tru cza 4a ma tr`n \m \hic hojn tojn \y. &[i so lhing ynu cu nhh t^nh to_n #j 4] nh[ lhu tr% thu`t to_n trq nnn ^t thcc ta trong trhpng hi theng thhpng 8hi chrng ta cn h}i sZ 5a m]t chui truy #kn #j c sq $ lilu c_c trbnh tc thng ng #[i \] $ji ln. m \|t \hic 8at qu} cao% hu hat c_c hiln thcc cza thu`t to_n njy sy $xng 8ian trrc m_y t^nh 5y l song song. trong ;E<% ;.<% t_c gi} \~ song hjnh #[i c_c Smith - Waterman thu`t to_n trnn 4] #i 5y l mxc \^ch chung theo 8ian trrc S",D. Cat qu} cza hw \~ ch ra rng toc \] thcc hiln \~ ttng 0 ln. ,ana#s8i SA #j &alle G% trong ;0<%

cY c_c thu`t to_n song song trnn m]t hl thong trang 4u - car$ \v hwa Ge6orce //== GTF cza N&"D"A #j thu \hic 8at qu} lj B%. GCUPS >so lhing cza hjng t ta 4jo \hic c` nh`t mi giy?. So #[i #ilc trimn 8hai tot nhkt trh[c \y% trnn m]t \n GPU #j trnn 8ian trrc S",D% toc \] thcc hiln \~ ttng 8ho}ng --B= ln. iu njy thcc hiln chZc chZn \~ \_nh $ku tm quan trwng cza 8ian trrc GPU \a li trong #kn \ sinh hwc. Ca tha hhng h_ njy 6arrar ;.< #j 8hai th_c \y \z sc m|nh t^nh to_n cza c_c nhn 5y l% trong ;G<% t_c gi} \~ 8hng \unh rng SWPSB lj cji \t nhanh nhkt #ectoriDe$ cza Smith - Waterman thu`t to_n trnn Cell H :' #j 5/0HSS' 8ian trrc% #[i m]t hl thong m_y t^nh sy $xng Iua$ core Pentium% nY cY thm \|t 7.%G GCUPS. Trong thcc hiln \iu \Y% thu`t to_n \~ \hic cji \t $ca trnn 8ian trrc \a li #j nY \~ \hic song song theo hhng h_ chz \ \a. Trong m]t so trhpng hi % toc \] linn 8at cza SWPSB \hic t^nh \m \hic giong nhh cza thu`t to_n :LAST% c_c thu`t to_n heuristic nhanh nhkt #jo lrc njy. Nhh \~ \ c` q trnn% m]t hh[ng mj lj su 5em 5{t trong qu_ trbnh cji \t cza Smith - Waterman thu`t to_n lj #ilc _ $xng sc m|nh t^nh to_n l[n song song cza GPU trong #ilc gi}i quyat #kn \ \fi hgi m]t lhing l[n c_c t^nh to_n. C_c giky ;/< trbnh 4jy # CUDASW J J% m]t cji \t cza Smith - Waterman thu`t to_n trnn car$ \v hwa cza N&"D"A. Phinn 4}n ch|y trnn m]t GPU $uy nhkt \|t \an toc \] 8ho}ng 7= GCUPS #[i N&"D"A Ge6orce GTF -/= #j m]t car$ \v hwa ch|y trnn GPU \a cY thm \|t 70 GCUPS. Nhng 8at qu} njy \~ cho thky m]t hilu sukt l[n hn nhiu so #[i SWPSB hoc N&:" :LAST-#j chng minh m]t ng $xng cao cza 8ian trrc \a li GPU trong #ilc gi}i quyat c_c #kn \ linn 8at chui. C. Phhng h_ tia c`n cza chrng tei &[i sc h_t trimn theo ck so nhn cza c sq $ lilu trbnh tc sinh hwc% sc cn thiat cza hhng h_ t^nh to_n hilu sukt cao \hic coi lj \m gi}i quyat nhng #kn \ tin sinh hwc% \c 4ilt lj sc linn 8at chui #kn \. Cat qu} gn \y sy $xng GPU \m thcc hiln Smith - Waterman thu`t to_n \~ cho thky m]t hilu sukt #hit tr]i so #[i c_c hhng h_ 8h_c. Tuy nhinn% thcc hiln mj ch cY thm \hic cji \t trnn m]t GPU \n hoc trnn m]t m_y t^nh \hic trang 4u \a GPU% #j 8heng cY 4kt 8 cji \t thcc thi trnn m]t cxm% ni mj c_c nrt \hic trang 4u \a GPU. Trong 4ji 4_o njy% chrng tei thcc hiln c_c thu`t to_n Smith-Waterman trnn m]t cxm 4ao gvm c} \a GPU GPUCluster. Nhng 8at qu} njy \~ \hic so s_nh #[i #ilc trimn 8hai cY hilu qu} trh[c \Y cho thky m]t c}i tian \_ng 8m # hilu sukt #j chng minh sc m|nh t^nh to_n l[n% m]t ng $xng cao cza GPUCluster trong #kn \ sinh hwc. """. TNG IUAN GPU CLUST'1 A. CUDA #j GPGPU Trong nhng ntm gn \y% sc m|nh t^nh to_n cza GPU 5y l \v hwa \~ ttng \_ng 8m so #[i CPU. Cho \an th_ng 0 ntm -==/% N&"D"A GPU tha hl GT-== \~ \|t \an nghng MBB G6L2PS% hn 7= ln so #[i 4] 5y l li 8{ "ntel Feon B%- GHD cng m]t lrc. Hbnh - cho thky m]t sc gia ttng l[n trong sc m|nh t^nh to_n cza 4] #i 5y l \v hwa N&"D"A so #[i 4] #i 5y l "ntel. iu njy hu #ilt trong ho|t \]ng 8heng 4ao hjm sc #hit tr]i # ceng nghl. GPU #j CPU \hic h_t trimn theo hai hh[ng 8h_c nhau( trong 8hi CPU toc \] ceng nghl lnn m]t nhilm #x $uy nhkt% ceng nghl GPU co gZng \m ttng so lhing c_c nhilm #x cY thm \hic thcc hiln song song. Nhh #`y% trong 8hi so lhing li trong CPU theng thhpng \~ 8heng \|t \hic / li \hic nnu ra% so lhing li trong \n GPU \~ \|t \an -E= #j cng ha hn s tia txc ttng \an .== li trong ntm -=7=. Lj m]t hbnh h|t cho sc m|nh t^nh to_n% GPU mkt t^nh linh ho|t cza nhn 5y l. . Hiln nay% tkt c} c_c nhn 5y l trnn m]t GPU $uy nhkt ch cY thm thcc hiln m]t m}nh $uy nhkt cza m~ t|i m]t thpi \imm% #b #`y GPU ch th^ch hi cho #kn \ $ lilu song song% trong \Y c_c m~ chhng trbnh thng tc s \hic thcc thi song song cho m]t so 4] $ lilu 8h_c nhau. ,ay mZn thay% hu hat c_c #kn \ \fi hgi sc m|nh t^nh to_n l[n cY thm \hic chuymn \i thjnh m]t lo|i $ lilu song song. :nn c|nh nhng n lcc c}i thiln 8h} ntng t^nh to_n GPU% nhj s}n 5ukt GPU cng \hic quan tm trong #ilc cung ck c_c mei trhpng h_t trimn ng $xng tot hn cho c_c nhj h_t trimn chung cho chhng trbnh $ $jng trnn GPU. N&"D"A CUDA ;G< m]t #^ $x tot # n lcc nhh #`y. &[i CUDA% c_c l` trbnh cY thm 8hai th_c sc m|nh t^nh to_n GPU cho 8heng ch c_c ng $xng 5y l \v hwa mj cfn cY mxc \^ch chung c_c ng $xng. Ceng nghl njy lj m]t trong nhng yau to quan trwng cho #ilc mq cya gn \y GPGPU >General-Pur ose t^nh to_n trnn 4] 5y l \v hwa? thpi \|i. Dh[i \y lj m]t so t^nh ntng ch^nh cza ngen ng l` trbnh h tri CUDA >gwi lj ngen ng CUDA?( CUDA ngen ng lj m]t mq r]ng cza ngen ng C% $o \Y quen thu]c #[i hu hat c_c nhj h_t trimn. CUDA \ang chia thjnh hai hn( m]t thcc thi trnn CPU #j m]t thcc thi trnn GPU. Phn thcc thi trnn GPU% cfn \hic gwi lj h|t nhn song song% 8hi \hic gwi% cY thm \hic thcc hiln song song trnn hjng ngjn chz \ thcc

hiln. ,i threa$ cY m]t \unh $anh $uy nhkt \hic sy $xng \m 5_c \unh nhilm #x cza mbnh. CUDA cho h{ l` trbnh \m 5_c \unh m]t so ty cza chz \ song song% nhhng \m tr_nh sc hx thu]c #jo hn cng% chz \ \hic chia thjnh c_c 8hoi #[i so lhing 8heng qu_ G0/ >GT-== tha hl?. iu njy cho h{ m]t l` trbnh \m thiat 8a chhng trbnh song song cza mbnh cY hilu qu} mj 8heng cn quan tm # 8h} ntng hn cng. :] nh[ hl thong $wc lj t chc \m sy $xng hilu qu} - :] nh[ ch^nh( 8hu #cc 4] nh[ cho CPU m~. Ch cY m~ njy cY thm truy c` #j sya \i theng tin q \y. - Tojn cu 4] nh[( c_c 8hu #cc 4] nh[ rng tkt c} c_c chz \ GPU cY thm truy c` #jo nY. C_c l` trbnh #inn cY thm $i chuymn $ lilu t 4] nh[ ch^nh #jo 4] nh[ tojn cu 4ng c_ch sy $xng chc ntng t m]t thh #iln CUDA c 4}n. :] nh[ njy thhpng \hic sy $xng \m lhu tr nguynn lilu \u #jo #j \u ra cho c_c chz \ song song trnn GPU. - :] nh[ $ng chung( c_c 8hu #cc 4] nh[ rng chz \ ch trong m]t 8hoi cY thm truy c` . :] nh[ njy \hic t^ch hi trnn chi % #b tha toc \] truy c` $ lilu trnn \Y cao hn nhiu so #[i trnn 4] nh[ tojn cu. :] nh[ njy thhpng \hic sy $xng \m lhu tr $ lilu t|m thpi chia s gia c_c chz \ trong m]t 8hoi \m ttng toc \] qu_ trbnh sy $xng 4] nh[. - ua hhng 4] nh[( c_c 8hu #cc 4] nh[ \hic hn 4 cho c_c 4ian \ua hhng cza mi chz \ #j m]t sii GPU 8heng thm truy c` \an nhng t nghpi 8h_c. &[i 8h} ntng thcc hiln song song $ lilu ngjy m]t nhiu nhh #`y cza c_c chz \% GPU lj m]t sc lca chwn th^ch hi \m thcc hiln c_c thu`t to_n Smith-Waterman% trong \Y mi chz \ cY thm t^nh to_n trnn m]t ta 4jo timu \hpng ch{o cza ma tr`n H. :. G ucluster Sc h_t trimn cza sc m|nh t^nh to_n cza car$ \v hwa% cng #[i sc gia ttng rkt nhanh chYng cza c_c 8^ch thh[c cza $ lilu \u #jo cho c_c #kn \ \fi hgi h}i t^nh to_n hilu sukt cao \~ $n \an ynu cu cza #ilc 8at hi sc m|nh t^nh to_n cza \a GPU trnn c_c nrt 8h_c nhau. ,]t hl thong cxm GPU gir gi}i quyat nhng #kn \. ,]t cxm GPU \hic \unh ngha lj m]t nhYm c_c m_y t^nh ni mi nrt \hic trang 4u #[i m]t hoc nhiu GPU. :ng c_ch 8hai th_c \y \z sc m|nh t^nh to_n cza GPU theng qua c_c mei trhpng h_t trimn ng $xng% chrng ta cY thm thcc hiln t^nh to_n rkt nhanh trnn m]t cxm GPU. & c 4}n% chng h|n m]t hl thong 4ao gvm c_c thjnh hn hn cng nhh CPU% GPU #j \m 8at noi c_c nrt chrng ta cn thnm m]t 8at noi m|ng nhh Giga4it 'thernet. Phn mm cn thiat \hic cji \t trnn m]t nrt GPU 4ao gvm( hl \iu hjnh% trbnh \iu 8himn GPU trong mi nrt #j giao $iln l` trbnh song song nhh :] CH T. Trong nhng th` 8 gn \y% \~ cY rkt nhiu c_c hl thong cxm GPU trimn 8hai nhh cji \t cza Gra hStream ;7=<% nhhng chrng ch lj c_c hl thong }o. ,]t so $c _n trimn 8hai c_c nrt t^nh to_n GPU 4ao gvm( GPU Cluster ODIO 70= c_c nrt t|i LANL-;77<% #j OIPO 70 nrt q NCSA ;7-<% c} hai \u $ca trnn ceng nghl Iuar$roPle5 cza N&"D"A. Nhng cji \t chz yau lj cung ck 8at qu} thy nghilm trong lnh #cc s}n 5ukt sy $xng hilu sukt cao. m 8hai th_c 8h} ntng t^nh to_n l[n cza Cluster GPU% chrng tei sy $xng thqng gi}m m|nh trong 8huen 8h CUDA a theng qua giao $iln mei trhpng - :] CH T. ,i nrt trong cluster cY nhilm #x rinng cza mbnh gyi $ lilu #j nhilm #x song song chuynn su \m GPU% ljm cho CPU min h^ \m thcc hiln truyn theng m|ng gia c_c nrt. m 4inn $uch #j ch|y chhng trbnh ,P"% nY 8heng h}i lj 8hY 8htn nau sy $xng N&"D"A trbnh 4inn $uch - n#cc 4inn $uch tojn 4] hn hi CUDA #j :] CH T \ang cng nhau. \y% chrng tei \ang ,P" tr]n #jo c_c t` tin nguvn CUDA% #b CUDA lj m]t mq r]ng cza ngen ng C #j n#cc 4inn $uch 8at thrc tot \ m icc trbnh 4inn $uch. &^ $x% \m 4inn $uch m~ nguvn cY cha c} hai CUDA #j m~ :] CH T% chrng tei sy $xng n#cc \m 4inn $uch% #j 4ao gvm thh #iln ,P" #j c_c tinu \(
#$cc m%i&exam%le'c( ) *H+,E-o%e#-m%i-lib lm%i *H+,E-o%e# m%i-i#cl(.e /

Phn tia theo cza chrng tei s trbnh 4jy thcc hiln cza thu`t to_n Smith-Waterman trnn GPUCluster hl thong #j m]t so 8at qu} thcc nghilm. "&. THC H"N thu`t to_n S,"TH-WAT'1,AN # GPU CLUST'1

A. chian lhic &[i sc me t} cza thu`t to_n SW q trnn% cng #[i hn t^ch c_c yau to cY thm \hic song song trong qu_ trbnh t^nh to_n ma tr`n% chrng tei sy $xng hai c_ch tia c`n trnn GPU #[i hai cxm $ lilu ck \] song song% \ 5ukt cza Pongchao Liu% Douglas L ,as8ell #j :ertil Schmi$t trong ;/<. ck \] \u tinn% c_c thu`t to_n cY thm thcc hiln sc linn 8at #[i trbnh tc \u #jo 8h_c nhau song song. Ntm th hai% t^nh to_n cza ma tr`n H cY thm \hic thcc hiln \vng thpi cho c_c gi_ tru cza c_c ta 4jo trnn timu \hpng ch{o. Gi} sy rng sc linn 8at cza hai chui lj m]t nhilm #x $uy nhkt% mc \] \u tinn cY thm \hic coi lj linn nhilm #x song song% \iu th hai \hic coi lj nhilm #x trong n]i 4] song song >t_ch m]t nhilm #x thjnh nhiu nhilm #x hx?. "nter-nhilm #x song song( mi ceng #ilc \hic giao cho m]t threa$ thcc hiln. Trong m]t 8hoi cza chz \% nhilm #x% \vng thpi thcc hiln 4qi chz \ 8h_c nhau. N]i ceng #ilc song song( mi ceng #ilc \hic giao cho m]t 8hoi. ,i threa$ 4nn trong m]t 8hoi s t^nh to_n m]t ta 4jo trnn \hpng ch{o hx th i $ca trnn gi_ tru cza c_c ta 4jo trnn hai timu \hpng ch{o >i-7? #j th >i--? ln th. Sau 8hi hojn thjnh #ilc t^nh to_n% gi_ tru cza timu \hpng ch{o #ector \hic \i ch \m t^nh to_n gi_ tru cza c_c \hpng ch{o t[i. "nter-nhilm #x song song \fi hgi nhiu 4] nh[ hn% nhhng nY s cho m]t hilu sukt tot hn% #b #`y nY h hi cho sc linn 8at cza c_c chui #[i \] $ji ngZn. Nghic l|i% trong n]i 4] ceng #ilc song song 8heng ynu cu nhiu 4] nh[ #j nY cY m]t hilu sukt thk hn% nY h hi cho c_c chui $ji hn. m t_ch 4ilt hai hhng h_ njy% chrng tei sy $xng m]t gi_ tru nghng # \] $ji chui \m quyat \unh hhng h_ \hic sy $xng. :. Song song c_c thu`t to_n trnn G ucluster Dca trnn hhng h_ song song #[i hai ck \] nhh \~ \ c` q trnn% chrng tei hn hoi $ lilu trnn hl thong cluster GPU thcc hiln trnn mi GPU cza mi nrt. C_c hl thong cluster sy $xng m]t thh mxc chia s $ lilu \hic \vng 4] hYa gia c_c nrt. Thh mxc njy lhu tr $ lilu trbnh tc sinh hwc. Gi} sy rng c_c hl thong cluster GPU 4ao gvm c_c nrt nRclient% mi nrt \hic trang 4u GPU nR$e#ice. Th nhkt% c sq $ lilu chui $ lilu \hic chia thjnh c_c hn nRclient $ca trnn 8^ch thh[c cza $ lilu. m tr_nh 5ung \]t 8hi truy c` c_c t` tin cY cha $ lilu chui% chrng tei sy $xng m]t $ lilu thay \v $4Rloc8 #j $4Rstat m]t $ lilu-tbnh tr|ng-4}n ghi cho mi truy c` cza c_c nrt. C_c 8hYa $ lilu #j c_c $ lilu tbnh tr|ng 8 lxc lj $uy nhkt #j chuymn gia c_c nrt. Chi m]t nrt h_t hiln ra rng $ lilu 8heng 4u 8hYa% nY s \wc $ lilu-tbnh tr|ng-ghi \m 5_c \unh #u tr^ trh[c \Y \m tia txc t}i $ lilu #jo 4] nh[. 8hi \wc hojn tkt% c_c nrt s c` nh`t c_c $ lilu tbnh tr|ng% ghi l|i #j mq 8hYa tz $ lilu.
)#itiali0e ,1), get t2e #(mbe o3 #o.e4; .b&bloc5&4i0e = total67Si0e - #&clie#t; 82ile (.ata 4till i4 loc5e.){ 8ait; } 9 eate .ata loc5e ; :ea. eco . o3 .ata 4tat(4 to .ete mi#e .ata %o4itio#; tem%&4i0e = 0; 82ile(tem%&4i0e < .b&bloc5&4i0e ;;.b)4<otEm%t=){ >et (Se?, Se?<ame, Se?/e#) i# 6ataFile; tem%&4i0e += Se?/e#; } Sa$e t2e .ata 4tat(4 eco .; @#loc5 .ata;

Trh[c 8hi $ lilu \hic chia thjnh 4] nh[ GPU% nY \hic sZ 5a th tc ttng $n \] $ji chui. Sc sZ 5a njy nhm hai mxc \^ch. Nghpi \u tinn lj \m t_ch thjnh hai 8hoi $ lilu \hic 5y l theo hai c_ch >nhh me t} q trnn?. Iu_ trbnh njy \hic \iu 8himn 4qi m]t nghng tham so thpi gian ch|y >nau c_c trbnh tc chiu $ji Qnghng% nY \hic g_n cho 8hoi \u tinn% nau 8heng nY thu]c # 8hoi th hai?. Nhng 8hoi tia txc 4u chia thjnh c_c hn thng ng #[i GPU nR$e#ice% $o \Y mi GPU s ch h}i 5y l c_c 8hoi thng tc cza $ lilu trong c} hai c_ch. ,xc \^ch th hai lj chz \ cza cng m]t 8hoi s sZ 5a trbnh tc cY \] $ji 5k 5 4ng nhau% #b #`y thpi gian ch|y cza hw lj 5k 5 thng tc% \iu njy ljm ttng hilu sukt thcc hiln. Chhng trbnh 4ao gvm hai h|t nhn thcc hiln hai thu`t to_n song song #[i hai ck \] trnn c_c cxm GPU >m]t h|t nhn thng ng #[i nhilm #x linn song song #j thng ng 8h_c trong n]i 4] ceng #ilc song song?. Sau 8hi t}i $ lilu #jo 4] nh[ th tc cza mi GPU% chhng trbnh s gwi h|t nhn \m sZ 5a trbnh tc. Sau \Y% 8at qu} \hic lhu trong 4] nh[ cza c_c nrt #j mi nrt s #iat 8at qu} #jo t` tin t|m thpi. T|i nrt =% chhng trbnh s thu th` 8at qu} thcc hiln trnn tkt c} c_c nrt. Dh[i \y lj m~ gi} cho #ilc thcc hiln c_c h|t nhn trnn Cluster GPU(
-AAAAAAAAAAAAAAAAAAAAABSS@,1C)+<SAAAAAAAAAAAAAAAAAAA

t2 ea.4D t2e #(mbe o3 t2 ea.4 exec(te. i# %a allel; #S,D t2e #(mbe o3 m(lti-% oce44o 4 %e >1@; )#te Se?<oD t2e #(mbe o3 4e?(e#ce4 E2o4e le#gt24 a e le44 t2a# t2e $al(e threshold %e )#t aSe?<oD t2e #(mbe o3 4e?(e#ce4 E2o4e le#gt24 a e mo e t2a# t2e $al(e threshold %e AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA#&batc2 = )#te Se?<o-t2 ea.4; E2ile (#&batc2 F 0){ # = mi#(#&batc2, #S,); .imG g i.4 (#, 1, 1); .imG bloc54 (t2 ea.4, 1, 1); i#te &5e #el<<g i.4, bloc54, 1FF (i#te Se?4, i#te Se?<o, .& e4(lt); #&batc2 -= #; } i3 (i#t aSe?<o F 0){ maxSe?+#e1a44 = HIJ; #&batc2 = i#t aSe?<o; E2ile (#&batc2 F 0){ # = mi#(#&batc2, maxSe?+#e1a44); .imG g i.4 (#, 1, 1); .imG bloc54 (t2 ea.4, 1, 1); i#t a&5e #el<<g i.4, bloc54, 1FF (i#t aSe?4, i#t aSe?<o, .& e4(lt); #&batc2 = #;} } t a#43e CoHo4t:e4(lt(.& e4(lt);

>1@ >1@

&. Cat qu} th^ &ilc thcc hiln c_c thu`t to_n Smith-Waterman trnn GPU cxm \~ \hic trimn 8hai #j thy nghilm t|i Trung tm iln to_n hilu sukt cao% Hj N]i |i hwc Choa hwc #j Ceng nghl. C_c mei trhpng thy nghilm lj m]t cxm GPU 4ao gvm c} hai nrt% m]t nrt \hic trang 4u hai car$ \v hwa N&"D"A Ge6orce GTF -M.% m]t Tesla C7=0= th% #j nrt cfn l|i \hic trang 4u - car$ \v hwa 8{ N&"D"A Ge6orce GTF -M.. m lo|i 4g sc hx thu]c #jo c_c chui truy #kn #j c_c c sq $ lilu sy $xng cho c_c thy nghilm 8h_c nhau% c` nh`t $i \]ng mi giy >CUPS? lj m]t thh[c \o hilu sukt thhpng \hic sy $xng trong tin sinh hwc. CUPS trbnh 4jy so lhing c_c ta 4jo cza ma tr`n H \hic t^nh to_n trong m]t giy% 4ao gvm c} #ilc t^nh gi_ tru trung gian cza ma tr`n '% 6?. Ceng thc >7? t^nh to_n CUPS gi_ tru cza m]t cu tr} lpi linn 8at chui(
c(%4 = ?/e# A .b/e#-t >7?

Trhpng hi qLen lj chiu $ji cza m]t chui truy #kn% $4Len lj chiu $ji cza m]t chui \oi thingN t lj thpi gian ch|y cza chhng trbnh. C_c t gi_ tru 4ao gvm thpi gian t}i $ lilu t 4] nh[ ch^nh \m 4] nh[ \iln tho|i% thpi gian t^nh to_n trnn GPU #j thpi gian chuymn giao 8at qu} cho CPU. Trong thy nghilm cza chrng tei% chrng tei sy $xng m]t 4] c_c chui truy #kn cY \] $ji lj 7==-.===% c_c trbnh tc c sq $ lilu sinh hwc UniProt h_t hjnh -=7=R=. - -= th_ng E ntm -=7= trong \Y 4ao gvm .70.=/= 7/7.0G0..=. trbnh tc #j a5it amin. &[i c sq $ lilu #j nghng gi_ tru lj B=G-% cY \an .7..EG- chui \hic linn 8at 4qi n]i nhilm #x song song #j 0=/ nghpi 8h_c \hic linn 8at 4qi linn nhilm #x song song. Th^ nghilm cza s3GPUCluster \hic thy nghilm trnn hai nrt No$e= #j lca chwn No$e 7% sy $xng \a GPU >4a \ei th GTF-M. - 0 GPU% m]t th Tesla C7=0= - 7 GPU?. &[i hl thong cluster GPU cza chrng tei% hilu sukt toi \a \|t \hic 8hi 8hoi chz \ siDe S -.0 #j m|ng lh[i c_c 8hoi 8^ch thh[c S B= >so lhing streaming multi rocessors cza GPU?. &ilc thcc hiln s3GPUCluster ttng lnn theo \] $ji cza chui truy #kn% t gi_ tru toi thimu BG%B-/ GCUPS #[i gi_ tru toi \a E0%G=0 GCUPS. Cat qu} njy \hic me t} trong 4}ng 7. Chrng tei \~ so s_nh c_c 8at qu} cza s3GPUCluster c_c gi}i h_ 8h_c \m thcc hiln c_c thu`t to_n SW nhh( cu$aSW J J hoc s3 sB. cu$aSW J J \~ \hic thy nghilm trnn m]t GTF-M. GPU. Cat qu} cza nY cho thky rng hilu sukt toi thimu lj /%B/G GCUPS #j toi \a lj M%-B- GCUPS. So #[i c_c cu$aSW J J trnn m]t \n GPU% toc \] thcc hiln s3GPUCluster lj 8ho}ng E%E \an . ln nhanh hn so #[i cu$aSW J J. ,]t so s_nh # hilu sukt \hic thcc hiln #[i c_c s3 sB thcc hiln. S3 sB \hic thy nghilm trnn 5/0Hssenn t}ng 4ao gvm m]t nrt \hic trang 4u m]t 4] 5y l Core Iua$ I/E== -%00 GhD >E li?% /G: 1A, #[i m]t sii hoc 4on chz \. &ilc thcc hiln s3GPUCluster lj 8ho}ng 7B%/---%/ ln so #[i c_c s3 sB 5/0Hsse--single- cot li% #j nY lj 8ho}ng B%E-77%E ln so #[i s3 sB 5/0Hsse- \a li% nhh trong hbnh 0.

TA:L' ". 1'SULTS 26 TH' ",PL','NTAT"2N 26 TH' S,"TH-WAT'1,AN ALG21"TH, 2N GPU CLUST'1

<uery !3//6/ !32222 !2:0:/ !386/8

=ength 7EE 7/M --BG.

-ime)s* =.GGMBMB =.MEE70E 7.=G/G/B 7.G-7BG/

#"!S BG.B-/ BM.-=E=.GM. E-.ME.

!/1831 !/2288 !/8901 !38814 !3:881 !20304 !3#4>9 !39120 !66:13 <0"?%2

..B G-M 7=== 7.== -==. -.=E B.0E E.E/ .7EG .EG/

-..B--0= B.--E7/. E.BBGB7= 0.E.0BM/ /..GM00E 7=.0/7E-0 7..7GEM7/ 7M.BE//G7 -7./MEMB/ -B.-ME.E0

EB.E7M EE.M=G E../EM E0.=0. E0.B.. E0..B= E0.07E0.0.B E0.0M= E0.G=0

&". CT LUN Trong 4ji 4_o njy% chrng tei trbnh 4jy c_c s3GPUCluster - m]t thcc hiln c_c thu`t to_n Smith-Waterman chui linn 8at trnn m]t hl thong cxm GPU 4ao gvm hai nrt \hic trang 4u \a GPU >B \ei th GTF-M. - 0GPUs #j m]t th Tesla C7=0= - 7GPU?. &[i 4ji 8imm tra \u #jo mj lj c sq $ lilu trbnh tc sinh hwc UniProt hinn 4}n -=7=R=. Ngjy -= th_ng E% cng #[i c_c thiat l` cku hbnh toi hu% hilu sukt cza s3GPUCluster ttng #[i chiu $ji cza chui truy #kn t gi_ tru toi thimu lj BG.B-/ GCUPS #[i gi_ tru toi \a lj E0.G=0 GCUPS. S3GPUCluster njy cung ck cho m]t hilu sukt tot hn \_ng 8m so #[i thcc hiln cji \t trh[c \Y trnn GPU hoc trnn 8ian trrc \a li nhh s3 sB hoc cu$aSW J J. Cat qu} cza chrng tei cho thky m]t ng $xng cao cza GPU \m ttng toc \] thcc hiln c_c thu`t to_n trong tin sinh hwc% nau chrng ta 8hai th_c tot \c \imm cza hn cng m_y t^nh. C_c ho|t \]ng ni 4`t cng cho thky rng hilu sukt cza GPU ttng nhanh hn nhiu so #[i hilu sukt cza CPU \a li. Tji lilu tham 8h}o
;7< ;-< ;B< ;E< 333-.cs.uh.e$uHTDhenDhaoH1e#ie3Halignment.htm htt (HH4last.nc4i.nlm.nih.go#H:last.cgi. htt (HHen.3i8i e$ia.orgH3i8iHSmithWatermanRalgorithm. 1ognes T% See4erg '( USi5-!ol$ s ee$-u o! Smith-Waterman sequence $ata4ase searches using arallel rocessing on common micro rocessorsV . Bioin*ormatics -=== % 70>/?;0MM-G=0 ;.< 6arrar ,( UStri e$ Smith-Waterman s ee$s $ata4ase searches si5 times o#er other S",D im lementationsV . Bioin*ormatics -==G % -B>-?;7.0707 ;0< ,ana#s8i SA% &alle G( UCUDA com ati4le GPU car$s as e!!icient har$3are accelerators !or Smith-Waterman sequence alignmentV ;G< SDal8o3s8i A% Le$erger4er C% Crahen4uhl P an$ DessimoD C( USWPSB @ !ast multi-threa$e$ #ectoriDe$ Smith-Waterman !or ":, CellH:.'. an$ W/0HSS'-V . B+C Research ,otes -==/% 7(7=G. ;/< Pongchao Liu% Douglas L ,as8ell an$ :ertil Schmi$t( UCUDASWJJ( o timiDing Smith-Waterman sequence $ata4ase searches !or CUDA- ena4le$ gra hics rocessing unitsV. B+C Research ,otes -==M% -;GB. ;M< N&"D"A.htt (HH333.n#i$ia.comHo49ectHcu$aRhomeRne3.html ;7=< >-==M? Gra hStream% "nc. 3e4site. ;2nline<. A#aila4le( htt (HH333.gra hstream.comH. ;77< D. GX$$e8e% 1. StrDo$8a% K. ,oh$-Puso!% P. ,cCormic8% S. :ui9ssen% ,. Gra9e3s8i% an$ S. Ture8a% U'5 loring 3ea8 scala4ility !or 6', calculations on a GPU-enhance$ cluster%V Parallel Com uting% #ol. BB% . 0/.-0MM% No# -==G. ;7-< ,. Sho3erman% K. 'nos% A. Pant% &. Cin$raten8o% C. Ste!!en% 1. Pennington% W. H3u% UIP( A Heterogeneous ,ulti-Accelerator Cluster%V in Proc. 7=th LC" "nternational Con!erence on High- Per!ormance Clustere$ Com uting% -==M. ;2nline<. A#aila4le(htt (HH333.ncsa.illinois.e$uHT8in$rH a ersHlci=MR a er. $!.

You might also like