Thinking Parallel, Part 3

10/10/13
Thinking Parallel, Part III: Tree Construction on the GPU | NVIDIA Developer Zone
NVIDIA Developer Zone Dev eloper Centers Resources Log In CUDA ZONE Hom e > CUDA ZONE > Parallel Forall Thinking Parallel, Part III: Tree Const ruct ion on t he GPU By Tero Karras, posted Dec 1 9 2 01 2 at 1 0:4 9 PM Secondary links Technologies Tools
Com m unity
Tags: Algorithms, Occupancy , Parallel Programming
In part II of this series, we looked at hierarchical tree trav ersal as a means of quickly identify ing pairs of potentially colliding 3D objects and we demonstrated how optimizing for low div ergence can result in substantial performance gains on massiv ely parallel processors. Hav ing a fast trav ersal algorithm is not v ery useful, though, unless we also hav e a tree to go with it. In this part, we will close the circle by looking at tree building; specifically , parallel bounding v olume hierarchy (BV H) construction. We will also see an ex ample of an algorithmic optimization that would be completely pointless on a single-core processor, but leads to substantial gains in a parallel setting. There are many use cases for BV Hs, and also many way s of constructing them. In our case, construction speed is of the essence. In a phy sics simulation, objects keep mov ing from one time step to the nex t, so we will need a different BV H for each step. Furthermore, we know that we are going to spend only about 0.25 milliseconds in trav ersing the BV H, so it makes little sense to spend much more on constructing it. One well-known approach for handling dy namic scenes is to essentially recycle the same BV H over and over. The basic idea is to only recalculate the bounding box es of the nodes according to the new object locations while keeping the hierarchical structure of nodes the same. It is also possible to make small incremental modifications to improv e the node structure around objects that hav e mov ed the most. Howev er, the main problem plaguing these algorithms is that the tree can deteriorate in unpredictable way s ov er time, which can result in arbitrarily bad trav ersal performance in the worst case. To ensure predictable worst-case behav ior, we instead choose to build a new tree from scratch ev ery time step. Lets look at how.
EXPLOITING THE Z-ORDER CURVE

The most promising current parallel BV H construction approach is to use a so-called linear BV H (LBV H). The idea is to simplify the problem by first choosing the order in which the leaf nodes (each corresponding to one object) appear in the tree, and then generating the internal nodes in a way that respects this order. We generally want objects that located close to each other in 3D space to also reside nearby in the hierarchy , so a reasonable choice is to sort them along a spacehttps://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 1/11
10/10/13
filling curve . We will use the Z-order curve for simplicity .
The Z-order curv e is defined in terms of Morton codes . To calculate a Morton code for the giv en 3D point, we start by looking at the binary fix ed-point representation of its coordinates, as shown in the top left part of the figure. First, we take the fractional part of each coordinate and ex pand it by inserting two gaps after each bit. Second, we interleav e the bits of all three coordinates together to form a single binary number. If we step through the Morton codes obtained this way in increasing order, we are effectiv ely stepping along the Z-order curv e in 3D (a 2D representation is shown on the right-hand side of the figure). In practice, we can determine the order of the leaf nodes by assigning a Morton code for each object and then sorting the objects accordingly . As mentioned in the contex t of sort and sweep in part I , parallel radix sort is just the right tool for this job. A good way to assign the Morton code for a giv en object is to use the centroid point of its bounding box , and ex press it relativ e to the bounding box of the scene. The ex pansion and interleav ing of bits can then be performed efficiently by ex ploiting the arcane bitswizzling properties of integer multiplication, as shown in the following code.
/ /E x p a n d sa1 0 b i ti n t e g e ri n t o3 0b i t s / /b yi n s e r t i n g2z e r o sa f t e re a c hb i t . u n s i g n e di n te x p a n d B i t s ( u n s i g n e di n tv ) { v=( v*0 x 0 0 0 1 0 0 0 1 u )&0 x F F 0 0 0 0 F F u ; v=( v*0 x 0 0 0 0 0 1 0 1 u )&0 x 0 F 0 0 F 0 0 F u ; v=( v*0 x 0 0 0 0 0 0 1 1 u )&0 x C 3 0 C 3 0 C 3 u ; v=( v*0 x 0 0 0 0 0 0 0 5 u )&0 x 4 9 2 4 9 2 4 9 u ; r e t u r nv ; } / /C a l c u l a t e sa3 0 b i tM o r t o nc o d ef o rt h e / /g i v e n3 Dp o i n tl o c a t e dw i t h i nt h eu n i tc u b e[ 0 , 1 ] . u n s i g n e di n tm o r t o n 3 D ( f l o a tx ,f l o a ty ,f l o a tz ) { x=m i n ( m a x ( x*1 0 2 4 . 0 f ,0 . 0 f ) ,1 0 2 3 . 0 f ) ; y=m i n ( m a x ( y*1 0 2 4 . 0 f ,0 . 0 f ) ,1 0 2 3 . 0 f ) ; z=m i n ( m a x ( z*1 0 2 4 . 0 f ,0 . 0 f ) ,1 0 2 3 . 0 f ) ;
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu 2/11
10/10/13
u n s i g n e di n tx x=e x p a n d B i t s ( ( u n s i g n e di n t ) x ) ; u n s i g n e di n ty y=e x p a n d B i t s ( ( u n s i g n e di n t ) y ) ; u n s i g n e di n tz z=e x p a n d B i t s ( ( u n s i g n e di n t ) z ) ; r e t u r nx x*4+y y*2+z z ; } In our ex ample dataset with 1 2,486 objects, assigning the Morton codes this way takes 0.02 milliseconds on GeForce GTX 690 , whereas sorting the objects takes 0.18 ms. So far so good, but we still hav e a tree to build.
TOP-DOWN HIERARCHY GENERATION

One of the great things about LBV H is that once we hav e fix ed the order of the leaf nodes, we can think of each internal node as just a linear range ov er them. To illustrate this, suppose that we hav e N leaf nodes in total. The root node contains all of them, i.e. it cov ers the range [0, N-1 ]. The left child of the root must then cov er the range [0, ], for some appropriate choice of , and the right child cov ers the range [ +1 , N-1 ]. We can continue this all the way down to obtain the following recursiv e algorithm.
N o d e *g e n e r a t e H i e r a r c h y (u n s i g n e di n t *s o r t e d M o r t o n C o d e s , i n t * i n t i n t { / /S i n g l eo b j e c t= >c r e a t eal e a fn o d e . i f( f i r s t= =l a s t ) r e t u r nn e wL e a f N o d e ( & s o r t e d O b j e c t I D s [ f i r s t ] ) ; / /D e t e r m i n ew h e r et os p l i tt h er a n g e . i n ts p l i t=f i n d S p l i t ( s o r t e d M o r t o n C o d e s ,f i r s t ,l a s t ) ; / /P r o c e s st h er e s u l t i n gs u b r a n g e sr e c u r s i v e l y . N o d e *c h i l d A=g e n e r a t e H i e r a r c h y ( s o r t e d M o r t o n C o d e s ,s o r t e d O b j e c t I D s , f i r s t ,s p l i t ) ; N o d e *c h i l d B=g e n e r a t e H i e r a r c h y ( s o r t e d M o r t o n C o d e s ,s o r t e d O b j e c t I D s , s p l i t+1 ,l a s t ) ; r e t u r nn e wI n t e r n a l N o d e ( c h i l d A ,c h i l d B ) ; } s o r t e d O b j e c t I D s , f i r s t , l a s t )
We start with a range that cov ers all objects (f i r s t =0, l a s t = N-1 ), and determine an appropriate position to split the range in two (s p l i t = ). We then repeat the same thing for the resulting subranges, and generate a hierarchy where each such split corresponds to one internal node. The recursion terminates when we encounter a range that contains only one item, in which case we create a leaf node.
10/10/13
The only remaining question is how to choose . LBV H determines according to the highest bit that differs between the Morton codes within the giv en range. In other words, we aim to partition the objects so that the highest differing bit will be zero for all objects in c h i l d A , and one for all objects in c h i l d B . The intuitiv e reason that this is a good idea is that partitioning objects by the highest differing bit in their Morton codes corresponds to classify ing them on either side of an ax is-aligned plane in 3D. In practice, the most efficient way to find out where the highest bit changes is to use binary search. The idea is to maintain a current best guess for the position, and try to adv ance it in ex ponentially decreasing steps. On each step, we check whether the proposed new position would v iolate the requirements for c h i l d A , and either accept or reject it accordingly . This is illustrated by the following code, which uses the _ _ c l z ( ) intrinsic function av ailable in NV IDIA Fermi and Kepler GPUs to count the number of leading zero bits in a 32-bit integer.
i n tf i n d S p l i t (u n s i g n e di n t *s o r t e d M o r t o n C o d e s , i n t i n t { / /I d e n t i c a lM o r t o nc o d e s= >s p l i tt h er a n g ei nt h em i d d l e . u n s i g n e di n tf i r s t C o d e=s o r t e d M o r t o n C o d e s [ f i r s t ] ; u n s i g n e di n tl a s t C o d e=s o r t e d M o r t o n C o d e s [ l a s t ] ;

f i r s t , l a s t )
10/10/13
i f( f i r s t C o d e= =l a s t C o d e ) r e t u r n( f i r s t+l a s t )> >1 ; / /C a l c u l a t et h en u m b e ro fh i g h e s tb i t st h a ta r et h es a m e / /f o ra l lo b j e c t s ,u s i n gt h ec o u n t l e a d i n g z e r o si n t r i n s i c . i n tc o m m o n P r e f i x=_ _ c l z ( f i r s t C o d e^l a s t C o d e ) ; / /U s eb i n a r ys e a r c ht of i n dw h e r et h en e x tb i td i f f e r s . / /S p e c i f i c a l l y ,w ea r el o o k i n gf o rt h eh i g h e s to b j e c tt h a t / /s h a r e sm o r et h a nc o m m o n P r e f i xb i t sw i t ht h ef i r s to n e . i n ts p l i t=f i r s t ;/ /i n i t i a lg u e s s i n ts t e p=l a s t-f i r s t ; d o { s t e p=( s t e p+1 )> >1 ;/ /e x p o n e n t i a ld e c r e a s e i n tn e w S p l i t=s p l i t+s t e p ;/ /p r o p o s e dn e wp o s i t i o n i f( n e w S p l i t<l a s t ) { u n s i g n e di n ts p l i t C o d e=s o r t e d M o r t o n C o d e s [ n e w S p l i t ] ; i n ts p l i t P r e f i x=_ _ c l z ( f i r s t C o d e^s p l i t C o d e ) ; i f( s p l i t P r e f i x>c o m m o n P r e f i x ) s p l i t=n e w S p l i t ;/ /a c c e p tp r o p o s a l } } w h i l e( s t e p>1 ) ; r e t u r ns p l i t ; } How should we go about parallelizing this kind of recursiv e algorithm? One way is to use the approach presented by Garanzha et al. , which processes the lev els of nodes sequentially , starting from the root. The idea is to maintain a growing array of nodes in a breadth-first order, so that ev ery lev el in the hierarchy corresponds to a linear range of nodes. On a giv en lev el, we launch one thread for each node that falls into this range. The thread starts by reading f i r s t and
l a s t from the node array and calling f i n d S p l i t ( ) . It then appends the resulting child nodes to the
same node array using an atomic counter and writes out their corresponding sub-ranges. This process iterates so that each lev el outputs the nodes contained on the nex t lev el, which then get processed in the nex t round.
OCCUPANCY
The algorithm just described (Garanzha et al.) is surprisingly fast when there are millions of objects. The algorithm spends most of the ex ecution time at the bottom lev els of the tree, which
10/10/13
contain more than enough work to fully employ the GPU. There is some amount of data div ergence on the higher lev els, as the threads are accessing distinct parts of the Morton code array . But those lev els are also less significant considering the ov erall ex ecution time, since they do not contain as many nodes to begin with. In our case, howev er, there are only 1 2K objects (recall the example seen from the last post ). Note that this is actually less than the number of threads we would need to fill our GTX 690, ev en if we were able to parallelize ev ery thing perfectly . GTX 690 is a dual-GPU card, where each of the two GPUs can run up to 1 6K threads in parallel. Ev en if we restrict ourselv es to only one of the GPUsthe other one can e.g. handle rendering while we do phy sicswe are still in danger of running low on parallelism. The top-down algorithm takes 1.04 ms to process our workload, which is more than twice the total time taken by all the other processing steps. To ex plain this, we need to consider another metric in addition to div ergence: occupancy . Occupancy is a measure of how many threads are ex ecuting on av erage at any giv en time, relativ e to the max imum number of threads that the processor can theoretically support. When occupancy is low, it translates directly to performance: dropping occupancy in half will reduce performance by 2x . This dependence gets gradually weaker as the number of activ e threads increases. The reason is that when occupancy is high enough, the ov erall performance starts to become limited by other factors, such as instruction throughput and memory bandwidth. To illustrate, consider the case with 1 2K objects and 1 6K threads. If we launch one thread per object, our occupancy is at most 7 5%. A bit low, but not by any means catastrophic. How does the top-down hierarchy generation algorithm compare to this? There is only one node on the first lev el, so we launch only one thread. This means that the first lev el will run at 0.006% occupancy ! The second lev el has two nodes, so it runs at 0.01 3% occupancy . Assuming a balanced hierarchy , the third lev el runs at 0.025% and the fourth at 0.05%. Only when we get to the 1 3th lev el, can we ev en hope to reach a reasonable occupancy of 25%. But right after that we will already run out of work. These numbers are somewhat discouragingdue to the low occupancy , the first lev el will cost roughly as much as the 1 3th lev el, ev en though there are 4096 times fewer nodes to process.
FULLY PARALLEL HIERARCHY GENERATION

There is no way to av oid this problem without somehow changing the algorithm in a fundamental way . Ev en if our GPU supports dy namic parallelism (as NV IDIA Tesla K20 does), we cannot av oid the fact that ev ery node is dependent on the results of its parent. We hav e to finish processing the root before we know which ranges its children cov er, and we cannot ev en hope to start processing them until we do. In other words, regardless of how we implement top-down hierarchy generation, the first lev el is doomed to run at 0.006% occupancy . Is there a way to break the dependency between nodes? In fact there is, and I recently presented the solution at High Performance Graphics 201 2 (paper, slides ). The idea is to number the internal nodes in a v ery specific way that allows us to find out which range of objects any giv en node corresponds to, without hav ing to know any thing about the rest of the tree. Utilizing the fact that any binary tree with N leaf nodes alway s has ex actly N-1 internal nodes, we can then generate the entire hierarchy as illustrated by the following pseudocode.
N o d e *g e n e r a t e H i e r a r c h y (u n s i g n e di n t *s o r t e d M o r t o n C o d e s , i n t * i n t
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu
s o r t e d O b j e c t I D s , n u m O b j e c t s )
6/11
10/10/13
{ L e a f N o d e *l e a f N o d e s=n e wL e a f N o d e [ n u m O b j e c t s ] ; I n t e r n a l N o d e *i n t e r n a l N o d e s=n e wI n t e r n a l N o d e [ n u m O b j e c t s-1 ] ; / /C o n s t r u c tl e a fn o d e s . / /N o t e :T h i ss t e pc a nb ea v o i d e db ys t o r i n g / /t h et r e ei nas l i g h t l yd i f f e r e n tw a y . f o r( i n ti d x=0 ;i d x<n u m O b j e c t s ;i d x + + )/ /i np a r a l l e l l e a f N o d e s [ i d x ] . o b j e c t I D=s o r t e d O b j e c t I D s [ i d x ] ; / /C o n s t r u c ti n t e r n a ln o d e s . f o r( i n ti d x=0 ;i d x<n u m O b j e c t s-1 ;i d x + + )/ /i np a r a l l e l { / /F i n do u tw h i c hr a n g eo fo b j e c t st h en o d ec o r r e s p o n d st o . / /( T h i si sw h e r et h em a g i ch a p p e n s ! ) i n t 2r a n g e=d e t e r m i n e R a n g e ( s o r t e d M o r t o n C o d e s ,n u m O b j e c t s ,i d x ) ; i n tf i r s t=r a n g e . x ; i n tl a s t=r a n g e . y ; / /D e t e r m i n ew h e r et os p l i tt h er a n g e . i n ts p l i t=f i n d S p l i t ( s o r t e d M o r t o n C o d e s ,f i r s t ,l a s t ) ; / /S e l e c tc h i l d A . N o d e *c h i l d A ; i f( s p l i t= =f i r s t ) c h i l d A=& l e a f N o d e s [ s p l i t ] ; e l s e c h i l d A=& i n t e r n a l N o d e s [ s p l i t ] ; / /S e l e c tc h i l d B . N o d e *c h i l d B ; i f( s p l i t+1= =l a s t ) c h i l d B=& l e a f N o d e s [ s p l i t+1 ] ; e l s e c h i l d B=& i n t e r n a l N o d e s [ s p l i t+1 ] ; / /R e c o r dp a r e n t c h i l dr e l a t i o n s h i p s . i n t e r n a l N o d e s [ i d x ] . c h i l d A=c h i l d A ; i n t e r n a l N o d e s [ i d x ] . c h i l d B=c h i l d B ; c h i l d A > p a r e n t=& i n t e r n a l N o d e s [ i d x ] ; c h i l d B > p a r e n t=& i n t e r n a l N o d e s [ i d x ] ;
10/10/13
} / /N o d e0i st h er o o t . r e t u r n& i n t e r n a l N o d e s [ 0 ] ; }
The algorithm simply allocates an array of N-1 internal nodes, and then processes all of them in parallel. Each thread starts by determining which range of objects its node corresponds to, with a bit of magic, and then proceeds to split the range as usual. Finally , it selects children for the node according to their respectiv e sub-ranges. If a sub-range has only one object, the child must be a leaf so we reference the corresponding leaf node directly . Otherwise, we reference another internal node from the array .
The way the internal nodes are numbered is already ev ident from the pseudocode. The root has index 0, and the children of ev ery node are located on either side of its split position. Due to some nice properties of the sorted Morton codes, this numbering scheme nev er results in any duplicates or gaps. Furthermore, it turns out that we can implement d e t e r m i n e R a n g e ( ) in much the same way as f i n d S p l i t ( ) , using two similar binary searches ov er the nearby Morton codes. For further details on how and why this works, please see the paper. How does this algorithm compare to the recursiv e top-down approach? It clearly performs more workwe now need three binary searches per node, whereas the top-down approach only needs one. But it does all of the work completely in parallel, reaching the full 7 5% occupancy in our ex ample, and this makes a huge difference. The ex ecution time of parallel hierarchy generation is merely 0.02 msa 50x improv ement ov er the top-down algorithm! Y ou might think that the top-down algorithm should start to win when the number of objects is sufficiently high, since the lack of occupancy is no longer a problem. Howev er, this is not the case in practicethe parallel algorithm consistently performs better on all problem sizes. The ex planation for this is, as alway s, div ergence. In the parallel algorithm, nearby threads are
10/10/13
alway s accessing nearby Morton codes, whereas the top-down algorithm spreads out the accesses ov er a wider area.
BOUNDING BOX CALCULATION

Now that we hav e a hierarchy of nodes in place, the only thing left to do is to assign a conserv ativ e bounding box for each of them. The approach I adopt in my paper is to do a parallel bottom-up reduction, where each thread starts from a single leaf node and walks toward the root. To find the bounding box of a giv en node, the thread simply looks up the bounding box es of its children and calculates their union. To av oid duplicate work, the idea is to use an atomic flag per node to terminate the first thread that enters it, while letting the second one through. This ensures that ev ery node gets processed only once, and not before both of its children are processed. The bounding box calculation has high ex ecution div ergenceonly half of the threads remain activ e after processing one node, one quarter after processing two nodes, one eigth after processing three nodes, and so on. Howev er, this is not really a problem in practice because of two reasons. First, bounding box calculation takes only 0.06 ms, which is still reasonably low compared to e.g. sorting the objects. Second, the processing mainly consists of reading and writing bounding box es, and the amount of computation is minimal. This means that the ex ecution time is almost entirely dictated by the av ailable memory bandwidth, and reducing ex ecution div ergence would not really help that much.
SUMMARY
Our algorithm for finding potential collisions among a set of 3D objects consists of the following 5 steps (times are for the 1 2K object scene used in the previous post ). 1. 2. 3. 4. 0.02 ms, one thread per object: Calculate bounding box and assign Morton code. 0.18 ms, parallel radix sort: Sort the objects according to their Morton codes. 0.02 ms, one thread per internal node: Generate BV H node hierarchy . 0.06 ms, one thread per object: Calculate node bounding box es by walking the hierarchy toward the root. 5. 0.25 ms, one thread per object: Find potential collisions by trav ersing the BV H. The complete algorithm takes 0.53 ms, out of which 53% goes to tree construction and 47 % to tree trav ersal.
DISCUSSION
We hav e presented a number of algorithms of v ary ing complex ity in the contex t of broad-phase collision detection, and identified some of the most important considerations when designing and implementing them on a massiv ely parallel processor. The comparison between independent trav ersal and simultaneous trav ersal illustrates the importance of div ergence in algorithm design the best single-core algorithm may easily turn out to be the worst one in a parallel setting. Rely ing on time complex ity as the main indicator of a good algorithm can sometimes be misleading or ev en harmfulit may actually be beneficial to do more work if it helps in reducing div ergence. The parallel algorithm for BV H hierarchy generation brings up another interesting point. In the traditional sense, the algorithm is completely pointlesson a single-core processor, the
10/10/13
dependencies between nodes were not a problem to begin with, and doing more work per node only makes the algorithm run slower. This shows that parallel programming is indeed fundamentally different from traditional single-core programming: it is not so much about porting ex isting algorithms to run on a parallel processor; it is about re-thinking some of the things that we usually take for granted, and coming up with new algorithms specifically designed with massiv e parallelism in mind. And there is still a lot to be accomplished on this front. About the author: Tero Karras is a graphics research scientist at NVIDIA Research. Parallel Forall is the NV IDIA Parallel Programming blog. If y ou enjoy ed this post, subscribe to the Parallel Forall RSS feed ! Y ou may contact us v ia the contact form .
NVIDIA Developer Programs Get exclusiv e access to the latest software, report bugs and receiv e notifications for special ev ents. Learn m ore and Register
Recommended Reading About Parallel Forall Contact Parallel Forall Parallel Forall Blog Feat ured Art icles
Udacity CS3 4 4 : Intro to Parallel Program m ing Prev iousPauseNext Tag Index accelerom eter (1 ) Algorithm s (3 ) Android (1 ) ANR (1 ) ARM (2 ) Array Fire (1 ) Audi (1 ) Autom otiv e & Em bedded (1 ) Blog (2 0) Blog (2 3 ) Cluster (4 ) com petition (1 ) Com pilation (1 ) Concurrency (2 ) Copperhead (1 ) CUDA (2 3 ) CUDA 4 .1 (1 ) CUDA 5.5 (3 ) CUDA C (1 5) CUDA Fortran (1 0) CUDA Pro Tip (1 ) CUDA Profiler (1 ) CUDA Spotlight (1 ) CUDA Zone (81 ) CUDACasts (2 ) Debug (1 ) Debugger (1 ) Debugging (3 ) Dev elop 4 Shield (1 ) dev elopm ent kit (1 ) DirectX (3 ) Eclipse (1 ) Ev ents (2 ) FFT (1 ) Finite Difference (4 ) Floating Point (2 ) Gam e & Graphics Dev elopm ent (3 5) Gam es and Graphics (8) GeForce Dev eloper Stories (1 ) getting started (1 ) google io (1 ) GTC (2 ) Hardware (1 ) Interv iew (1 ) Kepler (1 ) Lam borghini (1 ) Libraries (4 ) m em ory (6 ) Mobile Dev elopm ent (2 7 ) Monte Carlo (1 ) MPI (2 ) Multi-GPU
10/10/13
(3 ) nativ e_app_glue (1 ) NDK (1 ) NPP (1 ) Nsight (2 ) NSight Eclipse Edition (1 ) Nsight Tegra (1 ) NSIGHT Visual Studio Edition (1 ) Num baPro (2 ) Num erics (1 ) NVIDIA Parallel Nsight (1 ) nv idia-sm i (1 ) Occupancy (1 ) OpenACC (6 ) OpenGL (3 ) OpenGL ES (1 ) Parallel Forall (6 9 ) Parallel Nsight (1 ) Parallel Program m ing (5) PerfHUD ES (2 ) Perform ance (4 ) Portability (1 ) Porting (1 ) Pro Tip (5) Professional Graphics (6 ) Profiling (3 ) Program m ing Languages (1 ) Py thon (3 ) Robotics (1 ) Shape Sensing (1 ) Shared Mem ory (6 ) Shield (1 ) Stream s (2 ) tablet (1 ) TADP (1 ) Technologies (3 ) tegra (5) Tegra Android Dev eloper Pack (1 ) Tegra Android Dev elopm ent Pack (1 ) Tegra Dev eloper Stories (1 ) Tegra Profiler (1 ) Tegra Zone (1 ) Textures (1 ) Thrust (3 ) Tools (1 0) tools (2 ) Toradex (1 ) Visual Studio (3 ) Windows 8 (1 ) xoom (1 ) Zone In (1 ) Developer Blogs Parallel Forall Blog
About Contact Copy right 2013 NVIDIA Corporation Legal Inf ormation Priv acy Policy Code of Conduct
https://developer.nvidia.com/content/thinking-parallel-part-iii-tree-construction-gpu
11/11

Thinking Parallel, Part 3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thinking Parallel, Part 3

Uploaded by

Copyright:

Available Formats

10/10/13

Tags: Algorithms, Occupancy , Parallel Programming

EXPLOITING THE Z-ORDER CURVE

filling curve . We will use the Z-order curve for simplicity .

TOP-DOWN HIERARCHY GENERATION

FULLY PARALLEL HIERARCHY GENERATION

BOUNDING BOX CALCULATION

You might also like