Professional Documents
Culture Documents
Introduction
Tridiagonal solvers very popular technique in both compute and graphics applications
Application in Alternating Direction Implicit (ADI) methods
Outline
Tridiagonal Solvers Overview
Introduction Gauss Elimination Cyclic Reduction
Tridiagonal Solvers
Cyclic Reduction
Eliminate unknowns using linear combination of equations:
After one reduction step a system is decomposed into 2:
Cyclic Reduction
Parallel Cyclic Reduction (PCR)
Apply reduction to new systems and repeat O(log N) For some special cases can use fewer steps
Outline
Tridiagonal Solvers Overview
Fluid Simulation in 3D domain
Problem statement, applications ADI numerical method GPU implementation details, optimizations Performance analysis and results comparisons
Problem Statement
Viscid incompressible fluid in 3D domain
New challenge: complex dynamic boundaries
no-slip
free injection
Applications
Blood flow simulation in heart
Capture boundaries movement from MR or US
Applications
Sea and ocean simulations
Static boundaries Additional simulation parameters: salinity, etc.
Definitions
Density
Velocity
Temperature Pressure
Equation of state
Describe relation between
Example:
gas constant for air
and
Governing equations
Continuity equation
For incompressible fluids:
Navier-Stokes equations:
Dimensionless form, use equation of state
Governing equations
Energy equation:
Dimensionless form, use equation of state
Fixed Y, Z
Fixed X, Z
Fixed X, Y
global iterations
Discretization
Tridiagonal systems
Need to solve lots of tridiagonal systems
Sizes of systems may vary across the grid
system 1
system 2
system 3
Implementation details
<for each direction X, Y, Z> { <for each local iteration> { <for each equation u, v, w, T> { build tridiagonal matrices and rhs solve tridiagonal systems } update non-linear terms } }
GPU implementation
Store all data arrays entirely on GPU in linear memory
Reduce amount of memory transfers to minimum Map 3D arrays to linear memory
(X, Y, Z) Z + Y * dimZ + X * dimY * dimZ Z fastest-changing dimension
Main tasks
Build matrices
Solve systems
Building matrices
Input data:
Previous/non-linear 3D layers
a b c d
1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Build Build + Solve
Build kernels Requests L1 global IPC per load load hit % 23 23 32 25 45 33 44 0 15 1.4 1.4 0.2
X Y Z
XYZ
Transpose input arrays
XYZ
Transpose output arrays
XZY
Y local iterations
XZY
Build kernels
Requests L1 global IPC per load load hit % 32 23 0 15 30 38 0.2 1.3
Original Transposed
0.4
0.2 0.0 X dir Y dir Z dir Z dir OPT
Problem analysis
Big number of tridiagonal systems
Grid size squared on each step: 16K for 128^3
Interleaved layout
system 2
a1 a2 a3 a0 a0 a0 a0 a1 a1 a1 a1
Thread 1
Thread 1
Thread 2
Thread 2
Thread 3
Thread 3
Sweep friendly
ADI Z direction
ADI X, Y directions
ms
Tesla C2050
Matrix layout
*CR, SWEEP_T: sequential
6 PCR
4 2 0 64 96 128 160 192 224 256 CR SWEEP SWEEP_T
SWEEP: interleaved
PCR/CR implementations from: Fast tridiagonal solvers on the GPU by Y. Zhang, J. Cohen, J. Owens, 2010
Performance benchmark
CPU configuration:
Intel Core i7 4 cores @ 2.8 GHz Use all 4 cores through OpenMP (3-4x speed-up vs 1 core)
GPU configurations:
NVIDIA Tesla C1060 NVIDIA Tesla C2050
Test cases
Cube
Heart Volume
Sea Area
Grid size 128 x 128 x 128 96 x 160 x 128 160 x 128 x 128
Z dir 10K 8K 6K
SINGLE
9.5x
systems/ms
22 20 18 16
DOUBLE
8.2x
5.9x
Core i7 Tesla C1060 Tesla C2050
Tesla C1060
Tesla C2050
2.7x
SINGLE
systems/ms
45
DOUBLE
40
35 30
11.4x 8.1x
Core i7
7.8x
3.3x
SINGLE
systems/ms
80 70 60 50
DOUBLE
70 60
50 40 30 20 10 0 X dir Y dir Z dir All dirs
9.6x 6.5x
40 30
7.2x 3.1x
Tesla C2050
20 10
0 X dir Y dir Z dir
All dirs
Future work
Current main limitation is memory capacity
Lots of 3D arrays: grid data, time layers, tridiagonal matrices
Split grid along current direction (X, Y or Z) Assign each partition to one GPU Solve systems and update common 3D layers GPU 1 GPU 2 GPU 3 GPU 4
References
http://code.google.com/p/cmc-fluid-solver/
Tridiagonal Solvers on the GPU and Applications to Fluid Simulation, GPU Technology Conference, 2009, Nikolai Sakharnykh A dynamic visualization system for multiprocessor computers with common memory and its application for numerical modeling of the turbulent flows of viscous fluids, Moscow University Computational Mathematics and Cybernetics, 2007, Paskonov V.M., Berezin S.B., Korukhova E.S.
Outline
Tridiagonal Solvers Overview
Fluid Simulation in 3D domain Depth-of-Field Effect in 2D domain
Problem statement ADI numerical method
Problem statement
ADI method application for 2D problems
Real-time Depth-Of-Field simulation
Using diffusion equation to blur the image
Numerical method
Alternating Direction Implicit method
i+1
Problem analysis
Small number of systems
Max 2K for high resolutions
GPU implementation
Matrix layout
Sequential for X direction Interleaved for Y direction
PCR step
Doubles number of systems, halves systems size
W
0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 2 3 0 1 2 3
step 0
step 1
H
Systems Size H W Systems Size H*2 W/2 Systems Size H*4 W/4
Hybrid approach
Benefits of each additional PCR step
Improves memory layout for Sweep in X direction Subdivides systems for more efficient Sweep
16
32 16 Shared memory
sharp boundary
Depth-of-Field in games
Depth-of-Field in games
References
Pixar article about Diffusion DOF:
http://graphics.pixar.com/library/DepthOfField/paper.pdf
Summary
10x speed-ups for 3D ADI methods in complex areas
Great Fermi performance in double precision
Questions?