Professional Documents
Culture Documents
2
GPU Computing
3
GPU Computing
x86
4
CUDA
Framework to Program NVIDIA GPUs
7
GOOGLE DATACENTER STANFORD AI LAB
ACCELERATING
ACCELERATING
INSIGHTS
INSIGHTS
8
Deep learning with COTS HPC systems, A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro ICML 2013
Recent improvements
Image Recognition Pedestrian Detection Object Detection
9
12.0x Pascal
NVIDIA cuDNN
(cuDNN v5)
10.0x
8.0x M40
(cuDNN v3)
6.0x
developer.nvidia.com/cudnn
10
Accelerating linear algebra: cuBLAS
developer.nvidia.com/cublas
11
Accelerating sparse operations: cuSPARSE
The (Dense matrix) X (sparse vector) example
y = op(A)x + y
y2
y3
A21 A22
A31 A32
A23 A24 A25
A33 A34 A35
2
- + y2
y3
-
1
A = dense matrix
x = sparse vector
y = dense vector
Sparse vector could be frequencies
of words in a text sample
cuSPARSE provides a full suite of accelerated sparse matrix functions
developer.nvidia.com/cusparse
12
Multi-GPU communication: NCCL
Collective library
github.com/NVIDIA/nccl
13
NCCL Example
All-reduce
#include <nccl.h>
ncclComm_t comm[4];
ncclCommInitAll(comm, 4, {0, 1, 2, 3});
foreach g in (GPUs) { // or foreach thread
cudaSetDevice(g);
double *d_send, *d_recv;
// allocate d_send, d_recv; fill d_send with data
ncclAllReduce(d_send, d_recv, N, ncclDouble, ncclSum, comm[g], stream[g]);
// consume d_recv
}
14
Platform
15
Developer workstation
Titan X Pascal
11 TFLOPS FP32
INT8
3,584 CUDA
12 GB DDR5X
16
DGX-1
Worlds First Deep Learning Supercomputer
18
GIE (GPU Inference Engine)
MANAGE TRAIN DEPLOY
TEST TRAIN
MANAGE / AUGMENT
EMBEDDED DATA CENTER AUTOMOTIVE
developer.nvidia.com/gpu-inference-engine 19
EXAMPLE: DL EMBEDDED DEPLOYMENT
Jetson TX1 devkit
Jetson TX1
Inference at 258 img/s
No need to change code
20
GPU INFERENCE ENGINE
Optimizations
22
NVIDIA DIGITS
Interactive Deep Learning GPU Training System
Process Data Configure DNN Monitor Progress Visualize Layers
Test Image
developer.nvidia.com/digits
github.com/NVIDIA/DIGITS
23
CUDA
24
GPU architecture
25
GPU ARCHITECTURE
Two Main Components
Global memory
Analogous to RAM in a CPU server
Accessible by both GPU and CPU
Currently up to 24 GB
ECC on/off options for Quadro and Tesla products
Scheduler Scheduler
Dispatch Dispatch
Special-function units
Core Core Core Core
64K Configurable
Cache/Shared Mem
Uniform Cache
27
GPU MEMORY HIERARCHY REVIEW
L2
Global Memory
29
CUDA Programming model
32
ANATOMY OF A CUDA C/C++ APPLICATION
Serial code executes in a Host (CPU) thread
Parallel code executes in many Device (GPU) threads
across multiple processing elements
Device = GPU
Parallel code
Host = CPU
Serial code
Device = GPU
Parallel code
...
33
CUDA C : C WITH A FEW KEYWORDS
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
Standard C Code
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);
35
CUDA KERNELS
CUDA threads:
Lightweight
Fast switching
1000s execute simultaneously
GPU
Scheduler Scheduler
Dispatch Dispatch
Register File
Load/Store Units x 16
Special Func Units x 4
Interconnect Network
64K Configurable
Cache/Shared Mem
Uniform Cache
THREAD BLOCKS ALLOW SCALABILITY
Blocks can execute in any order, concurrently or sequentially
This independence between blocks gives scalability:
A kernel scales across any number of SMs
Kernel Grid
Launch
Device with 2 SMs Block 0 Device with 4 SMs
Block 1
SM 0 SM 1 SM 0 SM 1 SM 2 SM 3
Block 2
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 3
Block 2 Block 3 Block 4 Block 5 Block 6 Block 7
Block 4
Block 4 Block 5
Block 5
Block 6 Block 7
Block 6
Block 7
45
Memory System Hierarchy
46
MEMORY HIERARCHY
Thread:
Registers
47
MEMORY HIERARCHY
Thread:
Registers
Local memory
48
MEMORY HIERARCHY
Thread:
Registers
Local memory
Block of threads:
Shared memory
49
MEMORY HIERARCHY : SHARED MEMORY
Thread:
Registers
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
51
MEMORY HIERARCHY : GLOBAL MEMORY
52
CUDA memory management
53
MEMORY SPACES
CPU and GPU have separate memory spaces
Data is moved across PCIe bus
cudaFree(d_a);
55
DATA COPIES
cudaMemcpy( void *dst, void *src, size_t nbytes,
enum cudaMemcpyKind direction);
returns after the copy is complete
blocks CPU thread until all bytes have been copied
doesnt start copying until previous CUDA calls complete
enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
57
CUDA PROGRAMMING MODEL REVISITED
58
THREAD HIERARCHY
Threads launched for a parallel section are partitioned into thread blocks
59
IDS AND DIMENSIONS
Threads
3D IDs, unique within a block Device
Grid 1
Blocks Block Block Block
2D IDs, unique within a grid (0, 0) (1, 0) (2, 0)
Built-in variables
threadIdx, blockIdx
blockDim, gridDim
(Continued)
60
IDS AND DIMENSIONS
Threads
3D IDs, unique within a block Device
Grid 1
Blocks
Block Block Block
2D IDs, unique within a grid (0, 0) (1, 0) (2, 0)
61
LAUNCHING KERNELS ON GPU
67
Prepare and Start AWS Instance
Select the correct lab (Montreal GPU Programming Workshop) and once enabled press
Start Lab
71
Software
GPU Driver
CUDA toolkit
Includes all the software necessary for developers to write applications
Compiler (nvcc), libraries, profiler, debugger, documentation
CUDA Samples
Samples illustrating GPU functionality and performance
73
COME DO YOUR LIFES WORK
JOIN NVIDIA
We are looking for great people at all levels to help us accelerate the next wave of AI-driven
computing in Research, Engineering, and Sales and Marketing.
Our work opens up new universes to explore, enables amazing creativity and discovery, and
powers what were once science fiction inventions like artificial intelligence and autonomous
cars.