Cuda

Seminar ‘11 CUDA
Contents
1 WHAT IS CUDA ??????
2 EXECUTION MODEL
3 IMPLEMENTATION
4 APPLICATION
6/3/2019 2
Seminar ‘11 CUDA
What is CUDA ??????

 CUDA – Compute Unified Device
Architecture
 Hardware and software architecture
 For computing on the GPU
 Developed by Nvidia in 2007
 GPU
 Do massive amount of task simultaneously and quickly by

using several ALUs
 ALUs are programmable by Graphics API

6/3/2019 3
Seminar ‘11 CUDA
What is CUDA ??????

 Using CUDA – No need to map GPU towards Graphics APIs
 CUDA provides number crunching very fast
 CUDA is well suited for highly parallel algorithms and

large datasets
 Consists of heterogeneous programming model and

software environment
 Hardware and software models
 An Extension of C programming
 Designed to enable heterogeneous computation
 Computation with CPU &GPU
6/3/2019 4
Seminar ‘11 CUDA
CUDA kernels & threads

 Device = GPU
 Executes parallel portions of an application
as kernels
 Host = CPU
 Executes serial portions of an application
 Kernel = Functions that runs on device

 One kernel at one time
 Many threads execute each kernel
 Posses host and device memory
 Host and device connected by PCI
EXPRESS X16
6/3/2019 5
Seminar ‘11 CUDA
Arrays parallel threads

 A CUDA kernel is executed by an array of threads
 All threads run the same code
 Each thread has ID uses to compute memory addresses
6/3/2019 6
Seminar ‘11 CUDA
Thread batching
 Thread cooperation is valuable
 Share results to avoid redundant computation
 Share memory accesses
 Thread block = Group of threads
 Threads cooperate together using shared memory and

synchronization
 Thread ID is calculated by
 x+yDx (for 2 dimensional block)
(x,y) – thread index
(Dx,Dy) – block size

6/3/2019 7
Seminar ‘11 CUDA
Thread Batching (Contd…)

 (x+yDx+zDxDy) (for 3 dimensional block)
(x,y,z) – thread index
(Dx,Dy,Dz) – block size
 Grid = Group of thread blocks
6/3/2019 8
Seminar ‘11 CUDA
Thread Batching (Contd…)

 There is block ID
• Calculated as thread ID
 Threads in different blocks cannot cooperate
6/3/2019 9
Seminar ‘11 CUDA
Transparent Scalability
 Hardware is free to schedule thread blocks on any

processor
 A kernel scales across parallel multiprocessors
6/3/2019 10
Seminar ‘11 CUDA
CUDA architectures
Architecture’s Codename G80 GT200 Fermi

Release Year 2006 2008 2010
Number of Transistors 681 million 1.4 billion 3.0 billion
Streaming Multiprocessors
16 30 16
(SM)
Streaming Processors (per
8 8 32
SM)
Streaming Processors (total) 128 240 512
Configurable 48
Shared Memory (per SM) 16 KB 16 KB
KB or 16 KB
Configurable 16
L1 Cache (per SM) None None
KB or 48 KB
6/3/2019 11
Seminar ‘11 CUDA
8 & 10 Series Architecture
G80
GT200
6/3/2019 12
Seminar ‘11 CUDA
Kernel memory access

 Per thread
Thread
 Per block
Block
 Per device
6/3/2019 13
Seminar ‘11 CUDA
Physical Memory Layout

 “Local” memory resides in device DRAM
 Use registers and shared memory to minimize local memory use
 Host can read and write global memory but not shared
memory
6/3/2019 14
Seminar ‘11 CUDA
Execution Model
 Threads are executed
by thread processors
 Thread blocks are

executed by
multiprocessors
 A kernel is launched as
a grid of thread blocks
6/3/2019 15
Seminar ‘11 CUDA
CUDA software development
6/3/2019 16
Seminar ‘11 CUDA
Compiling CUDA code
 CUDA nvcc compiler to

compile the .cu files which
divides code into NVidia
assembly and C++ code.
6/3/2019 17
Seminar ‘11 CUDA
Example
int main(void){
float *a_h, *b_h; //host data
float *a_d, *b_d; //device data Host Device
int N = 15, nBytes, i;
nBytes = N*sizeof(float); a_h a_d
a_h = (float*)malloc(nBytes);
b_h = (float*)malloc(nBytes);
b_h b_d
cudaMalloc((void**)&a_d,nBytes);
cudaMalloc((void**)&b_d,nBytes);
for(i=0; i<N; i++) a_h[i]=100.f +i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nByyes, cudaMemcpyDeviceToHost);
for(i=0; i<N; i++) assert(a_h[i] == b_h[i]);
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;}
6/3/2019 18
Seminar ‘11 CUDA
Applications
Finance Numeric Medical
Oil & Gas Biophysics
Audio Video Imaging

6/3/2019 19
Seminar ‘11 CUDA
Advantages
 Provides shared memory
 Cost effective
 The gaming industries demand on Graphics cards has

forced a lot of research and money into the improvement
of the GPUs
 Transparent Scalability
6/3/2019 20
Seminar ‘11 CUDA
Drawbacks
 Despite having hundreds of “cores” CUDA is not as

flexible as CPU’s
 Not as effective for personal computers
6/3/2019 21
Seminar ‘11 CUDA
Future Scope
 Implementation of CUDA in several other group of

companies’ GPUs.
 More and more streaming processors can be included
 CUDA in wide variety of programming languages.
6/3/2019 22
Seminar ‘11 CUDA
Conclusion
 Brought significant innovations to the High Performance

Computing world.
 CUDA simplified process of development of general

purpose parallel applications.
 These applications have now enough computational

power to get proper results in a short time.
6/3/2019 23
Seminar ‘11 CUDA
References
1. “CUDA by Example: An Introduction to General-Purpose GPU
Programming” by Edward kandrot
2. “Programming Massively Parallel Processors: A Hands-on Approach
(Applications of GPU Computing Series)” By David B kirk & Wen Mei W.
Hwu.
3. “GPU Computing Gems Emerald Edition (Applications of GPU Computing
Series)” By Wen-mei W. Hwu .
4. “The Cost To Play: CUDA Programming” , By Douglas Eadline, Ph.D. ,on
Linux Magazine Wednesday, February 17th, 2010
5. “Nvidia Announces CUDA x86” Written by Cristian, On Tech Connect
Magazine 21 September 2010
6. CUDA Programming Guide. ver. 1.1,
http://www.nvidia.com/object/cuda_develop.html
7. TESLA GPU Computing Technical Brief,
http://www.nvidia.com/object/tesla_product_literature.html
8. G80 architecture reviews and specification,
http://www.nvidia.com/page/8800_reviews.html,
http://www.nvidia.com/page/8800_tech_specs.html
9. Beyond3D G80: Architecture and GPU Analysis,
http://www.beyond3d.com/content/reviews/1
10. Graphics adapters supporting CUDA,
http://www.nvidia.com/object/cuda_learn_products.html
6/3/2019 24
Seminar ‘11 CUDA
Questions?????
6/3/2019 26

Cuda

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cuda

Uploaded by

Copyright:

Available Formats

Seminar ‘11 CUDA

1 WHAT IS CUDA ??????

What is CUDA ??????

 For computing on the GPU

 Developed by Nvidia in 2007

 Do massive amount of task simultaneously and quickly by

 ALUs are programmable by Graphics API

What is CUDA ??????

 CUDA provides number crunching very fast

 CUDA is well suited for highly parallel algorithms and

 Consists of heterogeneous programming model and

CUDA kernels & threads

 Kernel = Functions that runs on device

Arrays parallel threads

 Each thread has ID uses to compute memory addresses

 Thread block = Group of threads

 Threads cooperate together using shared memory and

 x+yDx (for 2 dimensional block)

(x,y) – thread index

(Dx,Dy) – block size

Thread Batching (Contd…)

(x,y,z) – thread index

(Dx,Dy,Dz) – block size

 Grid = Group of thread blocks

Thread Batching (Contd…)

 Threads in different blocks cannot cooperate

 Hardware is free to schedule thread blocks on any

Architecture’s Codename G80 GT200 Fermi

8 & 10 Series Architecture

Kernel memory access

Physical Memory Layout

 Thread blocks are

CUDA software development

Compiling CUDA code

 CUDA nvcc compiler to

Finance Numeric Medical

Oil & Gas Biophysics

Audio Video Imaging

 Provides shared memory

 The gaming industries demand on Graphics cards has

 Despite having hundreds of “cores” CUDA is not as

 Not as effective for personal computers

 Implementation of CUDA in several other group of

 More and more streaming processors can be included

 CUDA in wide variety of programming languages.

 Brought significant innovations to the High Performance

 CUDA simplified process of development of general

 These applications have now enough computational

You might also like