GPU in DNN 8-11-2017

GPU in DNN
By Abdul Karim
Principal Supervisor: Professor Abdul Sattar

Associate Supervisor: MAHakim Newton
Why do we need specialized hardware for training Deep Neural Networks?
When you train a deep learning model, two main operations are
performed:
 But the real world data has
• Forward Pass Both involves matrix multiplication
normally hundreds and
• Backward Pass thousands of
dimensions/parameters.
 This seems to be very simple

task
Motivation
What does it take to reach human-level performance with a machine-learning algorithm?
Huge data set Appropriate ML

algorithm
Problem: My camera should identify each and every scene it sees like human eye
and then
I should train my deep Convolutional Neural Network with millions of images

Some Real world example: Places
http://places.csail.mit.edu/
new scene-centric database called Places
• A repository of 10 million scene photographs, labeled with scene semantic categories.

• Each image is of shape (64, 64, 3) where 3 is for the 3 channels (RGB).
Size of our training data-set
10 million x 12288
Forward pass A Reasonable Deep Neural Network
) ) ) )
12288x 10 million
10x10 million 10x10 million 10x10 million 1x10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1 Error
12288
10 10 10
A Reasonable Deep Neural Network
Weight update
Backward pass
12288x 10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10 1 Error
12288
10 10 10
 Is there any optimum error
Repeat until converge which is related to the number
of iteration? What does
{ convergence mean?
 Forward propagation to calculate the error or cost or objective function.
 Update the weights through backward propagation
}
 Does the weight update has
the info or any prints of the
previous weights?
 How is the optimization in

first iteration related to the
another immediate
iteration?
 Is there any optimum error which is related to the number of iteration?
Example of Human level error for Error

then same task:
DL Model
Human-level
Bayes Error Rate
Number of iterations
<<Deeplearning: Ian Goodfellow and Yoshua Bengio and Aaron Courville>>
Bayes error rate is the lowest possible error rate for any classifier of a random outcome.

 Does the weight update has the info or any prints of the previous weights?
OR
 How is the optimization in first iteration related to the another immediate iteration?
Weight update
Yes it increments or decrements by a
factor
(w) Or decrement.
w
 In subsequent iteration, the weights step up or step down a little baby step from the previous weights.
 Not all the weights are updated, but only those weights for which , but is calculated in every iteration.
Our idea
 Somehow we won’t need to calculate this. But instead if we will compute other term which will reflect the variation in
input data.
 In each layer of the DNN, there is a multiplication
of the size (ox millions ) matrices.
12288x 10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1 Error
12288
10 10 10
This effectively hides latency so that GPUs offer high
bandwidth while hiding their latency under thread parallelism Numbers
axbxc
a , b , c ,………
 CPU is Latency optimized……….Ferrari

 GPU is bandwidth optimized………..Truck
The best CPUs have about 50GB/s while the best GPUs have
750GB/s memory bandwidth Matrics
AxBxC A , B , C ,………
so for large chunks of memory GPUs provide the best memory

bandwidth while having almost no drawback due to latency via
thread parallelism
Why do we need GPU How specifically they help in matrix multiplication and CPU cycle usage?
What if we do the same task using CPU
Is there anything specific to back propagation which if we eliminate, can reduce the training time. Giving example on sample
dataset.
Algorithm to do above task.
Improvement on sample data set.
Performance on real world data set. In terms of CPU cycles or GPU cycles improvement
The only deep learning library which currently implements efficient algorithms across GPUs and across
computers is CNTK which uses Microsoft’s special parallelization algorithms of 1-bit quantization
(efficient) and block momentum (very efficient).
A simple way to understand the difference between a GPU and a CPU is to compare how they
process tasks. A CPU consists of a few cores optimized for sequential serial processing while a
GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores
designed for handling multiple tasks simultaneously.
• We need to write an interface program to

deploy our deep learning NN on GPU.
• Deep Learning frameworks like Tensors flow

can do it for us.
Select suitable framework
Select GPU
NVIDIA GPUs for deep learning are available in
desktops, notebooks, servers, and supercomputers
around the world, as well as in cloud services from
Amazon, IBM, Microsoft, and Google.
NVIDIA® DGX™ SYSTEMS
NVIDIA TITAN Xp
3840 NVIDIA® CUDA® cores running at 1.6 GH
12 TFLOPs of brute force
NVIDIA Quadro® GP100

Definitions of FLOP
1)The simple plural of “FLOP” (i.e. “operation X takes 50

FLOPs”)
2) The rate of FLOPs in the first sense (i.e. floating-point

math operations per second)
 Lets assume that one FLOP is required to perform addition, multiplication, division and
exponential. (To be minimum possible to make some minimum criteria).
In real world
scenario,
multiplication is
expensive than
addition and so is
the division and
exponential
respectively.
<http://www.latkin.org/blog/2014/11/09/a-simple-benchmark-of-various-math-operations/>
FLOPS required for Matrix operations
[1] G. H. Golub and C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, 1991.
[2] Kh.D. Ikramov and N.V. Savel’eva, “Conditionally Definite Matrices,” Journal of Mathematical
Sciences, vol. 98, no. 1, pp. 1–50, 2000.
Forward pass A Reasonable Deep Neural Network
3 Flops per sigmoid
Number of FLOPS =
Addition= 1 Flop
Multiplication= 1 Flop
Division = 1 Flop
) ) ) )
12288x 10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1 Error
12288
10 10 10
12288x 10 million Forward pass
10x12288 10x10 10x10 1x10

Flops
MxL MxN NxL

10x10m 10x 12288 12288x 10m ( 2.4575 × 1012
10x10m
Total operation in sigmoid=3
Flop per operation=1 ( 300 × 106
10x12288 10x10 10x10 1x10

Flops
MxL MxN NxL
10x10m 10x 10 10x10m
( 1.9 × 109
10x10m
10x12288 10x10 10x10 1x10

Flops
MxL MxN NxL
10x10m 10x 10 10x10m
( 1.9 × 109
10x10m
10x12288 10x10 10x10 1x10

Flops
MxL MxN NxL
1x10m 1x 10 10x10m
( 190 × 106
1x10m
30 × 106
Flop per operation=1 (
1x10 million
0.3 0.6 0.9 0.9 0.1 0.01 0.3 0.4 0.99 0.001
1 0 1 1 1 1 0 1 0 0
1x10 million
Simplest form of error:
1x10m Number of Substructions = 10m

1x1
Number of Squares = 10m
Number of additions= 10m
Total FLOPs in calculating error= 30 × 106

 We have selected Square loss as
our cost function and sigmoid as
our unit activation function.
Total flops in calculating

* )
300.00020999999 × 10 12
BY chain Rule
100.00001 × 1012 flops
10 million flops
199.99999 × 1012 flops
199.99999 × 106 flops
General Procedure for obtaining gradients in (L-1) to 1
)
12288x 10 million
) ) ) )
10x12288 10x10 10x10 1x10

10x10 million 10x10 million 1x10 million
10x10 million
MxL MxN NxL

Flops 10mx10m 10mx10 10x10m
10x10m 10x1 1x10m
(
10x10m 10x10m 10mx10m
( 1.9999999 × 1015 Flops (

1.9 × 1015 Flops
10x10 10x10m 10mx10
( 1.9999999 × 109 Flops

1.9000001 × 1015 Flops
Flops for gradient= 3.9000020999999 × 1015 Flops

12288x 10 million
) ) ) )
10x12288 10x10 10x10 1x10

10x10 million
MxL MxN NxL

10x10m 10x10 10x10m
( 1.9 × 109 Flops
10x10m 10x10m 10mx10m
( 1.9999999 × 1015 Flops (

1.9 × 1015 Flops
10x10 10x10m 10mx10
( 1.9999999 × 109 Flops

1.9000001 × 1015 Flops

12288x 10 million
) ) ) )
10x12288 10x10 10x10 1x10

10x10 million
MxL MxN NxL

10x10m 10x10 10x10m
( 1.9 × 109 Flops
10x10m 10x10m 10mx10m
( 1.9999999 × 1015 Flops (

1.9 × 1015 Flops
10x12288 10x10m 10m x 12288
( 2.45759987712 × 1012
Flops
1.9000001 × 1015 Flops

Number of Flops required for
10x12288 10x10 10x10 1x10
 Additions= 122880  Additions= 100  Additions= 100  Additions= 10

 Multiplications=122880  Multiplications=100  Multiplications=100  Multiplications=10
 Total Flops for  Total Flops for  Total Flops for  Total Flops for
Update=245760 Flops Update=200 Flops Update=200 Flops Update=20 Flops
Total Flops required for one Iteration =

DIGITS: Deep Learning GPU Training System
Open Source:
https://devblogs.nvidia.com/parallelforall/digits-deep-learning-gpu-training-system/
• In a recent talk at Center for Brains, Minds and Machines at MIT, I
heard Professor Josh Tenenbaum mention (someone else’s quote) -
“Deep Learning works very well in problems where there is a
repetitive structure in space or time”
For i:1, 10million
=
)
12288x 1 ) )
10x 1 10x 1 10x 1
)
1
1 1 1 1x 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1
12288
10 10 10
Steps for DNN Experiments
https://www.nvidia.com/en-us/deep-learning-ai/developer/

GPU in DNN 8-11-2017

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GPU in DNN 8-11-2017

Uploaded by

Copyright:

Available Formats

GPU in DNN

Principal Supervisor: Professor Abdul Sattar

 This seems to be very simple

What does it take to reach human-level performance with a machine-learning algorithm?

Huge data set Appropriate ML

I should train my deep Convolutional Neural Network with millions of images

• A repository of 10 million scene photographs, labeled with scene semantic categories.

Size of our training data-set

 How is the optimization in

Example of Human level error for Error

<<Deeplearning: Ian Goodfellow and Yoshua Bengio and Aaron Courville>>

Bayes error rate is the lowest possible error rate for any classifier of a random outcome.

 CPU is Latency optimized……….Ferrari

so for large chunks of memory GPUs provide the best memory

What if we do the same task using CPU

Algorithm to do above task.

Improvement on sample data set.

• We need to write an interface program to

• Deep Learning frameworks like Tensors flow

NVIDIA Quadro® GP100

1)The simple plural of “FLOP” (i.e. “operation X takes 50

2) The rate of FLOPs in the first sense (i.e. floating-point

3 Flops per sigmoid

10x12288 10x10 10x10 1x10

MxL MxN NxL

10x12288 10x10 10x10 1x10

10x12288 10x10 10x10 1x10

10x12288 10x10 10x10 1x10

Simplest form of error:

1x10m Number of Substructions = 10m

Total FLOPs in calculating error= 30 × 106

Total flops in calculating

10x12288 10x10 10x10 1x10

MxL MxN NxL

( 1.9999999 × 1015 Flops (

10x10 10x10m 10mx10

( 1.9999999 × 109 Flops

Flops for gradient= 3.9000020999999 × 1015 Flops

10x12288 10x10 10x10 1x10

MxL MxN NxL

( 1.9 × 109 Flops

10x10m 10x10m 10mx10m

( 1.9999999 × 1015 Flops (

10x10 10x10m 10mx10

( 1.9999999 × 109 Flops

Flops for gradient= 3.9000038999999 × 1015 Flops

10x12288 10x10 10x10 1x10

MxL MxN NxL

( 1.9 × 109 Flops

10x10m 10x10m 10mx10m

( 1.9999999 × 1015 Flops (

10x12288 10x10m 10m x 12288

Flops for gradient= 3.90245949987710 × 1015 Flops

10x12288 10x10 10x10 1x10

 Additions= 122880  Additions= 100  Additions= 100  Additions= 10

Total Flops required for one Iteration =

You might also like