Professional Documents
Culture Documents
By Abdul Karim
When you train a deep learning model, two main operations are
performed:
But the real world data has
• Forward Pass Both involves matrix multiplication
normally hundreds and
• Backward Pass thousands of
dimensions/parameters.
Problem: My camera should identify each and every scene it sees like human eye
and then
10 million x 12288
Forward pass A Reasonable Deep Neural Network
http://places.csail.mit.edu/
new scene-centric database called Places
) ) ) )
12288x 10 million
10x10 million 10x10 million 10x10 million 1x10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1 Error
12288
10 10 10
A Reasonable Deep Neural Network
http://places.csail.mit.edu/
new scene-centric database called Places
Weight update
Backward pass
12288x 10 million
10x10 million 10x10 million 10x10 million 1x10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10 1 Error
12288
10 10 10
Is there any optimum error
Repeat until converge which is related to the number
of iteration? What does
{ convergence mean?
Forward propagation to calculate the error or cost or objective function.
Update the weights through backward propagation
}
Does the weight update has
the info or any prints of the
previous weights?
Human-level
Bayes Error Rate
Number of iterations
Weight update
Yes it increments or decrements by a
factor
(w) Or decrement.
w
In subsequent iteration, the weights step up or step down a little baby step from the previous weights.
Not all the weights are updated, but only those weights for which , but is calculated in every iteration.
Our idea
Somehow we won’t need to calculate this. But instead if we will compute other term which will reflect the variation in
input data.
In each layer of the DNN, there is a multiplication
of the size (ox millions ) matrices.
12288x 10 million
10x10 million 10x10 million 10x10 million 1x10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1 Error
12288
10 10 10
This effectively hides latency so that GPUs offer high
bandwidth while hiding their latency under thread parallelism Numbers
axbxc
a , b , c ,………
AxBxC A , B , C ,………
Is there anything specific to back propagation which if we eliminate, can reduce the training time. Giving example on sample
dataset.
Performance on real world data set. In terms of CPU cycles or GPU cycles improvement
The only deep learning library which currently implements efficient algorithms across GPUs and across
computers is CNTK which uses Microsoft’s special parallelization algorithms of 1-bit quantization
(efficient) and block momentum (very efficient).
A simple way to understand the difference between a GPU and a CPU is to compare how they
process tasks. A CPU consists of a few cores optimized for sequential serial processing while a
GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores
designed for handling multiple tasks simultaneously.
Select GPU
NVIDIA GPUs for deep learning are available in
desktops, notebooks, servers, and supercomputers
around the world, as well as in cloud services from
Amazon, IBM, Microsoft, and Google.
NVIDIA® DGX™ SYSTEMS
NVIDIA TITAN Xp
3840 NVIDIA® CUDA® cores running at 1.6 GH
12 TFLOPs of brute force
In real world
scenario,
multiplication is
expensive than
addition and so is
the division and
exponential
respectively.
<http://www.latkin.org/blog/2014/11/09/a-simple-benchmark-of-various-math-operations/>
FLOPS required for Matrix operations
[1] G. H. Golub and C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, 1991.
[2] Kh.D. Ikramov and N.V. Savel’eva, “Conditionally Definite Matrices,” Journal of Mathematical
Sciences, vol. 98, no. 1, pp. 1–50, 2000.
Forward pass A Reasonable Deep Neural Network
Number of FLOPS =
Addition= 1 Flop
Multiplication= 1 Flop
Division = 1 Flop
) ) ) )
12288x 10 million
10x10 million 10x10 million 10x10 million 1x10 million
1
1 1 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1 Error
12288
10 10 10
12288x 10 million Forward pass
Flops
10x10m
Total operation in sigmoid=3
Flop per operation=1 ( 300 × 106
12288x 10 million Forward pass
Flops
MxL MxN NxL
10x10m 10x 10 10x10m
( 1.9 × 109
10x10m
Total operation in sigmoid=3
Flop per operation=1 ( 300 × 106
12288x 10 million Forward pass
Flops
MxL MxN NxL
10x10m 10x 10 10x10m
( 1.9 × 109
10x10m
Total operation in sigmoid=3
Flop per operation=1 ( 300 × 106
12288x 10 million Forward pass
Flops
MxL MxN NxL
1x10m 1x 10 10x10m
( 190 × 106
1x10m
Total operation in sigmoid=3
30 × 106
Flop per operation=1 (
1x10 million
0.3 0.6 0.9 0.9 0.1 0.01 0.3 0.4 0.99 0.001
1 0 1 1 1 1 0 1 0 0
1x10 million
BY chain Rule
100.00001 × 1012 flops
10 million flops
199.99999 × 1012 flops
199.99999 × 106 flops
General Procedure for obtaining gradients in (L-1) to 1
)
12288x 10 million
) ) ) )
(
10x10m 10x10m 10mx10m
) ) ) )
) ) ) )
( 2.45759987712 × 1012
Flops
1.9000001 × 1015 Flops
Total Flops for Total Flops for Total Flops for Total Flops for
Update=245760 Flops Update=200 Flops Update=200 Flops Update=20 Flops
https://devblogs.nvidia.com/parallelforall/digits-deep-learning-gpu-training-system/
• In a recent talk at Center for Brains, Minds and Machines at MIT, I
heard Professor Josh Tenenbaum mention (someone else’s quote) -
“Deep Learning works very well in problems where there is a
repetitive structure in space or time”
For i:1, 10million
=
)
12288x 1 ) )
10x 1 10x 1 10x 1
)
1
1 1 1 1x 1
2 2 2 2
3 3
10x12288 10 x10 3 10x10 3 1x10
1
12288
10 10 10
Steps for DNN Experiments
https://www.nvidia.com/en-us/deep-learning-ai/developer/