Lecture 4

Administrative
- how is the assignment going?

- btw, the notes get updated all the time
based on your feedback
- no lecture on Monday
Fei-Fei Li & Andrej Karpathy
Lecture 4 - 1
7 Jan 2015
Lecture 4:
Optimization
Lecture 4 - 2
7 Jan 2015
Image Classification
assume given set of discrete labels
{dog, cat, truck, plane, ...}
cat
Lecture 4 - 3
7 Jan 2015
Data-driven approach
Lecture 4 - 4
7 Jan 2015
1. Score function
Lecture 4 - 5
7 Jan 2015
Lecture 4 - 6
7 Jan 2015
1. Score function
2. Two loss functions
Lecture 4 - 7
7 Jan 2015
Lecture 4 - 8
7 Jan 2015
Three key components to training Neural Nets:

1. Score function
2. Loss function
3. Optimization
Lecture 4 - 9
7 Jan 2015
Brief aside: Image Features

- In practice, very rare to see Computer Vision applications
that train linear classifiers on pixel values
Lecture 4 - 10
7 Jan 2015

- In practice, very rare to see Computer Vision applications
that train linear classifiers on pixel values
Lecture 4 - 11
7 Jan 2015
Example: Color (Hue) Histogram
hue bins
+1
Lecture 4 - 12
7 Jan 2015
Example: HOG features

8x8 pixel region,
quantize the edge
orientation into 9 bins
(images from vlfeat.org)
Lecture 4 - 13
7 Jan 2015
Example: Bag of Words

1.
2.
Resize patch to a fixed size (e.g. 32x32 pixels)

Extract HOG on the patch (get 144 numbers)
repeat for each detected feature
gives a matrix of size

[number_of_features x 144]
Problem: different images will have different numbers of

features. Need fixed-sized vectors for linear classification
Lecture 4 - 14
7 Jan 2015

1.
2.
Resize patch to a fixed size (e.g. 32x32 pixels)

Extract HOG on the patch (get 144 numbers)
repeat for each detected feature
gives a matrix of size

[number_of_features x 144]
Lecture 4 - 15
7 Jan 2015
histogram of
visual words
visual word vectors
1000-d vector
learn k-means centroids

vocabulary of visual words
144
1000-d vector
e.g. 1000 centroids
Lecture 4 - 16
1000-d vector
7 Jan 2015
Lecture 4 - 17
7 Jan 2015
Most recognition systems are build on the same Architecture
(slide from Yann LeCun)
Lecture 4 - 18
7 Jan 2015
Most recognition systems are build on the same Architecture
CNNs:
end-to-end
models
(slide from Yann LeCun)
Lecture 4 - 19
7 Jan 2015
Visualizing the loss function
Lecture 4 - 20
7 Jan 2015
Visualizing the (SVM) loss function
Lecture 4 - 21
7 Jan 2015
Lecture 4 - 22
7 Jan 2015
Lecture 4 - 23
7 Jan 2015
the full data loss:
Lecture 4 - 24
7 Jan 2015
Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence),

then this becomes:
Lecture 4 - 25
7 Jan 2015

then this becomes:
Lecture 4 - 26
7 Jan 2015

then this becomes:
Lecture 4 - 27
7 Jan 2015
Question: CIFAR-10 has 50,000

training images, 5,000 per class
and 10 labels. How many
occurrences of one classifier row in
the full data loss?

then this becomes:
Lecture 4 - 28
7 Jan 2015
Optimization
Lecture 4 - 29
7 Jan 2015
Strategy #1: A first very bad idea solution: Random search
Lecture 4 - 30
7 Jan 2015
Lecture 4 - 31
7 Jan 2015
whats up
with
0.0001?
Lecture 4 - 32
7 Jan 2015
Lets see how well this works on the test set...
Lecture 4 - 33
7 Jan 2015
Fun aside:
When W = 0, what is the CIFAR-10 loss for SVM and Softmax?
Lecture 4 - 34
7 Jan 2015
Strategy #2: A better but still very bad idea solution:

Random local search
Lecture 4 - 35
7 Jan 2015
Strategy #2: A better but still very bad idea solution:

Random local search
gives 21.4%!
Lecture 4 - 36
7 Jan 2015
Lecture 4 - 37
7 Jan 2015
Lecture 4 - 38
7 Jan 2015
Strategy #3: Following the gradient

In 1-dimension, the derivative of a function:
In multiple dimension, the gradient is the vector of (partial derivatives).
Lecture 4 - 39
7 Jan 2015
Evaluation the
gradient numerically
Lecture 4 - 40
7 Jan 2015
Evaluation the
finite difference
approximation
Lecture 4 - 41
7 Jan 2015
Evaluation the
in practice:
centered difference formula
Lecture 4 - 42
7 Jan 2015
Evaluation the
Lecture 4 - 43
7 Jan 2015
performing
a
parameter
update
Lecture 4 - 44
7 Jan 2015
performing
a
parameter
update
Lecture 4 - 45
7 Jan 2015
original W
negative gradient direction
Lecture 4 - 46
7 Jan 2015
The problems with

numerical gradient:
Lecture 4 - 47
7 Jan 2015
The problems with

numerical gradient:
- approximate
- very slow to evaluate
Lecture 4 - 48
7 Jan 2015
We need something better...
Lecture 4 - 49
7 Jan 2015
We need something better...
Calculus
Lecture 4 - 50
7 Jan 2015
Lecture 4 - 51
7 Jan 2015
Lecture 4 - 52
7 Jan 2015
Lecture 4 - 53
7 Jan 2015
In summary:
-
Numerical gradient: approximate, slow, easy to write
Analytic gradient: exact, fast, error-prone
=>
In practice: Always use analytic gradient, but check

implementation with numerical gradient. This is called a
gradient check.
Lecture 4 - 54
7 Jan 2015
Gradient check: Words of caution
Lecture 4 - 55
7 Jan 2015
Gradient Descent
Lecture 4 - 56
7 Jan 2015
Mini-batch Gradient Descent

-
only use a small portion of the training set to compute the gradient.
Common mini-batch sizes are ~100 examples.

e.g. Krizhevsky ILSVRC ConvNet used 256 examples
Lecture 4 - 57
7 Jan 2015
Stochastic Gradient Descent (SGD)

-
use a single example at a time
(also sometimes called on-line Gradient Descent)
Lecture 4 - 58
7 Jan 2015
Summary
- Always use mini-batch gradient descent
- Incorrectly refer to it as doing SGD as everyone else
(or call it batch gradient descent)
- The mini-batch size is a hyperparameter, but it is not

very common to cross-validate over it (usually based
on practical concerns, e.g. space/time efficiency)
Lecture 4 - 59
7 Jan 2015
Fun question: Suppose you were training with

mini-batch size of 100, and now you switch to
mini-batch of size 1. Your learning rate (step
size) should:
-
increase
decrease
stay the same
become zero
Lecture 4 - 60
7 Jan 2015
The dynamics of Gradient Descent
Lecture 4 - 61
7 Jan 2015
The dynamics of Gradient Descent

pull some weights up and some down
always pull the weights

down
Lecture 4 - 62
7 Jan 2015
Momentum Update
gradient
update
momentum
Lecture 4 - 63
7 Jan 2015
Many other ways to perform optimization

- Second order methods that use the Hessian (or its
approximation): BFGS, LBFGS, etc.
- Currently, the lesson from the trenches is that well-tuned
SGD+Momentum is very hard to beat for CNNs.
Lecture 4 - 64
7 Jan 2015
Summary
- We looked at image features, and saw that CNNs can be
thought of as learning the features in end-to-end manner
- We explored intuition about what the loss surfaces of linear
classifiers look like
- We introduced gradient descent as a way of optimizing loss
functions, as well as batch gradient descent and SGD.
- Numerical gradient: slow :(, approximate :(, easy to write :)
- Analytic gradient: fast :), exact :), error-prone :(
- In practice: Gradient check (but be careful)
Lecture 4 - 65
7 Jan 2015
Next class:
Becoming a
backprop ninja
Lecture 4 - 66
7 Jan 2015

Lecture 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4

Uploaded by

Copyright:

Available Formats

Administrative

- how is the assignment going?

Fei-Fei Li & Andrej Karpathy

Fei-Fei Li & Andrej Karpathy

Fei-Fei Li & Andrej Karpathy

Fei-Fei Li & Andrej Karpathy

Fei-Fei Li & Andrej Karpathy

2. Two loss functions

Fei-Fei Li & Andrej Karpathy

Fei-Fei Li & Andrej Karpathy

Three key components to training Neural Nets:

Fei-Fei Li & Andrej Karpathy

Brief aside: Image Features

Fei-Fei Li & Andrej Karpathy

Brief aside: Image Features

Fei-Fei Li & Andrej Karpathy

Example: Color (Hue) Histogram

Fei-Fei Li & Andrej Karpathy

Example: HOG features

(images from vlfeat.org)

Fei-Fei Li & Andrej Karpathy

Example: Bag of Words

Resize patch to a fixed size (e.g. 32x32 pixels)

gives a matrix of size

Problem: different images will have different numbers of

Example: Bag of Words

Resize patch to a fixed size (e.g. 32x32 pixels)

gives a matrix of size

Fei-Fei Li & Andrej Karpathy

Example: Bag of Words

visual word vectors

learn k-means centroids

e.g. 1000 centroids

Fei-Fei Li & Andrej Karpathy

Brief aside: Image Features

Fei-Fei Li & Andrej Karpathy

Most recognition systems are build on the same Architecture

(slide from Yann LeCun)

Fei-Fei Li & Andrej Karpathy

Most recognition systems are build on the same Architecture

Fei-Fei Li & Andrej Karpathy

Visualizing the loss function

Fei-Fei Li & Andrej Karpathy

Visualizing the (SVM) loss function

Fei-Fei Li & Andrej Karpathy

Visualizing the (SVM) loss function

Fei-Fei Li & Andrej Karpathy

Visualizing the (SVM) loss function

Fei-Fei Li & Andrej Karpathy

Visualizing the (SVM) loss function

the full data loss:

Fei-Fei Li & Andrej Karpathy

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence),

Fei-Fei Li & Andrej Karpathy

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence),

Fei-Fei Li & Andrej Karpathy

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence),

Fei-Fei Li & Andrej Karpathy

Visualizing the (SVM) loss function

Question: CIFAR-10 has 50,000

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence),