You are on page 1of 66

Administrative

- how is the assignment going?


- btw, the notes get updated all the time
based on your feedback
- no lecture on Monday
Fei-Fei Li & Andrej Karpathy

Lecture 4 - 1

7 Jan 2015

Lecture 4:
Optimization

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 2

7 Jan 2015

Image Classification
assume given set of discrete labels
{dog, cat, truck, plane, ...}

cat

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 3

7 Jan 2015

Data-driven approach

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 4

7 Jan 2015

1. Score function

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 5

7 Jan 2015

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 6

7 Jan 2015

1. Score function

2. Two loss functions

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 7

7 Jan 2015

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 8

7 Jan 2015

Three key components to training Neural Nets:


1. Score function
2. Loss function
3. Optimization

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 9

7 Jan 2015

Brief aside: Image Features


- In practice, very rare to see Computer Vision applications
that train linear classifiers on pixel values

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 10

7 Jan 2015

Brief aside: Image Features


- In practice, very rare to see Computer Vision applications
that train linear classifiers on pixel values

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 11

7 Jan 2015

Example: Color (Hue) Histogram

hue bins
+1

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 12

7 Jan 2015

Example: HOG features


8x8 pixel region,
quantize the edge
orientation into 9 bins

(images from vlfeat.org)

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 13

7 Jan 2015

Example: Bag of Words


1.
2.

Resize patch to a fixed size (e.g. 32x32 pixels)


Extract HOG on the patch (get 144 numbers)
repeat for each detected feature

gives a matrix of size


[number_of_features x 144]

Problem: different images will have different numbers of


features. Need fixed-sized vectors for linear classification
Fei-Fei Li & Andrej Karpathy

Lecture 4 - 14

7 Jan 2015

Example: Bag of Words


1.
2.

Resize patch to a fixed size (e.g. 32x32 pixels)


Extract HOG on the patch (get 144 numbers)
repeat for each detected feature

gives a matrix of size


[number_of_features x 144]

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 15

7 Jan 2015

Example: Bag of Words

histogram of
visual words

visual word vectors

1000-d vector

learn k-means centroids


vocabulary of visual words
144

1000-d vector

e.g. 1000 centroids

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 16

1000-d vector

7 Jan 2015

Brief aside: Image Features

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 17

7 Jan 2015

Most recognition systems are build on the same Architecture

(slide from Yann LeCun)

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 18

7 Jan 2015

Most recognition systems are build on the same Architecture

CNNs:
end-to-end
models
(slide from Yann LeCun)

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 19

7 Jan 2015

Visualizing the loss function

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 20

7 Jan 2015

Visualizing the (SVM) loss function

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 21

7 Jan 2015

Visualizing the (SVM) loss function

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 22

7 Jan 2015

Visualizing the (SVM) loss function

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 23

7 Jan 2015

Visualizing the (SVM) loss function

the full data loss:

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 24

7 Jan 2015

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence),


then this becomes:

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 25

7 Jan 2015

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence),


then this becomes:

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 26

7 Jan 2015

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence),


then this becomes:

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 27

7 Jan 2015

Visualizing the (SVM) loss function

Question: CIFAR-10 has 50,000


training images, 5,000 per class
and 10 labels. How many
occurrences of one classifier row in
the full data loss?

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence),


then this becomes:

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 28

7 Jan 2015

Optimization

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 29

7 Jan 2015

Strategy #1: A first very bad idea solution: Random search

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 30

7 Jan 2015

Strategy #1: A first very bad idea solution: Random search

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 31

7 Jan 2015

Strategy #1: A first very bad idea solution: Random search

whats up
with
0.0001?

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 32

7 Jan 2015

Lets see how well this works on the test set...

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 33

7 Jan 2015

Fun aside:
When W = 0, what is the CIFAR-10 loss for SVM and Softmax?

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 34

7 Jan 2015

Strategy #2: A better but still very bad idea solution:


Random local search

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 35

7 Jan 2015

Strategy #2: A better but still very bad idea solution:


Random local search

gives 21.4%!
Fei-Fei Li & Andrej Karpathy

Lecture 4 - 36

7 Jan 2015

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 37

7 Jan 2015

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 38

7 Jan 2015

Strategy #3: Following the gradient


In 1-dimension, the derivative of a function:

In multiple dimension, the gradient is the vector of (partial derivatives).

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 39

7 Jan 2015

Evaluation the
gradient numerically

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 40

7 Jan 2015

Evaluation the
gradient numerically

finite difference
approximation

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 41

7 Jan 2015

Evaluation the
gradient numerically

in practice:

centered difference formula

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 42

7 Jan 2015

Evaluation the
gradient numerically

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 43

7 Jan 2015

performing
a
parameter
update

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 44

7 Jan 2015

performing
a
parameter
update

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 45

7 Jan 2015

original W
negative gradient direction
Fei-Fei Li & Andrej Karpathy

Lecture 4 - 46

7 Jan 2015

The problems with


numerical gradient:

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 47

7 Jan 2015

The problems with


numerical gradient:
- approximate
- very slow to evaluate

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 48

7 Jan 2015

We need something better...

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 49

7 Jan 2015

We need something better...

Calculus

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 50

7 Jan 2015

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 51

7 Jan 2015

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 52

7 Jan 2015

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 53

7 Jan 2015

In summary:
-

Numerical gradient: approximate, slow, easy to write

Analytic gradient: exact, fast, error-prone

=>

In practice: Always use analytic gradient, but check


implementation with numerical gradient. This is called a
gradient check.
Fei-Fei Li & Andrej Karpathy

Lecture 4 - 54

7 Jan 2015

Gradient check: Words of caution

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 55

7 Jan 2015

Gradient Descent

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 56

7 Jan 2015

Mini-batch Gradient Descent


-

only use a small portion of the training set to compute the gradient.

Common mini-batch sizes are ~100 examples.


e.g. Krizhevsky ILSVRC ConvNet used 256 examples

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 57

7 Jan 2015

Stochastic Gradient Descent (SGD)


-

use a single example at a time

(also sometimes called on-line Gradient Descent)

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 58

7 Jan 2015

Summary
- Always use mini-batch gradient descent
- Incorrectly refer to it as doing SGD as everyone else
(or call it batch gradient descent)

- The mini-batch size is a hyperparameter, but it is not


very common to cross-validate over it (usually based
on practical concerns, e.g. space/time efficiency)

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 59

7 Jan 2015

Fun question: Suppose you were training with


mini-batch size of 100, and now you switch to
mini-batch of size 1. Your learning rate (step
size) should:
-

increase
decrease
stay the same
become zero

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 60

7 Jan 2015

The dynamics of Gradient Descent

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 61

7 Jan 2015

The dynamics of Gradient Descent


pull some weights up and some down

always pull the weights


down

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 62

7 Jan 2015

Momentum Update

gradient

update

momentum

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 63

7 Jan 2015

Many other ways to perform optimization


- Second order methods that use the Hessian (or its
approximation): BFGS, LBFGS, etc.
- Currently, the lesson from the trenches is that well-tuned
SGD+Momentum is very hard to beat for CNNs.

Fei-Fei Li & Andrej Karpathy

Lecture 4 - 64

7 Jan 2015

Summary
- We looked at image features, and saw that CNNs can be
thought of as learning the features in end-to-end manner
- We explored intuition about what the loss surfaces of linear
classifiers look like
- We introduced gradient descent as a way of optimizing loss
functions, as well as batch gradient descent and SGD.
- Numerical gradient: slow :(, approximate :(, easy to write :)
- Analytic gradient: fast :), exact :), error-prone :(
- In practice: Gradient check (but be careful)
Fei-Fei Li & Andrej Karpathy

Lecture 4 - 65

7 Jan 2015

Next class:
Becoming a
backprop ninja
Fei-Fei Li & Andrej Karpathy

Lecture 4 - 66

7 Jan 2015

You might also like