DL Tutorial NIPS2015

Deep
Learning

NIPS2015 Tutorial

Geo Hinton, Yoshua Bengio & Yann LeCun
Breakthrough
Deep Learning: machine
learning algorithms based on

learning mulHple levels of
representaHon / abstracHon.

Amazing improvements in error rate in object recogni4on, object

detec4on, speech recogni4on, and more recently, in natural language
processing / understanding
Machine Learning,
AI & No Free Lunch
Four key ingredients for ML towards AI
1. Lots & lots of data
2. Very exible models
3. Enough compu4ng power
4. Powerful priors that can defeat the curse of
dimensionality
3
Bypassing the curse of

dimensionality
We need to build composi4onality into our ML models
Just as human languages exploit composi4onality to give
representa4ons and meanings to complex ideas
Exploi4ng composi4onality gives an exponen4al gain in

representa4onal power
(1) Distributed representa4ons / embeddings: feature learning
(2) Deep architecture: mul4ple levels of feature learning
Addi4onal prior: composi4onality is useful to

describe the world around us eciently
4
Classical Symbolic AI vs
Learning Distributed Representations
Two symbols are equally far from each other
Concepts are not represented by symbols in our
brain, but by paWerns of ac4va4on
(Connec'onism, 1980s)
Georey Hinton
Output units
Hidden units
Input
units
5
person
cat
dog
David Rumelhart
Exponential advantage of distributed

representations
Learning a set of parametric features that are not

mutually exclusive can be exponen4ally more sta4s4cally
ecient than having nearest-neighbor-like or clusteringlike models
Hidden Units Discover Semantically

Meaningful Concepts
Under review as a conference paper at ICLR 2015
Zhou et al & Torralba, arXiv1412.6856 submiWed to ICLR 2015
Figure 10: Interpretation of a picture by different layers of the Places-CNN u

Network trained to recognize places, not objects
by AMT workers. The first shows the final layer output of Places-CNN.
People
Lighting
Tables
detection results along with the confidence based on the units activation and
Object counts in SUN
Fireplace
15000(J=5.3%, AP=22.9%)
Bed (J=24.6%, AP=81.1%)
10000
5000 (J=4.2%, AP=12.7%)
Wardrobe
wall
window
chair
building
floor
tree
ceiling lamp
cabinet
ceiling
person
plant
cushion
sky
picture
curtain
painting
door
desk lamp
side table
table
bed
books
pillow
mountain
car
pot
armchair
box
vase
flowers
road
grass
bottle
shoes
sofa
outlet
worktop
sign
book
sconce
plate
mirror
column
rug
basket
ground
desk
coffee table
clock
shelves
b)
Billiard table (J=3.2%, AP=42.6%)

20
Mountain (J=11.3%, AP=47.6%)
Sofa (J=10.8%, AP=36.2%)
Counts of CNN units discovering each object class.
15
Animals
Seating
Building10(J=14.6%, AP=47.2%)
Washing machine (J=3.2%, AP=34.4%)
c)
a)
Object counts of most informative objects for scene recognition
30
a)
wall
window
chair
building
floor
tree
ceiling lamp
cabinet
ceiling
person
plant
cushion
sky
picture
curtain
painting
door
desk lamp
side table
table
bed
books
pillow
mountain
car
pot
armchair
box
vase
flowers
road
grass
bottle
shoes
sofa
outlet
worktop
sign
book
sconce
plate
mirror
column
rug
basket
ground
desk
coffee table
clock
shelves
Figure 11: (a) Segmentation of images from the SUN database using pool
20
Jaccard
segmentation index, AP = average precision-recall.) (b) Precision-r
discovered
objects. (c) Histogram of AP for all discovered object classes.
10
Note
d) that there are 115 units in pool5 of Places-CNN not detecting objects.
incomplete learning or a complementary texture-based or part-based represen
Figure 9: (a) Segmentations from pool5 in Places-CNN. Many classes are encoded by several units
Each feature can be discovered

without the need for seeing the
exponentially large number of
configurations of the other features
Consider a network whose hidden units discover the following
features:
Person wears glasses
Person is female
Person is a child
Etc.
If each of n feature requires O(k) parameters, need O(nk) examples

Non-parametric methods would require O(nd) examples
8
Exponential advantage of distributed

representations
Bengio 2009 (Learning Deep Architectures for AI, F & T in ML)
Montufar & Morton 2014 (When does a mixture of products
contain a product of mixtures? SIAM J. Discr. Math)
Longer discussion and rela4ons to the no4on of priors: Deep
Learning, to appear, MIT Press.
Prop. 2 of Pascanu, Montufar & Bengio ICLR2014: number of
pieces dis4nguished by 1-hidden-layer rec4er net with n units
and d inputs (i.e. O(nd) parameters) is
Deep Learning:
Automating
Feature Discovery
10
Output
Output
Output
Mapping
from
features
Output
Mapping
from
features
Mapping
from
features
Most
complex
features
Handdesigned
program
Handdesigned
features
Features
Simplest
features
Input
Input
Input
Input
Rule-based
systems
Classic
machine
learning
Representation
learning
Deep
learning
Fig: I. Goodfellow
Exponential advantage of depth

Theore4cal arguments:
2 layers of
Logic gates
Formal neurons
RBF units
= universal approximator
RBMs & auto-encoders = universal approximator

Theorems on advantage of depth:
(Hastad et al 86 & 91, Bengio et al 2007, Bengio

& Delalleau 2011, Martens et al 2013, Pascanu
et al 2014, Montufar et al NIPS 2014)
Some functions compactly

represented with k layers may
require exponential size with 2
layers
2n
1 2 3

1 2 3
Why does it work? No Free Lunch

It only works because we are making some assump4ons about
the data genera4ng distribu4on
Worse-case distribu4ons s4ll require exponen4al data
But the world has structure and we can get an exponen4al gain
by exploi4ng some of it
12
Exponential advantage of depth

Expressiveness of deep networks with piecewise linear ac4va4on
func4ons: exponen4al advantage for depth (Montufar et al,
NIPS 2014)
Number of pieces dis4nguished for a network with depth L and ni
units per layer is at least
or, if hidden layers have width n and input has size n0
13
Y LeCun
Backprop
(modular approach)
Typical Multilayer Neural Net Architecture

C(X,Y,)
l
Squared Distance
l
W3, B3 Linear
l
ReLU
W2, B2 Linear
ReLU
Complex learning machines can be

built by assembling modules into
networks
Linear Module
l Out = W.In+B
ReLU Module (Rectified Linear Unit)
l Out = 0
if Ini<0
i
l Out = In
i
i otherwise
Cost Module: Squared Distance
2
l C = ||In1 - In2||
Objective Function
l
W1, B1 Linear
X (input)
Y LeCun
L()=1/p k C(Xk,Yk,)
= (W1,B1,W2,B2,W3,B3)
Y (desired output)
Building a Network by Assembling Modules

l
Y LeCun
All major deep learning frameworks use modules (inspired by SN/Lush, 1991)
l Torch7, Theano, TensorFlow.
C(X,Y,)
NegativeLogLikelihood
LogSoftMax
W2,B2Linear
ReLU
W1,B1Linear
X
input
Y
Label
Computing Gradients by Back-Propagation

C(X,Y,)
Y LeCun
A practical Application of Chain Rule
Backprop for the state gradients:
dC/dXi-1 = dC/dXi . dXi/dXi-1
dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1
Backprop for the weight gradients:
dC/dWi = dC/dXi . dXi/dWi
dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi
Cost
Wn
dC/dWn
Fn(Xn-1,Wn)
dC/dXi
Wi
dC/dWi
Xi
Fi(Xi-1,Wi)
dC/dXi-1
Xi-1
F1(X0,W1)
X (input)
Y (desired output)
Running Backprop
Y LeCun
Torch7 example
Gradtheta contains the gradient

C(X,Y,)
NegativeLogLikelihood
LogSoftMax
W2,B2Linear
ReLU
W1,B1Linear
X
input
Y
Label
Module Classes
Y LeCun
T
Y = W.X ; dC/dX = W . dC/dY ; dC/dW = dC/dY . (dC/dX)T
ReLU
y = ReLU(x) ; if (x<0) dC/dx = 0 else dC/dx = dC/dy
Duplicate
Y1 = X, Y2 = X ; dC/dX = dC/dY1 + dC/dY2
Add
Y = X1 + X2
Max
y = max(x1,x2) ; if (x1>x2) dC/dx1 = dC/dy else dC/dx1=0
LogSoftMax
Yi = Xi log[j exp(Xj)] ; ..
Linear
; dC/dX1 = dC/dY ; dC/dX2 = dC/dY
Module Classes
l
Y LeCun
Many more basic module classes

Cost functions:
l Squared error
l Hinge loss
l Ranking loss
Non-linearities and operators
l ReLU, leaky ReLU, abs,.
l Tanh, logistic
l Just about any simple function (log, exp, add, mul,.)
Specialized modules
l Multiple convolutions (1D, 2D, 3D)
l Pooling/subsampling: max, average, Lp, log(sum(exp())), maxout
l Long Short-Term Memory, attention, 3-way multiplicative interactions.
l Switches
l Normalizations: batch norm, contrast norm, feature norm...
l inception
Any Architecture works
Y LeCun
"
Any connection graph is permissible

" Directed acyclic graphs (DAG)
" Networks with loops must be
unfolded in time.
"
Any module is permissible

" As long as it is continuous and
differentiable almost everywhere with
respect to the parameters, and with
respect to non-terminal inputs.
"
Most frameworks provide automatic

differentiation
" Theano, Torch7+autograd,
" Programs are turned into
computation DAGs and automatically
differentiated.
Backprop in Practice
"
"
"
"
"
"
"
Use ReLU non-linearities
"
"
"
Use dropout for regularization
"
More recent: Deep Learning (MIT Press book in preparation)
Y LeCun
Use cross-entropy loss for classification

Use Stochastic Gradient Descent on minibatches
Shuffle the training samples ( very important)
Normalize the input variables (zero mean, unit variance)
Schedule to decrease the learning rate
Use a bit of L1 or L2 regularization on the weights (or a combination)
" But it's best to turn it on after a couple of epochs
Lots more in [LeCun et al. Efficient Backprop 1998]
Lots, lots more in Neural Networks, Tricks of the Trade (2012 edition)
edited by G. Montavon, G. B. Orr, and K-R Mller (Springer)
Y LeCun
Convolutional
Networks
Deep Learning = Training Multistage Machines
"
Traditional Pattern Recognition: Fixed/Handcrafted Feature Extractor

Feature
Extractor
"
Trainable
Classifier
Mainstream Pattern Recognition 9until recently)

Feature
Extractor
"
Y LeCun
Mid-Level
Trainable
Features
Classifier
Deep Learning: Multiple stages/layers trained end to end

Low-Level
Features
Mid-Level
Features
High-Level
Features
Trainable
Classifier
Overall Architecture: multiple stages of
Normalization Filter Bank Non-Linearity Pooling
Norm
Filter
Bank
NonLinear
feature
Pooling
Norm
Filter
Bank
NonLinear
feature
Pooling
" Normalization: variation on whitening (optional)
Subtractive: average removal, high pass filtering

Divisive: local contrast normalization, variance normalization
" Filter Bank: dimension expansion, projection on overcomplete basis
" Non-Linearity: sparsification, saturation, lateral inhibition....
Rectification (ReLU), Component-wise shrinkage, tanh,..
" Pooling: aggregation over space or feature type
Max, Lp norm, log prob.
Y LeCun
Classifier
ConvNet Architecture
Y LeCun
Filter Bank +non-linearity

Pooling
Pooling
"
LeNet1 [LeCun et al. NIPS 1989]
Multiple Convolutions
Y LeCun
Animation: Andrej Karpathy http://cs231n.github.io/convolutional-networks/
Convolutional Networks (vintage 1990)
Y LeCun
" filters tanh average-tanh filters tanh average-tanh filters tanh
Example: 1D (Temporal) convolutional net

" 1D (Temporal) ConvNet, aka Timed-Delay Neural Nets
" Groups of units are replicated at each time step.
" Replicas have identical (shared) weights.
Y LeCun
LeNet5
Y LeCun
" Simple ConvNet

" for MNIST
" [LeCun 1998]
input
1@32x32
Layer 1
6@28x28
Layer 2
6@14x14
Layer 3
12@10x10
Layer 4
12@5x5
Layer 5
100@1x1
Layer 6: 10
10
5x5
convolution
2x2
pooling/
subsampling
5x5
convolution
5x5
convolution
2x2
pooling/
subsampling
Applying a ConvNet with a Sliding Window

" Every layer is a convolution
" Sometimes called fully convolutional nets
" There is no such thing as a fully connected layer
Y LeCun
Sliding Window ConvNet + Weighted FSM (Fixed Post-Proc)
Y LeCun
[Matan, Burges, LeCun, Denker NIPS 1991] [LeCun, BoWou, Bengio, Haner, Proc IEEE 1998]
Sliding Window ConvNet + Weighted FSM
Y LeCun
Why Multiple Layers? The World is Compositional
"
"
"
"
"
Y LeCun
Hierarchy of representations with increasing level of abstraction

Each stage is a kind of trainable feature transform
Image recognition: Pixel edge texton motif part object
Text: Character word word group clause sentence story
Speech: Sample spectral band sound phone phoneme word
Low-Level
Mid-Level
High-Level
Feature
Feature
Feature
Trainable
Classifier
Yes, ConvNets are somewhat inspired by the Visual Cortex
Y LeCun
" The ventral (recognition) pathway in the visual cortex has multiple stages
" Retina - LGN - V1 - V2 - V4 - PIT - AIT ....
[picture from Simon Thorpe]
[Gallant & Van Essen]
What are ConvNets Good For

"
"
"
"
Signals that comes to you in the form of (multidimensional) arrays.

Signals that have strong local correlations
Signals where features can appear anywhere
Signals in which objects are invariant to translations and distortions.
" 1D ConvNets: sequential signals, text
Text Classification
Musical Genre Recognition
Acoustic Modeling for Speech Recognition
Time-Series Prediction
" 2D ConvNets: images, time-frequency representations (speech and audio)

Object detection, localization, recognition
" 3D ConvNets: video, volumetric images, tomography images

Video recognition / understanding
Biomedical image analysis
Hyperspectral image analysis
Y LeCun
Recurrent Neural Networks
37

Selec4vely summarize an input sequence in a xed-size state
vector via a recursive update
st
38
xt
st+1
st
unfold
x
F
xt
F
xt+1

Can produce an output at each 4me step: unfolding the graph
tells us how to back-prop through 4me.
o
V
s
U
x
39
ot
V
W
unfold
st
V
1
W
U
xt
ot
st
ot+1
V
st+1
W
W
U
U
xt
xt+1
Generative RNNs
An RNN can represent a fully-connected directed generaHve
model: every variable predicted from all previous ones.
Lt
ot
V
W
st
V
W
40
U
xt
Lt+1
ot
1
1
Lt
st
ot+1
V
st+1
W
W
U
U
xt
xt+1 xt+2
Maximum Likelihood =
Teacher Forcing
During training, past y
in input is from training
data
At genera4on 4me,
past y in input is
generated
Mismatch can cause
compounding error
yt P (yt | ht )
yt
P (yt | ht )
ht
xt
41
(xt , yt ) : next input/output training pair
Increasing the Expressive Power of

RNNs with more Depth
ICLR 2014, How to construct deep recurrent neural networks

+ deep hid-to-out
+ deep hid-to-hid
+deep in-to-hid
Ordinary RNNs
+ stacking
42
+ skip connec4ons for

crea4ng shorter paths
Long-Term Dependencies
The RNN gradient is a product of Jacobian matrices, each
associated with a step in the forward computa4on. To store
informa4on robustly in a nite-dimensional state, the dynamics
must be contrac4ve [Bengio et al 1994].
Storing bits
robustly requires
sing. values<1

Problems:
Gradient
sing. values of Jacobians > 1 gradients explode
clipping
or sing. values < 1 gradients shrink & vanish (Hochreiter 1991)
or random variance grows exponen4ally

43
Gradient Norm Clipping

(Mikolov thesis 2012;
Pascanu, Mikolov, Bengio, ICML 2013)
44
RNN Tricks
(Pascanu, Mikolov, Bengio, ICML 2013; Bengio, Boulanger & Pascanu, ICASSP 2013)
Clipping gradients (avoid exploding gradients)

Leaky integra4on (propagate long-term dependencies)
Momentum (cheap 2nd order)
Ini4aliza4on (start in right ballpark avoids exploding/vanishing)
Sparse Gradients (symmetry breaking)
Gradient propaga4on regularizer (avoid vanishing gradient)
LSTM self-loops (avoid vanishing gradient)
error
45
Gated Recurrent Units & LSTM

Create a path where
gradients can ow for
longer with self-loop
Corresponds to an
eigenvalue of Jacobian
slightly less than 1
LSTM is heavily used
(Hochreiter & Schmidhuber
1997)
GRU light-weight version
(Cho et al 2014)
46
output
self-loop
+
state
input
input gate
forget gate
output gate
RNN Tricks
Delays and mul4ple 4me scales, Elhihi & Bengio NIPS 1996
ot
W3
W1
st
s
W3
47
unfold
2
W1
ot
W3
W3
st
xt
1
W1
1
ot+1
st
st+1
W1
xt
W3
W1
xt+1
Backprop in Practice
Other tricks: see Deep Learning book (in prepara4on, online)
48
The Convergence of Gradient Descent

Y
LeCun
" Batch Gradient

" There is an optimal learning
rate
nd
" Equal to inverse 2 derivative
Let's Look at a single linear unit

Y
LeCun
" Single unit, 2 inputs

" Quadratic loss
" E(W) = 1/p
p (Y WXp)2
" Dataset: classification: Y=-1 for blue, +1 for red.

" Hessian is covariance matrix of input vectors
" H = 1/p X X T
p p
" To avoid ill conditioning: normalize the inputs
" Zero mean
" Unit variance for all variable
W0
W1
W2
X1
X2
Convergence is Slow When Hessian has Different Eigenvalues

" Batch Gradient, small learning rate
Batch Gradient, large learning rate
Y
LeCun
Convergence is Slow When Hessian has Different Eigenvalues

" Batch Gradient, small learning rate
"S
tochastic Gradient: Much Faster
" But fluctuates near the minimum
Y
LeCun
Multilayer Nets Have Non-Convex Objective Functions

" 1-1-1 network
" Y = W1*W2*X
" trained to compute the identity function with quadratic loss
" Single sample X=1, Y=1 L(W) = (1-W1*W2)^2
" Solution: W2 = 1/W2 hyperbola.
Y
LeCun
Y
W2
Z
W1
X
Solution
Saddle point
Solution
Deep Nets with ReLUs and Max Pooling

Y
LeCun
"S
tack of linear transforms interspersed with Max operators
" Point-wise ReLUs:
31
W31,22
22
W22,14
" Max Pooling
" switches from one layer to the next
" Input-output function
" Sum over active paths
" Product of all weights along the path
" Solutions are hyperbolas
" Objective function is full of saddle points
14
W14,3
3
Z3
A Myth Has Been Debunked: Local

Minima in Neural Nets
! Convexity is not needed

(Pascanu, Dauphin, Ganguli, Bengio, arXiv May 2014): On the
saddle point problem for non-convex op'miza'on
(Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014):
Iden'fying and a[acking the saddle point problem in highdimensional non-convex op'miza'on
(Choromanska, Hena, Mathieu, Ben Arous & LeCun,
AISTATS2015): The Loss Surface of Mul'layer Nets
55
Saddle Points
Local minima dominate in low-D, but
saddle points dominate in high-D
Most local minima are close to the
boWom (global minimum error)
56
Saddle Points During Training

Oscilla4ng between two behaviors:
Slowly approaching a saddle point
Escaping it
57
Low Index Critical Points

Choromanska et al & LeCun 2014, The Loss Surface of Mul'layer Nets
Shows that deep rec4er nets are analogous to spherical spin-glass models
The low-index cri4cal points of large models concentrate in a band just
above the global minimum
58
Piecewise Linear Nonlinearity

Jarreth, Kavukcuoglu, Ranzato & LeCun ICCV 2009: absolute value

rec4ca4on works beWer than tanh in lower layers of convnet
Nair & Hinton ICML 2010: Duplica4ng sigmoid units with same
weights but dierent bias in an RBM approximates a rec4ed
linear unit (ReLU)
sotplus
f(x)=log(1+exp(x))
f(x)=max(0,x)
Glorot, Bordes and Bengio AISTATS 2011: Using a rec4er nonlinearity (ReLU) instead of tanh of sotplus allows for the rst 4me
to train very deep supervised networks without the need for Neuroscience motivations
Leaky integrate-and-fire model
unsupervised pre-training; was biologically moHvated
Krizhevsky, Sutskever & Hinton NIPS 2012:
rec4ers one of the crucial ingredients in
ImageNet breakthrough
Leaky integrate-and-fire model
Stochastic Neurons as Regularizer:
Improving neural networks by prevenHng co-adaptaHon

of feature detectors (Hinton et al 2012, arXiv)
Dropouts trick: during training mul4ply neuron output by random
bit (p=0.5), during test by 0.5
Used in deep supervised networks
Similar to denoising auto-encoder, but corrup4ng every layer
Works beWer with some non-lineari4es (rec4ers, maxout)
(Goodfellow et al. ICML 2013)
Equivalent to averaging over exponen4ally many architectures
60
Used by Krizhevsky et al to break through ImageNet SOTA

Also improves SOTA on CIFAR-10 (1816% err)
Knowledge-free MNIST with DBMs (.95.79% err)
TIMIT phoneme classica4on (22.719.7% err)
Dropout Regularizer: Super-Efficient

Bagging
*

61
m
we can standardize each feature1asX
follows
xki,k+,
BN (xxkk)==m k x
Batch Normalization
x
x
i=1
k
k
(Ioe & Szegedy ICML 2015)
k = p 2 m ,
x
1kkX
, the
k to k and k2 to x
+
k.
(3)
By setting
network
can
rec
2
=
(x
x
)
,
i,k
k
k
Standardize ac4va4ons (before nonlinearity) across minibatch

m
original
layer
So,
a standard
where
is arepresentation.
small positive constant
tofor
improve
numericalfeed
stai=1
Backprop through this operaHon
bility.
ayer
in
a
neural
network
where
m is the size of the mini-batch. Using these statis
Regularizes & helps to train
However,
standardizing
thefeature
intermediate
activations
rewe can
standardize each
asIffollows
we have
access to the who
information
not account
only from the
p
m
duces the
representational
power
of
the
layer.
To
for
y
=
(Wx
+
b),
X
1
k ones, allowing for bid
x
k =
x
xi,k ,
(1) xkthe future
this, batch
learnable pa k = padditional
x
,
m i=1normalization introduces
2 +
!
! !
k
m
rameters
and
,
which
respectively
scale
and
shift
the
data,
ht =
(W
X weights matrix, b is the bias
where W2 is1 the
vector,
hht
2
ak )small
(xi,k isx
,
(2)
where
positive
constant to improve
numerical
k = to
leading
a
layer
of
the
form
m i=1
h t = (W h h t
nput of the bility.

layer and is an arbitrary activation
f
!
h
=
[
h : h ]
e m is the size of the mini-batch.
Using
these
statistics,
However,
standardizing
the
intermediate
activation
batch
normalization
is
applied
as
follows
BN
(x
)
=
x
+
.
(4)
an standardize each feature as follows
k
k k
k
t
duces the representational power

of the
To the
accoun
where
[x : layer.
y] denotes
conca
k normalization
x batch
x
we can additional
stack
RNNs
by
using
h
By setting
and
to
x
,
the
network
can
recover
the
this,
introduces
learnable
k
k
k
pkk to
k =
x
,
(3)
(BN
(Wx)).
2 + y =
deeperand
architectures
[13
krepresentation.
rameters
and , which
scale
shift the
original layer
So,respectively
for creating
a standard
feedforward
e is a small positive leading
constant to improve
staa layernumerical
of the form
62 layer in a neural to
network
hlt = (Wh hlt 1
.
Note
that the bias vector has been removed, since i
Early Stopping
Beau4ful FREE LUNCH (no need to launch many dierent
training runs for each value of hyper-parameter for #itera4ons)
Monitor valida4on error during training (ater visi4ng # of
training examples = a mul4ple of valida4on set size)
Keep track of parameters with best valida4on error and report
them at the end
If error does not improve enough (with some pa4ence), stop.
63
Random Sampling of Hyperparameters

(Bergstra & Bengio 2012)
Common approach: manual + grid search

Grid search over hyperparameters: simple & wasteful
Random search: simple & ecient
Independently sample each HP, e.g. l.rate~exp(U[log(.1),log(.0001)])
Each training trial is iid
If a HP is irrelevant grid search is wasteful
More convenient: ok to early-stop, con4nue further, etc.
64
Sequential Model-Based Optimization

of Hyper-Parameters
(HuWer et al JAIR 2009; Bergstra et al NIPS 2011; Thornton et al
arXiv 2012; Snoek et al NIPS 2012)
Iterate
Es4mate P(valid. err | hyper-params cong x, D)
choose op4mis4c x, e.g. maxx P(valid. err < current min. err | x)
train with cong x, observe valid. err. v, D D U {(x,v)}
65
Distributed Training
Minibatches
Large minibatches + 2nd order & natural gradient methods
Asynchronous SGD (Bengio et al 2003, Le et al ICML 2012, Dean et al NIPS 2012)
Data parallelism vs model parallelism
BoWleneck: sharing weights/updates among nodes, to avoid
node-models to move too far from each other
EASGD (Zhang et al NIPS 2015) works well in prac4ce
Eciently exploi4ng more than a few GPUs remains a challenge
66
Vision
((switch laptops)
67
Speech Recognition
68
The dramatic impact of Deep

Learning on Speech Recognition
(according to Microsoft)
Word error rate on Switchboard
100%
69
Using DL
10%
4%
2%
1%
1990
2000
2010
Speech Recognition with Convolutional Nets (NYU/IBM)
"M
ultilingual recognizer
" Multiscale input
" Large context window
Y LeCun
"
"
"
"
coustic Model: ConvNet with 7 layers. 54.4 million parameters.

A
Classifies acoustic signal into 3000 context-dependent subphones categories
ReLU units + dropout for last layers
Trained on GPU. 4 days of training
Y LeCun

" Training samples.
" 40 MEL-frequency Cepstral Coefficients
" Window: 40 frames, 10ms each
Y LeCun
" Convolution Kernels at Layer 1:

" 64 kernels of size 9x9
Y LeCun
End-to-End Training
with Search
interpretation graph
74
"u"
0.8
"p" 0.2
"c" 0.4
"a" 0.2
grammar graph
"t"
0.8
"r"
"n"
"a"
Graph Composition
Hybrid systems, neural nets +

HMMs (Bengio 1991, Bo[ou 1991)
Neural net outputs scores for
each arc, recognized output =
labels along best path; trained
discrimina4vely (LeCun et al 1998)
Connec4onist Temporal
Classica4on (Graves 2006)
DeepSpeech and aWen4onbased end-to-end RNNs
(Hannun et al 2014; Graves &
Jaitly 2014; Chorowski et al
NIPS 2015)
"t" 0.8
interpretations:
cut (2.0)
cap (0.8)
cat (1.4)
"t"
match
& add
match
& add
match
& add
"b"
"u"
"t"
"e"
"u"
"c"
"r"
"a"
"p"
"t" "r"
"c" 0.4
"x" 0.1
"o" 1.0
"a" 0.2
"d" 1.8
"u" 0.8
"e"
"d"
"p" 0.2
"t" 0.8
Recognition
Graph
Natural Language
Representations
75
Neural Language Models: fighting one

exponential by another one!
(Bengio et al NIPS2000)
ith output = P(w(t) = i | context)
output
softmax
...
...
Exponen4ally large set of

generaliza4ons: seman4cally close
sequences
most computation here
tanh
...
C(w(tn+1))
...
R(w1 ) R(w2 ) R(w3 ) R(w4 ) R(w5 ) R(w6 )
w1
76
w2
w3
w4
w5
input sequence
w6
...
Table
lookup
in C
index for w(tn+1)
C(w(t2))
...
...
C(w(t1))
...
Matrix C
shared parameters
across words
index for w(t2)
index for w(t1)
Exponen4ally large set of possible contexts
Neural word embeddings: visualization

directions = Learned Attributes
77
Analogical Representations for Free

(Mikolov et al, ICLR 2013)
Seman4c rela4ons appear as linear rela4onships in the space of

learned representa4ons
King Queen Man Woman
Paris France + Italy Rome
France
Italy
Paris
Rome
78
Handling Large Output Spaces

Sampling nega4ve examples: increase score of

correct word and stochas4cally decrease all the
others
Uniform sampling (Collobert & Weston, ICML 2008)
Importance sampling, (Bengio & Senecal AISTATS 2003; Dauphin et al ICML

2011) ; GPU friendly implementa4on (Jean et al ACL 2015)
Decompose output probabili4es hierarchically (Morin &

Bengio 2005; Blitzer et al 2005; Mnih & Hinton 2007,2009; Mikolov et al 2011)
categories
79
words within each category
Encoder-Decoder Framework
Intermediate representa4on of meaning
= universal representa4on
Encoder: from word sequence to sentence representa4on
Decoder: from representa4on to word sequence distribu4on
English
decoder
French
encoder
French sentence
80
For unilingual data
For bitext data
English sentence
English sentence
English
decoder
English
encoder
English sentence
(Cho et al EMNLP 2014; Sutskever et al NIPS 2014)
Attention Mechanism for Deep

Learning
Consider an input (or intermediate) sequence or image
Consider an upper level representa4on, which can choose
where to look , by assigning a weight or probability to each
input posi4on, as produced by an MLP, applied at each posi4on
Higher-level
Sotmax over lower
loca4ons condi4oned
on context at lower and
higher loca4ons
Sot aWen4on (backprop) vs

Stochas4c hard aWen4on (RL)
Lower-level
81
(Bahdanau, Cho & Bengio, arXiv sept. 2014) following up on (Graves 2013) and
(Larochelle & Hinton NIPS 2010)

End-to-End Machine Translation with

Recurrent Nets and Attention Mechanism
(Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015)
>Qr 7` +M r2 ;Q rBi? p2`v H`;2 i`;2i pQ+#mH`v\ URV

Reached the state-of-the-art in one year, from scratch
UV 1M;HBb?6`2M+? UqJh@R9V
LJh
Y*M/
YlLE
Y1Mb
LJhUV
jkXe3
jjXk3
jjXNN
jeXdR
U#V 1M;HBb?:2`KM UqJh@R8V

JQ/2H
k9X3
k9Xy
kjXe
kkX3
kkXd
82
LQi2
L2m`H Jh
lX1/BM#m`;?- avMi+iB+ aJh
GAJaAfEAh
lX1/BM#m`;?- S?`b2 aJh
EAh- S?`b2 aJh
:QQ;H2
jyXe
jkXd
jeXN
S@aJh
jdXyj
U+V 1M;HBb?*x2+? UqJh@R8V

JQ/2H
R3Xj
R3Xk
RdXe
RdX9
ReXR
LQi2
L2m`H Jh
C>l- aJhYGJYPaJYaT`b2
*l- S?`b2 aJh
lX1/BM#m`;?- S?`b2 aJh
lX1/BM#m`;?- avMi+iB+ aJh
IWSLT 2015 Luong & Manning (2015)

TED talk MT, English-German
BLEU (CASED)
HTER (HE SET)
35
30
28.18
30.85
30
26.18
25
26.02
25
-26%
21.84
24.96
22.51
20.08
20
20
22.67
23.42
16.16
15
15
10
10
5
0
83
Stanford
Karlsruhe Edinburgh Heidelberg
PJAIT
Baseline
Stanford
Edinburgh Karlsruhe Heidelberg
PJAIT
Image-to-Text: Caption Generation

with Attention
(Xu et al, ICML 2015)

Following many papers
on cap4on genera4on,
including (Kiros et al
2014; Mao et al 2014;
Vinyals et al 2014;
Donahue et al 2014;
Karpathy & Li 2014;
Fang et al 2014)
ui
zi
Attention
weight
aj
aj =1
Convolutional Neural Network
Attention
Mechanism
Recurrent
State
Word
Ssample
f = (a, man, is, jumping, into, a, lake, .)
84
Usm 2i HX- kyR8V- UuQ 2i HX- kyR8V
Annotation
Vectors
hj
X X X X
X X X X
X X X X
X X X X
X X X X
X X X X
X
X
X
X
Paying
Attention to
Selected Parts
of the Image
While Uttering
Words
85
The Good
86
And the Bad
87
But How can Neural Nets Remember Things?
Y
LeCun
" Recurrent networks cannot remember things for very long

" The cortex only remember things for 20 seconds
" We need a hippocampus (a separate memory module)
" LSTM [Hochreiter 1997], registers
" Memory networks [Weston et 2014] (FAIR), associative memory
" NTM [Graves et al. 2014], tape.
Attention
mechanism
Recurrent net
memory
Memory Networks Enable REASONING

" Add a short-term memory to a network
Y
LeCun
http://arxiv.org/abs/1410.3916
Results on
Question Answering
Task
(Weston, Chopra,
Bordes 2014)
End-to-End Memory Network

" [ Sukhbataar, Szlam, Weston, Fergus NIPS 2015, ArXiv:1503.08895]
" Weakly-supervised MemNN: no need to tell which memory location to use.
Y
LeCun
Stack-Augmented RNN: learning algorithmic sequences

" [Joulin & Mikolov, ArXiv:1503.01007]
Y
LeCun
Sparse Access Memory for Long-Term

Dependencies
A mental state stored in an external memory can stay for
arbitrarily long dura4ons, un4l evoked for read or write
Forgeng = vanishing gradient.
Memory = larger state, reducing the need for forgeng/vanishing
passive copy
access
92
How do humans generalize

from very few examples?
They transfer knowledge from previous learning:
Representa4ons
Explanatory factors
Previous learning from: unlabeled data

+ labels for other tasks
Prior: shared underlying explanatory factors, in

parHcular between P(x) and P(Y|x)
93
Unsupervised and Transfer Learning Challenge

+ Transfer Learning Challenge: Won by
Unsupervised Deep Learning
Raw data
1 layer
ICML2011
workshop on
Unsup. &
Transfer Learning
2 layers
NIPS2011
Transfer
Learning
Challenge
Paper:
ICML2012
3 layers
4 layers
Multi-Task Learning
Generalizing beWer to new tasks (tens
of thousands!) is crucial to approach AI
Example: speech recogni4on, sharing
across mul4ple languages
task 1
output y1
task 2
output y2
Task A
Task B
task 3
output y3
Task C
Deep architectures learn good

intermediate representa4ons that can
be shared across tasks
(Collobert & Weston ICML 2008,
Bengio et al AISTATS 2011)
Good representa4ons that disentangle

underlying factors of varia4on make
raw input x
E.g. dic4onary, with intermediate
sense for many tasks because each
task concerns a subset of the concepts re-used across many deni4ons
factors
Prior: shared underlying explanatory factors between tasks

95
Google Image Search

Joint Embedding: different
object types represented in same space

Google:
S. Bengio, J.
Weston & N.
Usunier
(IJCAI 2011,
NIPS2010,
JMLR 2010,
ML J 2010)
WSABIE objec4ve func4on:
Combining Multiple Sources of Evidence

with Shared Representations
Tradi4onal ML: data = matrix
Rela4onal learning: mul4ple sources,
dierent tuples of variables
Share representa4ons of same types
across data sources
Shared learned representa4ons help
propagate informa4on among data
sources: e.g., WordNet, XWN,
Wikipedia, FreeBase, ImageNet
(Bordes et al AISTATS 2012, ML J. 2013)
FACTS = DATA
DeducHon = GeneralizaHon
97
person
url
event
url
event
words
history
url person
history words
url
P(person,url,event)
P(url,words,history)
Multi-Task / Multimodal Learning

with Different Inputs for Different
Tasks
Y
E.g. speaker adapta4on,

mul4modal input
Unsupervised mul4modal case:
(Srivastava & Salakhutdinov NIPS 2012)
selection switch
h1
98
X1
h2
X2
h3
X3
Maps Between
Representations
hx = fx (x)
x and y represent
dierent modali4es, e.g.,
image, text, sound

Can provide 0-shot
generaliza4on to new
categories (values of y)
fx
hy = fy (y)
fy
x-space
y -space
xtest
ytest
(x, y) pairs in the training set

x -representation (encoder) function fx
y -representation (encoder) function fy
relationship between embedded points
within one of the domains
maps between representation spaces
99
Unsupervised Representation
Learning
100
Why Unsupervised Learning?

Recent progress mostly in supervised DL
Real challenges for unsupervised DL
Poten4al benets:
Exploit tons of unlabeled data

Answer new ques4ons about the variables observed
Regularizer transfer learning domain adapta4on
Easier op4miza4on (divide and conquer)
Joint (structured) outputs
101
Why Latent Factors & Unsupervised

Representation Learning? Because of
Causality.
On causal and an'causal learning, (Janzing et al ICML 2012)
If Ys of interest are among the causal factors of X, then
P (X|Y )P (Y )
P (Y |X) =
P (X)
is 4ed to P(X) and P(X|Y), and P(X) is dened in terms of P(X|Y), i.e.
The best possible model of X (unsupervised learning) MUST
involve Y as a latent factor, implicitly or explicitly.
Representa4on learning SEEKS the latent variables H that explain
the varia4ons of X, making it likely to also uncover Y.

102
If Y is a Cause of X, Semi-Supervised
Learning Works
Just observing the x-density reveals the causes y (cluster ID)
Ater learning p(x) as a mixture, a single labeled example per class
suces to learn p(y|x)
Mixture model
0.5
y=1
y=2
y=3
10
x
15
0.4
p(x)
0.3
0.2
0.1
0.0
103
20
Invariance & Disentangling

Underlying Factors
Invariant features
Which invariances?
Alterna4ve: learning to disentangle factors, i.e.
keep all the explanatory factors in the
representa4on
Good disentangling
avoid the curse of dimensionality
Emerges from representa4on learning
(Goodfellow et al. 2009, Glorot et al. 2011)
104
Boltzmann Machines /
Undirected Graphical Models
Boltzmann machines:
(Hinton 84)

Itera4ve sampling scheme =

stochas4c relaxa4on,
Monte-Carlo Markov chain
Training requires sampling:
might take a lot of 4me to
converge if there are wellseparated modes
Restricted Boltzmann Machine

(Smolensky 1986, Hinton et al 2006)
(RBM)
A building block
(single-layer) for
deep architectures

BiparHte undirected
graphical model
observed
h ~ P(h|x )
h ~ P(h|x)
hidden
x ~ P(x | h)
Block
Gibbs
sampling
Capturing the Shape of the

Distribution: Positive & Negative
Samples
Energy(x)
Boltzmann machines, undirected graphical models,
RBMs, energy-based models
P r(x) =
Observed (+) examples push the energy down

Generated / dream / fantasy (-) samples / particles push
the energy up
X+
X-
Eight Strategies to Shape the Energy Function
Yann
LeCun
LeCun
" 1. build the machine so that the volume of low energy stuff is constant
" PCA, K-means, GMM, square ICA
" 2. push down of the energy of data points, push up everywhere else
" Max likelihood (needs tractable partition function)
" 3. push down of the energy of data points, push up on chosen locations
" contrastive divergence, Ratio Matching, Noise Contrastive Estimation,
Minimum Probability Flow
" 4. minimize the gradient and maximize the curvature around data points
" score matching
" 5. train a dynamical system so that the dynamics goes to the manifold
" denoising auto-encoder, diffusion inversion (nonequilibrium dynamics)
" 6. use a regularizer that limits the volume of space that has low energy
" Sparse coding, sparse auto-encoder, PSD
" 7. if E(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible.
" Contracting auto-encoder, saturating auto-encoder
" 8. Adversarial training: generator tries to fool real/synthetic classifier.
Auto-Encoders
P(x|h)
Ancestral sampling / directed models

Helmholtz machine, VAE, etc.
(Hinton et al 1995)
reconstruc,on!r!
Decoder.g!
P(h)
Itera4ve sampling / undirected models:

RBM, denoising auto-encoder
ProbabilisHc reconstrucHon criterion:

Reconstruc4on log-likelihood =
- log P(x | h)
code!h!
Q(h|x)
Encoder.f!
x
109
input!x!
Denoising auto-encoder:
During training, input is corrupted
stochas4cally, and auto-encoder must
learn to guess the distribu4on of the
missing informa4on.
Predictive Sparse Decomposition (PSD)
Yann
LeCun
LeCun
[Kavukcuoglu, Ranzato, LeCun, rejected by every conference, 2008-2009]

" Train a simple feed-forward function to predict the result of a complex
optimization on the data points of interest
Generative Model
Factor A
INPUT
Factor B
Decoder
Distance
Fast Feed-Forward Model
LATENT
VARIABLE
Factor A'
Encoder
Distance
1. Find optimal Zi for all Yi;

2. Train Encoder to predict
Zi from Yi
Energy = reconstruction_error + code_prediction_error + code_sparsity
Probabilistic interpretation of autoencoders

Manifold & probabilis4c interpreta4ons of auto-encoders
Denoising Score Matching as induc4ve principle
(Vincent 2011)
Es4ma4ng the gradient of the energy func4on

(Alain & Bengio ICLR 2013)
Sampling via Markov chain

(Bengio et al NIPS 2013; Sohl-Dickstein et al ICML 2015)
Varia4onal auto-encoders
(Kingma & Welling ICLR 2014)
(Gregor et al arXiv 2015)
111
Denoising Auto-Encoder
Learns a vector eld poin4ng towards higher
probability direc4on (Alain & Bengio 2013)
reconstruction(x)
x !
2@
log p(x)
@x
prior: examples
concentrate near a
lower dimensional
manifold
Some DAEs correspond to a kind of Gaussian Corrupted input

RBM with regularized Score Matching (Vincent
2011)
[equivalent when noise0]
Corrupted input
Regularized Auto-Encoders Learn a

Vector Field that Estimates a
Gradient Field (Alain & Bengio ICLR 2013)
113
Denoising Auto-Encoder Markov Chain
corrupt
Xt
X~ t denoise
Xt+1
X~ t+1
X~ t+2
Xt+2
The corrupt-encode-decode-sample Markov chain associated with a DAE

samples from a consistent es4mator of the data genera4ng distribu4on
114
Preference for Locally Constant Features

Denoising or contrac4ve auto-encoder on 1-D input:
@E(x)
r(x)" x
@x
E(x)
x1"
E[||r(x + z)
115
x2"
x3"
x|| ] E[||r(x)
x"
2
x|| ] +
@r(x) 2
||
||F
@x
Helmholtz Machines (Hinton et al 1995) and

Variational Auto-Encoders (VAEs)
Encoder = inference
Parametric approximate
Q(h3 |h2 )
inference
h2
Successors of Helmholtz
Q(h2 |h1 )
machine (Hinton et al 95)
h1
Maximize varia4onal lower
bound on log-likelihood:
Q(h1 |x)

min
KL(Q(x, h)||P (x, h))
x
where = data distr.
Q(x)
Q(x)
or equivalently
max
X
x
116
P (h3 )
P (h2 |h3 )
P (h1 |h2 )
P (x|h1 )
Decoder = generator
(Kingma & Welling 2013, ICLR 2014)

(Gregor et al ICML 2014; Rezende et al ICML 2014)
(Mnih & Gregor ICML 2014; Kingma et al, NIPS 2014) h3
X
P (x, h)
= max
Q(h|x) log
Q(h|x) log P (x|h) + KL(Q(h|x)||P (h))
Q(h|x)
x
Geometric Interpretation
Encoder: map input to a new space
where the data has a simpler
distribu4on
Add noise between encoder output
and decoder input: train the
decoder to be robust to mismatch
between encoder output and prior
output.
P(h)
Q(h|x)
f(x)
contrac4ve
g
117
DRAW: Sequential Variational AutoEncoder

with Attention
Network
For Image Generation
(Gregor et al of Google DeepMind, arXiv 1502.04623, 2015)
KAROLG @ GOOGLE . COM
DANIHELKA @ GOOGLE . COM
GRAVESA @ GOOGLE . COM
WIERSTRA @ GOOGLE . COM
Even for a sta4c input, the encoder and decoder are now
recurrent nets, which gradually add elements to the answer,
DRAW: A Recurrent Neural Network For Image Generation
and use an aWen4on mechanism to choose where to do so.
limpses, or foveations, than by a sinhe entire image (Larochelle & Hinton,

012; Tang et al., 2013; Ranzato, 2014;
Mnih et al., 2014; Ba et al., 2014; SerThe main challenge faced by sequential
learning where to look, which can be
forcement learning techniques such as
nih et al., 2014). The attention model in
fully differentiable, making it possible
d backpropagation. In this sense it ree read and write operations developed
g Machine
(Graves et al., 2014).
Time
P (x|z)
decoder
FNN
ct
ct
write
write
decoder
RNN
decoder
RNN
zt
zt+1
sample
sample
sample
Q(z|x)
encoder
FNN
hdec
t 1
Q(zt |x, z1:t

henc
t 1
1)
. . . cT
Q(zt+1 |x, z1:t )
encoder
RNN
encoder
RNN
read
read
P (x|z1:T )
decoding
(generative model)
encoding
(inference)
Figure 2. Left: Conventional Variational Auto-Encoder. DurionFigure

defines
the DRAW
DRAW
architecture,
118 1. A trained
network
generating MNIST diging generation, a sample z is drawn from a prior P (z) and passed
its. Eachused
row shows
stages
in thethe
generation
function
forsuccessive
training
and
pro- of a sin-
DRAW Samples of SVHN Images:

generated
samples
vs training nearest
rent Neural
Network For Image
Generation
neighbor
Nearest training
example for last
column of samples
119
GAN: Generative Adversarial Networks

Goodfellow et al NIPS 2014
Adversarial nets framework
Random Generator
Vector Network
Fake
Image
Random Training
Set
Index
Real
Image
Discriminator
Network
1
2
0
LAPGAN: Laplacian Pyramid of

Generative Adversarial Networks
Laplacian(Denton + Chintala, et al 2015)
Pyramid
http://soumith.ch/eyescream/

1
2
1
LAPGAN: Visual Turing Test

(Denton + Chintala, et al 2015)
LAPGAN
results
40% of samples mistaken by humans for real photos
Sharper images than max. lik. proxys (which min. KL(data|model)):

GAN objec4ve = compromise between KL(data|model) and KL(model|data)
122
Convolutional GANs
Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model
could learn to memorize training examples, but this is experimentally unlikely as we train with a
(Radford et al, arXiv 1511.06343)
small learning rate and minibatch SGD.
We are aware of no prior empirical evidence demonstrating
memorization with SGD and a small learning rate in only one epoch.
Strided convolu4ons, batch normaliza4on, only convolu4onal layers,

ReLU and leaky ReLU
123
Space-Filling in Representation-Space
Deeper representaHons " abstracHons " disentangling (Bengio et al ICML 2013)

Manifolds are expanded and abened
X-space
H-space
Linear interpola4on at layer 2
9s manifold
3s manifold
Linear interpola4on at layer 1
Linear interpola4on in pixel space
Under review as a conference paper at ICLR 2016
GAN: Interpolating in Latent Space

If the model is good (unfolds the manifold), interpola4ng between
latent values yields plausible images.
125
Figure 4: Top rows: Interpolation between a series of 9 random points in Z show that the space
Supervised and Unsupervised in One Learning Rule?

" Boltzmann Machines have all the right properties [Hinton 1831] [OK, OK 1983 ;-]
" Sup & unsup, generative & discriminative in one simple/local learning rule
" Feedback circuit reconstructs and propagates virtual hidden targets
" But they don't really work (or at least they don't scale).
" Problem: the feedforward path eliminates information
" If the feedforward path is invariant, then
" the reconstruction path is a one-to-many mapping
" Usual solution: sampling. But I'm allergic.
Predicted What
what
Predicted What
Many
To
One
One
To
Many
Many
To
One
reconstruction
input
input
Cost
Cost
what
One
To
Many
Cost
reconstruction
Deep Semi-Supervised Learning

Unlike unsupervised pre-training, modern approaches op4mize
jointly the supervised and unsupervised objec4ve
Discrimina4ve RBMs (Larochelle & Bengio, ICML 2008)
Semi-Supervised VAE (Kingma et al, NIPS 2014)
Ladder Network (Rasmus et al, NIPS 2015)
127
for ll == LL to
to 00 do
do
for
forl l==11totoLLdo
do
for
(l)
(l)
(l)(l(l 1)
then
ifif ll == LL then
W(l)
pre
zzpre
W
hh 1)
(L)
(L)
(L)
(L)
(l)
u
batchnorm(
h
)
(l)
(l)
Semisupervised
Learning
with
Ladder
u
batchnorm(
(l)
batchmean(
pre))
batchmean(
zzpre
else
else
(l)
(l)
(l)
(l)
(l)
(l) (l+1)
batchstd(
Network
pre))
batchstd(
zzpre
(Rasmus et al, NIPS 2015)
u(l)
batchnorm(V(l)
z(l+1)))
u
batchnorm(V
z
(l)
(l)
(l)
(l)
z
batchnorm(
z
+noise
noise
end ifif
pre))+
z
batchnorm(zpre
end
(l)
(l) (l)
(l)
(l)
(l)
(l)
(l)
(l)
(l)
8i
:
z
g(
z
,
u
# Eq.
Eq. (1)
(1)
(l)
(l)
(l)
(l)
activation(
(
+
))
h
i , uii )) #
8i : zii
g(
zi(l)
h
activation(
(
zz +
))
Jointly trained stack of denoising auto-encoders with gated
(l)
z(l)
(l)
(l)
end
for
i
i
z
(l)
end
for
8i
:
z
i
i
(l)
i,BN
(L)
8i
:
z
lateral connec4ons and semi-supervised objec4ve

(L)
(l)
i,BN
(
y| |x)
x) h
h
ii
PP(
y
end for
for
Cleanencoder
encodery(for
(fordenoising
denoisingtargets)
targets)
y end
##Clean
(0)
(0)
# Cost
Cost function
function C
C for
for training:
training:
h
z
x(n)
(0)
(0)
#
h
z
x(n)
C 00
for
l
=
1
to
L
do
C
(2)
for l (l)
= 1 to L do z(2) g (l)(, )(l 1) z(2) (2)
z(2)t(n) then
if
Semi-supervised objec4ve:
z(l) 2
batchnorm(W(l) h(l 1) )
Cd
if
t(n)
then
h
)
Nz(0,(l)) batchnorm(W (l)
C
log P (
y = t(n) | x)
(l)
log
(l)
h(l)
activation( (l)
(z(l) + (l)
))
C
P (
y = t(n) | x)
(2)
h
activation(
(z +
(2)
end if
f))
()
end for f ()
end if
2
PL
(l)
end for
(l)
2
(l)
# Eq
C
C + PLl=1 l z(l) z
(1)
BN
g (1) (, )
z
z(1)
z(1)
(1)
z
z
#
Eq
C
C
+
Cd
BN
l=1 l
N (0,
f (1) ()
f (1) ()
N (0,
(0)
(, )
(0)
Cd
They also use

Batch Normaliza4on
4
4
x
1% error on PI-MNIST with 100 labeled examples (Pezeshki et al arXiv 1511.06430)

Figure
2: A conceptual illustration of the Ladder network when L = 2. The feedforward path
(1)
x!
z ! z(2) ! y) shares the mappings f (l) with the corrupted feedforward path, or encoder
128
(1) ! z
(2) ! y
). The decoder (
(l) ! x
) consists of denoising functions g (l) and
(x ! z
z(l) ! z
(l)
(l)
(l)
Stacked What-Where
Auto-Encoder (SWWAE)
Predicted
Output
Desired
Output
Loss
[Zhao, Mathieu, LeCun arXiv:1506.02351]

Stacked What-Where Auto-Encoder
Yann
LeCun
LeCun
A bit like a ConvNet paired with a DeConvNet
Inpu
t
Recons
tructio
n
Conclusions & Challenges
130
Learning How the world ticks

So long as our machine learning models cheat by relying only
on surface sta4s4cal regulari4es, they remain vulnerable to outof-distribu4on examples
Humans generalize beWer than other animals by implicitly having
a more accurate internal model of the underlying causal
rela4onships
This allows one to predict future situa4ons (e.g., the eect of
planned ac4ons) that are far from anything seen before, an
essen4al component of reasoning, intelligence and science
131
Learning Multiple Levels of

Abstraction
The big payo of deep learning is to allow learning
higher levels of abstrac4on
Higher-level abstrac4ons disentangle the factors of
varia4on, which allows much easier generaliza4on and
transfer
132
Challenges & Open Problems

A More ScienHc Approach is Needed, not Just Building Beber Systems
Unsupervised learning
How to evaluate?
Long-term dependencies
Natural language understanding & reasoning
More robust op4miza4on (or easier to train architectures)
Distributed training (that scales) & specialized hardware
Bridging the gap to biology
Deep reinforcement learning
133

DL Tutorial NIPS2015

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Tutorial NIPS2015

Uploaded by

Copyright:

Available Formats

Deep

learning algorithms based on

Amazing improvements in error rate in object recogni4on, object

Bypassing the curse of

Exploi4ng composi4onality gives an exponen4al gain in

Addi4onal prior: composi4onality is useful to

Exponential advantage of distributed

Learning a set of parametric features that are not

Hidden Units Discover Semantically

Zhou et al & Torralba, arXiv1412.6856 submiWed to ICLR 2015

Figure 10: Interpretation of a picture by different layers of the Places-CNN u

Bed (J=24.6%, AP=81.1%)

Billiard table (J=3.2%, AP=42.6%)

Mountain (J=11.3%, AP=47.6%)

Sofa (J=10.8%, AP=36.2%)

Counts of CNN units discovering each object class.

Washing machine (J=3.2%, AP=34.4%)

Object counts of most informative objects for scene recognition

Each feature can be discovered

Exponential advantage of distributed

Exponential advantage of depth

RBMs & auto-encoders = universal approximator

(Hastad et al 86 & 91, Bengio et al 2007, Bengio

Some functions compactly

Why does it work? No Free Lunch

Exponential advantage of depth

or, if hidden layers have width n and input has size n0

Typical Multilayer Neural Net Architecture

Complex learning machines can be

Building a Network by Assembling Modules

Computing Gradients by Back-Propagation

A practical Application of Chain Rule

Backprop for the state gradients:

dC/dXi-1 = dC/dXi . dXi/dXi-1

dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1

Backprop for the weight gradients:

dC/dWi = dC/dXi . dXi/dWi

dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi

Gradtheta contains the gradient

Y = W.X ; dC/dX = W . dC/dY ; dC/dW = dC/dY . (dC/dX)T

y = ReLU(x) ; if (x<0) dC/dx = 0 else dC/dx = dC/dy

Y1 = X, Y2 = X ; dC/dX = dC/dY1 + dC/dY2

y = max(x1,x2) ; if (x1>x2) dC/dx1 = dC/dy else dC/dx1=0

; dC/dX1 = dC/dY ; dC/dX2 = dC/dY

Many more basic module classes

Any Architecture works

Any connection graph is permissible

Any module is permissible

Most frameworks provide automatic

Use ReLU non-linearities

Use dropout for regularization

More recent: Deep Learning (MIT Press book in preparation)

Use cross-entropy loss for classification

Deep Learning = Training Multistage Machines

Traditional Pattern Recognition: Fixed/Handcrafted Feature Extractor

Mainstream Pattern Recognition 9until recently)

Deep Learning: Multiple stages/layers trained end to end

Overall Architecture: multiple stages of

Normalization Filter Bank Non-Linearity Pooling

" Normalization: variation on whitening (optional)

Subtractive: average removal, high pass filtering

Rectification (ReLU), Component-wise shrinkage, tanh,..

" Pooling: aggregation over space or feature type

Max, Lp norm, log prob.