Professional Documents
Culture Documents
The Unreasonable
Effectiveness of
Deep Learning
Yann LeCun
Facebook AI Research &
Center for Data Science, NYU
yann@cs.nyu.edu
http://yann.lecun.com
55 years of hand-crafted features
Y LeCun
Perceptron
Architecture of “Classical” Recognition Systems
Y LeCun
It's deep if it has more than one stage of non-linear feature transformation
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Trainable Feature Hierarchy
Y LeCun
Trainable Feature
Deep Learning addresses the problem of learning
hierarchical representations with a single algorithm Transform
or perhaps with a few algorithms
The Mammalian Visual Cortex is Hierarchical
Y LeCun
The ventral (recognition) pathway in the visual cortex has multiple stages
Retina - LGN - V1 - V2 - V4 - PIT - AIT ....
Lots of intermediate representations
“shallow & wide” vs “deep and narrow” == “more memory” vs “more time”
Look-up table vs algorithm
Few functions can be computed in two steps without an
exponentially large lookup table
Using more than 2 steps can reduce the “memory” by an
exponential factor.
Step 4
Step 3
Step 2 Step 2
What Are
Good Feature?
Basic Idea for Invariant Feature Learning
Y LeCun
Pooling
Non-Linear
Or
Function
Aggregation
Input
Stable/invariant
high-dim
features
Unstable/non-smooth
features
Sparse Non-Linear Expansion → Pooling
Y LeCun
Clustering,
Pooling.
Quantization,
Aggregation
Sparse Coding
Overall Architecture: multiple stages of
Normalization → Filter Bank → Non-Linearity → Pooling Y LeCun
ReLU ( x )=max ( x , 0)
Pooling: aggregation over space or feature type
– Max, Lp norm, log prob.
p p 1 bX i
MAX : Max i ( X i ) ; L p : √ X i ; PROB :
b
log
( ∑e
i )
Deep Nets with ReLUs and Max Pooling
Y LeCun
22
W22,14
14
Max Pooling
“switches” from one layer to the next W14,3
3
Z3
Supervised Training: Stochastic (Sub)Gradient Optimization
Y LeCun
Loss Function for a simple network
Y LeCun
1-1-1 network
– Y = W1*W2*X
trained to compute the identity function with quadratic loss
– Single sample X=1, Y=1 L(W) = (1-W1*W2)^2
Y
W2
W1
X
Solution
Solution
Saddle point
Deep Nets with ReLUs
Y LeCun
Single output:
Ŷ =∑ δP (W , X )( ∏ W ij ) X P start
31
P (ij)∈P
Ŷ =∑ δP (W , X )( ∏ W ij ) X P W14,3
start
P (ij)∈P
3
Dp(W,X): 1 if path P is “active”, 0 if inactive
Input-output function is piece-wise linear
Z3
Polynomial in W with random coefficients
Deep Convolutional Nets (and other deep neural nets)
Y LeCun
L (W )=∑ ∑ ( X P Y )δ P (W , X )( ∏ W ij )
k k k
start
k P (ij)∈ P
L (W )=∑ [ ∑ ( X kP Y k )δ P (W , X k )]( ∏ W ij )
start
P k (ij)∈P
L (W )=∑ C p ( X ,Y ,W )( ∏ W ij )
P (ij)∈P
L (W )=∑ C p ( X ,Y ,W )( ∏ W ij ) 31
P (ij)∈P
Piecewise polynomial in W with random W31,22
coefficients
A lot is known about the distribution of critical 22
points of polynomials on the sphere with random
(Gaussian) coefficients [Ben Arous et al.] W22,14
High-order spherical spin glasses
14
Random matrix theory
W14,3
Histogram of minima 3
Z3
L(W)
Deep Nets with ReLUs:
Objective Function is Piecewise Polynomial Y LeCun
Critical Points
Zoomed:
Y LeCun
Convolutional
Networks
Convolutional Network
Y LeCun
Pooling
Pooling
“Simple cells”
“Complex
cells”
pooling
Multiple subsampling
convolutions
“Simple cells”
“Complex cells”
Training is supervised
With stochastic gradient
descent
pooling
Multiple subsampling [LeCun et al. 89]
convolutions
Retinotopic Feature Maps [LeCun et al. 98]
Convolutional Network (ConvNet)
Y LeCun
Layer 3
256@6x6 Layer 4
Layer 1 Output
Layer 2 256@1x1
64x75x75 101
input 64@14x14
83x83
9x9
9x9 10x10 pooling, convolution
convolution 6x6 pooling
5x5 subsampling (4096 kernels)
(64 kernels) 4x4 subsamp
Curved
manifold
Flatter
manifold
LeNet1 Demo from 1993
Y LeCun
Energy
Making every single module in the
system trainable.
When the
cardinality or
dimension of Y is
large, exhaustive
search is
impractical.
We need to use
“smart” inference
procedures:
min-sum, Viterbi,
min cut, belief
propagation,
gradient decent.....
Converting Energies to Probabilities
Y LeCun
Energies are uncalibrated
– The energies of two separately-trained systems cannot be
combined
– The energies are uncalibrated (measured in arbitrary untis)
How do we calibrate energies?
– We turn them into probabilities (positive numbers that sum to 1).
– Simplest way: Gibbs distribution
– Other ways can be reduced to Gibbs by a suitable redefinition of
the energy.
X Y
(observed) (observed on
training set)
Integrating Deep Learning and Structured Prediction
Y LeCun
E(Z,Y,X)
E(Z,Y,X)
Marginalization over latent variables:
Z
X Y
Estimation this integral may require some approximations
(sampling, variational methods,....)
Example 1: Integrated Training with Sequence Alignment Y LeCun
Object models
(elastic template)
Energies Switch
Sequence of
feature vectors
LVQ2 Loss
Trainable feature
extractors
Input Sequence
(acoustic vectors) Warping Category
(latent var) (output)
Example: 1-D Constellation Model (a.k.a. Dynamic Time Warping) Y LeCun
Energy
(elastic template)
Trellis
Object models
Warping/Path
Sequence of (latent var)
feature vectors
The Oldest Example of Structured Prediction
Y LeCun
Variables that would make the task easier if they were known:
– Face recognition: the gender of the person, the orientation
of the face.
– Object recognition: the pose parameters of the object
(location, orientation, scale), the lighting conditions.
– Parts of Speech Tagging: the segmentation of the
sentence into syntactic units, the parse tree.
– Speech Recognition: the segmentation of the sentence into
phonemes or phones.
– Handwriting Recognition: the segmentation of the line into
characters.
– Object Recognition/Scene Parsing: the segmentation of
the image into components (objects, parts,...)
In general, we will search for the value of the latent variable that allows
us to get an answer (Y) of smallest energy.
Probabilistic Latent Variable Models
Y LeCun
Training an EBM consists in shaping the energy function so that the energies
of the correct answer is lower than the energies of all other answers.
– Training sample: X = image of an animal, Y = “animal”
Training set
Training
Form of the loss functional
– invariant under permutations and repetitions of the samples
Energy surface
Regularizer
Per-sample Desired for a given Xi
loss answer as Y varies
Designing a Loss Functional
Y LeCun
Energy Loss
– Simply pushes down on the energy of the correct answer
!!
S!
K
R
O
W
!!!
ES
PS
A
LL
O
C
Negative Log-Likelihood Loss
Y LeCun
Gibbs distribution:
Energy:
Inference:
Loss:
Learning Rule:
If Gw(X) is linear in W:
A Better Loss Function: Generalized Margin Losses
Y LeCun
Hinge Loss
– [Altun et al. 2003], [Taskar et al. 2003]
– With the linearly-parameterized binary classifier
architecture, we get linear SVMs
Log Loss
– “soft hinge” loss
– With the linearly-parameterized binary
classifier architecture, we get linear
Logistic Regression
Examples of Margin Losses: Square-Square Loss
Y LeCun
Square-Square Loss
– [LeCun-Huang 2005]
– Appropriate for positive energy
functions
Learning Y = X^2
!!
E!
PS
A
LL
O
C
O
N
Other Margin-Like Losses
Y LeCun
Recognizing Words
With Deep Learning
And Structured Prediction
End-to-End Learning – Word-Level Discriminative Training
Y LeCun
Outputs: Y1 Y2 Y3 Y4
Latent Vars: Z1 Z2 Z3
Input: X
Using Graphs instead of Vectors or Arrays.
Y LeCun
Variables:
– X: input image
– Z: path in the
interpretation
graph/segmentation
– Y: sequence of labels on
a path
Loss function: computing the
energy of the desired answer:
Graph Transformer
Networks Y LeCun
Variables:
– X: input image
– Z: path in the
interpretation
graph/segmentation
– Y: sequence of labels on
a path
Loss function: computing the
constrastive term:
Graph Transformer
Networks Y LeCun
Convolutional Networks
In
Visual Object Recognition
We knew ConvNet worked well with characters and small images
Y LeCun
False positives per image-> 4.42 26.9 0.47 3.36 0.5 1.28
[Vaillant et al. IEE 1994][Osadchy et al. 2004] [Osadchy et al, JMLR 2007]
Simultaneous face detection and pose estimation
Y LeCun
[Vaillant et al. IEE 1994][Osadchy et al. 2004] [Osadchy et al, JMLR 2007]
Simultaneous face detection and pose estimation
Y LeCun
Visual Object Recognition with Convolutional Nets
Y LeCun
lotus
cellphon an
dollar minar t
w. e joshua
et
chair
metronom t. cougar body backgroun
e d
Late 2000s: we could get decent results on object recognition
Y LeCun
But we couldn't beat the state of the art because the datasets were too small
Caltech101: 101 categories, 30 samples per category.
But we learned that rectification and max pooling are useful! [Jarrett et al. ICCV 2009]
Flute
Fast Graphical Processing Units (GPU)
Capable of 1 trillion operations/second Strawberry
Bathing
Backpack cap
Racket
ImageNet Large-Scale Visual Recognition Challenge
Y LeCun
Won the 2012 ImageNet LSVRC. 60 Million parameters, 832M MAC ops
4M FULL CONNECT 4Mflop
MAX POOLING
442K CONV 3x3/ReLU 256fm 74M
1.3M CONV 3x3ReLU 384fm 224M
884K CONV 3x3/ReLU 384fm 149M
Samy
Bengio
???
NYU ConvNet Trained on ImageNet: OverFeat Y LeCun
Apply convnet with a sliding window over the image at multiple scales
Important note: it's very cheap to slide a convnet over an image
Just compute the convolutions over the whole image and replicate the
fully-connected layers
Applying a ConvNet on Sliding Windows is Very Cheap!
Y LeCun
output: 3x3
96x96
input:120x120
Apply convnet with a sliding window over the image at multiple scales
For each window, predict a class and bounding box parameters
Evenif the object is not completely contained in the viewing window,
the convnet can predict where it thinks the object is.
Classification + Localization:
sliding window + bounding box regression + bbox voting
Y LeCun
Apply convnet with a sliding window over the image at multiple scales
For each window, predict a class and bounding box parameters
Compute an “average” bounding box, weighted by scores
Localization: Sliding Window + bbox vote + multiscale
Y LeCun
Detection / Localization
Y LeCun
Give a bounding box and a category for all objects in the image
MAP = mean average precision
Red:ConvNet, blue: no ConvNet
2013 Teams mAP Off cycle results mAP 2014 Teams mAP
NUS 37.2
MSRA 35.1
Groundtruth is
sometimes ambiguous
or incomplete
Detection: Difficult Examples
Y LeCun
Snake → Corkscrew
Detection: Bad Groundtruth
Y LeCun
ConvNets
As Generic
Feature Extractors
Cats vs Dogs
Y LeCun
● Kaggle competition: Dog vs Cats
Cats vs Dogs
Y LeCun
● Won by Pierre Sermanet (NYU):
● ImageNet network (OverFeat) last layers retrained on cats and dogs
OverFeat Features ->Trained Classifier on other datasets
Y LeCun
Gw
joint work with Sumit Chopra: Hadsell et al. CVPR 06; Chopra et al., CVPR 05
Siamese Architecture Y LeCun
EW
GW X W GW X
X1 X2
Image
Winner-take-all output
(14.55% error >3 pix)
Original image
SADD
Census
ConvNet
KITTI Stereo Leaderboard (Sept 2014)
Y LeCun
Presentation at ECCV
workshop 2014/9/6
Y LeCun
The list of perceptual tasks for which ConvNets hold the record is growing.
Most of these tasks (but not all) use purely supervised convnets.
Y LeCun
Training samples.
40 MEL-frequency Cepstral Coefficients
Window: 40 frames, 10ms each
Speech Recognition with Convolutional Nets (NYU/IBM)
Y LeCun
Convolutional Networks
In
Image Segmentation,
& Scene Labeling
ConvNets for Image Segmentation
Y LeCun
Categories
Superpixel boundaries
Categories aligned
With region
Convolutional classifier
boundaries
Multi-scale ConvNet
Input image
Barcelona dataset
[Tighe 2010]:
170 categories.
No post-processing
Frame-by-frame
ConvNet runs at 50ms/frame on Virtex-6 FPGA hardware
But communicating the features over ethernet limits system
performance
Temporal Consistency
Y LeCun
[C. Cadena, J. Kosecka “Semantic Parsing for Priming Object Detection in RGB-D Scenes”
Semantic Perception Mapping and Exploration (SPME), Karlsruhe 2013]
Scene Parsing/Labeling on RGB+Depth Images
Y LeCun
Temporal consistency
Self-Learning
ConvNet for Scene Labeling
Vision-Based Navigation
for Off-Road Robots
LAGR project: vision-based navigation for off-road robot
Y LeCun
...
3x12x25 input window
100@25x121 ``
CONVOLUTIONS (6x5)
YUV image band
20-36 pixels tall,
...
20@30x125
36-500 pixels wide
20@30x484
CONVOLUTIONS (7x6)
3@36x484
YUV input
Scene Labeling with ConvNet + online learning
Y LeCun
Input image Stereo Labels
Classifier Output
Software Tools
and
Hardware Acceleration
for
Convolutional Networks
Software Platform for Deep Learning: Torch7
Y LeCun
Torch7
based on the LuaJIT language
Simple and lightweight dynamic language (widely used for games)
Multidimensional array library with CUDA and OpenMP backends
FAST: Has a native just-in-time compiler
Has an unbelievably nice foreign function interface to call C/C++
functions from Lua
Torch7 is an extension of Lua with
Multidimensional array engine
A machine learning library that implements multilayer nets,
convolutional nets, unsupervised pre-training, etc
Various libraries for data/image manipulation and computer vision
Used at Facebook Ai Research, Google (Deep Mind, Brain), Intel, and
many academic groups and startups
Single-line installation on Ubuntu and Mac OSX:
http://torch.ch
Torch7 Cheat sheet (with links to libraries and tutorials):
– https://github.com/torch/torch7/wiki/Cheatsheet
Example: building a Neural Net in Torch7
Y LeCun
Simple 2layer neural network
Creating a 2-layer net
Make a cascade module model = nn.Sequential()
Reshape input to vector model:add(nn.Reshape(ninputs))
Add Linear module model:add(nn.Linear(ninputs,nhiddens))
Add tanh module model:add(nn.Tanh())
Add Linear Module
model:add(nn.Linear(nhiddens,noutputs))
Add log softmax layer
model:add(nn.LogSoftMax())
MEM
Multiport memory
DMA
controller (DMA)
[x12 on a V6 LX240T]
RISC CPU, to
CPU reconfigure tiles and
data paths, at runtime
global networkonchip to
allow fast reconfiguration
grid of passive processing tiles (PTs)
[x20 on a Virtex6 LX240T]
NewFlow: Processing Tile Architecture Y LeCun
configurable router,
Termby:term to stream data in
streaming operators and out of the tile, to
(MUL,DIV,ADD,SU neighbors or DMA
B,MAX) ports
[x20]
[x8,2 per tile]
configurable piecewise
linear or quadratic
mapper
[x4]
configurable bank of
full 1/2D parallel convolver FIFOs , for stream
with 100 MAC units buffering, up to 10kB
per PT [Virtex6 LX240T]
[x4] [x8]
NewFlow ASIC: 2.5x5 mm, 45nm, 0.6Watts, >300GOPS
Y LeCun
Peak
GOP/sec 40 40 160 182 320 1350
Actual
GOP/sec 12 37 147 54 300 294
FPS 14 46 182 67 364 374
Power
(W) 50 10 10 30 0.6 220
Embed?
(GOP/s/W) 0.24 3.7 14.7 1.8 490 1.34
Unsupervised Learning
Energy-Based Unsupervised Learning
Y LeCun
Y2
Y1
Capturing Dependencies Between Variables
with an Energy Function
Y LeCun
The energy surface is a “contrast function” that takes low values on the data
manifold, and higher values everywhere else
Special case: energy = negative log density
Example: the samples live in the manifold
Y 2=(Y 1 )2
Y1
Y2
Transforming Energies into Probabilities (if necessary)
Y LeCun
P(Y|W)
Y
E(Y,W)
Y
Learning the Energy Function
Y LeCun
1. build the machine so that the volume of low energy stuff is constant
PCA, K-means, GMM, square ICA
2. push down of the energy of data points, push up everywhere else
Max likelihood (needs tractable partition function)
3. push down of the energy of data points, push up on chosen locations
contrastive divergence, Ratio Matching, Noise Contrastive Estimation,
Minimum Probability Flow
4. minimize the gradient and maximize the curvature around data points
score matching
5. train a dynamical system so that the dynamics goes to the manifold
denoising auto-encoder
6. use a regularizer that limits the volume of space that has low energy
Sparse coding, sparse auto-encoder, PSD
7. if E(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible.
Contracting auto-encoder, saturating auto-encoder
#1: constant volume of low energy
Energy surface for PCA and K-means
Y LeCun
1. build the machine so that the volume of low energy stuff is constant
PCA, K-means, GMM, square ICA...
K-Means,
PCA
Z constrained to 1-of-K code
E (Y )=∥W T WY −Y ∥2 E (Y )=min z ∑i ∥Y −W i Z i∥2
#2: push down of the energy of data points,
push up everywhere else
Y LeCun
Y
Minimizing -log P(Y,W) on training samples
E(Y)
Y
Gradient descent: E(Y)
E (Y , Z )=−Z WY T
E (Y )=−log ∑ z e Z WY
#6. use a regularizer that limits
the volume of space that has low energy
Y LeCun
Generative Model
Factor A
Factor B
Distance Decoder
LATENT
INPUT Y Z VARIABLE
Sparse Modeling: Sparse Coding + Dictionary Learning
Y LeCun
i i 2
E (Y , Z )=∥Y −W d Z∥ + λ ∑ j ∣z j∣
i
∥Y −Y∥
2
WdZ ∑j .
INPUT Y
FACTOR DETERMINISTIC
FUNCTION
Z ∣z j∣
FEATURES
VARIABLE
Y → Ẑ =argmin Z E (Y , Z )
#6. use a regularizer that limits
the volume of space that has low energy
Y LeCun
Factor B
LATENT
INPUT Y Z VARIABLE
Fast Feed-Forward Model
Factor A'
Encoder Distance
Encoder-Decoder Architecture
Y LeCun
[Kavukcuoglu, Ranzato, LeCun, rejected by every conference, 2008-2009]
Train a “simple” feed-forward function to predict the result of a complex
optimization on the data points of interest
Generative Model
Factor A
Factor B
Distance Decoder
LATENT
INPUT Y Z VARIABLE
Fast Feed-Forward Model
Factor A'
Encoder Distance
Training sample
Training sample
Training sample
Training sample
Training sample
Training sample
Learning to Perform
Approximate Inference:
Predictive Sparse Decomposition
Sparse Auto-Encoders
Sparse auto-encoder: Predictive Sparse Decomposition (PSD)
Y LeCun
[Kavukcuoglu, Ranzato, LeCun, 2008 → arXiv:1010.3467],
Prediction the optimal code with a trained encoder
Energy = reconstruction_error + code_prediction_error + code_sparsity
i i 2 i 2
E Y , Z =∥Y −W d Z∥ ∥Z −g e W e ,Y ∥ ∑ j ∣z j∣
i i
ge (W e , Y )=shrinkage(W e Y )
i
∥Y −Y∥
2
WdZ ∑j .
INPUT Y Z ∣z j∣
FEATURES
i
ge W e ,Y 2
∥Z − Z∥
Regularized Encoder-Decoder Model (auto-Encoder)
for Unsupervised Feature Learning Y LeCun
FEATURES
g e ( W e ,Y i ) ̃ 2
∥Z − Z∥ L2 norm within
each pool
PSD: Basis Functions on MNIST
Y LeCun
Learning to Perform
Approximate Inference
LISTA
Better Idea: Give the “right” structure to the encoder
Y LeCun
INPUT Y We + sh() Z
Lateral Inhibition S
ISTA/FISTA reparameterized:
LISTA (Learned ISTA): learn the We and S matrices to get fast solutions
[Gregor & LeCun, ICML 2010], [Bronstein et al. ICML 2012], [Rolfe & LeCun ICLR 2013]
LISTA: Train We and S matrices
to give a good approximation quickly
Y LeCun
Think of the FISTA flow graph as a recurrent neural net where We and S are
trainable parameters
INPUT Y We + sh() Z
S
Time-Unfold the flow graph for K iterations
Learn the We and S matrices with “backprop-through-time”
Get the best approximate solution within K iterations
Y We
+ sh() S + sh() S Z
Learning ISTA (LISTA) vs ISTA/FISTA
Y LeCun
Reconstruction Error
Smallest elements
removed
Lateral
Decoding L1 Z̄ 0
Inhibition
Filters
X We
Architecture
Wd X̄ X
+ +
Encoding () S + ()
Filters Wc Ȳ Y
Can be repeated
Image = prototype + sparse sum of “parts” (to move around the manifold)
Convolutional Sparse Coding
Y LeCun
Convolutional S.C.
Y = ∑. * Zk
k
Wk
“deconvolutional networks” [Zeiler, Taylor, Fergus CVPR 2010]
Convolutional PSD: Encoder with a soft sh() Function
Y LeCun
Convolutional Formulation
Extend sparse coding from PATCH to IMAGE
Filters and Basis Functions obtained with 1, 2, 4, 8, 16, 32, and 64 filters.
Using PSD to Train a Hierarchy of Features
Y LeCun
∥Y i −Ỹ ∥2 Wd Z λ∑ .
Y Z ∣z j∣
g e ( W e ,Y i ) ̃ 2
∥Z − Z∥
FEATURES
Using PSD to Train a Hierarchy of Features
Y LeCun
Y ∣z j∣
g e ( W e ,Y i )
FEATURES
Using PSD to Train a Hierarchy of Features
Y LeCun
∥Y i −Ỹ ∥2 Wd Z λ∑ .
Y ∣z j∣ Y Z ∣z j∣
g e ( W e ,Y i ) g e ( W e ,Y i ) ̃ 2
∥Z − Z∥
FEATURES
Using PSD to Train a Hierarchy of Features
Y LeCun
Y ∣z j∣ ∣z j∣
g e ( W e ,Y i ) g e ( W e ,Y i )
FEATURES
Using PSD to Train a Hierarchy of Features
Y LeCun
Y ∣z j∣ ∣z j∣ classifier
g e ( W e ,Y i ) g e ( W e ,Y i )
FEATURES
Y LeCun
Unsupervised + Supervised
For
Pedestrian Detection
Pedestrian Detection, Face Detection
Y LeCun
[Osadchy,Miller LeCun JMLR 2007],[Kavukcuoglu et al. NIPS 2010] [Sermanet et al. CVPR 2013]
ConvNet Architecture with Multi-Stage Features
Y LeCun
Feature maps from all stages are pooled/subsampled and sent to the final
classification layers
Pooled low-level features: good for textures and local motifs
High-level features: good for “gestalt” and global shape
2040 9x9
Av Pooling
filters+tanh filter+tanh
2x2
68 feat maps
Input
78x126xYUV
ConvNet
Color+Skip
Supervised
ConvNet
ConvNet
B&W
Color+Skip ConvNet
Supervised
Unsup+Sup B&W
Unsup+Sup
Stage 2 filters.
Unsupervised training with convolutional predictive sparse decomposition
Y LeCun
VIDEOS
Y LeCun
VIDEOS
Pedestrian Detection: INRIA Dataset. Miss rate vs false
positives
Y LeCun
ConvNet
Color+Skip
Supervised
ConvNet
ConvNet
B&W
Color+Skip ConvNet
Supervised
Unsup+Sup B&W
Unsup+Sup
Unsupervised Learning:
Invariant Features
Learning Invariant Features with L2 Group Sparsity
Y LeCun
i
∥Y −Ỹ ∥
2
Wd Z
λ∑ .
INPUT Y Z √ ( ∑ Z 2
k)
FEATURES
g e ( W e ,Y i ) ̃
∥Z − Z∥
2 L2 norm within
each pool
Learning Invariant Features with L2 Group Sparsity
Y LeCun
Idea: features are pooled in group.
Sparsity: sum over groups of L2 norm of activity in group.
[Hyvärinen Hoyer 2001]: “subspace ICA”
decoder only, square
[Welling, Hinton, Osindero NIPS 2002]: pooled product of experts
encoder only, overcomplete, log student-T penalty on L2 pooling
[Kavukcuoglu, Ranzato, Fergus LeCun, CVPR 2010]: Invariant PSD
encoder-decoder (like PSD), overcomplete, L2 pooling
[Le et al. NIPS 2011]: Reconstruction ICA
Same as [Kavukcuoglu 2010] with linear encoder and tied decoder
[Gregor & LeCun arXiv:1006:0448, 2010] [Le et al. ICML 2012]
Locally-connect non shared (tiled) encoder-decoder
L2 norm within
INPUT
Encoder only (PoE, ICA),
SIMPLE
FEATURES each pool λ∑ .
Y Decoder Only or
Encoder-Decoder (iPSD, RICA)
Z √ (∑ Z 2k ) INVARIANT
FEATURES
Groups are local in a 2D Topographic Map Y LeCun
Input
Encoder
Image-level training, local filters but no weight sharing Y LeCun
Training on 115x115 images. Kernels are 15x15 (not shared across space!)
K Obermayer and GG Blasdel, Journal of
Topographic Maps Neuroscience, Vol 13, 4114-4129 (Monkey)
Y LeCun
Each edge in the tree indicates a zero in the S matrix (no mutual inhibition)
Sij is larger if two neurons are far away in the tree
Invariant Features via Lateral Inhibition: Topographic Maps Y LeCun
Decoder Predicted
St St-1 St-2
input
W1 W1 W1 W2
̃ 2 ̃2
W W
Encoder St St-1 St-2 Input
Low-Level Filters Connected to Each Complex Cell
Y LeCun
C1
(where)
C2
(what)
Generating Images
Y LeCun
Generating images
Input
Y LeCun
Memory Networks
Memory Network [Weston, Chopra, Bordes 2014]
Y LeCun
Results on
Question Answering
Task
Y LeCun
Future
Challenges
Future Challenges
Y LeCun