Professional Documents
Culture Documents
!"#$%&'(&%))*+'
,'
Program Organizer!
!"#$%&'(&%))*+'
-'
Program Instructor!
- Prof of Data Science at the Institute
of Data Science at NTU, Singapore!
- Publications in NIPS, ICML, JMLR!
- Teach Master and PhD courses in
Data Science at EPFL !
- Trained at EPFL, UCLA!
- Consulting !
!"#$%&'(&%))*+'
.'
Teaching Assistants!
M. Kirell Benzi!
kirell.benzi@epfl.ch!
Data Scientist/Artist!
!"#$%&'(&%))*+'
M. Michal De"errard!
michael.de"errard@epfl.ch!
Data Scientist!
/'
Data Scientist!
0'
In the News!
!"#$%&'(&%))*+'
1'
Data Science!
Q: What is Data Science? !
!"#$%&'(&%))*+'
2'
!"#$%&'(&%))*+'
3'
Computer Science
Science!
Personalized !
Services!
Services
Mathematical!
Mathematical
Modeling!
Modeling
Data Science
Science!
Data!
Knowledge !
Discovery !
E.g. Physics, genomics, !
social sciences.
sciences.!
Issues of privacy, !
ownership!
security, ownership
Domain
Domain!
Expertise
Expertise!
Sciences!
Sciences
Government!
Government
E.g. Healthcare, Defense, Education, Transportation..!
Industry!
Industry
Intelligent !
Systems!
Systems
Q: Is AI new?!
Cloud computing
computing!
GPU!
Not new!!
!"#$%&'(&%))*+'
,4'
RNN!
Schmidhuber!
CNN!
LeCun!
First !
NIPS!
Visual primary cortex!
Hubel-Wiesel!
1959'
1962 1975'
1962'
1958'
Backprop !
Perceptron
Perceptron!
Werbos!
Rosenblatt
Rosenblatt!
First !
KDD
KDD!
1989'
1989
1987'
Neocognitron!
Fukushima!
Birth of!
Data Science!
Split from Statistics!
Tukey!
AI Hope!
!"#$%&'(&%))*+'
Big Data!
Volume doubles/1.5 year!
1998
1997 1998'
1997'
1995'
1999
1999'
Hardware!
GPU speed doubles/ year!
First Amazon!
Cloud Center!
Google AI !
TensorFlow!
Facebook AI!
Torch!
Kaggle!
Platform!
2010'
2006'
Auto-encoder!
LeCun, Hinton, Bengio!
First NVIDIA !
GPU!
SVM/Kernel techniques!
Vapnik!
AI Winter [1966-2012]!
Kernel techniques!
Handcrafted features!
Graphical models!
2012
2012'
2014' 2015'
Data scientist!
Facebook Center!
1st Job in US!
OpenAI Center!
AI Resurgence!
,,'
!"#$%&'(&%))*+'
,-'
Networks/Graphs!
!! Graphs encode complex data structures.
They are everywhere: WWW, Facebook,
Amazon, etc !
MNIST Image !
Network!
Social Network'
Graph of Google Query!
California'
GTZAN Music !
Network!
,.'
7'
3+14#5.'()*+",/!
6"#4'./)"01)0"(!
8(5(1+990'41#:+'.
'()*+",/!
!"#$%&'()*+",-.
./)"01)0"(2.2#)#..
!"#$%&'(&%))*+'
,/'
567899:";"<)=$%+=%<=%<=*>&)%?;@'
!"#$%&'(&%))*+'
,0'
Graph Science
Science!
Data structure
structure!
Pattern extraction!
extraction
Unsupervised
Clustering
Clustering!
k-means, graph cuts
cuts!
Python!
Python
Language for !
data science!
science
Supervised
Classification!
Classification
SVM!
SVM
Deep Learning
Learning!
NNs, CNNs, RNNs,
RNNs,!
Data Science
Science!
Pagerank, collaborative
Pagerank
filtering
content filtering!
3rd day!
Data
Visualization!
Visualization
Manifold, t-SNE
t-SNE!
!"#$%&'(&%))*+'
Recommender
Systems
Systems!
Feature
Extraction!
Extraction
2nd day!
,1'
Xavier Bresson
17
!"#$%&'(&%))*+'
,3'
QuesCons?
Xavier Bresson
19
Data Science!
Sept 12-14, 2016!
!"#$%&'(&%))*+'
,'
Python!
!! Why Python?!
!"#$%&'(&%))*+'
-'
Python!
!! Why Python for Data Science?!
!"#$%&'(&%))*+'
.'
Computational Needs
Fast numerical mathematics: BLAS & LAPACK libraries
Easy bridging to data: data files, databases, scraping
Easy bridging to legacy code: C, matlab, Fortran
Easy results presentation: html / web & pdf reports
Rapid prototyping
Ideally the same framework for R&D and production
Cluster computing: multi-threads, MPI, OpenMP, Ipython Parallel
GPU computing: OpenCL, CUDA
Xavier Bresson
Xavier Bresson
Xavier Bresson
Python Cons
Python 2 vs 3
Slow execution
l Specialized libraries: numpy, scipy
l Compilation: pypy, numba, jython
Need to run to catch errors
Xavier Bresson
Scientific Python
Libraries for everything !
Numerical analysis
l numpy: multidimensional arrays, data types, linear algebra
l scipy:
higher-level algorithms, e.g. optimization, interpolation, signal
processing, sparse matrices, decompositions
SciKits
l scikit-learn: machine learning
l scikit-image: image processing
Deep Learning: tensorflow, theano, keras
Statistics: pandas
Symbolic algebra: sympy
Visualization
l matplotlib: similar to MATLAB plots
l bokeh: interactive visualization
Xavier Bresson
Data Storage
Flat files
l CSV: numpy / pandas
l Matlab: scipy
l JSON: std lib
l HDF5: h5pyBasic relational database storage
Connectors for relational databases
l SQLite: std lib
l PostgreSQL: psycopg (DB API)
l MySQL: mysqlclient
l Oracle: cx_Oracle (DB API)
l Microsoft SQL Server: pypyodbc (DB API)
NoSQL data stores
l Redis: Redis-py
l MongoDB: PyMongo (MongoEngine)
l Hbase: HappyBase
l Cassandra: Datastax
Object-Relational Mapping (ORM)
l SQLAlchemy, Peewee, Pony
Xavier Bresson
Jupyter
HTML-based notebook environment
Multiple kernels / languages: Python, matlab, R, Julia
Platform agnostic: Windows, Mac, Linux, Cloud
All-in-one reports: text, latex math, code, figures, results
Most adapted for prototyping / data exploration
l Convert to Python modules when mature for production
Cloud: github, nbviewer
Alternatively, scientific IDEs: Spyder, Rodeo
l Jupyter is itself becoming an HTML-based IDE !
Other IDEs: IDLE, PyCharm
Text editors: vim, emacs, atom, sublime text
Xavier Bresson
10
Install It Yourself
Windows: anaconda or python(x,y) or Enthought Canopy
Mac: anaconda or homebrew / macports / fink
Linux: package manager (apt-get, yum, pacman)
Use pip to install packages from PyPI or GitHub
Use pyvenv to work with virtual environments
Xavier Bresson
11
Live Session
1) Cloud IDE: nitrous.io
2) Notebook: Jupyter / IPython
3) Basics of Scientific Python: numpy, scipy, scikit-learn, matplotlib
4) Demo: data visualization by Kirell Benzi
Xavier Bresson
12
Ques8ons?
Xavier Bresson
13
Data Science!
Sept 12-14, 2016!
!"#$%&'(&%))*+'
,'
Outline!
Graph Science and Graph Theory!
!
Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!
Xavier Bresson
Graph/Network Science!
!! Definition of graph/network: mathematical models representing pairwise
relations between objects/data:!
All pairwise !
relationships'
data1'
data2'
data3'
data1'
data2'
!"#$%&'(&%))*+'
.'
Simplification
Knigsberg!
City
Graph !
representation
Source: Wikipedia.
Graph theory oers many analysis tools to use networks for all kind of
applications: from clustering to classification, visualization, recommendation,
deep learning, etc.!
Xavier Bresson
Outline!
Graph Science and Graph Theory!
!
Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!
Xavier Bresson
Class of Graphs/Networks!
!! Natural Graphs:!
(1) Social networks: Facebook, LinkedIn, Twitter!
(2) Biological networks: Brain connectivity and functionality, Gene regulatory networks!
(3) Communication networks: Internet, Networking Devices!
(4) Transportation networks: Trains, Cars, Airplanes, Pedestrians !
(5) Power networks: Electricity, Water !
Facebook'
Brain !
Connectivity'
!'
Graphs/networks!
US Electrical Network'
Telecommunication!
Network'
1'
Class of Graphs/Networks!
!! Constructed Graphs (from Data). !
Examples:!
3D mesh points!
n=1K
n=100K
n=1M
!! Approximate technique:
kd-tree'
2'
Class of Graphs/Networks!
!! Mathematical/Simulated Graphs:!
(1) Erdos-Renyi graphs (1959)!
(2) Stochastic blockmodels [Faust-Wasserman 92]!
(3) Lancichinetti-Fortunato-Radicchi (LFR) graphs (2008)
!
Erdos-Renyi Network'
Source: Wikipedia.'
Advantages: Precise control of your data analysis model (best performances, data
assumptions). No need to perform extensive experiments! (big issue with deep learning)!
! Limitations: Most data assumptions are too restrictive, and it may be hard to check if
your data follow the model assumptions.!
!"#$%&'(&%))*+'
3'
Outline!
Graph Science and Graph Theory!
!
Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!
Xavier Bresson
Basic Definitions!
!! Graphs: Fully defined by G=(V,E,W):!
!! V set of vertices,!
!! E set of edges, and |V|=n,!
!! W similarity matrix.!
eij 2 E
i2V
!! Directed/undirected graphs: !
j2V
Wij = 0.9
,5'
Basic Definitions!
!! Vertex Degree: !
(1) Binary graphs (Wij={0,1}):
degree= ! di =
Wij
j2V
!"#$%&'(&%))*+'
,,'
|E| =
n(n
1)
2
= O(n2 )
|E| = O(n)
!! A: Sparse networks are highly desirable for memory and computational e#ciency.!
Ex: Internet,, n= 4.73 billion pages (August 2016)
2016)!
|E| = n2 = 1018 if it was full.
full.!
|E| = k.n = 1011 as it is (very) sparse.
sparse.!
!! Good news: most natural/real-world networks (Facebook, Brain,
Communication) are sparse. Besides, sparsity
structure:!
!"#$%&'(&%))*+'
Full graph'
Sparse graph'
,-'
Adjacency/Similarity Matrix W!
Definition: Matrix W in G=(V,E,W) actually contains all information
about your network. There are two choices of W:!
(1) Binary W: Wij in {0,1}!
(2) Weighted W: Wij in [0,1] (commonly normalized to 1)!
1 if (i, j) 2 E
0 otherwise
Xavier Bresson
13
prin
prout
!"#$%&'(&%))*+'
,/'
Outline!
Graph Science and Graph Theory!
!
Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!
Xavier Bresson
15
Curse of Dimensionality!
Q: What is the curse of dimensionality?!
A: In high dimensions, (Euclidean) distances between data is meaningless
all data are close to each other!!
Result [Beyer98]: Suppose data are uniformly distributed in Rd,!
Pick any data xi then we have:
limd!1
d`2 (x , V x ) d`2 (x , V
i
i
min i
E max
2
d`min
(xi , V xi )
xi )
!0
Xavier Bresson
1-D Gaussian
1000,000-D Gaussian
16
Blessing of Structure!
Q: What is the blessing of structure?!
Good news: Assumption data are uniformly distributed is not true for realworld data. Data have always some structures in the sense that they belong to
a low-dimensional space called manifold
distances on this surface are
meaningful!
Uniform distribution !
of data!
No structure!
Randomness
Xavier Bresson
Non-Uniform distribution !
of data!
!
Structureness!
17
Outline!
Graph Science and Graph Theory!
!
Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!
Xavier Bresson
18
Manifold Learning!
!! Big challenge: It is di#cult to discover the structures hidden in the data
because:!
(1) High-dimensional data!
(2) Large-scale data!
A class of algorithms exists and is called manifold learning techniques (later
discussed).'
MNIST Image !
Graph!
!"#$%&'(&%))*+'
,4'
Xavier Bresson
20
Sampling!
Smooth!
manifold !
Data points!
Graph!
[Belkin]'
G=(V,E,W)!
(
!! Neighborhood Graphs: !
dist(xi ,xj )2
2
if j 2 Nik
e
k-NN graphs (most popular)! Wij =
0 otherwise
!"#$%&'(&%))*+'
-,'
Outline!
Graph Science and Graph Theory!
!
Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!
Xavier Bresson
22
!"#$%&'(&%))*+'
-.'
Lun = D
nn
di =
n = |V |
Wij
L=D
1/2
Lun D
1/2
= In
1/2
WD
1/2
L=D
Lun = In
Note: All Laplacian are diusion operators, but dierent diusion properties.!
Xavier Bresson
24
Graph Spectrum!
Motivation: Study the modes of variation of the graph system.!
Q: How? A: Eigenvalue Decomposition (EVD) of Laplacian L:!
L = U U
Luk =
k uk
U = [u1 , ..., un ]
= diag( 1 , ...,
huk , uk0 i =
0=
min
n)
1 if k = k 0
0 otherwise
1
...
max
Interpretation:!
(1) uk: Fourier modes, i.e. vibration vectors of the graph.!
(2) k: Frequencies of the Fourier modes uk, i.e. how much they vibrate.!
Xavier Bresson
25
-1'
Neuroscience!
!! Goal: Find meaningful activation patterns in brain using Structural MRI
and Functional MRI. !
Time series !
at this location!
Dynamic activity !
of the brain"
!! Methodology: G
uk
Connectivity !
of the brain!
(fibers connecting!
regions)!
Graph connectivity'
-2'
Outline!
Graph Science and Graph Theory!
!
Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!
Xavier Bresson
28
Xavier Bresson
29
[Zelnik-Peron04]:
Wij =
Xavier Bresson
i j
if j 2 Nik
e
0 otherwise
30
What Distances?!
Q: What Distances do you know?!
(1) Euclidean distance: !
Good for low-dim data d<10!
Good for high-dim data with clear structures (MNIST)!
v
u d
uX
xj k2 = t
|xi,m
m=1
xj,m |2
31
Wij = e
dist(xi ,xj )2
2
'
Wij = e
!"#$%&'(&%))*+'
dist(zi ,zj )2
2
.-'
Data Pre-Processing!
Center data (along each dimension): zero-mean property (very common)!
xi
xi
mean({xi })
xi
xi /std({x
})
s i
std({xi }) =
X
j
|xj
(w/ zero-mean )!
mean({xi }|2
xi
xi /kxi k2
xi 2 [0, 1]
Xavier Bresson
xi
33
!"#$%&'(&%))*+'
./'
!"#$%&'(&%))*+'
.0'
Outline!
Graph Science and Graph Theory!
!
Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!
Xavier Bresson
36
Summary!
Graph is a superior representation of data: !
Data
Graph G=(V,E,W)
1st fundamental tool: Adjacency Matrix W!
(1) It reveals structures hidden on data.!
(2) It allows to visualize graphs.!
(3) It is used for analysis tasks (later discussed).!
2nd fundamental tool: graph Laplacian Matrix L!
(1) It represents the modes of variations of the graph. !
(2) Used for image compression (jpeg), neuroscience, etc.!
Xavier Bresson
37
Graph!
Feature!
Extraction'
High-dim !
Raw data'
!! Step 2: Graph
Graph!
Construction'
Data
Features'
Good !
Practices'
Analysis!
Spectral!
Graph
Theory'
Graph'
Graph!
Analysis'
Identify!
Patterns
Patterns'
Unsupervised
Learning'
Supervised
Learning'
Graph'
Recommend
ation'
Graph!
Analysis'
Visualization'
!"#$%&'(&%))*+'
Domain!
Expertise '
Feature!
Extraction'
'
.3'
Ques8ons?
Xavier Bresson
39
Data Science!
Sept 12-14, 2016!
!"#$%&'(&%))*+'
,'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
Unsupervised Learning!
Q: What unsupervised means? !
Unsupervised learning aims at designing algorithms that can find
patterns in datasets without the use of labels, i.e. prior information.!
There exists several unsupervised learning techniques:!
(1) Unsupervised data clustering (this lecture)!
(2) Graph partitioning (this lecture)!
(3) Data representation/feature extraction (Lecture 7)!
Xavier Bresson
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
K-Means Algorithm!
Most popular clustering algorithm (among top 10 algorithms in data
science).!
Three types of K-Means techniques:!
(1) Standard/linear K-Means!
(2) Kernel K-Means Expectation-Maximization (EM) Approach!
(3) Kernel K-Means Spectral Approach!
Xavier Bresson
Standard/Linear K-Means!
!! Description: Given n data xi in Rd, K-Means partitions the data into K
clusters S1,,SK that minimize the least-squares objective: !
E(M,
(M, S
S) =
K X
X
k=1 xi 2Sk
Means:! M = {m1 , ..., mK }
Clusters:' S = {S1 , ..., SK }
Distance between xi !
and its mean mk!
mk k22
kkxi
kth mean'
kth cluster'
Sk
!"#$%&'(&%))*+'
Sk 0
1'
ml+1
k
!"#$%&'(&%))*+'
xi 2Skl+1
|Skl+1 |
mlk0 k22 , 8k 0 6= k}
Voronoi cell'
xi
2'
!"#$%&'(&%))*+'
3'
Properties of EM Algorithm!
Advantages:!
(1) Monotonic: El+1
Limitations:!
(1)Non-convex energy (NP-hard)!
Existence of local minimizers: some are good, some are bad.!
Good initialization is critical, or restart many times and pick the solution
with the lowest energy value.!
Xavier Bresson
Main Limitation!
!! Assumption: Standard K-Means
suppose data follow a Gaussian Mixture
Model (GMM), meaning that clusters
are linearly separable and spherical. !
!"#$%&'(&%))*+'
,5'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
11
Kernel K-Means
[Scholkopf-Smola-Muller98] !
E(M, S) =
K X
X
k=1 xi 2Sk
i k ((xi )
xi ! (xi )
Weight contribution!
of data xi!
!! Mean update:!
!"#$%&'(&%))*+'
@E
=0
@mk
mk =
mk k22
i (xi )
xi 2Sk i
xi 2Sk
,-'
Cluster Update!
!! Value of kth cluster Sk:!
Skl+1 = {xi : k (xi )
d(xi , mk0 )
d(xi , mk )
With:'
d(xi , mk ) = k (xi )
mlk0 k22 , 8k 0 6= k}
mk k22 = h (xi )
mk , (xi )
mk i
Linear algebra:'
1 if xi 2 Sk
Fik =
0 otherwise
K(x, y) = h (x), (y)i
!"#$%&'(&%))*+'
Kernel matrix'
,.'
l
Dik
= d(xi , mlk )
D = diag(K)
Update clusters: !
l+1
Fik
2KF + diag(F T KF )
l
l
1 if Dik
= argmink0 Dik
0
0 otherwise
Fk is an implicit representation!
of cluster Sk!
l+1
Skl+1 = {xi : Fik
= 1}
!"#$%&'(&%))*+'
,/'
!"#$%&'(&%))*+'
,0'
Kernel Trick!
Q: Do we need to compute the kernel mapping "?!
!! A: No, we never use explicitly the non-linear function "! The exact
expression is actually irrelevant, only the kernel matrix K is important. !
!! Why is this good?!
hxi , xj i
h (xi ), (xj )i
consuming!
Time consuming!!
!! Popular kernels:!
(1)! Gaussian kernels:!
Kij = e
kxi xj k22 /
,1'
Algorithm Properties!
Advantage: All computations are basically matrix computations
(linear algebra) Good news because most processors have an
architecture and libraries to perform very fast linear algebra calculus:!
(1) Intel Math Kernel Library (MKL) that includes Linear Algebra
Package (LAPACK) and Basic Linear Algebra Subprograms
(BLAS).!
(2) AMD Core Math Library (ACML) also includes LAPACK and
BLAS.!
Xavier Bresson
17
k=1 xi 2Sk
min
Y
( P i
j2Sk j
1/2
if i 2 Sk
(weighted) indicator !
of clusters'
otherwise
as:'
!"#$%&'(&%))*+'
min
Y
E , max E
Y
,3'
Spectral Relaxation!
Q: What NP-hard means?!
!! Minimizing the objective is a NP-hard problem (i.e. would take forever)! !
!"#$%&'(&%))*+'
,4'
Ayk =
k yk
max
with'
hyk , yk0 i =
!! Top K eigenvalues/eigenvectors: K largest #k: !
tr(Y T AY ) =
K
X
ykT Ayk =
k=1
K
X
ykT
k yk
k=1
k hyk , yk i
k=1
K
X
...
1 if k = k 0
0 otherwise
K
X
Y T Y = IK
k=1
K largest values
values!
!"#$%&'(&%))*+'
-5'
A = 1/2 K1/2
k=1,..,K
Ayk =
k yk
we drop the
Xavier Bresson
21
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
22
Xavier Bresson
23
V = {x1 , ..., xn } 2 Rd
G = (V, E, W )
!"#$%&'(&%))*+'
-/'
Min Cut
[Wu-Leahy93]!
!! Cut operator: Given a graph G, a cut partitions G into two sets S and Sc
with value: !
X
c
Cut(S, S ) =
Wij
i2S,j2S c
!!
"#$&
"#$&'
"#$%'
!"!
"#$'''
Value of cut1:!
cut: Cut(S,Sc) = 0.3 + 0.2 + 0.3 = 0.8!
Value of cut2: Cut(S,Sc) = 0.5 + 0.5 + 0.5 + 0.5 = 2.0 !
Value of cut3: Cut(S,Sc) = 0.5 !
-0'
Cut(S, S )
min(V ol(S), V ol(S c ))
Ckc
Ck
Cut(S, S c ) Cut(S, S c )
+
min
S
V ol(S)
V ol(S c )
!! Normalized Association:!
Assoc(S, S) Assoc(S c , S c )
+
min
S
V ol(S)
V ol(S c )
Ck
Partitioning by max vertex matching.
with'
Cut(S, S ) =
Wij
i2S,j2S c
V ol(S) =
di , with di =
i2S
Assoc(S, S) =
!"#$%&'(&%))*+'
Wij
j2V
i2S,j2S
Wij
-1'
Spectral Relaxation!
!! Issue: Solving balanced cut problems is directly intractable as NP-hard
combinatorial problems
We need to find the best possible approximation
(close to original solution)
Best approximate techniques are based on
spectral relaxation.!
!! Normalized Association:!
min
Sk
K
X
Assoc(Sk , Sk )
V ol(Sk )
k=1
(1)
max
F
K
X
F T W Fk
k
k=1
FkT DF
(2)
(2)
Yik =
!"#$%&'(&%))*+'
D1/2 Fk
kD1/2 Fk k2
1/2
WD
Dii
1/2
( V ol(S
)
)
k
0
1/2
if i 2 Sk
otherwise
-2'
Spectral Relaxation!
Binary constraint Y in Sind makes the problem NP-hard - we drop it:!
max tr(Y T AY ) s.t. Y T Y = IK
Y
Xavier Bresson
28
(1)
Balanced Cuts:!
max tr(Y T AY ) s.t. Y T Y = IK with A = D
Y
Equivalence:!
(1)
(2)
for
=D
1/2
WD
1/2
, Y 2 SInd
(2)
, K=W
Xavier Bresson
Can
29
Demo!
!! Run lecture04_code03.ipynb!
!"#$%&'(&%))*+'
.5'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
31
NCut Algorithm
[Yu-Shi04] !
Xavier Bresson
32
Demo: NCut!
!! Run lecture04_code04.ipynb!
!"#$%&'(&%))*+'
..'
Technical Details!
Step 1: !
1/2
WD
1/2
Solved by EVD.
Step 2: !
min kZ
Z,R
Step 1: Y* !
Xavier Bresson
34
A = Y Y T
Ayk =
k yk
uk A = uk k
Avk = k vk
U T U = In
V T V = Im
EVD and SVD are matrix factorization techniques: very common tools in
(linear) data science: many techniques boil down to EVD and SVD. !
Xavier Bresson
35
Example!
Ncut on noisy real-world networks WEBK4 and CITESEER: !
Xavier Bresson
36
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
37
PCut
[B-et.al.16]
State-of-The-Art !
!! Results:!
!"#$%&'(&%))*+'
.3'
Demo: PCut!
!! Run lecture04_code05.ipynb!
!"#$%&'(&%))*+'
.4'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
40
Clustering/Partitioning with !
Unknown Number of Clusters!
Recall: Previous techniques assume to know the number K of clusters.!
If K is unknown, there exist two approaches:!
(1) Define a quality measure of clustering (domain expertise), and use
previous techniques with dierent K values, pick the K with best
measure.!
(2) K is a variable of the clustering problem: Louvain Algorithm.!
Louvain Technique [Blondel-et.al.08]: Very popular in social sciences.
It is greedy algorithm that optimizes the modularity objective:!
max Q(f ) =
f
(Wij
min
Sk
K
X
k=1
2m
Cut(Sk , Skc )
2m =
ij
(fi , fj ) =
Wij = V ol(V )
1
0
if i = j
otherwise
with
ij
di dj ) (fi , fj )
min
Sk
K
X
k=1
Cut(Sk , Skc )
V ol(Sk )V ol(Skc )
41
Greedy Algorithm!
Q: What is a greedy algorithm?!
Step 1: Energy minimization step!
Find communities by minimizing locally the modularity.!
Each node is first associated to its own community then for each
node i, we assign i to the community of its neighbor that best
decreases the modularity. The process is repeated until no changes
occur.!
Wkk0 =
Wij
i2Sk ,j2S
k
Note: Greedy algorithm:!
(1) (Relatively) fast algorithm!
(2) No theoretical guarantee on the solution (local minimizer)!
Xavier Bresson
42
Demo: Louvain!
!! Run lecture04_code06.ipynb!
!"#$%&'(&%))*+'
/.'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
44
Clustering/Partitioning !
Small-Scale Communities !
Motivation: How Facebook target small communities of users for
advertisement? !
Goal: Identify small-scale clusters on networks.!
Xavier Bresson
45
Nibble Algorithm!
Core principle: It is a greedy algorithm that optimizes locally the
Cheeger Cut on graphs: !
Iterate until K clusters are found:
Step 1: Pick a vertex randomly on graph.!
Step 2: Diuse the Dirac function of the vertex s:!
l+1
l
= Lf
with
L = In
1/2
WD
1/2
Cut(S , Sc )
min ECheeger (S ) =
S
min(V ol(S ), V ol(Sc ))
Xavier Bresson
46
Demo: Nibble!
!! Run lecture04_code07.ipynb!
!"#$%&'(&%))*+'
/2'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
48
Summary!
Unsupervised Clustering'
unknown
K is unknown'
K is known'
Full Matrix'
Sparse Matrix!
But graph construction !
may be needed'
Data!
(no graph)'
Graph'
K-Means*'
Balanced Cuts!
Cheeger, Normalized Cuts/!
Associations'
Louvain Algorithm!
Greedy technique'
Kernel!
Construction'
NP-hard!
Spectral relaxation'
Clustering
Spectral Graph Clustering'
Kernel K-Means!
(1)! EM (Graclus*)!
(2)! Spectral'
Equivalence !
of solutions'
!"#$%&'(&%))*+'
Linear !
Relaxation '
Ncut*!
Loose relaxation!
of balanced cuts!
Medium size clusters'
Non-linear !
Relaxation '
Nibble
Nibble!
Pcut!
Small-scale!
Tight relaxation'
Clusters!
Greedy algorithm'
/4'
Transductive Clustering!
Previous techniques are fully unsupervised, no prior information about
class labels is given. !
Transductive clustering: when class labels are available, it usually boosts
the clustering results significantly, like 5-20%. However, it can be time
consuming to collect labeled data (trade-o).!
Note: Transductive clustering is dierent from semi-supervised classification.
Classification aims at learning a decision function for new data, clustering
objective is to classify given data (no new data are considered). !
Xavier Bresson
50
Conclusion!
Unsupervised clustering is one of the most generic data analysis tasks.!
(1) It is applied when basically nothing is new about the data.!
(2) It is a Lego block that can be used for all kind of data analysis tasks.!
Xavier Bresson
51
Ques8ons?
Xavier Bresson
52
Data Science!
Sept 12-14, 2016!
!"#$%&'(&%))*+'
,'
Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!
Xavier Bresson
labels'
label : xi , `i = +1
f (x) = +1
8x 2 C1
label : xi , `i =
C1! C2!
!"#$%&'(&%))*+'
f (x) = 1
8x 2 C2
/'
f (x) =
f (x) = +1
C1!
C2!
Linear SVM!
Supervised Learning!
[Vapnik-Chervonenkis63]!
!"#$%&'(&%))*+'
Linear SVM!
Supervised Learning!
[Vapnik-Chervonenkis63]!
Non-Linear/Kernel SVM!
Laplacian SVM!
Supervised Learning! Semi-Supervised Learning!
[Boser,Guyon,Vapnik-Chervonenkis92]! [Belkin,Niyogi,Sindhwari06]!
0'
Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!
Xavier Bresson
C1!
C2!
hyperplane!
f : x 2 Rd ! { 1, +1}
!! Class of (linear) solutions: Given the assumption, determine the hyperplane
that best separates the two classes. Any hyperplane is parameterized by two
variables (w,b), where w is the normal vector of hyperplane, and b is the o"set value. !
Hyperplane equation:!
!"#$%&'(&%))*+'
hw, xi + b = 0
2'
f (x) = sign(hw, xi + b) =
if x 2 C1
if x 2 C2
+1
1
hw, xi + b > 0
w
C1!
C2!
hw, xi + b = 0
hhw,
w, x
xi + b < 0
!"#$%&'(&%))*+'
3'
d'
d
x+
C1!
C2!
x i+b=2
d~ = x+
x = w
2
kwk22
d=
2
kwk2
!"#$%&'(&%))*+'
2
, min kwk22
kwk2
4'
Primal Optimization!
hw, xi i + b
`i = 1
+1
1
+1
1
if x 2 C1
if x 2 C2
if x 2 C1
if x 2 C2
`i .fi
1 8i 2 V
hw, xi i + b
`i =
1
1
!"#$%&'(&%))*+'
1 8i 2 V
Convex set!
(polytope)!
fi = hw, xi i + b
SVM !
classifier!
,5'
Support Vectors!
!! Definition: Data that are exactly localized on the margin hyperplanes.!
`i .(hw, xSP
i i + b)
!"#$%&'(&%))*+'
1 = 0, 8xSP
!
i
bi = `i
hw, xSP
i i
b = E({bi })
,,'
Dual Problem!
Primal problem: ! min kwk22 s.t. `i .fi
w,b
1 8i 2 V
1 T
min Q
0 2
With:
h, 1i s.t. h, 1i = 0
QP problem
Q = LKL
L = diag(`1 , ..., `n )
Kij = hxi , xj i
Linear kernel
Xavier Bresson
12
Optimization Algorithm!
Classification function: !
f (x) = sign(hw? , xi + b? )
= sign(?T LK(x) + b? )
Optimization scheme: Solution * given by iterative scheme: !
1
1
l=0
l=0
,
=
=
y
=
0
Initalization:
kQk
kLk
Iterate until convergence: l=0,1,2,...
l+1 = P
0 [( Q
+ In )
(k +
Ly k )]
y l+1 = y k + Ll+1
Xavier Bresson
? = 1
13
!"#$%&'(&%))*+'
,/'
Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!
Xavier Bresson
15
!! Soft SVM: Find an hyperplane that best separate the data (by maximizing
the margin) while allowing as few outliers as possible.!
!"#$%&'(&%))*+'
,1'
!! New optimization:!
min kwk22 +
w,b
n
X
s.t. `i .fi
i=1
ei , e i
0 8i 2 V
,2'
Dual Problem!
min kwk22 +
w,b
s.t. `i .fi
ei , e i
i=1
0 8i 2 V
!! Primal problem:!
n
X
!! Dual problem:!
After some computations'
min
0
0
1 T
Q
2
h, 1i s.t. h, 1i = 0
With:'
Q = LKL
L = diag(`1 , ..., `n )
Kij = hxi , xj i
!"#$%&'(&%))*+'
,3'
!"#$%&'(&%))*+'
,4'
min
w,b
kwk22
s.t. `i .fi
i=1
!! Primal problem:!
n
X
n
X
ei , e i
0 8i 2 V
Vhinge (fi , `i )
i=1
fi .`i )
!"#$%&'(&%))*+'
-5'
(1 fi .`i )2 if fi .`i 1
V`2 (fi , `i ) =
Quadratic/L2 loss:!
0
otherwise
VHuber (fi , `i ) =
8
<
:
1
2
1
2 (1
fi .`i
fi .`i )2
if fi .`i 0
if 0 < fi .`i 1
otherwise
Xavier Bresson
fi .`i
21
Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!
Xavier Bresson
22
Kernel Techniques!
Very popular techniques (until deep learning).!
f (x) =
n
X
ai K(x, xi ) + b
i=1
Xavier Bresson
23
f (x) =
n
X
ai K(x, xi ) + b
i=1
with
Popular kernels:!
(1) Linear kernel:!
!
K(x, y) = hx, yi
K(x, y) = e
K(x, xi ) = e
kx yk22 /
kx xi k22 /
and inversely:!
def'
K (x, y) = h 0 (x),
(y)i
!! Summary:!
Representer !
Theorem:!
X
f (x) =
ai Kxi (x)
i
Reproducing!
Kernel K '
Bounded
Continuous!
Function f'
Feature Map!
$'
f (x) =
X
i
!"#$%&'(&%))*+'
Kernel trick:!
ai h (xi ), (xj )i
-0'
Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!
Xavier Bresson
26
Non-Linear/Kernel SVM !
[Bosser, Guyon, Vapnik 92]!
!! Motivation: Linear/soft SVM assume data are linearly separable (up to a few
outliers). For several real-world data, the hyperplane assumption is not
satisfied. A better separation is a non-linear hyperplane, that is a
hypersurface.!
!! Kernel Trick: Project data
into a higher-dim space with
feature map $ where the data
are linearly separable.!
Linear separator'
Non-linear separator'
f (x) = X
hw, xi + b
w=
i ` i xi
i
f (x) =
!"#$%&'(&%))*+'
X
i
!
!
f (x) X
= hw, (x)i + b
w=
i `i (xi )
i
i `i h (x), (xi )i + b =
X
i
i `i K(xi , x) + b
Kernelx!
Kernelx
-2'
Optimization!
Dual problem:!
min
1 T
Q
2
h, 1i s.t. h, 1i = 0
With: Q
= LKL
L = diag(`1 , ..., `n )
Kij = hxi , xj i
K(x, y) =
Xavier Bresson
(ahx, yi + b)c
2
2
e kx yk2 /
28
!"#$%&'(&%))*+'
-4'
min kwk22 +
w
n
X
Vloss (fi , `i )
i=1
min kf k2HK +
f 2HK
n
X
i=1
Regularity!
of f!
f (x) =
n
X
Trade-o"!
ai K(x, xi )
i=1
Norm of in RKHS:!
!"#$%&'(&%))*+'
kf k2HK
Vloss (fi , `i )
= hf, f iHK =
fi fj Kij = f T Kf
ij
.5'
Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!
Xavier Bresson
31
f (x) =
!"#$%&'(&%))*+'
.-'
Xavier Bresson
33
Manifold Assumption!
Observation: Geometry of data is independent of labels!!
(Labeled and unlabeled) data are assumed to lie on a manifold, where the
classification will be carried out. !
Xavier Bresson
34
Manifold Assumption!
!! How to introduce the manifold geometry in SVM?!
! First, we approximate the manifold M with a neighborhood graph,
i.e. a k-NN graph.!
! Second, we add a regularization term that forces the classification
function f to be smooth on the manifold (/graph).!
!"#$%&'(&%))*+'
.0'
Optimization!
!! Optimization problem:!
min kf k2HK +
f 2HK
n
X
Vloss (fi , `i ) +
i=1
|rf |2
Dirichlet energy:!
(1)!It forces f to be smooth on M!
(2)!Derivative is "f=0 (heat di#usion)!
|rf |
X
ij
Wij |f (xi )
f (xj )|2 = f T L
Lf
Laplacian operator!
!"#$%&'(&%))*+'
.1'
Algorithm !
Semi-supervised SVM or Laplacian SVM [Belkin, Niyogi, Sindhwani06]:!
min kf k2HK +
f 2HK
n
X
Vhinge (fi , `i ) +
Gf
i=1
Lf
)
f (x) = sign
n
X
a?i K(x, xi )
i=1
a? = (I +
G LK)
HL?
1 T
Q
= arg min
0 l 2
?
with
Xavier Bresson
Q = LHK(I +
G LK)
h, 1i s.t. h, 1i = 0
1
HL
37
!"#$%&'(&%))*+'
.3'
Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!
Xavier Bresson
39
Summary!
f (x) =
f (x) = +1
C1!
C2!
Linear SVM!
Supervised Learning!
[Vapnik-Chervonenkis63]!
!"#$%&'(&%))*+'
Linear SVM!
Supervised Learning!
[Vapnik-Chervonenkis63]!
Non-Linear/Kernel SVM!
Laplacian SVM!
Supervised Learning! Semi-Supervised Learning!
[Boser,Guyon,Vapnik-Chervonenkis92]! [Belkin,Niyogi,Sindhwari06]!
/5'
Summary!
!! General supervised and semi-supervised optimization techniques:!
min kf k2HK +
f 2HK
Vloss
!"#$%&'(&%))*+'
n
X
Vloss (fi , `i ) +
G Rgraph (f )
i=1
Regularity!
of f!
8
Hinge
>
>
>
>
< L2
L1
=
>
>
Huber
>
>
:
Logistic
8
< Dirichlet: krG k22
Total Variation: krG k1
Rgraph () =
:
Wavelets: kDwavelets k22
Graph regularization!
for unlabeled data!
/,'
Ques8ons?
Xavier Bresson
42
Data Science!
Sept 12-14, 2016!
!"#$%&'(&%))*+'
,'
Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!
Xavier Bresson
Introduction!
Recommendation have become a central part of intelligent systems. !
Q: Where do you find recommender systems? !
Xavier Bresson
Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!
Xavier Bresson
Google PageRank!
!! A B$ Algorithm! " !
!! PageRank is an algorithm that ranks websites on Internet. It is at
the core of Google Search Engine, which introduced a revolution in 1998 as
ranking was previously done manually by humans.!
Q: Do you know how many webpages in 1998 and 2016? !
In 1998, the size of WWW was 2.4M of webpages.!
Today, on Aug 2016, the size of the WWW is 4.6B!!
!"#$%&'(&%))*+'
0'
PageRank Technique!
It is a sound technique as!
(1) Mathematically well defined.!
(2) Computationally ecient.!
Core idea: PageRank sorts the vertices of a directed graph G using the
stationary state of G.!
Definition: The stationarity and modes of vibration of graphs/networks
can be studied by EVD (Lecture 3) such that:!
Axl =
l xl
Xavier Bresson
Perron-Frobenius Theorem!
!! Given a graph G=(V,E,W) defined by a stochastic and irreducible matrix
W, the PF theorem establishes that the largest left eigenvector (with
eigenvalue 1) is the stationary state (PageRank) solution:!
xTmax W = xTmax
Max left eigenvector '
max
= xTmax
Eigenvalue=1 '
x T W = xT
!"#$%&'(&%))*+'
2'
Stochastic Matrix!
!! Definition: A matrix W with the rows normalized as probability density
function:!
X
W ='
Wij = 1
W1 = 1
!! Make W stationary:!
W
!"#$%&'(&%))*+'
for
Dii
P
( j Wij )
=
0
if ith row6= 0
otherwise
3'
Irreducible Matrix !
!! Definition: A matrix W that represents a strongly
connected graph, that is W has, for any pair of
vertices (i,j):!
(1) A directed path from i to j.!
(2) A directed path from j to i. !
!! Make W irreducible:!
Wsi
W + (1
In
)
n
Identity matrix'
Original matrix!
= Sparse matrix'
4'
Interpretation!
Q: What is a random surfer? !
Term (1-)In/n: It is equivalent to a random surfer/user who can jump to
any webpage. !
Whole model D-1Wsi + (1-)In/n: It represents a surfer/user who follows
the internet structure % of the time and suddenly, in (1-)% of the time, the
surfer clicks to a random webpage that has no connection to the provious page:!
Xavier Bresson
10
Naive Algorithm!
PageRank simple algorithm: Solve the EVD problem:!
Left eigenproblem
xT Wsi = xT
Right eigenproblem
11
Power Method !
At convergence:
T 1
x1 = Wsi
x
T
x1 = xpagerank because solution of x = Wsi
x
Algorithm:!
1n
n
Initialization:
xk=0 =
xk+1 = D
Xavier Bresson
W T xk +
1n
(1
n
12
Properties!
Full matrix!
O(n2)!
T
EVD: ! Wsi x = x
W T xk +
1n
(1
n
Sparse matrix!
O(E)!
EVD: O(n2) !
Power Method: The number of iterations to converge to a precision #
is controled: !
log
85
for = 0.85
10
kxk+1 xk k1
=
, and = 10
K=
for'
1833 for = 0.99
log10
,.'
Demo: PageRank!
!! Run lecture06_code01.ipynb!
!"#$%&'(&%))*+'
,/'
Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!
Xavier Bresson
15
!"#$%&'(&%))*+'
,1'
Collaborative Filtering!
!! Formulation: Given a few ratings/observations Mij of movie j and user i, find
a low-rank matrix X that best fits the ratings. !
Recommendation !
= !
Matrix completion!
M'
!"#$%&'(&%))*+'
X'
,2'
Low-Rank Recommendation!
!! Definition: A low-rank matrix has many columns and rows that linearly
dependent. The rank of a matrix gives the number of independent rows and
columns. !
Nb of independent rows = 13!
Nb of independent cols = 15 !
rank(X) = max(13,15) = 15 '
X='
Same assumptions for Amazon (users, products), LinkedIn (users, jobs), Facebook
(users, ads), etc.!
!"#$%&'(&%))*+'
,3'
Formalization!
!! Modeling:!
min rank(X)
s.t.
Noiseless case!
(Observations are clean)'
Xij = Mij
Xij = Mij + nij
Combinatorial !
NP-hard problem'
8ij 2 obs
8ij 2 obs
Noisy case!
(Observations may be corrupted)'
Relaxation needed!
Convex !
Relaxation!
!"#$%&'(&%))*+'
Non-Convex !
Relaxation!
,4'
Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!
Xavier Bresson
20
Convex Optimization!
Convex optimization has become a very powerful tool in the last
decade in data science (2nd topic at NIPS conference, behind deep learning). !
Several state-of-the-art techniques are based on convex opt s.a. (sparse)
data representation, recommender systems, unsupervised clustering, etc. !
Classes of optimization problems in data science:!
(1) Linear programming (LP)!
(2) Quadratic programming (QP)!
(3) Smooth convex optimization !
(4) Non-smooth convex optimization !
(5) Non-Convex optimization!
Xavier Bresson
21
Linear Programming!
!! Linear programming (LP): very common!
min hc, xi
x
s.t.
Linear objective!
Ax b
Convex set!
(polytope)!
Convex set
!"#$%&'(&%))*+'
--'
Quadratic Programming!
!! Quadratic programming (QP):!
1 T
min x Qx
x
2
s.t.
Ax b
Ex: SVM!
min kAx
x
bk22 + kRxk22
A0 x = b0
!"#$%&'(&%))*+'
-.'
min Fs (x)
x
s.t.
Ax b
Newtons algorithm: !
xk+1 = xk
[HFs (xk )]
Hessian !
Matrix
=!
Optimal!
time step
Advantages:!
rFs (xk )
Gradient!
vector
F (x? ) =
O( k12 )
O(e k )
Xavier Bresson
24
s.t.
Ax b
F (x? ) = O(
1
) (optimal J [Nesterov])
k2
bk22 + kxk1
Encourage sparsity !
(feature selection)
Interpretation
25
Non-Convex Optimization!
!! No general theory for non-convex problems.!
!! Case-by-case math analysis.!
!! What always work: Standard gradient descent algorithm:!
k+1
=x
@F k
(x )
@x
Time step:!
-1'
s.t.
min rank(X)) +
X
kIobs (X
M )k2F
1
0
if ij 2 obs
otherwise
kIobs (X
M )k2F
p=min(m,n)
m,n)
k=1
SVD'
X = U V T
!"#$%&'(&%))*+'
k (X)|
Singular values!
= diag(
1 , ...,
p)
-2'
Primal-Dual Optimization!
!! Algorithm:!
X k=0 = M
Initialization:!
Y k=0 = 0
!"#$%&'(&%))*+'
k+1
Xk
U h1/ ()V T
k Y k+1 + k M
1 + k Iobs
SVD'
with'
U V
=Yk +
Xk
h (x)
-3'
!"#$%&'(&%))*+'
-4'
Properties!
Advantages:!
(1) Unique solution (whatever the initialization)!
(2) Well-posed optimization algorithms!
Limitations:!
(1) Complexity is dominated by SVD O(n3)!
(2) Memory requirement is O(n2)!
Convex algorithms do not scale up to big data.!
Xavier Bresson
30
Non-Convex Techniques!
!! Combinatorial problem for robust recommendation:!
min rank(X) +
X
kIobs (X
promotes low-rank!
M )k2F
1
0
if ij 2 obs
otherwise
Combinatorial !
NP-hard problem
problem'
1
1
kLk2F + kRk2F + kIobs (LR
2
2
2
M )k2F
rm
R'
L'
!"#$%&'(&%))*+'
nr
X=LR'
nm
r n, m
.,'
Properties!
Advantages:!
(1)Optimization problem is non-convex, but smooth and quadratic
conjuguate gardient, Newton, etc. !
(2) Big data optimization: As the objective is dierentiable, stochastic
!
!
Xavier Bresson
32
Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!
Xavier Bresson
33
!"#$%&'(&%))*+'
./'
Gr = (Vr , Er , Wr )
Network of users'
Cols/products graph:!
Gc = (Vc , Ec , Wc )
Network of products'
!"#$%&'(&%))*+'
.0'
Content Recommendation !
[Huang-Chung-Ong-Chen02]!
Cols/products graph:'
Gc = (Vc , Ec , Wc )
Rows/users graph:'
Gr = (Vr , Er , Wr )
Recommendation !
=!
Matrix completion'
M'
X'
.1'
Formalization!
Simple idea: Diuse the ratings on the networks of users and products.!
Optimization formulation:!
kIobs (X
M )k2F
kXkdiff,G
Xavier Bresson
37
= tr(X T LX)
where L is the graph Laplacian.'
.3'
Content Recommendation!
!! Optimization problem:!
min kXkdiff,G rows + kXkdiff,G cols +
X
kIobs (X
kIobs (X
M )k2F
M )k2F
(Im Lr + Lc In + Imn )X = M
Ax = b
!"#$%&'(&%))*+'
.4'
!"#$%&'(&%))*+'
/5'
Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!
Xavier Bresson
41
Hybrid Systems!
!"#$%&'(&%))*+'
/-'
Formalization!
Optimization:!
min kXk? +
X
Xavier Bresson
tr(X T Lr X) +
tr(XLc X T ) +
kIobs (X
M )k2F
43
!"#$%&'(&%))*+'
//'
State-of-the-Art!
!! Limitation: Graph Dirichlet regularization/smoothness forces:!
(1)!Two rows/columns of X to be similar if they are close on graphs ".!
!
Graph TV regularization/smoothness!
/0'
Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!
Xavier Bresson
46
Fundamental Property of !
Recommender Systems!
Prediction !
Error!
(the lower !
the better)'
regularization'
Content Filtering/ graph regularization
Hybrid=!
Content'
Hybrid recommender!
recommender
System'
System
number
Small number!
of ratings'
Hybrid=!
Collaborative'
Large number!
number
of ratings'
#Available!
Observations/!
ratings'
!! Conclusion:!
(1) If not enough ratings, then focus on collecting data features!
(2) When enough ratings, then give less importance of features!
!"#$%&'(&%))*+'
/2'
Summary!
PageRank for data ranking according to pairwise relationships.!
Xavier Bresson
48
QuesDons?
Xavier Bresson
49
Data Science
Sept 12-14, 2016!
!"#$%&'(&%))*+'
,'
Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
!! Goal: Find the best possible representation of data that reveal special
structures useful for further applications (like classification, recognition, etc).!
Apply!
Filters!
Meaningful!
Features!
Raw data!
.'
Handcrafted Features!
!! Domain expertise: Handcrafted features are domain-dependent, i.e. designed
from experts in specific fields with years of experience (usually not
generalizable to other fields).!
!! Popular example: SIFT - Best image features in Computer Vision, used for
many applications such as image recognition. It needed 30 years of experience
(1966-1999) to design good image features! !
Image!
SIFT!
Filters!
SIFT!
Features!
Features for!
image/object !
recognition!
!"#$%&'(&%))*+'
/'
Xavier Bresson
Linear Representation!
!! Formulation:!
z = Dx
Features!
or coe"cients of x!
in the dictionary D'
Dictionary !
of filters!
(or basis functions)'
3
hD1, , xi
7
6
..
z = Dx = 4
5
.
hDK, , xi
2
High-dim!
data'
zi = hDi,
i, , xi
ith coe"cient'
!"#$%&'(&%))*+'
ith filter'
1'
Linear Representation!
!! How to learn A and z?!
'Techniques available: PCA, ICA, NMF, Sparse Coding, etc. !
!! Which technique to choose?!
Each technique assumes di"erent assumptions on data. Pick the one that
follows your data properties (later discussed).!
!! Matrix factorization: Linear representation of data can also be seen as a
matrix factorization problem:!
X = DZ
d'
Kxd
nxd
d'
n x K'
!"#$%&'(&%))*+'
2'
Non-Linear Representation!
Non-linear mapping : !
Linear representation:
Non-linear representation:
x
x
!
!
z = Dx
z = '(x)
(z 6= Dx)
Examples: !
(1) Non-linear PCA, Locally Linear Embedding (LLE), Laplacian
Eigenmaps, t-Distributed Stochastic Neighbor Embedding (t-SNE)
(Lecture 8).!
(2) Deep Learning (Lectures 9-12).!
Xavier Bresson
!"#$%&'(&%))*+'
4'
Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
10
Xavier Bresson
11
Formalization!
PCA defines an orthogonal transformation that maps the data to a new
coordinate system (v1,v2,,vK) called principal directions such that the vks
capture the largest possible data variances. !
Notes: !
(1) PCA requires the data to be centered.!
(2) PCA does not say anything about data normalization, but its
analysis may change (PCA is not invariant w.r.t. data normalization).!
Xavier Bresson
12
Covariance Matrix!
!! Definition: The Covariance Matrix C is defined as!
C = XT X
d x d'
X= n x d data matrix!
n = number of data!
d = number of dimensions'
d x n'
n
X
n
X
i=1
Xi Xi =
i=1
|Xi |2 =
n
X
n
X
x2i,
i=1
xi, xj,
i=1
Variance of data !
along #th dimension '
Covariance of data !
along #th and 6th dimensions '
n
X
i=1
xi, =
n
X
i=1
Xi = 0 8
,.'
Principal direction !
(PD) v1!
1 v1
Largest variance!
of data along PD v1!
i=1
|hxi , vi|
i|2
kvk2 =1
n
X
Matrix notation'
= argmax
kvk2 =1
!"#$%&'(&%))*+'
1 v1
v1T Cv1
v1T
1 v1
2
1 kv1 k2
= argmax
kvk2 =1
n
X
i=1
|hxi , vi|2
,/'
kvk2 =1
v T Cv s.t. hv, v1 i = 0
)
Cv2 =
2 v2 ,
with
C = V V T
with
Xavier Bresson
V = v1 , ..., vd , V T V = Id , = diag(
1 , ...,
d)
15
Principal Components!
Definition: PCs are the coordinates of original data projected into the basis
of principal directions (PDs): !
Xpca = XV
Xavier Bresson
16
X = U V
V T with U U T = In , V V T = Id
n x d'
nxn
n'
n x d'
d x d'
T
= Vevd Vevd
8
< Vsvd = Vevd
2 = ! k = k2
:
Xpca = XVevd = Usvd
!! Q: PCA with EVD or SVD? It depends on the size of data matrix X:!
(1) For d>n: use SVD.!
(2) For d<n: use EVD.!
!"#$%&'(&%))*+'
,2'
xi = hxi , v1 i + hxi , v2 i
hxi , v1 i
The first PDs are enough to provide a good data representation, i.e.!
kX
XK = U K VKT
Truncated 8 with !
first K PDs'
,3'
PK
k=1
Pd
k=1
k
k
0.9
K
YaleBFaces dataset!
!"#$%&'(&%))*+'
structure'
noise'
,4'
Xavier Bresson
20
!"#$%&'(&%))*+'
-,'
Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
22
Sparse PCA!
Q: Is PCA interpretable? !
!! Motivation: Standard PCA is able to !
(1) Capture most variability information contained in data.!
(2) Identify uncorrelated information (because principal directions are
orthogonal).!
!! However, PCA is limited in feature interpretation: It is hard to to identify
the most relevant features for each principal direction.!
!! Example: Analysis of genes with standard PCA gives: !
Gene1!
Gene2!
Gene3!
'
-.'
Xavier Bresson
24
Elastic PCA!
!! Elastic PCA solves an elastic net regression problem:!
min kX
A,B
XBAT k2F +
2
2 kBkF
Data fidelity!
term!
Bk1
1 kBk
L1 term forces!
sparse solution!
Elastic net!
regression!
?
B,j
=
? k
kB,j
2
sPDj = V,j
Xspca = XV
!"#$%&'(&%))*+'
s.t. AT A = IK
-0'
Algorithm!
Optimization problem is non-smooth but convex (separately):!
Initialization:
Iterate until convergence:
Step 1:
B m+1 = argmin kX
XBAm k2F +
1 kBk1
(1)
Step 2:
Am+1 = argmin kX
(2)
2
2 kBkF
Xavier Bresson
26
!"#$%&'(&%))*+'
-2'
Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
28
Robust PCA!
Q: Is PCA robust to outliers? !
!"#$%&'(&%))*+'
-4'
Formalization!
!! Standard PCA:!
!! Robust PCA:!
min kX
L
min rank(X) +
L,S
card(S)) s.t. X = L + S
Data'
)
min kXk? +
L,S
(1)
Low-rank matrix!
Standard PCA!
(structure)'
Sparse matrix!
captures outliers!
(no structure)'
Convex !
relaxation'
kSk1 s.t. X = L + S
(2)
!"#$%&'(&%))*+'
.5'
Algorithm!
!! ADMM technique: Fast, robust and accurate solutions.!
Initialization:!
Lm=0 = X
S m=0 = Z m=0 = 0
S m+1 = h /r X Lm+1 + Z m /r
Z m+1 = Z m + r(X
!"#$%&'(&%))*+'
Lm+1
S m+1 )
S m + Z m /r
h (x)
.,'
!"#$%&'(&%))*+'
.-'
Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
33
PCA on Graphs!
Q: Can do PCA on networks like Facebook? !
!! Motivation: When data similarities are available or can be computed,
it enhances PCA.!
!! Formalization:!
min rank(X) +
L,S
card(S) +
G kLkG smooth
s.t. X = L + S
Force smoothness !
on graphs!
)
min kXk? +
L,S
!"#$%&'(&%))*+'
kSk1 +
Continuous!
convex !
relaxation'
G kLkG Dir
s.t. X = L + S
./'
!"#$%&'(&%))*+'
.0'
Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
36
[Lee-Seung99]'
PCA'
!"#$%&'(&%))*+'
NMF'
.2'
Matrix Factorization!
!! PCA and NMF are both factorized models:!
svd
PCA : X = U V T
m!
m!
NMF : X = LR with L, R
r! r!
R!
L!
LR!
0
n!
X!
=!
n!
Essential constraints to !
Identify data parts!
X!
10'
Users!
L!
Compressed !
User features'
!"#$%&'(&%))*+'
10'
Movies!
R!
Compressed !
Movie features'
.3'
Linear Representation!
!! Text document representation: !
ri
40,000 words'
m!
20,000
text
n!
documents'
xi
=!
Lri
xi = Lri
'Each document is represented by a linear
combination of compressed word features.!
'Same for word:'
!"#$%&'(&%))*+'
xj = R T `j
.4'
X = LR with L, R
min kX
L,R 0
LRk2F
!
(2) Kullback-Leibler (relative entropy) loss: (histogram distances)!
L,R 0
Xavier Bresson
X
ij
Xij log
(LR)ij
Xij
40
Algorithms!
Several techniques exist:!
(11) Multiplicative update techniques!
Advantage: Monotonic.!
Limitation: Slow to converge.!
!
Xavier Bresson
41
Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
42
Sparse Coding!
!! Motivation: PCA and NMF make strong assumptions about the dictionary
D used for linear representation: !
z = Dx
dictionary '
PCA!
D captures main directions!
of data variations!
NMF!
D captures main !
common parts of data!
/.'
Formalization!
!! Optimization problem: !
n
X
min
kxj
D,zj
j=1
Controls!
filter energies'
Eni = kDi, k2 =
1
0
zj
Di,
=0
!"#$%&'(&%))*+'
//'
Algorithm !
!! Non-smooth and convex optimization:!
n
X
min kX
Z,D
j=1
Z,D
Initialization:!
kxj
min
Dm=0 = randn
Dm Zk2F + kZk1
/0'
Learned Dictionary= !
Human visual filters (V1 cells) !
in the primary visual cortex!
!"#$%&'(&%))*+'
/1'
Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
47
Summary!
Feature Extraction Problem: !
(1) Handcrafted filters/features: less popular.!
(2) Learned filters/features: more and more common.!
Xavier Bresson
48
Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
49
Ques;ons?
Xavier Bresson
50
Data Science!
Sept 12-14, 2016!
!"#$%&'(&%))*+'
,'
Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!
Xavier Bresson
Visualization Problem!
!! Data visualization is the same problem as!
(1) Data representation!
(2) Feature extraction!
!! Data representation looks for the best filters or dictionary D where
the data x can be represented, and the projected data z on D are used
as coordinates for 2D or 3D visualization:!
zi 2 R 2
2D visualization'
xi 2 R d , d
1
3D visualization'
!"#$%&'(&%))*+'
zi 2 R 3
.'
Visualization Techniques!
!! Visualization techniques are also dimensionality reduction techniques
because they aim at mapping data into a much lower-dimensional space,
2D or 3D Euclidean spaces.!
!! Linear dimensionality reduction (LDR) techniques.!
Assumption: Data that can be represented on a low-dimensional
hyperplane.!
d
R , d
xi
+'
+'
+'+'
+'+'
+'
xi 2 Rd
!"#$%&'(&%))*+'
Hyperplane Rm , m d
LDR!
Find a linear!
mapping A s.t.!
A : xi ! zi
Am,
+'
+' z
i
+'
+' +'
A1,
zi 2 R m = A x i
/'
dim(M) = m d
xi
+'+' +'
+' +'+'
+'
NLDR!
zi
Find a non-linear!
mapping " s.t.!
xi 2 R
' : xi ! z i
zi 2 Rm = '(xi )
0'
!"#$%&'(&%))*+'
1'
Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!
Xavier Bresson
Kernel PCA
[Scholkopf-Smola-Muller97]!
!! Standard PCA:!
2
6
Gram matrix:' G = XX = 6
6
4
T
hx1 , x1 i hx1 , x2 i . . .
hx2 , x1 i hx2 , x2 i
..
..
.
.
hxn , xn i
7
7 EV D
7 = U DU T ) Xpca = U D1/2
5
h (x1 ), (x1 )i
6 h (x2 ), (x1 )i
6
G=6
..
4
.
h (x1 ), (x2 )i
h (x2 ), (x2 )i
...
..
.
h (xn ), (xn )i
7
7 EV D
7 = U DU T ) Xkpca = U D1/2
5
kx yk22 /
Never computed!!
!"#$%&'(&%))*+'
3'
!"#$%&'(&%))*+'
4'
Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!
Xavier Bresson
10
[Roweis, Saul00]!
!! Motivation: Design a mapping from high-dimensional space to lowdimensional space such that the geometric distances between
neighbor data are preserved.!
xi
+'
+
Rd , d
+ xj
+'
'
j
xi x+
+'
+'
+
R3
!"#$%&'(&%))*+'
,,'
Algorithm!
Step 1: For each data xi, compute the k nearest neighbors.!
Step 2: Compute linear patches: find the weights Wij which best linearly
reconstruct xi from its neighbors:!
min
W
n
X
xi
i=1
Wij xj
2
2
s.t.
X
j
Wij = 1 8i
Solution: Ax=b
Z=[z1 ,...,zm ]
n
X
i=1
zi
X
j
Wij zj
2
2
s.t.
zi = 0, Z T Z = Im
Solution: EVD
Xavier Bresson
12
Demo: LLE!
!! Run lecture08_code03.ipynb!
!"#$%&'(&%))*+'
,.'
Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!
Xavier Bresson
14
Laplacian Eigenmaps
[Belkin, Niyogi03]!
Xavier Bresson
15
Dierential Geometry!
Eigenfunctions vk of continuous Laplace-Beltrami M serves as
embedding coordinates of M:!
Xavier Bresson
16
Formalization!
1D Visualization: Map a graph G=(V,E,W) to a line such that neighbor
data on G stay as close as possible on the line.!
Wij (yi
yj ) 2
(1)
ij
Xavier Bresson
17
X
k
T
Y,k
LY,k = tr(Y T LY ) s.t. Y T Y = IK
Graph Laplacian!
L = U U T ! Y = UK
!! Advantages:!
(1) Global solutions (independent of initialization)!
(2) Fast algorithms!
!"#$%&'(&%))*+'
,3'
MNIST !
PCA!
!"#$%&'(&%))*+'
MNIST !
Lap Eigenmpas!
USPS!
Lap Eigenmpas!
,4'
Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!
Xavier Bresson
20
T-SNE
e
pij = P
kxi xj k22 /
ke
2
i
kxi xk k22 /
i
2
i
(1 + kyi yj k22 ) 1
qij (y)
y)) = P
yk k22 )
k (1 + kyi
!"#$%&'(&%))*+'
-,'
Optimizing Kullback-Leibler !
Problem:!
ij
yim+1
yim
X
j
(pij
pij log
pij
qij (y)
xj k2 )
(yim
yj )
Advantages:!
(1) Local distance preservation (as LapEig, LLE): minimizing KL !
forces qij to be close to pij, the distribution of high-dim data.!
(2) t-SNE does not assume the existence of a manifold: More !
flexibility to visualize more complex hidden structures.!
Limitations:!
(1) Non-convex energy Existence of bad local solutions, problem of
initialization (PCA is used as initialization).!
(2) Slow optimization (gradient descent).!
Xavier Bresson
22
Demo: T-SNE!
!! Run lecture08_code05.ipynb!
!"#$%&'(&%))*+'
-.'
Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!
Xavier Bresson
24
Summary!
Data'
Non-Linear
Non-Linear!
Structure
Structure'
Linear!
Structure
Structure'
zi = '(xi )
zi = Axi
Low-dim!
data'
High-dim!
High-dim
data'
Low-dim!
data'
High-dim!
data'
Non-Linear Mapping/Embedding'
Dictionary'
Sparsity
Sparsity!
Structure
Structure'
Variability
Variability!
Structure
Structure'
Sparse Coding!
PCA!
(1997)'
(1901)'
Kernel PCA
PCA!
(1998)'
LLE!
LLE
(2000)'
T-SNE!
LapEig! T-SNE
LapEig
(2008)'
Maps!
(2000)'
xi
'+ xj
'
+
!"#$%&'(&%))*+'
Rd , d
xi xj
'
'++
'
R3
Most popular!
Popular!
Popular
Math sound!
Unique solution!
Manifold assumption !
Too strong!
-0'
Gephi!
Q: What visualization software is the most used? !
-1'
Ques:ons?
Xavier Bresson
27
Data Science
Sept 12-14, 2016!
,'
Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
Classification Problem!
Q: What it the classification problem? !
!! Classification is a core problem in many applications:!
(1) Computer Vision: Image classification !
Image " Class (original deep learning [Hinton-et.al12])!
(2) Speech: Sound recognition!
Sound " Class (original deep learning [Dahl-et.al12])!
(3) Text document: Text categorization!
Text " Class (Wikipedia analysis)!
(4) Neuroscience: Brain functionality!
Activation pattern " Vision, hearing, body control !
!"#$%&'(&%))*+'
.'
Image Classification !
!! We will consider the image classification problem in Computer Vision as a
generic classification problem (generalization will be discussed in Lecture 11).!
!! Problem:!
Image !
Classification '
!"#$%&'(&%))*+'
/'
Main Challenge!
!! Bridge the semantic gap between raw data (N-D array of numbers) and
cognitive/human understanding.!
!"#$%&'(&%))*+'
0'
(2) Illumination!
changes
(5) Background!
clutter
Xavier Bresson
(3) Object !
deformation
(4) Occlusion
(6) Intra-class!
variation
!"#$%&'(&%))*+'
2'
!! Note: Collecting data is easy (big data era) but labeling is time consuming.!
!"#$%&'(&%))*+'
3'
Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
Training set!
!"#$%&'(&%))*+'
Test data!
Nearest data!
in training set!
,5'
!"#$%&'(&%))*+'
ij
,,'
Test Time!
!! Q: What is the test time? And how the classification speed depends on
the size n of training data? A: O(n), linearly ". This is a (major)
limitation. Fast test time is preferred in practice.!
Note: Neural Networks have fast test time, but expensive training time. !
!! Partial solution: Use approximate nearest neighbor techniques, which finds
approximate nearest neighbors quickly. !
!"#$%&'(&%))*+'
,-'
Training set!
!"#$%&'(&%))*+'
data!
Test data
k=5!
k-Nearest data!
in training set!
,.'
Illustration!
outlier'
Data!
NN/1-NN classifier!
5-NN classifier!
,/'
Hyperparameters!
Q: What is the di#erence between parameter and hyperparameter?!
!! There exist two types of parameters:!
(1) Parameters: Variables that can be estimated by optimization #.!
(2) Hyperparameters: Variables that can be estimated by cross- !
validation (cannot be estimated by optimization) ".!
!! Examples of hyperparameters: distance metric, k value.!
L2, L1, cosine, KullbackLeibler
KullbackLeibler?'
k=1,2,5,10,15?
k=1,2,5,10,15?'
Q: What is cross-validation?!
!! Cross-validation:!
Q: Try out what hyperparameters work best on test set? Bad idea.
!
Test set used for the generalization performance! Use it only after training is
done. !
!"#$%&'(&%))*+'
,0'
Cross-Validation!
!! Split training data into training set and validation:!
Validation data!
Use to test hyperparameters'
Training data!
use to learn classifier'
Training data
data!
!"#$%&'(&%))*+'
Validation data!
Training data!
,1'
Cross-Validation Result!
!! Example of 5-fold cross-validation for finding the value of k:!
!"#$%&'(&%))*+'
,2'
!"#$%&'(&%))*+'
,3'
Xavier Bresson
19
Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
20
Linear Classifier!
!! Image classification: !
Image !
classification !
task'
Class of Images: !
CAT!
Array 32x32x3'
!! Linear classifier: !
input'
input
f (x, W, b) = W x + b
vectorize'
3D array !
32x32x3'
O#set/!
Bias'
10 numbers!
indicating
class scores!
(highest is
the choice)!
s!
1D array !
10x1!
x!
1D array
3072x1!
Linear classifier/!
Score function:'
f = Wx + b
10x1!
!"#$%&'(&%))*+'
Parameters/!
Weights'
Weights
10x3072! 3072x1!
10x1!
-,'
un-vectorize'
Wi, =
1x3072!
1x3072
32x32x3!
32x32x3
--'
!"#$%&'(&%))*+'
-.'
Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
24
Xavier Bresson
25
si + 1)
comes form the margin
margin'
SVM loss measures how well the weights s are chosen to get the highest
possible score. Then, Li is 0 when si is well classified, that is when it has the
highest score for its own class yi, and Li is large when si is misclassified. !
Example !
when si is !
well classified:!
!"#$%&'(&%))*+'
-1'
Example !
when si is !
misclassified:
1X
Total SVM loss:! L =
Li
n i=1
Q: What is the min value of L?!
A: 0.!
A: + . !
Xavier Bresson
27
Loss Functions!
Q: Will we get the same classification for this loss function?!
X
A: Probably not.!
Li =
max(0, sj si + 1)2
j6=i
Xavier Bresson
28
Non-Uniqueness of Solutions!
n
Optimization problem:!
min
W
1X
Li (W )
n i=1
(1)
1 XX
max(0, sj
n i=1
si + 1)
j6=i
1 XX
max(0, W xj
n i=1
W xi + 1)
j6=i
Example:
Xavier Bresson
29
Regularization!
n
!! Remember Lecture 5: !
min
W
1X
Li (W ) + kW k2F
n i=1
!! Regularization terms:!
(1) L2 regularization: smooth and di#erentiable.!
(2) L1 regularization: non-smooth and non-di#erentiable, but promotes
sparsity (a few non-zero elements).!
(3) Elastic net regularization: mixture of L1 and L2.!
!"#$%&'(&%))*+'
.5'
!"#$%&'(&%))*+'
.,'
Xavier Bresson
32
Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
33
Softmax Classifier!
!! Softmax classifier = Multinomial logistic regression !
!! Motivation (from statistics): Maximize the log likelihood of the score
probabilities of the classes:!
Scores = unnormalized
log probabilities !
of the classes '
si = f (xi , W )
e si
P (Y
Y = yi |X = xi ) = P sj
je
Li =
log P (Y = yi |X = xi )
Li =
!"#$%&'(&%))*+'
Softmax function'
esi
log P sj
je
./'
!"#$%&'(&%))*+'
.0'
Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
36
Image !
classification !
task'
Class of Images: !
CAT!
Array 32x32x3'
!! Linear classifier: !
vectorize'
f = Wx
10x1!
10x1
3D array !
32x32x3'
!"#$%&'(&%))*+'
x!
1D array
3072x1!
10x3072! 3072x1!
s!
1D array !
10x1!
.2'
non-linear!
activation'
weights'
10x100!
10x100
100x3072!
100x3072
W1
vectorize'
3D array !
32x32x3'
!! 3-layer classifier: !
x!
1D array
3072x1!
h!
h
max(W1x,0)!
1D array
100x1!
W2
f = W2 max(W1 x, 0)
W2max(W1x,0)!
1D array
10x1!
!"#$%&'(&%))*+'
.3'
789:;;$"<=&")>?@$=7AB?$*;-5,0;52;,-;B")$CD9E=7*+D+%=F*&>'
!"#$%&'(&%))*+'
.4'
The need for more structure: FC networks are very generic but also
highly computationally expensive to learn (huge number of parameters).
They cannot be deep! !
However, using special structures of data (like local stationarity in convolutional
neural networks, and recurrence in recurrent neural networks) allow to construct
deep networks that can be learned (later discussed).!
Xavier Bresson
40
Test Time!
!! Once training is done, it is fast to classify new data (simple linear
algebra operations):!
!"#$%&'(&%))*+'
/,'
!"#$%&'(&%))*+'
/-'
!"#$%&'(&%))*+'
/.'
Online Demo!
!! ConvNetJS:
http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html!
more neurons = !
more capacity'
Regularization !
handles outliers'
!"#$%&'(&%))*+'
//'
Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
45
Brain Analogy!
Wx + b
!"#$%&'(&%))*+'
/1'
!"#$%&'(&%))*+'
/2'
Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
48
Summary!
Image/data classification: Given a training set, design a classifier and
predict labels for test set. !
Linear/softmax classifier:!
Predict labels with a linear function.!
Has been used for a long time (kernel techniques) but overcome by deep learning.!
Score function:!
f = Wx + b
max(0, sj si + 1)
j6=i
!
esi
Softmax loss function: ! Li = log P s
j
je
Xavier Bresson
Li =
49
Summary!
!! Standard Neural Networks (NNs):!
Neurons arranged as fully connected layers. !
Series of linear functions and non-linear activations.!
Fast test time (matrix multiplications)!
Performances: bigger = better, but expensive training time (thanks GPUs)!
Bigger = (layer) width and depth (deep)!
width!
depth !
!"#$%&'(&%))*+'
05'
QuesHons?
Xavier Bresson
51
Data Science
Sept 12-14, 2016!
,'
Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!
Xavier Bresson
s = f (W, x) = W x
Weight W: They are found by minimizing a loss function which quantifies
how well the training data have been classified:!
X
(1) SVM loss: !
Li (W ) =
max(0, sj si + 1)
!
j6=i
(3) Regularization: !
e si
Li (W ) = log P sj
je
X
E(W ) =
Li (W ) + R(W )
i
Xavier Bresson
!"#$%&'(&%))*+'
/'
Gradient Operator!
!! Two types: !
(1) Analytic gradient: !
!
rW E =
@E
= explicit formula
@W
E
E(W +
=
W
W)
W
E(W )
!"#$%&'(&%))*+'
!"#$%&'(&%))*+
0'
Analytic Gradient!
Properties:!
(1) Exact value (use Calculus)!
(2) Fast to evaluate.!
E(W ) =
kW k2F
@E
= 2W
rW E =
@W
Xavier Bresson
Update Rule!
!! Update: !
W m+1 W m
W
=
=
t
Speed of !
Gradient descent!
techniques'
Time step/!
Learning rate/!
Step size'
rW E(W m )
W m+1 = W m
rW E(W m )
negative
negative'
!! Code: !
!"#$%&'(&%))*+'
2'
Monotonicity!
!! Loss/energy value decreases monotonically at each iteration m:!
E(W ) =
n
X
Li (W ) + R(W )
i=1
Analytic gradient uses all data at the same time it is not possible
to load all data in memory!!
!"#$%&'(&%))*+'
3'
1X
min L(W ) =
Li (W )
W
n i=1
nq
n1
n2
1 X
1 X
1 X
Li (W ) +
Li (W ) + ... +
Li (W )
=
n1 i=1
n2 i=1
nq i=1
n = n1 + n2 + ... + nq
)
nq
n1
n2
1 X
1 X
1 X
rL(W ) =
rLi (W ) +
rLi (W ) + ... +
rLi (W )
n1 i=1
n2 i=1
nq i=1
W m+1 = W m
m+1
=W
nj
1 X
rLj (W m )
nj i=1
rL(W
W m)
All data
data'
!"#$%&'(&%))*+'
4'
W m+1 = W m
nj
1 X
rLj (W m )
nj i=1
m
E
rLi (W )
nj i=1
E(W m+1 )
Xavier Bresson
m!1
m!1
1X
rLi (W m )
n i=1
W m+1
10
Stochastic Monotonicity!
!! Code:!
,,'
Large "!
Small "!
Optimal "!
!"#$%&'(&%))*+'
,-'
Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!
Xavier Bresson
13
Computational Graph!
!! Neural networks (NNs) are represented by computational graphs (CGs).!
Definition: A series of operators applied to inputs. Easy to combine (lego
strategy), can be huge.!
Usefulness: Clear visualization of NN operations (great for debugging).!
CG are essential to derive gradients by backpropagation.!
Computational !
Graph: '
Google Tensorflow'
!"#$%&'(&%))*+'
,/'
Backpropagation!
Definition: A recursive application of chain rule along a
computational graph (CG) provides the gradients of all inputs,
weights, intermediate variables.
Chain rule: Calculus:
@L
@L(F (x))
@L @F (x)
=
=
.
@x
@x
@F @x
Xavier Bresson
15
Local Rule!
!! Any computational graph is a series of elementary neurons (also called
nodes, gates) 'The gradient of the loss w.r.t. the inputs x,y of the local
neurons can be computed with the local rule:!
Gradient L w.r.t. x,y = !
Recursive gradient * Local gradient w.r.t. x,y!
Chain rule
rule'
L'
Recursive gradient'
!"#$%&'(&%))*+'
,1'
Backpropagation Techniques!
!! Backpropagation consists of two steps:!
(1) Forward pass/flow: Compute final loss value and all intermediate
output values of neurons/nodes. Save them in memory for gradient
computations (in backward step).!
(2) Backward pass/flow: Compute the gradient of the loss functions w.r.t.
all variables on the network using the local gradient rule.!
Backward low: !
Compute gradient values'
!"#$%&'(&%))*+'
Forward low: !
Compute loss values'
,2'
An Example!
!! Step 1:!
@f
=1
@f
!! Step 2:!
!"#$%&'(&%))*+'
!"#$%&'(&%))*+
,3'
An Example!
!! Step 3:!
!! Step :
:!
!"#$%&'(&%))*+'
,4'
Another Example!
!"#$%&'(&%))*+'
-5'
Backpropagation Implementation!
Forward and Backward Functions!
!! Code:!
!"#$%&'(&%))*+'
-,'
Backpropagation Implementation!
Forward and Backward Functions!
!! Pseudo-code:!
!"#$%&'(&%))*+'
--'
@x @ff
Jacobian !
Matrix'
h @f i
@f
i
=
@x
@xj
@f @L
@L
=
.
@x
@x @f
4096 x 1 '
4096 x 1 '
!"#$%&'(&%))*+'
-.'
Example!
!! Activation gate:!
!"#$%&'(&%))*+'
-/'
Backpropagation Cost!
Cost forward cost backward (slightly higher) !
Backward requires to store forward values!!
Xavier Bresson
25
Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!
Xavier Bresson
26
Activation Functions!
!! Reminder: Neural network classifiers are succession of linear classification and
non-linear activations.!
Exs: 2-layer classifier:! f = W2 max(W1 x, 0)
W1 x, 0), 0)
3-layer classifier:! f = W3 max(W2 max(W
Activation !
function!
!"#$%&'(&%))*+'
-2'
Xavier Bresson
28
Sigmoid Activation!
!! Historically popular by analogy
with neurobiology.!
Sigmoid !
(x) = 1/(1 + e
!! Three issues:!
(1) Saturated neurons kill gradients!
r = (1
@f
@f @z
@f
xi ,
=
=
@wi
@z @wi
@z
xi
-4'
Tanh!
(x) = tanh(x)
!! Issue:!
Kill gradients Vanishing
gradient problem (later discussed)!
!"#$%&'(&%))*+'
.5'
ReLu!
(x) = max(x, 0)
!! Advantages: !
(1) Converges 6x faster than sigmoid/tanh!
(2) Does not saturate in positive region!
(3) Max is computationally e"cient !
!! Limitations:!
(1) Not zero-centered function!
(2) It kills gradient when input is negative !
Standard trick: Initialize neurons with a small positive biases like 0.01. !
!"#$%&'(&%))*+'
.,'
Leaky !
ReLu!
!! In practice: Use Relu, try out Leaky ReLu, do not expect much of
tanh, never use sigmoid. !
!"#$%&'(&%))*+'
.-'
!"#$%&'(&%))*+'
..'
Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!
Xavier Bresson
34
Weight Initialization!
Q: What happens when initial W=0 is used? !
A: All neurons compute the same outputs and the weights are the same. !
'Need to break symmetry!!
'Works well for small networks, but not for deep networks.'
!"#$%&'(&%))*+'
.0'
!"#$%&'(&%))*+'
.1'
Xavier Bresson
37
)
' Tricky to set a good value for the mean of
the normal distribution:!
2 [0.01, 1]
!"#$%&'(&%))*+'
.3'
.4'
xk
x
= p
k
!! Node/gate in NN:!
E(xk )
Var(xk )
!"#$%&'(&%))*+'
/5'
Normalize: '
xk
x
= p
E(xk )
Var(xk )
yk =
k k
x
+
k
k
Note the network can learn the identity mapping if it wants to:'
k
!"#$%&'(&%))*+'
q
= Var(xk )
= E(xk )
/,'
Properties!
Pseudo-code:'
!! Properties:!
(1) Reduces strong dependence on
initialization!
(2) Improves the gradient flow through
the network!
(3) Allows higher learning rates !
Learn faster the network!
(4) Acts as regularization!
!! Price: 30% more computational time.!
!! At test time: Mean and variance are estimated during training and
average values are selected.!
!"#$%&'(&%))*+'
/-'
!"#$%&'(&%))*+'
/.'
Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!
Xavier Bresson
44
/0'
Gradient loss
is steep
vertically'
Gradient loss
is flat
horizontally'
/1'
Momentum [Hinton-et.al86] !
!! New update rule:!
Friction'
Acceleration'
Velocity'
!! Advantages: !
(1) Velocity builds up along flat directions.!
(2) Decrease velocity in steep directions.!
!"#$%&'(&%))*+'
/2'
Limitation of Momentum!
Momentum overshoots the minimum but overall gets faster than SGD
(too much velocity).!
In practice: !
(1) = 0.5, 0.9!
(2) Initialization: v=0!
Xavier Bresson
48
Nesterov Momentum!
!! Nesterov accelerated gradient (NAG) technique used for momentum
update:!
only change'
Nesterov update:'
v t = v t rf (xt + v t )
xt+1 = xt + v t
/4'
Xavier Bresson
50
AdaGrad [Duchi-et.al11]!
!! Origin: Convex optimization.!
!! Update rule:!
dx
Prevents division by 0.
0.'
|v n+1 |2
sign(dx) = 1
0,'
RMSProp [Hinton12]!
!! RMSProp update rule: It does not stop the learning process.!
!"#$%&'(&%))*+'
0-'
Adam [Kingma-Ba14]!
!! Adam = Momentum + Adagrad/RMSProp!
!"#$%&'(&%))*+'
0.'
(1)!Exponential decay!
!
!
= 0 e 0 m
= 0 /(1 + 0 m)
!! Common good practice: Babysit the loss value and the learning
rate.!
!"#$%&'(&%))*+'
0/'
!"#$%&'(&%))*+'
00'
Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!
Xavier Bresson
56
!"#$%&'(&%))*+'
02'
Why It Works?!
It prevents overfitting: It reduces the number of parameters to learn
for the NN.!
Xavier Bresson
58
Code!
Xavier Bresson
59
Demo: Dropout!
!! Run lecture10_code04.ipynb!
!"#$%&'(&%))*+'
15'
Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!
Xavier Bresson
61
Summary!
!! Training Neural Networks:!
(1) Sample a batch of data.!
(2) Forward prop it through the graph, get loss values!
(3) Backprop to calculate the gradients!
(4) Update the parameters using gradients!
!"#$%&'(&%))*+'
1-'
Summary!
Weight initializations:!
(1) Xaviers initialization (default)!
(2) Batch Normalization (30% additional cost)!
Parameter updates/optimization:!
(1) SGD!
(2) Momentum!
(3) Nesterov momentum!
(4) Adagrad/RMSProp!
(5) Adam (default)!
Dropout regularization!
Xavier Bresson
63
Ques@ons?
Xavier Bresson
64
Data Science
Sept 12-14, 2016!
,'
Outline!
Step 1: Pre-process data!
Step 2: Choose NN Architecture!
Step 3: Monitor Loss Decrease!
Step 4: Hyperparameter Optimization!
Step 5: Monitor Test Accuracy!
Xavier Bresson
!"#$%&'(&%))*+'
.'
Outline!
Step 1: Pre-process data!
Step 2: Choose NN Architecture!
Step 3: Monitor Loss Decrease!
Step 4: Hyperparameter Optimization!
Step 5: Monitor Test Accuracy!
Xavier Bresson
Example!
CIFAR:!
!! Initialization:!
(1) Small networks: Normal distribution with 0.01 standard deviation!
(2) Large networks: Xaviers initialization !
!"#$%&'(&%))*+'
0'
Outline!
Step 1: Pre-process data!
Step 2: Choose NN Architecture!
Step 3: Monitor Loss Decrease!
Step 4: Hyperparameter Optimization!
Step 5: Monitor Test Accuracy!
Xavier Bresson
!"#$%&'(&%))*+'
2'
Xavier Bresson
5'
Outline!
Step 1: Pre-process data!
Step 2: Choose NN Architecture!
Step 3: Monitor Loss Decrease!
Step 4: Hyperparameter Optimization!
Step 5: Monitor Test Accuracy!
Xavier Bresson
10
!"#$%&'(&%))*+'
,,'
!"#$%&'(&%))*+'
,-'
Hyperparameters!
!! List:!
(1) Network architecture!
(2) Learning rate, decay schedule!
(3) Regularization: L2 and dropout!
!"#$%&'(&%))*+'
,.'
Outline!
Step 1: Pre-process data!
Step 2: Choose NN Architecture!
Step 3: Monitor Loss Decrease!
Step 4: Hyperparameter Optimization!
Step 5: Monitor Test Accuracy!
Xavier Bresson
14
!"#$%&'(&%))*+'
,0'
Ques9ons?
Xavier Bresson
16
Data Science
Sept 12-14, 2016!
,'
Outline!
History of CNNs!
Standard CNNs!
CNNs for Graph-Structured Data!
Conclusion!
Xavier Bresson
A Brief History!
!! Hubel and Wiesel: Nobel Prize in Medicine in 1959 for understanding
the primary visual cortex system.!
!! Visual system is composed of receptive fields called V1 cells that are
composed of neurons that activate depending on the orientation.!
!"#$%&'(&%))*+'
.'
!"#$%&'(&%))*+'
/'
Perceptron [Rosenblatt57]!
!! Application: Character recognition!
Perceptron is only hardware (circuits, electronics), no code/simulations.!
Perceptron was connected to a camera that produced 400-pixel images.!
!
W t+1 = W t + (D
(x) =
1
0
Y t )X
if hw, xi + b > 0
otherwise
!"#$%&'(&%))*+'
0'
Neurocognitron [Fukushima80]!
Application: Handwritten character recognition!
Direct implementation of Huber-Wiesel simple and complex cells (V1 and V2
cells) with hierarchical organization.!
Introduction of concepts of local features (reception fields).!
No concepts of loss function, no gradient, no backpropagation Learning was bad.!
Inspiration of convolutional neural networks (CNNs)!
Xavier Bresson
Backpropagation [Rumelhart-et.al86] !
Introduction of backpropagation: Concepts of loss function, gradient,
gradient descent.!
Issue: Backprop did not work for large-scale/deep NNs (vanishing gradient
problem).!
Xavier Bresson
!"#$%&'(&%))*+'
3'
!"#$%&'(&%))*+'
4'
!"#$%&'(&%))*+'
,5'
DeepArts !
678)9::;%%8"&<=$*'
!"#$%&'(&%))*+'
,,'
!"#$%&'(&%))*+'
,-'
Outline!
History of CNNs!
Standard CNNs!
CNNs for Graph-Structured Data!
Conclusion!
Xavier Bresson
13
Key idea: Learn local stationary structures and compose them to form
multiscale hierarchical patterns.!
Why CNNs are good? It is open (math) question to prove the eciency of
CNNs.!
Note: Despite the lack of theory, the entire ML and CV communities have
shifted to Deep Learning techniques! Ex: NIPS16: 2326 submissions, 328 DL
(14%), Convex Optimization 90 (3.8%). !
Xavier Bresson
14
Local Stationarity!
!! Assumption: Data are locally stationary
across the data domain:!
"
!"#$"%&'"#(%)
*+%,-./)
F1
F2
F3
!"#$%&'(&%))*+'
x F1
x F2
x F3
,0'
height!
width!
width
depth!
Filter/!
Reception field!
for each neuron'
!"#$%&'(&%))*+'
w x+b
,1'
!"#$%&'(&%))*+'
,2'
Layer 22'
Layer 33'
Layer 44'
Deep/hierarchical !
Features (simple to abstract)'
2x2 max !
pooling'
,3'
!"#$%&'(&%))*+'
,4'
Xavier Bresson
20
Classification Function!
Classifier: After extracting multiscale locally stationary features, use them to
design a classification function with the training labels.!
How to design a (linear) classifier? !
Fully connected neural networks.!
Class 1
Class 2
xout = W xlayer
Class K
Features
Xavier Bresson
Output signal
Class labels
21
x F1
F1
F2
x
7-8&)(9'$('"#)
:)
;.+<)<"=#/(>1%+#2)
:)
?""%+#2))
@'
x F2
IJ2J)+>(2-)
F3
0&,1&,)/+2#(%)
x F3
!%(//)%(3-%/)
xl=0 = x
xl=0
conv
A#1&,)/+2#(%)
!"#$"%&'"#(%)%(B-./)
C-D,.(9,)%"9(%)/,('"#(.B)E-(,&.-/)(#<)9">1"/-),F->))
$+()<"=#/(>1%+#2)(#<)1""%+#2G)
!"#$%&'(&%))*+'
xl=1
xl=1
conv
xl
y 2 R nc
*&%%B)9"##-9,-<)%(B-./)
C!%(//+H9('"#)E	'"#G)
--'
Example!
Xavier Bresson
23
Case Studies!
!! LeNet5 [LeCun-Bengio-et.al98]: !
Input is 32x32!
Architecture is CL-PL-CL-PL-FC-FC!
Accuracy on MNIST is 99.6%!
!! AlexNet [Krizhevsky-et.al12]: !
Input is 227x227x3!
Architecture is 7CL-3PL-2FC!
Accuracy on ImageNet is 15.4%!
Note: CL1 with 96 filters 11x11: 227x227x3 >'55x55x96 (stride=4), #parameters=(11x11x3)x96=35K!
PL1 2x2: 55x55x96 >'27x27x96, #parameters=0!!
!"#$%&'(&%))*+'
-/'
Case Studies!
!! GoogleNet [Szegedy-et.al14]: !
Input is 227x227x3!
Architecture is 22 layers!
Accuracy on ImageNet is 6.7%!
Architecture'
!! ResNet [He-et.al15]:
Microsoft Asia !
Input is 227x227x3!
Architecture is 152 layers!!
Accuracy on ImageNet is 3.6%!
!"#$%&'(&%))*+'
-0'
!"#$%&'(&%))*+'
-1'
Demo: LeNet5!
!! Run lecture11_code01.ipynb!
TensorBoard!
!"#$%&'(&%))*+'
-2'
Sound (1D)'
-3'
Outline!
History of CNNs!
Standard CNNs!
CNNs for Graph-Structured Data!
Conclusion!
Xavier Bresson
29
Non-Euclidean Data!
!! Examples of irregularly/graph-structured data: !
(i) Social networks (Facebook, Twitter)!
(ii) Biological networks (genes, brain connectivity)!
(iii) Communication networks (Internet, wireless, tra"c)!
P'
N"9+(%)#-,=".L/!
O.(+#)/,.&9,&.-!
Q-%-9">>&#+9('"#)
#-,=".L/!
;.(1FK#-,=".LM)
)/,.&9,&.-<)<(,())
!! Main challenges: !
(1) How to define convolution, downsampling and pooling on graphs?!
(2) And how to make them numerically fast?!
!! Current solution: Map graph-structured data to regular/Euclidean grids with
e.g. kernel methods and apply standard CNNs. !
Limitation: Handcrafting the mapping is against CNN principle! !
!"#$%&'(&%))*+'
.5'
Xavier Bresson
31
Related Works!
!! Categories of graph CNNs: !
(1) Spatial approach!
(2) Spectral (Fourier) approach !
!! Spatial approach: !
! Local reception fields [Coates-Ng11, Gregor-LeCun10]:!
Find compact groups of similar features, but no defined convolution.!
! Locally Connected Networks [Bruna-Zaremba-Szlam-LeCun13]:!
Exploit multiresolution structure of graphs, but no defined convolution.!
! ShapeNet [Bronstein-et.al.1516]:!
Generalization of CNNs to 3D-meshes. Convolution well-defined in these
smooth low-dimensional non-Euclidean spaces. Handle multiple graphs. !
Obtained state-of-the-art results for 3D shape recognition.!
!! Spectral approach: !
! Deep Spectral Networks [Hena#-Bruna-LeCun15]:!
Computational complexity is O(n2), while ours is O(n). !
!"#$%&'(&%))*+'
.-'
i2V
j2V
Wij = 0.9
L = D W
normalized
L = In D 1/2 W D 1/2 unnormalized
[1] Chung, 1997'
!"#$%&'(&%))*+'
..'
FG f = f = U T f 2 Rn ,
n
X1
f (i)ul (i)
i=0
n
X1
fl ul (i).
l=0
Xavier Bresson
34
FG g 2 R n ,
n
X1
fl gl ul (i)
l=0
as
f G g = U (U T f )
6
(U T g) = U 4
g(
0)
..
.
g(
n 1)
7 T
5U f
= U g()U T f = g(L)f
35
i,
i) = (f G
i )(j)
n
X1
fl ul (i)ul (j),
l=0
where f() = hf, e2ix i, and e2ix are the eigenfunctions of the continuum
Laplace-Beltrami operator , i.e. the continuum version of the graph Fourier
modes ul .
36
(a) Ts f
(b) Ts0 f
(c) Ts00 f
(d) Ti f
(e) Ti0 f
(f) Ti00 f
[Shuman-Ricaud-!
Vandergheynst16]'
Figure 1: Translated signals in the continuous R2 domain (a-c), and in the graph
domain (d-f). The component of the translated signal at the center vertex is
highlighted in green.
!"#$%&'(&%))*+'
.2'
(1)
k=0
l.
if
dG (i, j) > K,
(2)
where dG (i, j) is the discrete geodesic distance on graphs, that is the shortest
path between vertex i and vertex j.
[2] Hammond, Vandergheynst, Gribonval, 2011'
!"#$%&'(&%))*+'
.3'
Then
= Ti pK (j) = pK G
ij
=0
ij
such that
(j) = pK (L)
(j) =
K
X
ak Lk
(j).
k=0
if
dG (i, j) > K.
ij
BiK = Support of
polynomial filter at vertex i
Vertex i
!"#$%&'(&%))*+'
.4'
BiK = Support of
polynomial filter at vertex i
Vertex i
!"#$%&'(&%))*+'
/5'
!"#$"%&'"#(%)
*+%,-./)
F1
x F2
F2
F3
x F3
The monomial basis {1, x, x2 , x3 , ..., xK } provides localized spatial filters, but
R1
2 1
does not form an orthogonal basis (e.g. h1, xi = 0 1xdx = x2 0 = 12 ), which
limits its ability to learn good spectral filters.
polynomials: Let Tk (x) the Chebyshev polynomial of order k gen!! Chebyshev
C!
erated by the fundamental recurrence property Tk (x) = 2xTk 1 (x) Tk 2 (x)
with T0 = 1 and T1 = x. The Chebyshev basis {T0 , T1 , ..., TK } forms an orthogonal basis in [ 1, 1].
x F1
/,'
K
X
k Tk ( ).
k=0
PK
Fast filtering: Let denote Xk := Tk (L)x and rewrite y = k=0 k Xk . Then
F!
all {Xk } are generated with the recurrence equation Xk = 2LXk 1 Xk 2 .
As L is sparse, all matrix multiplications are done between a sparse matrix
and a vector. The computational complexity is O(EK), and reduces to linear
complexity O(n) for k-NN graphs.
42
Graph Coarsening!
!! Graph coarsening: As standard CNNs, we must define a grid coarsening
process for graphs. It will be essential for pooling similar features together.!
Graph coarsening/
clustering
Gl=0 = G
Graph coarsening/
clustering
Gl=1
Gl=2
!"#$%&'(&%))*+'
/.'
Graph Partitioning!
!! Balanced Cuts [4]: Two powerful measures of graph clustering are the
Normalized Cut and Normalized Association defined as:!
Normalized Cut:
K
X
Cut(Ck , Ckc )
min
C1 ,...,CK
Vol(Ck )
k=1
Ckc
Ck
Equivalence by
complementarity
Normalized Association:
K
X
Assoc(Ck )
max
C1 ,...,CK
Vol(Ck )
k=1
Ck
Partitioning by max vertex matching.
P
P
where Cut(A,
B)
:=
W
,
Assoc(A)
:=
ijP
i2A,j2B
i2A,i2B Wij ,
P
Vol(A) := i2A,j2B di , and di := j2V Wij is the degree of vertex i.
//'
@'
@
i, j = argmax
j
l+1
(P2): G
dli + dlj
?
?'
?
?'
Wijl+1 = Cut(C
Cil , Cjl )
ll+1
+1
Wii = Assoc(C
Cil )
Graph coarsening/
clustering
?'
?
?'
?
?'
?
Gl
l o
Wiil + 2W
Wijl + Wjj
Gl
+1
Gll+1
@
@'
Gl+2
matched{i,j}
l
Wiil + 2Wijl + Wjj
dli + dlj
K
X
Assoc(C l )
k
k=1
Vol(Ckl )
Figure 6: Graph coarsening with Graclus. Graclus proceeds by two successive steps: (P1)
Vertex matching, and (P2) Graph coarsening. These two steps provide a local solution to
the Normalized Association clustering problem at each coarsening level l.'
[5] Dhillon, Guan, Kulis, 2007'
!"#$%&'(&%))*+'
/0'
Xavier Bresson
46
Graph coarsening:
Matched vertices
0
Gl=0 = G
Gl=0 = G
55,, 7
2 0, 2
33,, 6
55,, 7
3, 6
0, 2
1, 4
Graph pooling:
0
2
0
Reindexing w.r.t.
coarsening structure
66,, 7
0 0, 1
Gl=1
Gl=2
44,, 5
1 22,, 3
0, 1
2, 3
44,, 5
66,, 7
Graph pooling:
3
Gl=1
3 11,, 4
Gl=2
7
1
3
1
7
3
Figure 7: Fast graph pooling using graph coarsening structure. The binary tree arrangement of
vertices allows a very e"cient pooling on graphs, as fast as a regular 1D Euclidean grid pooling.'
!"#$%&'(&%))*+'
/2'
G = Gl=0
g1K1
7-8&)(9'$('"#)
:)
;.(1F)9"(./-#+#2)
Gl=2
Gl=1
*(9,".)51)
?.-M9">1&,-<)
:)
?""%+#2)C;?R/G)
@'
g2K1
g3K1
0&,1&,)/+2#(%)
;.(1F)
!%(//)%(3-%/)
IDS)/"9+(%T)3+"%"2+9(%T))
,-%-9">>&#+9('"#)2.(1F/)
x 2 Rn
x
l=0
2R
nl=0
A#1&,)/+2#(%)
"#)2.(1F/)
y 2 R nc
l=6 2 Rn5 nc
xl=0
2 R n0 F 1
g
l=1 2 RK1 F1
xl=1 2 Rn1 F1
n1 = n0 /2p1
xl=5 2 Rn5 F5
l=5 2 RK5 F1 ...F5
;.(1F)9"#$"%&'"#(%)%(B-./)
C-D,.(9,)>&%'/9(%-)%"9(%)/,('"#(.B)E-(,&.-/)"#)2.(1F/G)
*&%%B)9"##-9,-<)%(B-./)
!"#$%&'(&%))*+'
/3'
Optimization!
!! Backpropagation [6] = Chain rule applied to the neurons at each layer.!
yj =
Fin
X
gij (L)xi
i=1
'
yj
ij
'
xi
@E
@xi
Loss function: E =
ls log ys
Local !
computations!
s2S
Gradient descents:
n+1
n
= ij
ij
xn+1
= xni
i
@E
. @
ij
@E
. @xi
@yj
@ij
@E
E
@yyj
Backpropagation '
Accumulative
Accumulative!
computations!
Local gradients:
Xh
@E @yj
@E
X0,s , ..., XK
=
=
@ij
@yj @ij
s2S
F
out
X
@E @yj
@E
@E
=
=
gij (L)
@xi
@yj @xi
@yj
j=1
iT @E
1,s
@yj,s
/4'
Wij = e
kxi
kxi xj k22 /
xj k2
05'
Algorithm
Linear SVM
Softmax
CNNs [LeNet5]
graph CNNs: CN32-P4-CN64-P4-FC512-softmax
!"#$%&'(&%))*+'
Accuracy
91.76
92.36
99.33
99.18
0,'
Non-Euclidean CNNs!
!! Text categorization with 20NEWS: It is a benchmark dataset introduced at
CMU [9]. It has 20,000 text documents (dim data = 33,000, #words in
dictionary) across 20 topics.!
0-'
Non-Euclidean CNNs!
Xavier Bresson
Word2vec features
65.90
68.51
66.28
64.64
65.76
68.26
53
!"#$%&'(&%))*+'
0/'
Outline!
History of CNNs!
Standard CNNs!
CNNs for Graph-Structured Data!
Conclusion!
Xavier Bresson
55
Summary!
CNNs is game changer:!
(1) Breakthrough for all Computer Vision-related problems!
(2) Revive dream of Artificial Intelligence!
(3) Deep learning = Big Data + GPUs/Cloud + Neural Networks!
(4) Big question why it works so well?!
Xavier Bresson
56
QuesBons?
Xavier Bresson
57
Data Science
Sept 12-14, 2016!
,'
Outline!
Motivation!
Vanilla Recurrent Neural Networks (RNNs)!
Long Short-Term Memory (LSTM)!
Conclusion!
Xavier Bresson
Motivation!
!! Recurrent Neural Networks (RNNs) operate on ordered sequences of
inputs and outputs. Examples: Text, financial series, videos, robot motion, etc.!
Output data!
Ex: class!
Hidden !
Layers!
Input data!
Ex: image'
Vanilla NNs!
One input vector maps to
one output vector.!
Ex: CNNs (image to
class)'
!"#$%&'(&%))*+'
Output data!
Ex: sequence of
words!
Hidden !
Layers!
Input data!
Ex: image'
RNNs!
One input vector maps to
multiple output vectors.!
Ex: Image captioning
(image to caption sentence)'
.'
Motivation!
Output data!
Ex: sequence of
words!
Output data!
Ex: class!
Hidden !
Layers!
Hidden !
Layers!
Input data!
Ex: sequence of
images!
Input data!
Ex: sequence of
words!
RNNs!
Multiple input vectors map
to an output vector.!
Ex: Video classification
(sequence of images to class)'
RNNs!
Multiple input vectors map
to multiple output vectors.!
Ex: Machine translation
(sentence to sentence)'
/'
Outline!
Motivation!
Vanilla Recurrent Neural Networks (RNNs)!
Long Short-Term Memory (LSTM)!
Conclusion!
Xavier Bresson
General Description!
!! RNNs are recurrent learning machine:!
!
Block RNN !
State ht,!
Parameters W'
x'y'
!"#$%&'(&%))*+'
1'
Recurrence Formula!
!! Update of the RNN state is done with a
recurrence formula at each time step:!
Recurrence! Previous!
function'
state'
ht = fW ((ht
New state of
RNN'
Weights/!
Parameters of
recurrence
function'
1 , xt )
ht
Recurrence !
Formula '
Input vector at
current time step'
!! Notes:!
(1) Recurrence function is independent of time t!!
Same function f is used at every time step.!
(2) Changing W will change the behavior of RNNs.!
(3) Weights W are learned by backpropagation on training data.!
!"#$%&'(&%))*+'
2'
Vanilla RNNs!
!! Simplest RNNs:!
ht = fW (ht
1 , xt )
ht
Recurrence !
Formula '
RNN state at !
step t'
ht = tanh(Whh ht
yt = Why ht
+ Wxh xt )
tanh!
!"#$%&'(&%))*+'
3'
Unormalized !
probability!
Recurrence formula:!
Not!
good!
good!!
ht = tanh(Whh ht
+ Wxh xt )
Linear/softmax classifier!
for next character:
character:!
yt = Why ht
Note: In text
analysis, we never
work with characters
directly, but with
numbers (via 1-to-1
mapping between
characters and
numbers)!
Learn by !
backpropagation!
Vocabulary!
Vector!
!"#$%&'(&%))*+'
4'
https://gist.github.com/karpathy/d4dee566867f8291f086!
Xavier Bresson
10
!"#$%&'(&%))*+'
,,'
!"#$%&'(&%))*+'
,-'
!"#$%&'(&%))*+'
,.'
Example: Mathematics!
!! Training data: Open source textbooks on algebraic geometric!
!"#$%&'(&%))*+'
,/'
Example: Code!
!! Training data: Linux code!
!"#$%&'(&%))*+'
,0'
Image Captioning !
!! It is possible to merge CNNs and RNNs!!
Example: Image captioning !
!"#$%&'(&%))*+'
,1'
Design!
!! Step 1: Remove the last FC layer and softmax classifier in CNNs
(classification is not needed, only visual feature extractors).!
CNNs !
!"#$%&'(&%))*+'
,2'
Design!
!! Step 2: Connect CNN output to RNN.!
New!!
!"#$%&'(&%))*+'
,3'
Design!
!! Step 3: Construct the whole RNN.!
!"#$%&'(&%))*+'
,4'
Results!
!"#$%&'(&%))*+'
-5'
!"#$%&'(&%))*+'
-,'
Deep RNNs!
!! Multilayer RNNs:!
+ Wxh xt )
ht = tanh(Whh ht
rewriting'
ht = tanh W
!"#$%&'(&%))*+'
xt
ht 1
layer'
with W =
Wxh
0
0
Whh
--'
Outline!
Motivation!
Vanilla Recurrent Neural Networks (RNNs)!
Long Short-Term Memory (LSTM)!
Conclusion!
Xavier Bresson
23
!"#$%&'(&%))*+'
-/'
Understanding LSTM!
!! From paper: !
!"#$%&'(&%))*+'
-0'
Understanding LSTM!
LSTM has two state vectors: !
h: hidden state vector!
c: cell state vector. !
Besides:!
f: called forget vector!
i: called input vector!
o: called output vector!
26
Understanding LSTM!
Time step t'
!"#$%&'(&%))*+'
-2'
Understanding LSTM!
Cell state c flows!
to hidden state'
!"#$%&'(&%))*+'
-3'
Understanding LSTM!
Stack up to get multilayer LSTM: !
Xavier Bresson
29
Xavier Bresson
30
LSTM Variants!
At the end of the day, LSTM gives the best performances over many
possible experimental conditions. !
Xavier Bresson
31
!"#$%&'(&%))*+'
.-'
Outline!
Motivation!
Vanilla Recurrent Neural Networks (RNNs)!
Long Short-Term Memory (LSTM)!
Conclusion!
Xavier Bresson
33
Summary!
RNNs oer lots of flexibility in NN architecture.!
!
Hot research:!
(1) Architecture design.!
(2) Better understanding.!
(3) Why performances are so good? Open theoretical question.!
Xavier Bresson
34
Ques8ons?
Xavier Bresson
35
Data Science!
Sept 12-14, 2016!
!"#$%&'(&%))*+'
,'
Data Science !
Science of transforming raw data into meaningful
knowledge to provide smart decisions to real-world
problems.!
!"#$%&'(&%))*+'
-'
Data Science!
Computer Science
Science!
Personalized !
Services!
Services
Mathematical!
Mathematical
Modeling!
Modeling
Data Science
Science!
Data!
Knowledge !
Discovery !
E.g. Physics, genomics, !
social sciences.
sciences.!
Issues of privacy, !
ownership!
security, ownership
Domain
Domain!
Expertise
Expertise!
Sciences!
Sciences
Government!
Government
E.g. Healthcare, Defense, Education, Transportation..!
Industry!
Industry
Intelligent !
Systems!
Systems
Deep Learning!
Data Science = Big Data + Computational Infrastructure + Artificial Intelligence!
3rd industrial !
revolution!
!"#$%&'(&%))*+'
Cloud computing
computing!
GPU!
Math parts!
.'
First !
NIPS!
Visual primary cortex!
Hubel-Wiesel!
1959'
1962 1975'
1962'
1958'
Backprop !
Perceptron
Perceptron!
Werbos!
Rosenblatt
Rosenblatt!
First !
KDD
KDD!
1989'
1989
1987'
Neocognitron!
Fukushima!
Birth of!
Data Science!
Split from Statistics!
Tukey!
AI Hope!
!"#$%&'(&%))*+'
Big Data!
Volume doubles/1.5 year!
1998
1997 1998'
1997'
1995'
1999
1999'
Hardware!
GPU speed doubles/ year!
First Amazon!
Cloud Center!
Google AI !
TensorFlow!
Facebook AI!
Torch!
Kaggle!
Platform!
2010'
2006'
Auto-encoder!
LeCun, Hinton, Bengio!
First NVIDIA !
GPU!
SVM/Kernel techniques!
Vapnik!
AI Winter [1966-2012]!
Kernel techniques!
Handcrafted features!
Graphical models!
2012'
2012
2014' 2015'
Data scientist!
Facebook Center!
1st Job in US!
OpenAI Center!
AI Resurgence!
/'
Graph Science
Science!
Data structure
structure!
Pattern extraction!
extraction
Unsupervised
Clustering
Clustering!
k-means, graph cuts
cuts!
Python!
Python
Language for !
data science!
science
Supervised
Classification!
Classification
SVM!
SVM
Deep Learning
Learning!
NNs, CNNs, RNNs,
RNNs,!
Data Science
Science!
Pagerank, collaborative
Pagerank
filtering
content filtering!
3rd day!
Data
Visualization!
Visualization
Manifold, t-SNE
t-SNE!
!"#$%&'(&%))*+'
Recommender
Systems
Systems!
Feature
Extraction!
Extraction
2nd day!
0'
!"#$%&'(&%))*+'
1'
Rapid Development!!
!"#$%&'(&%))*+'
2'
!"#$%&'(&%))*+'
3'
!"#$%&'(&%))*+'
,4'
Thank you!
Xavier Bresson
11