Introduction To Data Science!

Data Science!
Sept 12-14, 2016!
EPFL-UNIL Continuing Education !

Lecture 1: Introduction to Data Science!
Xavier Bresson!
!
Swiss Federal Institute of Technology (EPFL) !
!"#$%&'(&%))*+'
,'
Program Organizer!
Professor, Vice-Provost EPFL!

Head of LTS4 at EPFL!
Apple ARTS award, Boelpaepe prize !
Co-founded two start ups!
Prof. Pierre Vandergheynst!

pierre.vandergheynst@epfl.ch !
!"#$%&'(&%))*+'
-'
Program Instructor!
- Prof of Data Science at the Institute
of Data Science at NTU, Singapore!
- Publications in NIPS, ICML, JMLR!
- Teach Master and PhD courses in
Data Science at EPFL !
- Trained at EPFL, UCLA!
- Consulting !
Dr. Xavier Bresson !

xavier.bresson@epfl.ch!
!"#$%&'(&%))*+'
.'
Teaching Assistants!
M. Kirell Benzi!
kirell.benzi@epfl.ch!
Data Scientist/Artist!
!"#$%&'(&%))*+'
M. Michal De"errard!
michael.de"errard@epfl.ch!
Data Scientist!
/'
Data Scientist!
Source: Drew Conway '
Best job in the U.S in 2015 [Forbes].!

Salary has jumped from $125,000 to $200,000+ [Glassdoor].!
McKinsey projects that by 2018, the U.S. alone may face a 50 percent to 60
percent gap between supply and requisite demand of deep analytic talent. !
!"#$%&'(&%))*+'
0'
In the News!
!"#$%&'(&%))*+'
1'
Data Science!
Q: What is Data Science? !
!"#$%&'(&%))*+'
2'
What is Data Science? - Short Answer !

Science of transforming raw data into meaningful
knowledge to provide smart decisions to real-world
problems.!
!"#$%&'(&%))*+'
3'
What is Data Science? - Long Answer

Q: What are the fields !
of Data Science? !
Q: What are the applications? !
Computer Science
Science!
Scalable databases for storing, accessing data. !

E.g. Cloud computing, Amazon EC2, Hadoop.!
Distributed and parallel frameworks !

for data processing. !
E.g. MapReduce, GraphLab.
GraphLab.!
Personalized !
Services!
Services
E.g. Healthcare (enhanced diagnostics)

diagnostics)!
(products)!
Commerce (products)
Mathematical!
Mathematical
Modeling!
Modeling
Design algorithms that transform

transform!
data into knowledge.
knowledge.!
Use Linear algebra, optimization, !
statistics.!
graph theory, statistics.
Data Science
Science!
Multidisciplinary field: 1+1=3

1+1=3!
Data!
Knowledge !
Discovery !
E.g. Physics, genomics, !
social sciences.
sciences.!
Collection of massive amounts of !

data at increasing rate.!
E.g. Social networks, sensor networks, !
mobile devices, biological networks,!
administrative, economics data!
Issues of privacy, !
ownership!
security, ownership
Domain
Domain!
Expertise
Expertise!
Sciences!
Sciences
E.g. Economy, Biology, Physics, Neuroscience, sociology.

sociology.!
Government!
Government
E.g. Healthcare, Defense, Education, Transportation..!
Q: What are the main !

challenges?!
Industry!
Industry
Intelligent !
Systems!
Systems
E.g. Autonomous cars, security, !

interactive tools for data organization
organization!
and exploration. !
E.g. E-commerce, Telecommunications, !

Finance.
Finance.!
Major challenges: Multidisciplinary integration, large-scale databases, scalable

computational infrastructures, design math algorithms for massive datasets, trade-o"
speed and accuracy for real-time decisions, interactive visualization tools. !
What is Data Science? - Medium Answer !

Q: What is big data?!
Q: Is AI new?!
Data Science = Big Data + Computational Infrastructure + Artificial Intelligence

Intelligence!
3rd industrial !
revolution!
Cloud computing
computing!
GPU!
Not new!!
!"#$%&'(&%))*+'
,4'
A Brief History of Data Science!
Q: Did you hear about !

the 4th industrial revolution?!
RNN!
Schmidhuber!
CNN!
LeCun!
First !
NIPS!
Visual primary cortex!
Hubel-Wiesel!
1959'
1962 1975'
1962'
1958'
Backprop !
Perceptron
Perceptron!
Werbos!
Rosenblatt
Rosenblatt!
First !
KDD
KDD!
1989'
1989
1987'
Neocognitron!
Fukushima!
Birth of!
Data Science!
Split from Statistics!
Tukey!
AI Hope!
!"#$%&'(&%))*+'
Big Data!
Volume doubles/1.5 year!
1998
1997 1998'
1997'
1995'
1999
1999'
Hardware!
GPU speed doubles/ year!
First Amazon!
Cloud Center!
Google AI !
TensorFlow!
Facebook AI!
Torch!
Kaggle!
Platform!
2010'
2006'
Auto-encoder!
LeCun, Hinton, Bengio!
First NVIDIA !
GPU!
SVM/Kernel techniques!
Vapnik!
AI Winter [1966-2012]!
Kernel techniques!
Handcrafted features!
Graphical models!
2012
2012'
2014' 2015'
Data scientist!
Facebook Center!
1st Job in US!
OpenAI Center!
4th industrial revolution?!

Digital Intelligence !
Deep Learning!
Revolution!
Breakthough !
or new AI bubble?!
Hinton, Ng!
AI Resurgence!
,,'
Data Science and Graph Science!
and Graph Science !!
!"#$%&'(&%))*+'
,-'
Networks/Graphs!
!! Graphs encode complex data structures.
They are everywhere: WWW, Facebook,
Amazon, etc !
MNIST Image !
Network!
Social Network'
Graph of Google Query!
California'
!! Graphs are the most important discrete

models in the world! - G. Strang (MIT)!
GTZAN Music !
Network!
Network of Text Documents!

20newsgroups'
!"#$%&'(&%))*+'
,.'
Why Networks are important?!

!! Networks improve all data science tasks, for a small computational price!!
!! Essential data lie on networks:!
(1) Social networks (Facebook, Twitter)!
(2) Biological networks (genes, brain connectivity)!
(3) Communication networks (Internet, wireless, tra"c)!
7'
3+14#5.'()*+",/!
6"#4'./)"01)0"(!
8(5(1+990'41#:+'.
'()*+",/!
!"#$%&'()*+",-.
./)"01)0"(2.2#)#..
!"#$%&'(&%))*+'
,/'
567899:";"<)=$%+=%<=%<=*>&)%?;@'
!"#$%&'(&%))*+'
,0'
Outline of the Course!

1st day!
Graph Science
Science!
Data structure
structure!
Pattern extraction!
extraction
Unsupervised
Clustering
Clustering!
k-means, graph cuts
cuts!
Python!
Python
Language for !
data science!
science
Supervised
Classification!
Classification
SVM!
SVM
Deep Learning
Learning!
NNs, CNNs, RNNs,
RNNs,!
Data Science
Science!
Pagerank, collaborative
Pagerank
filtering
content filtering!
3rd day!
Data
Visualization!
Visualization
Manifold, t-SNE
t-SNE!
!"#$%&'(&%))*+'
Recommender
Systems
Systems!
Feature
Extraction!
Extraction
PCA, NMF, Sparse

coding
coding!
2nd day!
,1'
Structure of the Course!

Introduction of main concepts/ideas, technical details.!
!
Coding illustration on real-world data.!

!
Please, ask questions! !

!
Please, share your own data science problem for discussion.!
Xavier Bresson
17
Goal of the Course!
!"#$%&'(&%))*+'
,3'
QuesCons?
Xavier Bresson
19
Data Science!
Sept 12-14, 2016!

Lecture 2: Introduction to Python!
Kirell Benzi and Michael De"errard!
!
!"#$%&'(&%))*+'
,'
Python!
!! Why Python?!
!"#$%&'(&%))*+'
-'
Python!
!! Why Python for Data Science?!
!"#$%&'(&%))*+'
.'
Computational Needs
Fast numerical mathematics: BLAS & LAPACK libraries
Easy bridging to data: data files, databases, scraping
Easy bridging to legacy code: C, matlab, Fortran
Easy results presentation: html / web & pdf reports
Rapid prototyping
Ideally the same framework for R&D and production
Cluster computing: multi-threads, MPI, OpenMP, Ipython Parallel
GPU computing: OpenCL, CUDA
Xavier Bresson
Python Pros, for Prototyping

Easy-to-learn while powerful
Elegant syntax (quick to write, easy to read)
High-level data structures: list, tuple, set, dict & containers
Multi-paradigm: procedural, object-oriented, functional
Dynamically typed
Automatic memory management (garbage collector)
Interpreted (JIT is coming)
Runs everywhere: Windows, Mac, Linux, Cloud
Large community
Extensive ecosystem of libraries
l
easy to share & install packages via pip
Xavier Bresson
Python Pros, for Production

General purpose (unlike matlab / R / julia)
Encourage code reuse: modules and packages
Integrated documentation
Open-source
Many tools: unit & integration testing, documentation generation,
debugging, performance optimization
Xavier Bresson
Python Cons
Python 2 vs 3
Slow execution
l Specialized libraries: numpy, scipy
l Compilation: pypy, numba, jython
Need to run to catch errors
Xavier Bresson
Scientific Python
Libraries for everything !
Numerical analysis
l numpy: multidimensional arrays, data types, linear algebra
l scipy:
higher-level algorithms, e.g. optimization, interpolation, signal
processing, sparse matrices, decompositions
SciKits
l scikit-learn: machine learning
l scikit-image: image processing
Deep Learning: tensorflow, theano, keras
Statistics: pandas
Symbolic algebra: sympy
Visualization
l matplotlib: similar to MATLAB plots
l bokeh: interactive visualization
Xavier Bresson
Data Storage
Flat files
l CSV: numpy / pandas
l Matlab: scipy
l JSON: std lib
l HDF5: h5pyBasic relational database storage
Connectors for relational databases
l SQLite: std lib
l PostgreSQL: psycopg (DB API)
l MySQL: mysqlclient
l Oracle: cx_Oracle (DB API)
l Microsoft SQL Server: pypyodbc (DB API)
NoSQL data stores
l Redis: Redis-py
l MongoDB: PyMongo (MongoEngine)
l Hbase: HappyBase
l Cassandra: Datastax
Object-Relational Mapping (ORM)
l SQLAlchemy, Peewee, Pony
Xavier Bresson
Jupyter
HTML-based notebook environment
Multiple kernels / languages: Python, matlab, R, Julia
Platform agnostic: Windows, Mac, Linux, Cloud
All-in-one reports: text, latex math, code, figures, results
Most adapted for prototyping / data exploration
l Convert to Python modules when mature for production
Cloud: github, nbviewer
Alternatively, scientific IDEs: Spyder, Rodeo
l Jupyter is itself becoming an HTML-based IDE !
Other IDEs: IDLE, PyCharm
Text editors: vim, emacs, atom, sublime text
Xavier Bresson
10
Install It Yourself
Windows: anaconda or python(x,y) or Enthought Canopy
Mac: anaconda or homebrew / macports / fink
Linux: package manager (apt-get, yum, pacman)
Use pip to install packages from PyPI or GitHub
Use pyvenv to work with virtual environments
Xavier Bresson
11
Live Session
1) Cloud IDE: nitrous.io
2) Notebook: Jupyter / IPython
3) Basics of Scientific Python: numpy, scipy, scikit-learn, matplotlib
4) Demo: data visualization by Kirell Benzi
Xavier Bresson
12
Ques8ons?
Xavier Bresson
13
Data Science!
Sept 12-14, 2016!

Lecture 3: Graph Science!
Xavier Bresson!
!
!"#$%&'(&%))*+'
,'
Outline!
Graph Science and Graph Theory!
!
Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!
Xavier Bresson
Graph/Network Science!
!! Definition of graph/network: mathematical models representing pairwise
relations between objects/data:!
All pairwise !
relationships'
data1'
data2'
data3'
data1'
data2'
Q: Why are they useful? !

Graphs o"ers a global view of data structure!
Extract global meaningful patterns, insights about data!
Increase performances, ex: classification from 5-20% (later)!
Easy to use, slight increase of computational time!
Some tasks are only designed on networks, e.g. Google pagerank!
!"#$%&'(&%))*+'
.'
Graph Science = Graph Theory!

Q: When did it start? !
History of graph theory: Graphs have been studied since 1736, starting with
Leonhard Euler and the famous problem of Seven Bridges of Knigsberg:!
Q: Can we find a path through the city that cross each bridge once and only
once? Yes if the graph has even degree.!
Simplification
Knigsberg!
City
Graph !
representation
Source: Wikipedia.
A: Not possible. Needs cycles

All vertices must have even degree.
Graph theory oers many analysis tools to use networks for all kind of
applications: from clustering to classification, visualization, recommendation,
deep learning, etc.!
Xavier Bresson
Outline!
!
Class of Networks!
Basic Definitions!
Conclusion!
Xavier Bresson
Class of Graphs/Networks!
!! Natural Graphs:!
(1) Social networks: Facebook, LinkedIn, Twitter!
Q: Cite a few networks? !
(2) Biological networks: Brain connectivity and functionality, Gene regulatory networks!
(3) Communication networks: Internet, Networking Devices!
(4) Transportation networks: Trains, Cars, Airplanes, Pedestrians !
(5) Power networks: Electricity, Water !
Facebook'
Brain !
Connectivity'
!'
Minnesota Road Network'
Graphs/networks!
US Electrical Network'
Telecommunication!
Network'
!! Essential data $$ lie on network structures, like medical, social,

communication data.!
Terminology: Natural graphs means no graph construction.!
!"#$%&'(&%))*+'
1'
!! Constructed Graphs (from Data). !
Examples:!
MNIST Image Graph!
(1) MNIST image network!

(2) GTZAN music network!
(3) 20NEWS text document network!
GTZAN Music Graph !

Graph of Text Documents!
20newsgroups'
(4) 3D mesh points!
3D mesh points!
!! Graph Construction: No universal recipe but good common practices

(later discussed) and domain expertise knowledge.!
Q: How much time to construct a network from data? !
!! Computational time: May be time consuming O(n2), n=#data.!
Exs: d=1K,
n=1K
n=100K
time < 1sec!

time ~ 1 min!
n=1M
time > 1 hour!
!! Approximate technique:
FLANN is a library for performing fast !
approximate nearest neighbor searches in high dimensional spaces.!

FLANN: http://www.cs.ubc.ca/research/flann!
!"#$%&'(&%))*+'
kd-tree'
2'
!! Mathematical/Simulated Graphs:!
(1) Erdos-Renyi graphs (1959)!
(2) Stochastic blockmodels [Faust-Wasserman 92]!
(3) Lancichinetti-Fortunato-Radicchi (LFR) graphs (2008)
!
Erdos-Renyi Network'
Source: Wikipedia.'
Q: Why using artificial math networks?!

!! Mathematical Models:!
!
Advantages: Precise control of your data analysis model (best performances, data
assumptions). No need to perform extensive experiments! (big issue with deep learning)!
! Limitations: Most data assumptions are too restrictive, and it may be hard to check if
your data follow the model assumptions.!
!"#$%&'(&%))*+'
3'
Outline!
!
Class of Networks!
Basic Definitions!
Conclusion!
Xavier Bresson
Basic Definitions!
!! Graphs: Fully defined by G=(V,E,W):!
!! V set of vertices,!
!! E set of edges, and |V|=n,!
!! W similarity matrix.!
eij 2 E
i2V
!! Directed/undirected graphs: !
j2V
Wij = 0.9
Note: In this workshop, we will mostly talk about undirected networks.!

!"#$%&'(&%))*+'
,5'
Basic Definitions!
!! Vertex Degree: !
(1) Binary graphs (Wij={0,1}):
degree= #edge connected to a vertex!
(2) Generic graphs (Wij in [0,1]):
degree= ! di =
Wij
j2V
!! Shortest path: A path on a graph with the smallest possible distance.!

Fast Algorithm: Dijkstra's algorithm (1956).!
!
Q: Do you know a popular application? !

A: Road navigator, e.g. Lausanne to Venezia.!
!"#$%&'(&%))*+'
,,'
Full vs. Sparse Graphs!

!! Full/Complete Graphs: Each vertex is connected to all other vertices.!
|E| =
n(n
1)
2
= O(n2 )
!! Sparse Graphs: Each vertex is connected to a few other vertices (k=10-50).!

Q: Full or sparse? !
|E| = O(n)
!! A: Sparse networks are highly desirable for memory and computational e#ciency.!
Ex: Internet,, n= 4.73 billion pages (August 2016)
2016)!
|E| = n2 = 1018 if it was full.
full.!
|E| = k.n = 1011 as it is (very) sparse.
sparse.!
!! Good news: most natural/real-world networks (Facebook, Brain,
Communication) are sparse. Besides, sparsity
structure:!
!"#$%&'(&%))*+'
Full graph'
Sparse graph'
,-'
Adjacency/Similarity Matrix W!
Definition: Matrix W in G=(V,E,W) actually contains all information
about your network. There are two choices of W:!
(1) Binary W: Wij in {0,1}!
(2) Weighted W: Wij in [0,1] (commonly normalized to 1)!
(1) Binary W:!

Wij =
1 if (i, j) 2 E
0 otherwise
(2) Weighted W:!
wij 2 [0, 1] if (i, j) 2 E

Wij =
0 otherwise
Xavier Bresson
13
Demo: Synthetize Social Networks!

!! Run lecture03_code01.ipynb!
Synthesize LFR social networks: Play with the mixing
parameter $:!
(1) $ small: communities well separated.!
(2) $ large: communities are mixed up.!
prin
prout
!"#$%&'(&%))*+'
,/'
Outline!
!
Class of Networks!
Basic Definitions!
Conclusion!
Xavier Bresson
15
Curse of Dimensionality!
Q: What is the curse of dimensionality?!
A: In high dimensions, (Euclidean) distances between data is meaningless
all data are close to each other!!
Result [Beyer98]: Suppose data are uniformly distributed in Rd,!
Pick any data xi then we have:
limd!1
d`2 (x , V x ) d`2 (x , V
i
i
min i
E max
2
d`min
(xi , V xi )
xi )
!0
Loss of intuition: Gaussian distribution of data.!

In low-dim, most data are concentrated at the center.!
In high-dim, most data are concentrated at the surface.
Xavier Bresson
1-D Gaussian
1000,000-D Gaussian
16
Blessing of Structure!
Q: What is the blessing of structure?!
Good news: Assumption data are uniformly distributed is not true for realworld data. Data have always some structures in the sense that they belong to
a low-dimensional space called manifold
distances on this surface are
meaningful!
Uniform distribution !
of data!
No structure!
Randomness
Xavier Bresson
Non-Uniform distribution !
of data!
!
Structureness!
17
Outline!
!
Class of Networks!
Basic Definitions!
Conclusion!
Xavier Bresson
18
Manifold Learning!
!! Big challenge: It is di#cult to discover the structures hidden in the data
because:!
(1) High-dimensional data!
(2) Large-scale data!
A class of algorithms exists and is called manifold learning techniques (later
discussed).'
!! Some data have clear structures (some others less):'
MNIST Image !
Graph!
!"#$%&'(&%))*+'
Graph of Text Documents!

20newsgroups'
,4'
From Manifolds to Graphs!

Manifold assumption: High-dim data are
sampled from a low-dim manifold.!
Ex: Let x be a movie, each movie is defined
by d features/attributes like genre, actors,
release year, origin country, etc such that x
in Rd. Then we can make the assumption
that all movies form a manifold in Rd.!
Assumption validity: It is a good working hypothesis for:!

(1) Several types of data (images, text documents, music, etc)!
(2) Most data science tasks (classification, visualization, recommendation, etc)!
Xavier Bresson
20
From Manifolds to Graphs!

!! Graphs=Manifold sampling: The manifold information is encoded by
neighborhood graphs (we never form the manifold):!
Graph !
construction!
Sampling!
Smooth!
manifold !
Data points!
Graph!
[Belkin]'
G=(V,E,W)!
(
!! Neighborhood Graphs: !
dist(xi ,xj )2
2
if j 2 Nik
e
k-NN graphs (most popular)! Wij =
0 otherwise
where dist(xi,xj) is the distance between

xi and xj (to be defined), % is the scale
parameter (value depends on data), Nik is
the neighborhood of data xi:'
!"#$%&'(&%))*+'
-,'
Outline!
!
Class of Networks!
Basic Definitions!
Conclusion!
Xavier Bresson
22
Graph Analysis=Spectral Graph Theory!

Q: How to use graphs? !
!! A: Given a data graph G=(V,E,W), use spectral graph theory (SGT) to:!
(1) Find meaningful patterns (multi-scale data structures)!
(2) Analyze data on top of the network (Facebook users with messages, videos, etc)!
(3) Design data science tasks (clustering, classification, recommendation, etc) !
!! Graph Laplacian Operator L: The most powerful tool in SGT to

analyze and process networks! What is it?!
(1) Heat di"usion operator (on graphs).!
(2) Basis functions of this operator are the well-known Fourier modes (on graphs)
(later discussed).!
!"#$%&'(&%))*+'
-.'
Graph Laplacian Definitions!

Unnormalized/combinatorial graph Laplacian: (historical)!
Lun = D
with D is the degree matrix: D = diag(d1 , ..., dn ),
nn
di =
n = |V |
Wij
Normalized Laplacian: (most popular)!
L=D
1/2
Lun D
1/2
= In
1/2
WD
1/2
Nice math properties s.a. robustness w.r.t.

unbalanced sampling:
Random Walk Laplacian: (for Google PageRank)!
L=D
Lun = In
Note: All Laplacian are diusion operators, but dierent diusion properties.!
Xavier Bresson
24
Graph Spectrum!
Motivation: Study the modes of variation of the graph system.!
Q: How? A: Eigenvalue Decomposition (EVD) of Laplacian L:!
L = U U
Luk =
k uk
U = [u1 , ..., un ]
= diag( 1 , ...,
huk , uk0 i =
0=
min
n)
1 if k = k 0
0 otherwise
1
...
max
Interpretation:!
(1) uk: Fourier modes, i.e. vibration vectors of the graph.!
(2) k: Frequencies of the Fourier modes uk, i.e. how much they vibrate.!
Xavier Bresson
25
Demo: Modes of Variations of the Graph System!

Graph = Meshes in graphics

(Google 3D shape recognition)!
Graph = Regular grids (JPEG image

compression, most used worldwide)!
Q: What is the main property of the first and last of eigenvectors?!
!! A: First eigenvectors = smoothest modes of vibration of the graph !

Last eigenvectors = highest frequencies of the graph !
!"#$%&'(&%))*+'
-1'
Neuroscience!
!! Goal: Find meaningful activation patterns in brain using Structural MRI
and Functional MRI. !
Time series !
at this location!
Dynamic activity !
of the brain"
!! Methodology: G
uk
dynamic activation patterns:'
Connectivity !
of the brain!
(fibers connecting!
regions)!
Graph connectivity'
Q: How to construct the dynamic network?!
!! Results: (Re)discover dynamic

patterns related to basic functional
tasks: vision, body motor, language, etc!
!"#$%&'(&%))*+'
-2'
Outline!
!
Class of Networks!
Basic Definitions!
Conclusion!
Xavier Bresson
28
How to Construct Graphs from Data?!

Three fundamental questions:!
(1) What type of graphs?!
(2) What distances between data?!
(3) What data features?!
Answers: Optimal graph construction is an open problem depends on
data and analysis tasks. However, they exist good practices and domain
expertise are useful.!
Xavier Bresson
29
What Type of Graphs?!

Neighborhood graphs: k-NN graphs (there exist -graphs but they are dense)!
Parameters:
(1) k = #nearest neighbors !
(2) = scale parameter!
k-value: 10-50!
-value: Two strategies:!
(1) Global scale: = mean distance of all kth neighbors!
(
dist(xi ,xj )2
2
if j 2 Nik
e
Wij =
0 otherwise
(2) Local scale
[Zelnik-Peron04]:
Wij =
Xavier Bresson
i = distance of the kth neighbor of vertex i.!

dist(xi ,xj )2
i j
if j 2 Nik
e
0 otherwise
30
What Distances?!
Q: What Distances do you know?!
(1) Euclidean distance: !
Good for low-dim data d<10!
Good for high-dim data with clear structures (MNIST)!
d`2 (xi , xj ) = kxi
(2) Cosine distance: !
v
u d
uX
xj k2 = t
|xi,m
m=1
xj,m |2
Good for high-dim and sparse data (text documents)!

hx , x i
i
j
1
dcos (xi , xj ) = cos
= |ij |
kxi k2 kxj k2
Note:
(3) Other distances: Kullback-Leibler (information theory), Wasserstein (earths

mover) (PDEs theory), etc.!
Xavier Bresson
31
What Data Features?!

!! Three types of data features:! xi 2 Rd
(1) Natural features (e.g. movie features s.a. genre, actors, year, etc)!
(2) Hand-crafted features (e.g. SIFT in computer vision)!
(3) Learned features (PCA, NMF, sparse coding, deep learning)!
!! Bad approach: It is usually a bad idea to use directly raw data as features. !
!! Good approach: Transform raw data into meaningful data representation by
extracting features (later discussed) and use them for graph construction (trick
known as bag of words):!
Wij = e
dist(xi ,xj )2
2
'
Wij = e
!"#$%&'(&%))*+'
dist(zi ,zj )2
2
.-'
Data Pre-Processing!
Center data (along each dimension): zero-mean property (very common)!
xi
xi
mean({xi })
Normalize data variance (along each dimension): z-scoring property
xi
xi /std({x
})
s i
std({xi }) =
X
j
|xj
(w/ zero-mean )!
mean({xi }|2
Projection on l2-sphere (along each data): !
xi
xi /kxi k2
Normalize max and min value: !
xi 2 [0, 1]
Xavier Bresson
xi
33
Demo: Graph Construction and Pre-Processing!

Let us test: Pre-processing, Construct k-NN graphs, Visualize distances,
Visualize W, Test graph quality with clustering accuracy.!
!"#$%&'(&%))*+'
./'
Demo: Construct Network of Text Documents !

!"#$%&'(&%))*+'
.0'
Outline!
!
Class of Networks!
Basic Definitions!
Conclusion!
Xavier Bresson
36
Summary!
Graph is a superior representation of data: !
Data
Graph G=(V,E,W)
1st fundamental tool: Adjacency Matrix W!
(1) It reveals structures hidden on data.!
(2) It allows to visualize graphs.!
(3) It is used for analysis tasks (later discussed).!
2nd fundamental tool: graph Laplacian Matrix L!
(1) It represents the modes of variations of the graph. !
(2) Used for image compression (jpeg), neuroscience, etc.!
Xavier Bresson
37
Pipeline of Graph Science!

!! Step 1: Data
Graph!
Feature!
Extraction'
High-dim !
Raw data'
!! Step 2: Graph
Graph!
Construction'
Data
Features'
Good !
Practices'
Analysis!
Spectral!
Graph
Theory'
Graph'
Graph!
Analysis'
Identify!
Patterns
Patterns'
Unsupervised
Learning'
Supervised
Learning'
Graph'
Make use of!

Graphs'
Recommend
ation'
Graph!
Analysis'
Visualization'
!"#$%&'(&%))*+'
Domain!
Expertise '
Feature!
Extraction'
'
.3'
Ques8ons?
Xavier Bresson
39
Data Science!
Sept 12-14, 2016!

Lecture 4: Unsupervised Learning!
Xavier Bresson!
!
!"#$%&'(&%))*+'
,'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
Unsupervised Learning!
Q: What unsupervised means? !
Unsupervised learning aims at designing algorithms that can find
patterns in datasets without the use of labels, i.e. prior information.!
There exists several unsupervised learning techniques:!
(1) Unsupervised data clustering (this lecture)!
(2) Graph partitioning (this lecture)!
(3) Data representation/feature extraction (Lecture 7)!
Q: What is the most popular unsupervised

clustering algorithm? !
Xavier Bresson
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
K-Means Algorithm!
Most popular clustering algorithm (among top 10 algorithms in data
science).!
Three types of K-Means techniques:!
(1) Standard/linear K-Means!
(2) Kernel K-Means Expectation-Maximization (EM) Approach!
(3) Kernel K-Means Spectral Approach!
Xavier Bresson
Standard/Linear K-Means!
!! Description: Given n data xi in Rd, K-Means partitions the data into K
clusters S1,,SK that minimize the least-squares objective: !
E(M,
(M, S
S) =
K X
X
k=1 xi 2Sk
Means:! M = {m1 , ..., mK }
Clusters:' S = {S1 , ..., SK }
Distance between xi !
and its mean mk!
mk k22
kkxi
kth mean'
kth cluster'
Sk
!"#$%&'(&%))*+'
Sk 0
1'
Algorithm [Lloyd57, Forgy65]!

!! Expectation-Maximization (EM) approach:!
Initialization: Random initial means !
Iterate until convergence: l=0,1,!
(1) Cluster update/expectation step: !
Skl+1 = {xi : kxi
mlk k22 kxi
(2) Mean update/maximization step: !
ml+1
k
!"#$%&'(&%))*+'
xi 2Skl+1
|Skl+1 |
mlk0 k22 , 8k 0 6= k}
Voronoi cell'
xi
2'
Demo: Standard K-Means!

!"#$%&'(&%))*+'
3'
Properties of EM Algorithm!
Advantages:!
(1) Monotonic: El+1
El for all iterations.!

(2) Convergence is guaranteed!
(3) Speed/complexity: O(n.d.K.#iter), where
#iter=nb iterations to converge!
(4) Easy to implement!

(5) Several extensions: K-Medians, !
K-hyperplans, other distances:!
(6) K-Means has relationships with !
other popular algorithms (PCA, NMF) !
Limitations:!
(1)Non-convex energy (NP-hard)!
Existence of local minimizers: some are good, some are bad.!
Good initialization is critical, or restart many times and pick the solution
with the lowest energy value.!
Xavier Bresson
Main Limitation!
!! Assumption: Standard K-Means
suppose data follow a Gaussian Mixture
Model (GMM), meaning that clusters
are linearly separable and spherical. !
!! Consequence: Standard K-Means does

not work for non-linear separable data.!
Solution: Kernel K-Means. !
!"#$%&'(&%))*+'
,5'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
11
Kernel K-Means
[Scholkopf-Smola-Muller98] !
Q: Do you know the kernel trick?!

!! Kernel trick: Map data to a higherdimensional space where data can be
linearly separated:!
!! New objective: Weighted Kernel K-Means energy:!
E(M, S) =
K X
X
k=1 xi 2Sk
i k ((xi )
xi ! (xi )
Weight contribution!
of data xi!
!! Mean update:!
!"#$%&'(&%))*+'
@E
=0
@mk
mk =
mk k22
i (xi )
xi 2Sk i
xi 2Sk
,-'
Cluster Update!
!! Value of kth cluster Sk:!
Skl+1 = {xi : k (xi )
mlk k22 k (xi )
d(xi , mk0 )
d(xi , mk )
With:'
d(xi , mk ) = k (xi )
mlk0 k22 , 8k 0 6= k}
mk k22 = h (xi )
mk , (xi )
mk i
= h (xi ), (xi )i 2h (xi ), mk i + hmk , mk i

(II)kk
K(xi , xi ) = Kii
(I)ik
Kernel matrix!
Linear algebra:'
(I)ik = (KF )ik

(II)kk = (F T KF )kk
= diag(1 , ..., n )
X
X
= diag(1/
i , ..., 1/
i )
i2S1
i2SK
1 if xi 2 Sk
Fik =
0 otherwise
K(x, y) = h (x), (y)i
!"#$%&'(&%))*+'
Kernel matrix'
,.'
Kernel K-Means EM Algorithm!

Initialization: Random initial means !
Iterate until convergence: l=0,1,!
(1) Cluster update/expectation step: !
Compute all distances: !
l
Dik
= d(xi , mlk )
D = diag(K)
Update clusters: !
l+1
Fik
2KF + diag(F T KF )
l
l
1 if Dik
= argmink0 Dik
0
0 otherwise
Fk is an implicit representation!
of cluster Sk!
l+1
Skl+1 = {xi : Fik
= 1}
(2) Mean update/maximization step: !
No update here! It is implicitly done in D.!
!"#$%&'(&%))*+'
,/'
Demo: Kernel K-Means!

!"#$%&'(&%))*+'
,0'
Kernel Trick!
Q: Do we need to compute the kernel mapping "?!
!! A: No, we never use explicitly the non-linear function "! The exact
expression is actually irrelevant, only the kernel matrix K is important. !
!! Why is this good?!
K(xi , xj ) = (ahxi , xj i + b)c = h (xi ), (xj )i
hxi , xj i
Low-dim products, e.g. xi in R100'
h (xi ), (xj )i
High-dim products, e.g. "(xi) in R1,000'
consuming!
Time consuming!!
!! Popular kernels:!
(1)! Gaussian kernels:!
Kij = e
kxi xj k22 /
(2)! Polynomial kernels: ! Kij = (ahxi , xj i + b)c

!"#$%&'(&%))*+'
,1'
Algorithm Properties!
Advantage: All computations are basically matrix computations
(linear algebra) Good news because most processors have an
architecture and libraries to perform very fast linear algebra calculus:!
(1) Intel Math Kernel Library (MKL) that includes Linear Algebra
Package (LAPACK) and Basic Linear Algebra Subprograms
(BLAS).!
(2) AMD Core Math Library (ACML) also includes LAPACK and
BLAS.!
Limitation: Still local minimizers, next approach will decrease the

number of bad local minimizers.!
Xavier Bresson
17
Kernel K-Means Spectral Approach!

!! Let us start again from the weighted kernel k-means objective:!
K X
X
min E(M, S) =
kxi mk k22
M,S
k=1 xi 2Sk
!! Let us rewrite it as a trace optimization problem under some constraints:!

after some linear algebra'
min
Y
tr(Y T 1/2 K1/2 Y ) s.t. Y T Y = IK and Y 2 SInd

Yik =
( P i
j2Sk j
1/2
if i 2 Sk
(weighted) indicator !
of clusters'
otherwise
max tr(Y T 1/2 K1/2 Y ) s.t. Y T Y = IK and Y 2 SInd

Y
as:'
!"#$%&'(&%))*+'
min
Y
E , max E
Y
,3'
Spectral Relaxation!
Q: What NP-hard means?!
!! Minimizing the objective is a NP-hard problem (i.e. would take forever)! !
max tr(Y T 1/2 K1/2 Y ) s.t. Y T Y = IK and Y 2 SInd

Y
This binary constraint makes!

the problem NP-hard'
!! Spectral relaxation: We drop the indicator constraint Y in Sind:!
max tr(Y T AY ) s.t. Y T Y = IK with A = 1/2 K1/2

Y
!! Solution [Spectral Theorem]: top K eigenvectors of matrix A given by EVD.!
!"#$%&'(&%))*+'
,4'
Eigen Value Decomposition (EVD)!

!! Suppose A is symmetric and positive semi-definite (SDP) (all kernel matrices
are sym. and SDP by construction), then EVD gives:!
Ayk =
k yk
max
with'
hyk , yk0 i =
!! Top K eigenvalues/eigenvectors: K largest #k: !
tr(Y T AY ) =
K
X
ykT Ayk =
k=1
K
X
ykT
k yk
k=1
k hyk , yk i
k=1
K
X
...
1 if k = k 0
0 otherwise
K
X
Y T Y = IK
k=1
K largest values
values!
max tr(Y T AY ) s.t. Y T Y = IK

Y
!"#$%&'(&%))*+'
-5'
Kernel K-Means Spectral Algorithm!

Three steps:!
(1) Compute: !
A = 1/2 K1/2
(2) Solve EVD for the top K eigenvectors:
k=1,..,K
Ayk =
k yk
(3) Binarization step: Use the solution Y as embedding coordinates of X and

apply standard K-Means on Y.!
Note 1: Remember original K-Means problem is NP-hard
approximate solution.!
indicator constraint Y in Sind
we drop the
Note 2: Spectral techniques (EVD) are vastly used in data analysis

because theory well understood. However, they do not scale well to big
data as EVD complexity is O(n3) (it exists however stochastic EVD).!
Xavier Bresson
21
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
22
Balanced Cuts for Graph Partitioning!

Motivations:!
(1) Kernel techniques are sound, but they do not scale well to large datasets
because K is full: memory requirement is O(n2) and EVD is O(n3).!
Ex: n=1M
n2=1Tera, n3=1Yotta! !
Solution: Sparse matrices, but how? Data graphs!!
(2) Natural graphs/networks: Partitioning graphs is a fundamental
problem for (i) Identifying connected groups of data (users on Facebook),!
(ii) Balanced graph cutting for distributed big graphs computing. !
Solution: Balanced graph cuts. They play a central role in:!

(1) Graph theory (define families of networks and their properties).!
(2) Applications (state-of-the-art for unsupervised clustering).!
Xavier Bresson
23
Data Clustering = Balanced Cuts!

!! Graph representation: Given a set of data V={x1,,xn} in Rd, construct a
k-NN graph G=(V,E,W):!
k-NN graph!
construction !
V = {x1 , ..., xn } 2 Rd
G = (V, E, W )
!! Data clustering: Observe that data close on

graphs are similar
finding clusters of data
can be done by cutting the graph at some
specific locations.!
!"#$%&'(&%))*+'
-/'
Min Cut
[Wu-Leahy93]!
!! Cut operator: Given a graph G, a cut partitions G into two sets S and Sc
with value: !
X
c
Cut(S, S ) =
Wij
i2S,j2S c
!!
"#$&
"#$&'
"#$%'
!"!
"#$'''
Value of cut1:!
cut: Cut(S,Sc) = 0.3 + 0.2 + 0.3 = 0.8!
Value of cut2: Cut(S,Sc) = 0.5 + 0.5 + 0.5 + 0.5 = 2.0 !
Value of cut3: Cut(S,Sc) = 0.5 !
!! Min cut is biased: It favors small sets of isolated data.!

Solution: Find clusters of similar sizes/volumes while minimizing the cut
operator:!
Cut
Balanced cuts!!
min Cut and max V ol , min
V ol
!"#$%&'(&%))*+'
-0'
Most Popular Balanced Cuts!

!! Cheeger Cut [Cheeger69]: (most popular in
graph theory)!
c
min
S
Cut(S, S )
min(V ol(S), V ol(S c ))
Ckc
Ck
!! Normalized Cut [Shi-Malik00]: (most

popular in applications)!
Partitioning by min edge cuts.
Cut(S, S c ) Cut(S, S c )
+
min
S
V ol(S)
V ol(S c )
!! Normalized Association:!
Assoc(S, S) Assoc(S c , S c )
+
min
S
V ol(S)
V ol(S c )
Ck
Partitioning by max vertex matching.
with'
Cut(S, S ) =
Wij
#connections between S and Sc'
i2S,j2S c
V ol(S) =
di , with di =
i2S
Assoc(S, S) =
!"#$%&'(&%))*+'
Wij
j2V
i2S,j2S
Wij
#connections between all vertices in S'
-1'
!! Issue: Solving balanced cut problems is directly intractable as NP-hard
combinatorial problems
We need to find the best possible approximation
(close to original solution)
Best approximate techniques are based on
spectral relaxation.!
!! Normalized Association:!
min
Sk
K
X
Assoc(Sk , Sk )
V ol(Sk )
k=1
(1)
!! Reformulation: Rewrite combinatorial problem as a continuous

optimization problem:!
(1)
max
F
K
X
F T W Fk
k
k=1
FkT DF
(2)
with the change !

of variable:' Yk =
(2)
max tr(Y T AY ) s.t. Y T Y = IK , Y 2 SInd and A = D

Y
Yik =
!"#$%&'(&%))*+'
D1/2 Fk
kD1/2 Fk k2
1/2
WD
Dii
1/2
( V ol(S
)
)
k
0
1/2
if i 2 Sk
otherwise
-2'
Binary constraint Y in Sind makes the problem NP-hard - we drop it:!
max tr(Y T AY ) s.t. Y T Y = IK
Y
Solution: Top K eigenvectors of A given by EVD.!
Q: Have we seen this before?!
Xavier Bresson
28
Relationships between Kernel K-Means and

Balanced Cuts [Bach-Jordan04, Dhillon-Guan-Kulis04] !
Relationships:!
Kernel K-Means:!
max tr(Y T AY ) s.t. Y T Y = IK with A = 1/2 K1/2 , Y 2 SInd
Y
(1)
Balanced Cuts:!
max tr(Y T AY ) s.t. Y T Y = IK with A = D
Y
Equivalence:!
(1)
(2)
for
=D
1/2
WD
1/2
, Y 2 SInd
(2)
, K=W
Consequence: Balanced cut problems can also be solved by EM approach! !

Graclus algorithm [Dhillon-Guan-Kulis04]: It does not require EVD
Scale up to large datasets (one of the best graph partitioning techniques)!
Either EM or Spectral approaches still compute approximate solutions
we improve the quality of clustering/partitioning solutions?!
Xavier Bresson
Can
29
Demo!
!"#$%&'(&%))*+'
.5'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
31
NCut Algorithm
[Yu-Shi04] !
Motivation: We must drop the binary constraint Y in Sind or Y in {0,1}

to compute a solution of the NP-hard problem. Spectral solutions do not
naturally satisfy this constraint
let us try to enforce it.!
NCut: It is one of the best graph clustering algorithm!!

It has two steps:!
(1) Compute spectral solutions (as before).!
(2) Find best solution that satisfies the binary constraint (new step). !
Xavier Bresson
32
Demo: NCut!
!"#$%&'(&%))*+'
..'
Technical Details!
Step 1: !
Y ? = argmax tr(Y T AY ) s.t. Y T Y = IK with A = D
1/2
WD
1/2
Solved by EVD.
Step 2: !
min kZ
Z,R
Y ? Rk2F s.t. RT R = IK , and Z 2 {0, 1}

Solved by SVD.
Step 1: Y* !
Xavier Bresson
Step 2: Z (rotate Y*) !
34
EVD and SVD!

Q: Do you know SVD?!
EVD (Eigen Value Decomposition): A symmetric and PSD: !
A = Y Y T
Ayk =
k yk
SVD (Singular Value Decomposition): Generalization of EVD to

non-square and non-PSD matrix: !
T
U: Left singular vectors!
A
=
U
V
n x m
m x m
n x n
V: Right singular vectors!
n x m
: Singular values
With:
uk A = uk k
Avk = k vk
U T U = In
V T V = Im
EVD and SVD are matrix factorization techniques: very common tools in
(linear) data science: many techniques boil down to EVD and SVD. !
Xavier Bresson
35
Example!
Ncut on noisy real-world networks WEBK4 and CITESEER: !
Xavier Bresson
36
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
37
PCut
[B-et.al.16]
State-of-The-Art !
!! Balanced Cuts are biased towards cluster outliers:!
!! Results:!
!"#$%&'(&%))*+'
.3'
Demo: PCut!
!"#$%&'(&%))*+'
.4'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
40
Clustering/Partitioning with !
Unknown Number of Clusters!
Recall: Previous techniques assume to know the number K of clusters.!
If K is unknown, there exist two approaches:!
(1) Define a quality measure of clustering (domain expertise), and use
previous techniques with dierent K values, pick the K with best
measure.!
(2) K is a variable of the clustering problem: Louvain Algorithm.!
Louvain Technique [Blondel-et.al.08]: Very popular in social sciences.
It is greedy algorithm that optimizes the modularity objective:!
max Q(f ) =
f
(Wij
min
Sk
K
X
k=1
2m
Cut(Sk , Skc )
2m =
ij
(fi , fj ) =
Wij = V ol(V )
1
0
if i = j
otherwise
V ol(Sk ).V ol(Skc )
Same eect as balanced cut, !

but extra parameter to select.
Xavier Bresson
with
ij
di dj ) (fi , fj )
min
Sk
K
X
k=1
Cut(Sk , Skc )
V ol(Sk )V ol(Skc )
41
Greedy Algorithm!
Q: What is a greedy algorithm?!
Step 1: Energy minimization step!
Find communities by minimizing locally the modularity.!
Each node is first associated to its own community then for each
node i, we assign i to the community of its neighbor that best
decreases the modularity. The process is repeated until no changes
occur.!
Step 2: Graph coarsening step!

Compute a new graph by merging the communities
to a super-vertex.!
From Step 1, a new graph is constructed by forming a new
X
adjacency matrix:!
new
Wkk0 =
Wij
i2Sk ,j2S
k
Note: Greedy algorithm:!
(1) (Relatively) fast algorithm!
(2) No theoretical guarantee on the solution (local minimizer)!
Xavier Bresson
42
Demo: Louvain!
!"#$%&'(&%))*+'
/.'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
44
Clustering/Partitioning !
Small-Scale Communities !
Motivation: How Facebook target small communities of users for
advertisement? !
Goal: Identify small-scale clusters on networks.!
Nibble Algorithm: Leskovec, Lang, Dasgupta, Mahoney08, Spielman,

Teng08, Reid, Chung, Lang06.!
Xavier Bresson
45
Nibble Algorithm!
Core principle: It is a greedy algorithm that optimizes locally the
Cheeger Cut on graphs: !
Iterate until K clusters are found:
Step 1: Pick a vertex randomly on graph.!
Step 2: Diuse the Dirac function of the vertex s:!
l+1
l
= Lf
with
L = In
1/2
WD
1/2
Step 3: Threshold f at the value that optimizes the Cheeger cut:!
Cut(S , Sc )
min ECheeger (S ) =
S
min(V ol(S ), V ol(Sc ))
Xavier Bresson
46
Demo: Nibble!
!"#$%&'(&%))*+'
/2'
Outline!
Definition!
!
Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!
Xavier Bresson
48
Summary!
Unsupervised Clustering'
unknown
K is unknown'
K is known'
Full Matrix'
Sparse Matrix!
But graph construction !
may be needed'
Data!
(no graph)'
Graph'
K-Means*'
Balanced Cuts!
Cheeger, Normalized Cuts/!
Associations'
Louvain Algorithm!
Greedy technique'
Kernel!
Construction'
NP-hard!
Spectral relaxation'
Clustering
Spectral Graph Clustering'
Kernel K-Means!
(1)! EM (Graclus*)!
(2)! Spectral'
Equivalence !
of solutions'
!"#$%&'(&%))*+'
Linear !
Relaxation '
Ncut*!
Loose relaxation!
of balanced cuts!
Medium size clusters'
Non-linear !
Relaxation '
Nibble
Nibble!
Pcut!
Small-scale!
Tight relaxation'
Clusters!
Greedy algorithm'
/4'
Transductive Clustering!
Previous techniques are fully unsupervised, no prior information about
class labels is given. !
Transductive clustering: when class labels are available, it usually boosts
the clustering results significantly, like 5-20%. However, it can be time
consuming to collect labeled data (trade-o).!
Note: Transductive clustering is dierent from semi-supervised classification.
Classification aims at learning a decision function for new data, clustering
objective is to classify given data (no new data are considered). !
Xavier Bresson
50
Conclusion!
Unsupervised clustering is one of the most generic data analysis tasks.!
(1) It is applied when basically nothing is new about the data.!
(2) It is a Lego block that can be used for all kind of data analysis tasks.!
Unsupervised data representation or feature extraction will be

studied in Lecture 7: PCA, NMF, Sparse Coding.!
Xavier Bresson
51
Ques8ons?
Xavier Bresson
52
Data Science!
Sept 12-14, 2016!

Lecture 5: Supervised Learning!
Xavier Bresson!
!
!"#$%&'(&%))*+'
,'
Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!
Xavier Bresson
Classes of Learning Techniques!

Q: Do you know the dierence between unsupervised, !
supervised, and semi-supervised learning?!
Three classes of learning algorithms (for clustering, classification,
recommendation, visualization, feature extraction, etc):!
(1) Unsupervised learning: Algorithms that do not use any prior
information.!
(2) Supervised learning: Algorithms that only use labeled data.!
(3) Semi-Supervised learning: Algorithms that use both labeled and
unlabeled data. !
Labeled data are gold data but they are generally expensive to produce,
while unlabeled data are usually cheap to get. Ex: Image recognition in
Computer Vision, millions of images are available with Google, but they are
unlabeled. How much time to label 1,000 images? !
Previous lecture focused on unsupervised data clustering.!
This lecture focuses on supervised and semi-supervised data classification,
particularly SVM techniques. !
Xavier Bresson
Support Vector Machine (SVM)!

!! SVM is a very popular classification technique (among top 10 algorithms in
data science). SVM is used in deep learning as loss function (later).!
!! We will cover the supervised and semi-supervised binary SVM
classification techniques: Given a set of labeled data that belongs to two
classes, construct a classification function that outputs the class of any new
data. !
V = {{xi , ì }ni=1 , xi 2 Rd , ì 2 { 1, +1}
data
data'
labels'
label : xi , ì = +1
f (x) = +1
8x 2 C1
label : xi , ì =
C1! C2!
!"#$%&'(&%))*+'
f (x) = 1
8x 2 C2
/'
History of SVM Techniques!
f (x) =
f (x) = +1
C1!
C2!
Linear SVM!
Supervised Learning!
[Vapnik-Chervonenkis63]!
!"#$%&'(&%))*+'
Linear SVM!
Laplacian SVM!
Supervised Learning! Semi-Supervised Learning!
[Boser,Guyon,Vapnik-Chervonenkis92]! [Belkin,Niyogi,Sindhwari06]!
0'
Outline!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Graph SVM!
Conclusion!
Xavier Bresson
Linear SVM [Vapnik, Chervonenkis 63]!

!! Assumption: Training (and test) data are linearly separable, i.e. data can
be perfectly separated by a hyperplane: !
C1!
C2!
hyperplane!
!! Linear SVM: Given a training dataset V={xi,li}, design a classification

function that assigns any new data x to the class that is best consistent with V. !
f : x 2 Rd ! { 1, +1}
!! Class of (linear) solutions: Given the assumption, determine the hyperplane
that best separates the two classes. Any hyperplane is parameterized by two
variables (w,b), where w is the normal vector of hyperplane, and b is the o"set value. !
Hyperplane equation:!
!"#$%&'(&%))*+'
hw, xi + b = 0
2'
Linear SVM Classifier!

!! Classification function: !
f (x) = sign(hw, xi + b) =
if x 2 C1
if x 2 C2
+1
1
hw, xi + b > 0
w
C1!
C2!
hw, xi + b = 0
hhw,
w, x
xi + b < 0
!"#$%&'(&%))*+'
3'
How to Find (w,b)? !

!! SVM idea: Define the best hyperplane by maximizing the margin d
between the 2 classes:!
margin '
d'
d
x+
C1!
C2!
!! Relationship between w and d: !

hw, x+ i + b = +1
hw, x i + b = 1
hw, x+
x i+b=2
d~ = x+
x = w
2
kwk22
d=
2
kwk2
'Maximize the margin d: ! max d , max

'Minimize the weight w !
!"#$%&'(&%))*+'
2
, min kwk22
kwk2
4'
Primal Optimization!
hw, xi i + b
ì = 1
!! Variable w is called primal variable. !

kwk22
Optimization problem w.r.t. w is! min
w
with constraints: !
fi = hw, xi i + b =
ì =
+1
1
+1
1
if x 2 C1
if x 2 C2
if x 2 C1
if x 2 C2
ì .fi
1 8i 2 V
hw, xi i + b
ì =
1
1
!! Summary: SVM classifier f is given by the solution of the Quadratic

Programming (QP) problem:!
min kwk22 s.t. ì .fi

w,b
Quadratic!
function!
!"#$%&'(&%))*+'
1 8i 2 V
Convex set!
(polytope)!
fi = hw, xi i + b
SVM !
classifier!
,5'
Support Vectors!
!! Definition: Data that are exactly localized on the margin hyperplanes.!
!! O"set value b: b is defined by:

by:!
ì .(hw, xSP
i i + b)
!"#$%&'(&%))*+'
1 = 0, 8xSP
!
i
bi = ì
hw, xSP
i i
b = E({bi })
,,'
Dual Problem!
Primal problem: ! min kwk22 s.t. ì .fi
w,b
1 8i 2 V
Dual problem: The dual variable is .!

Motivation: Nice to work with data products xi,xj (for the kernel trick). !
After some computations
1 T
min Q
0 2
With:
h, 1i s.t. h, 1i = 0
QP problem
Q = LKL
L = diag(`1 , ..., `n )
Kij = hxi , xj i
Linear kernel
Xavier Bresson
12
Optimization Algorithm!
Classification function: !
f (x) = sign(hw? , xi + b? )
= sign(?T LK(x) + b? )
Optimization scheme: Solution * given by iterative scheme: !
1
1
l=0
l=0
,
=
=
y
=
0
Initalization:
kQk
kLk
Iterate until convergence: l=0,1,2,...
l+1 = P
0 [( Q
+ In )
(k +
Ly k )]
y l+1 = y k + Ll+1
Xavier Bresson
? = 1
13
Demo: Standard/Linear SVM!

!"#$%&'(&%))*+'
,/'
Outline!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Graph SVM!
Conclusion!
Xavier Bresson
15
Soft-Margin SVM [Cortes-Vapnik 95]!

!! Motivation: Linear SVM suppose data are linearly separable, i.e. there exist
an hyperplane separating perfectly the data. !
Q: What happens in the presence of outliers? !
!! Soft SVM: Find an hyperplane that best separate the data (by maximizing
the margin) while allowing as few outliers as possible.!
!"#$%&'(&%))*+'
,1'
Modeling Errors with Slack Variables!

!! Slack variables ei: Measure the error of each data xi to be an outlier.!
!! New optimization:!
min kwk22 +
w,b
n
X
s.t. ì .fi
i=1
ei , e i
0 8i 2 V
Trade-o" between large margin!

and small errors!
!"#$%&'(&%))*+'
,2'
Dual Problem!
min kwk22 +
w,b
s.t. ì .fi
ei , e i
i=1
0 8i 2 V
!! Primal problem:!
n
X
!! Dual problem:!
After some computations'
min
0
0
1 T
Q
2
Only this trivial modification!!
h, 1i s.t. h, 1i = 0
With:'
Q = LKL
L = diag(`1 , ..., `n )
Kij = hxi , xj i
!"#$%&'(&%))*+'
,3'
Demo: Soft-Margin SVM !

!"#$%&'(&%))*+'
,4'
Hinge Loss Function!

min kwk22 +
w,b
min
w,b
kwk22
s.t. ì .fi
i=1
!! Primal problem:!
n
X
n
X
ei , e i
0 8i 2 V
Q: Do you know the

hinge function?!
Vhinge (fi , ì )
i=1
Vhinge (fi , ì ) = max(0, 1
fi .ì )
Popular SVM loss function'
!"#$%&'(&%))*+'
-5'
Several Other Loss Functions!
(1 fi .ì )2 if fi .ì 1
V`2 (fi , ì ) =
Quadratic/L2 loss:!
0
otherwise
(1 fi .ì )2 + |1 fi .ì | if fi .ì 1

VElasticN et (fi , ì ) =
Elastic Net loss:!
0
otherwise
Huber loss:!
VHuber (fi , ì ) =
8
<
:
1
2
1
2 (1
fi .ì
fi .ì )2
if fi .ì 0
if 0 < fi .ì 1
otherwise
Logistic loss (Facebook, Amazon):! VLogistic (fi , ì ) = e1
Xavier Bresson
fi .ì
21
Outline!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Graph SVM!
Conclusion!
Xavier Bresson
22
Kernel Techniques!
Very popular techniques (until deep learning).!
Reproducing Kernel Hilbert Space (RKHS): A space associated to

bounded, symmetric, PSD operators called kernels K that can reproduce any
smooth function f. !
Representer Theorem [Scholkopf, Herbrich, Smola, 01]: Any continuous and

smooth function in a RKHS can be represented as a linear combination of the
kernel function K evaluated at the data points: !
f (x) =
n
X
ai K(x, xi ) + b
i=1
Xavier Bresson
23
Interpretation of Representer Theorem!

Interpretation: The Representer Theorem is a powerful tool for function
interpolation in high-dim spaces: !
f (x) =
n
X
ai K(x, xi ) + b
i=1
with
Popular kernels:!
(1) Linear kernel:!
!
(2) Gaussian kernel:!
K(x, y) = hx, yi
K(x, y) = e
K(x, xi ) = e
kx yk22 /
kx xi k22 /
(3) Polynomial kernel:!

Xavier Bresson
K(x, y) = (ahx, yi + b)c

24
Feature Maps and Kernel Trick!

!! Definition: Any feature map $ defines a reproducing kernel K: !
def'
h (x), (y)i = K(x, y)
and inversely:!
def'
K (x, y) = h 0 (x),
(y)i
!! Summary:!
Representer !
Theorem:!
X
f (x) =
ai Kxi (x)
i
Reproducing!
Kernel K '
K(x, y) = h (x), (y)i

Norm of f:!
kf kHK = aT Ka
Bounded
Continuous!
Function f'
Feature Map!
$'
f (x) =
X
i
!"#$%&'(&%))*+'
Kernel trick:!
ai h (xi ), (xj )i
-0'
Outline!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Graph SVM!
Conclusion!
Xavier Bresson
26
Non-Linear/Kernel SVM !
[Bosser, Guyon, Vapnik 92]!
!! Motivation: Linear/soft SVM assume data are linearly separable (up to a few
outliers). For several real-world data, the hyperplane assumption is not
satisfied. A better separation is a non-linear hyperplane, that is a
hypersurface.!
!! Kernel Trick: Project data
into a higher-dim space with
feature map $ where the data
are linearly separable.!
Linear separator'
Non-linear separator'
!! Decision function in high-dim:!
f (x) = X
hw, xi + b
w=
i ` i xi
i
f (x) =
!"#$%&'(&%))*+'
X
i
!
!
f (x) X
= hw, (x)i + b
w=
i ì (xi )
i
i ì h (x), (xi )i + b =
X
i
i ì K(xi , x) + b
Kernelx!
Kernelx
-2'
Optimization!
Dual problem:!
min
1 T
Q
2
h, 1i s.t. h, 1i = 0
With: Q
= LKL
L = diag(`1 , ..., `n )
Kij = hxi , xj i
Kij = h (xi ), (xj )i
Same optimization algorithm J

Recall: We never compute explicitly and the products (xi),(xj), only
the Kernel matrix:
K(x, y) =
Xavier Bresson
(ahx, yi + b)c
2
2
e kx yk2 /
28
Demo: Kernel/Non-Linear SVM !

!"#$%&'(&%))*+'
-4'
General Supervised Learning!

!! Generalization:!
min kwk22 +
w
n
X
Vloss (fi , ì )
i=1
min kf k2HK +
f 2HK
n
X
i=1
Error for inaccurate !

predictions!
Regularity!
of f!
kf k2HK = kwk22 for f (x) = hw, xi

Representer theorem:!
f (x) =
n
X
Trade-o"!
ai K(x, xi )
i=1
Norm of in RKHS:!
!"#$%&'(&%))*+'
kf k2HK
Vloss (fi , ì )
= hf, f iHK =
fi fj Kij = f T Kf
ij
.5'
Outline!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Graph SVM!
Conclusion!
Xavier Bresson
31
Semi-Supervised Learning Graph SVM!

!! Motivation: It is time consuming to label data, but it is cheap to collect
unlabeled data! Besides, unlabeled data are without information: they hold
the geometric structure of data Best design of classification function
is with simultaneously labeled and unlabeled data.!
f (x) = +1
f (x) =
Labeled data +,-!

!"#$%&'(&%))*+'
Labeled data +,-!

Labeled data +,-!

Unlabeled data o!
Semi-Supervised Learning!
.-'
Semi-Supervised Learning Graph SVM!

Case of few labels: Semi-supervised learning is particularly wanted in
the case of few labels (extreme case is one label/class). Let us call!
n: the number of labeled data!
m: the number of unlabeled data!
Then, semi-supervised plays an important role in the case of n m.!
Q: How to design SVM to deal simultaneously

with labeled and unlabeled data? !
SVM on graphs!!
Xavier Bresson
33
Manifold Assumption!
Observation: Geometry of data is independent of labels!!
(Labeled and unlabeled) data are assumed to lie on a manifold, where the
classification will be carried out. !
Xavier Bresson
34
Manifold Assumption!
!! How to introduce the manifold geometry in SVM?!
! First, we approximate the manifold M with a neighborhood graph,
i.e. a k-NN graph.!
! Second, we add a regularization term that forces the classification
function f to be smooth on the manifold (/graph).!
!"#$%&'(&%))*+'
.0'
Optimization!
!! Optimization problem:!
min kf k2HK +
f 2HK
n
X
Vloss (fi , ì ) +
i=1
|rf |2
Dirichlet energy:!
(1)!It forces f to be smooth on M!
(2)!Derivative is "f=0 (heat di#usion)!
!! Discretization of Dirichlet on graphs:!
|rf |
X
ij
Wij |f (xi )
f (xj )|2 = f T L
Lf
Laplacian operator!
!"#$%&'(&%))*+'
.1'
Algorithm !
Semi-supervised SVM or Laplacian SVM [Belkin, Niyogi, Sindhwani06]:!
min kf k2HK +
f 2HK
n
X
Vhinge (fi , ì ) +
Gf
i=1
Lf
)
f (x) = sign
n
X
a?i K(x, xi )
i=1
a? = (I +
G LK)
HL?
1 T
Q
= arg min
0 l 2
?
with
Xavier Bresson
Q = LHK(I +
G LK)
h, 1i s.t. h, 1i = 0
1
HL
37
Demo: Graph SVM !

!"#$%&'(&%))*+'
.3'
Outline!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Graph SVM!
Conclusion!
Xavier Bresson
39
Summary!
f (x) =
f (x) = +1
C1!
C2!
Linear SVM!
!"#$%&'(&%))*+'
Linear SVM!
Laplacian SVM!
Supervised Learning! Semi-Supervised Learning!
[Boser,Guyon,Vapnik-Chervonenkis92]! [Belkin,Niyogi,Sindhwari06]!
/5'
Summary!
!! General supervised and semi-supervised optimization techniques:!
min kf k2HK +
f 2HK
Vloss
!"#$%&'(&%))*+'
n
X
Vloss (fi , ì ) +
G Rgraph (f )
i=1
Regularity!
of f!
Error for inaccurate !

predictions!
8
Hinge
>
>
>
>
< L2
L1
=
>
>
Huber
>
>
:
Logistic
8
< Dirichlet: krG k22
Total Variation: krG k1
Rgraph () =
:
Wavelets: kDwavelets k22
Graph regularization!
for unlabeled data!
/,'
Ques8ons?
Xavier Bresson
42
Data Science!
Sept 12-14, 2016!

Lecture 6: Recommender Systems!
Xavier Bresson!
!
!"#$%&'(&%))*+'
,'
Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!
Xavier Bresson
Introduction!
Recommendation have become a central part of intelligent systems. !
Q: Where do you find recommender systems? !
Examples: Google search engine recommends webpages on internet,!

recommending movies on Netflix, friends on Facebook, products on Amazon,
jobs on LinkedIn, articles on NY Times website:!
Xavier Bresson
Outline!
Google PageRank!
Hybrid Systems!
Conclusion!
Xavier Bresson
Google PageRank!
!! A B$ Algorithm! " !
!! PageRank is an algorithm that ranks websites on Internet. It is at
the core of Google Search Engine, which introduced a revolution in 1998 as
ranking was previously done manually by humans.!
Q: Do you know how many webpages in 1998 and 2016? !
In 1998, the size of WWW was 2.4M of webpages.!
Today, on Aug 2016, the size of the WWW is 4.6B!!
!"#$%&'(&%))*+'
0'
PageRank Technique!
It is a sound technique as!
(1) Mathematically well defined.!
(2) Computationally ecient.!
Core idea: PageRank sorts the vertices of a directed graph G using the
stationary state of G.!
Definition: The stationarity and modes of vibration of graphs/networks
can be studied by EVD (Lecture 3) such that:!
Axl =
l xl
What is the operator A in the case of directed graphs? Or how to design a

meaningful stationary state of G?!
Perron-Frobenius Theorem.
Xavier Bresson
Perron-Frobenius Theorem!
!! Given a graph G=(V,E,W) defined by a stochastic and irreducible matrix
W, the PF theorem establishes that the largest left eigenvector (with
eigenvalue 1) is the stationary state (PageRank) solution:!
xTmax W = xTmax
Max left eigenvector '
max
= xTmax
Eigenvalue=1 '
'Consequence: xpagerank = solution of eigenproblem:!
x T W = xT
!"#$%&'(&%))*+'
2'
Stochastic Matrix!
!! Definition: A matrix W with the rows normalized as probability density
function:!
X
W ='
Wij = 1
W1 = 1
!! Interpretation: Wij = probability to move from vertex i to vertex j!
!! Make W stationary:!
W
!"#$%&'(&%))*+'
for
Dii
P
( j Wij )
=
0
if ith row6= 0
otherwise
3'
Irreducible Matrix !
!! Definition: A matrix W that represents a strongly
connected graph, that is W has, for any pair of
vertices (i,j):!
(1) A directed path from i to j.!
(2) A directed path from j to i. !
!! Make W irreducible:!
Wsi
Stochastic and irreducible matrix!

= Full matrix'
W + (1
In
)
n
Identity matrix'
Original matrix!
= Sparse matrix'
Common choice: "=0.85 '

!"#$%&'(&%))*+'
4'
Interpretation!
Q: What is a random surfer? !
Term (1-)In/n: It is equivalent to a random surfer/user who can jump to
any webpage. !
Whole model D-1Wsi + (1-)In/n: It represents a surfer/user who follows
the internet structure % of the time and suddenly, in (1-)% of the time, the
surfer clicks to a random webpage that has no connection to the provious page:!
Xavier Bresson
10
Naive Algorithm!
PageRank simple algorithm: Solve the EVD problem:!
Left eigenproblem
xT Wsi = xT
(xT Wsi )T = (xT )T

T
Wsi
x=x
Right eigenproblem
Limitations: Computing eigensolution is:!

(1) Slow O(n2)!
(2) Memory consuming O(n2)!
(3) Not easily parallelizable, eigenvectors are solutions of global linear systems!
EVD does not scale to big networks like Internet.
Xavier Bresson
11
Power Method !
[Mises, PollaczekGeiringer 29, Page, Brin, Motwani, Winograd98] !
Solution of WsiTx=x can be found by a fixed point iterative scheme: !

T k
xk+1 = Wsi
x
At convergence:
T 1
x1 = Wsi
x
T
x1 = xpagerank because solution of x = Wsi
x
Algorithm:!
1n
n
Initialization:
xk=0 =
Iterate until convergence:
xk+1 = D
Xavier Bresson
W T xk +
1n
(1
n
12
Properties!
Full matrix!
O(n2)!
!! (1) Memory e#ciency:
T
EVD: ! Wsi x = x
Power Method: ! xk+1 = D

Ex: For WWW, n=4.6B, E100.nn2!
!! (2) Speed e#ciency:
W T xk +
1n
(1
n
Sparse matrix!
O(E)!
EVD: O(n2) !
Power Method: The number of iterations to converge to a precision #
is controled: !
log
85
for = 0.85
10
kxk+1 xk k1
=
, and = 10
K=
for'
1833 for = 0.99
log10
!! (3) Highly parallelizable: Basic linear algebra

operations Ax.!
!! (4) Very popular: Power method is the best
technique to compute largest or smallest eigenvector.!
!"#$%&'(&%))*+'
,.'
Demo: PageRank!
!"#$%&'(&%))*+'
,/'
Outline!
Google PageRank!
Hybrid Systems!
Conclusion!
Xavier Bresson
15
Recommender Systems Collaborative Filtering!

Q: Do you know the Netflix Prize? !
!! Famous example: Netflix Prize 1M$!
Netflix is the biggest online movie company in the U.S. !
(launched in Europe in 2015) #Movies> 100,000 and
#Users> 50M.!
!
Netflix prize was a competition for the best algorithm that

can predict user ratings for movies based on previous ratings.!
ratings.
(Big) data: 480,189 users!
17,770 movies!
100,480,507 ratings only 0.011%'
!! Companies using collaborative filtering technique: Facebook, LinkedIn,

MySapce, LastFM.!
!"#$%&'(&%))*+'
,1'
Collaborative Filtering!
!! Formulation: Given a few ratings/observations Mij of movie j and user i, find
a low-rank matrix X that best fits the ratings. !
Recommendation !
= !
Matrix completion!
M'
!"#$%&'(&%))*+'
X'
,2'
Low-Rank Recommendation!
!! Definition: A low-rank matrix has many columns and rows that linearly
dependent. The rank of a matrix gives the number of independent rows and
columns. !
Nb of independent rows = 13!
Nb of independent cols = 15 !
rank(X) = max(13,15) = 15 '
X='
!! Low-rank assumption: This hypothesis is valid for many real-world

recommender systems. For Netflix movie recommendation:!
(1) There exist communities of users who rate movies the same way.!
(2) There are groups of movies that receive same ratings. !
!
Same assumptions for Amazon (users, products), LinkedIn (users, jobs), Facebook
(users, ads), etc.!
!"#$%&'(&%))*+'
,3'
Formalization!
!! Modeling:!
min rank(X)
s.t.
Noiseless case!
(Observations are clean)'
Xij = Mij
Xij = Mij + nij
Combinatorial !
NP-hard problem'
8ij 2 obs
8ij 2 obs
Noisy case!
(Observations may be corrupted)'
Relaxation needed!
Convex !
Relaxation!
!"#$%&'(&%))*+'
Non-Convex !
Relaxation!
,4'
Outline!
Google PageRank!
Hybrid Systems!
Conclusion!
Xavier Bresson
20
Convex optimization has become a very powerful tool in the last
decade in data science (2nd topic at NIPS conference, behind deep learning). !
Several state-of-the-art techniques are based on convex opt s.a. (sparse)
data representation, recommender systems, unsupervised clustering, etc. !
Classes of optimization problems in data science:!
(1) Linear programming (LP)!
(2) Quadratic programming (QP)!
(3) Smooth convex optimization !
(4) Non-smooth convex optimization !
(5) Non-Convex optimization!
Xavier Bresson
21
Linear Programming!
!! Linear programming (LP): very common!
min hc, xi
x
s.t.
Linear objective!
Ax b
Convex set!
(polytope)!
Convex set
vs. Non-convex set!
Ex: Bipartite graph matching!
!"#$%&'(&%))*+'
--'
Quadratic Programming!
!! Quadratic programming (QP):!
1 T
min x Qx
x
2
s.t.
Ax b
Ex: SVM!
!! Tikhonov/least-squares/ridge regression problem:!
min kAx
x
bk22 + kRxk22
A0 x = b0
Trick: Never solve exactly a linear system of equations Ax=b!!

Use fast approximate solution like a few (10-50) steps of conjugate
gradient echnique [Hestenes-Stiefel52]. '
!"#$%&'(&%))*+'
-.'
Smooth Convex Optimization!

Smooth convex optimization: !
min Fs (x)
x
s.t.
Ax b
Newtons algorithm: !
xk+1 = xk
[HFs (xk )]
Hessian !
Matrix
=!
Optimal!
time step
Advantages:!
rFs (xk )
Gradient!
vector
(1) (Very) fast convergence:!

F (xk )
F (x? ) =
O( k12 )
O(e k )
for convex functions

for strongly convex functions
(2) Also work for non-convex functions (but

captures local minimizers)
Xavier Bresson
24
Non-Smooth Convex Optimization!

Non-smooth convex optimization:!
min Fns (x)

x
s.t.
Ax b
Three classes of ecient algorithms:!

(1) Primal-dual techniques!
(2) ADMM techniques!
(3) Iterative shrinkage techniques (FISTA)!
Convergence rate: F (xk )
F (x? ) = O(
1
) (optimal J [Nesterov])
k2
Lasso (least absolute shrinkage) regression problem:!

min kAx
x
bk22 + kxk1
Encourage sparsity !
(feature selection)
Interpretation
Compressed sensing problem:!

Revolution in digital signal processing (beyond Shannons sampling theory).!
Xavier Bresson
25
Non-Convex Optimization!
!! No general theory for non-convex problems.!
!! Case-by-case math analysis.!
!! What always work: Standard gradient descent algorithm:!
min Fnc (x)

x
k+1
=x
@F k
(x )
@x
Time step:!
(1) Manual choice or !

(2) Automatic line search technique!
Be aware: Gradient descent techniques are slow! !

'Big issue/bottleneck in deep learning.!
!! Feel free to use convex optimization techniques, but use them with
safety.!
!"#$%&'(&%))*+'
-1'
Convex Optimization for !

!! Combinatorial problem for robust recommendation:!
min rank(X)
s.t.
min rank(X)) +
X
Xij = Mij + nij 8

8ij 2 obs
kIobs (X
M )k2F
with' (Iobs )ij =
1
0
if ij 2 obs
otherwise
promotes low-rank! promotes data fidelity !

and robustness!
Combinatorial !
NP-hard problem
problem'
Continuous (non-smooth) convex relaxation:!

min kXk? +
X
kIobs (X
M )k2F
p=min(m,n)
m,n)
k=1
SVD'
X = U V T
!"#$%&'(&%))*+'
k (X)|
Singular values!
= diag(
1 , ...,
p)
-2'
Primal-Dual Optimization!
!! Algorithm:!
X k=0 = M
Initialization:!
Y k=0 = 0
Iterate until convergence between primal variable X and dual

variable Y:!
Y k+1 = Y k
X
!"#$%&'(&%))*+'
k+1
Xk
U h1/ ()V T
k Y k+1 + k M
1 + k Iobs
SVD'
with'
U V
=Yk +
Xk
h (x)
-3'
Demo: Collaborative Filtering!

!"#$%&'(&%))*+'
-4'
Properties!
Advantages:!
(1) Unique solution (whatever the initialization)!
(2) Well-posed optimization algorithms!
Limitations:!
(1) Complexity is dominated by SVD O(n3)!
(2) Memory requirement is O(n2)!
Convex algorithms do not scale up to big data.!
Xavier Bresson
30
Non-Convex Techniques!
!! Combinatorial problem for robust recommendation:!
min rank(X) +
X
kIobs (X
promotes low-rank!
M )k2F
with' (Iobs )ij =
1
0
if ij 2 obs
otherwise
promotes data fidelity !

and robustness!
Combinatorial !
NP-hard problem
problem'
Continuous non-smooth and non-convex relaxation:!

min
L,R
1
1
kLk2F + kRk2F + kIobs (LR
2
2
2
M )k2F
rm
R'
L'
!"#$%&'(&%))*+'
nr
X=LR'
nm
r n, m
.,'
Properties!
Advantages:!
(1)Optimization problem is non-convex, but smooth and quadratic
conjuguate gardient, Newton, etc. !
(2) Big data optimization: As the objective is dierentiable, stochastic
!
!
gradient descent techniques can be applied large-scale opt and !

recommender systems.!
(3) Monotonicity property: There exists a class of factorization algorithms !
called non-negative matrix factorization (NMF) that is monotonic
!
(later discussed).!
E k+1 E k , 8k
Limitations:!
(1) Non-convex local minimizers (good initialization is essential)!
(2) Extra parameter: The rank value r needs to be fixed a priori.!
Xavier Bresson
32
Outline!
Google PageRank!
Hybrid Systems!
Conclusion!
Xavier Bresson
33
Recommender Systems Content Filtering!

!! Collaborative recommendation focuses on low-rank approximation
of ratings. !
Content recommendation looks for similarities between users and
between products.!
!! Companies using this class of recommendation: Amazon, IMDB.!
!! Formulation: Given (1) a few ratings Mij of product j and user i,!
(2) a set of user features and product features, !
find a matrix X that best fits the ratings and satisfies the similarities between
users and products.!
!! User features and product features:!
(1) User features/attributes: Genre, age, job, hobbies, etc!
(2) Product features/attributes: Field, price, age, new, etc !
!"#$%&'(&%))*+'
./'
How to Encode Similarity Information?!

!! Similarities between users/products are encoded by graphs, as graphs
naturally store the proximities between any pair of data!
Rows/users graph:!
Gr = (Vr , Er , Wr )
Network of users'
Cols/products graph:!
Gc = (Vc , Ec , Wc )
Network of products'
!"#$%&'(&%))*+'
.0'
Content Recommendation !
[Huang-Chung-Ong-Chen02]!
Cols/products graph:'
Gc = (Vc , Ec , Wc )
Rows/users graph:'
Gr = (Vr , Er , Wr )
Recommendation !
=!
Matrix completion'
M'
X'
'How to fill M using the graphs of users and products?!

!"#$%&'(&%))*+'
.1'
Formalization!
Simple idea: Diuse the ratings on the networks of users and products.!
Optimization formulation:!
min kXkdiff,G rows + kXkdiff,G cols +

X
kIobs (X
M )k2F
What is the diusion objective on graphs?!
kXkdiff,G
Xavier Bresson
37
How to Design the Di%usion Term?!

!! Observation: When user i is close to user i on
the graph Gr=Gusers, it means that the two users
i,i are similar, and so there is a high chance
they will rate all movies almost the same way.
Hence !
'rowi and rowi on X are expected to be
close!!
'Di$usion term should be designed to force the ratings to be

smooth on graphs. There exists a popular choice of graph smoothness
called Dirichlet objective:!
kXkdiff,G = kXkDir
!"#$%&'(&%))*+'
= tr(X T LX)
where L is the graph Laplacian.'
.3'
!! Optimization problem:!
min kXkdiff,G rows + kXkdiff,G cols +
X
min tr(X T Lr X) + tr(XLc X T ) +

X
kIobs (X
kIobs (X
M )k2F
M )k2F
'Problem is smooth and quadratic:!

(1)! It reduces to Ax=b: Conjugate gradient technique.!
!
(Im Lr + Lc In + Imn )X = M
Ax = b
(2) Stochastic gradient descent for large-scale dataset.!
!"#$%&'(&%))*+'
.4'
Demo: Content Filtering!

!"#$%&'(&%))*+'
/5'
Outline!
Google PageRank!
Hybrid Systems!
Conclusion!
Xavier Bresson
41
Hybrid Systems!
Q: Can we combine collaborative and content recommendation? !

!! Combine collaborative and content techniques for enhanced recommender
systems: [Ma-et.al11, Bresson-Vandergheynst14]!
!! Formulation: Given a few ratings Mij of product j and user i, user features
and product features, design a recommender systems that best benefits from:!
(1) Collaborative filtering by constraining X to be low-rank,!
(2) Content filtering by enforcing graph regularizing on X.!
!"#$%&'(&%))*+'
/-'
Formalization!
Optimization:!
min kXk? +
X
Xavier Bresson
tr(X T Lr X) +
tr(XLc X T ) +
kIobs (X
M )k2F
43
Demo: Hybrid Filtering!

!"#$%&'(&%))*+'
//'
State-of-the-Art!
!! Limitation: Graph Dirichlet regularization/smoothness forces:!
(1)!Two rows/columns of X to be similar if they are close on graphs ".!
!
(2) X to be smooth on graphs !This assumes the ratings to vary !

smoothly on the graphs #. This assumption is actually limited and a !
better one is to force X to be piecewise constant on graphs.!
Graph Dirichlet regularization/smoothness!
Graph TV regularization/smoothness!
!! State-of-the-art technique available on GitHub:!

678)9::;$<6=>?@*A:B$B*6):&%@*;'
!"#$%&'(&%))*+'
/0'
Outline!
Google PageRank!
Hybrid Systems!
Conclusion!
Xavier Bresson
46
Fundamental Property of !
Prediction !
Error!
(the lower !
the better)'
Collaborative Filtering/ low-rank approximation'
regularization'
Content Filtering/ graph regularization
Hybrid=!
Content'
Hybrid recommender!
recommender
System'
System
number
Small number!
of ratings'
Hybrid=!
Collaborative'
Large number!
number
of ratings'
#Available!
Observations/!
ratings'
!! Conclusion:!
(1) If not enough ratings, then focus on collecting data features!
(2) When enough ratings, then give less importance of features!
!"#$%&'(&%))*+'
/2'
Summary!
PageRank for data ranking according to pairwise relationships.!
Recommender systems are based on:!

(1) Collaborative filtering!
(2) Content filtering!
(3) Hybrid formulation !
Optimization:!
(1) For small-scale recommendation (n<10K), use convex techniques. !
(2) For large-scale recommendation, prefer non-convex techniques. !
Xavier Bresson
48
QuesDons?
Xavier Bresson
49
Data Science
Sept 12-14, 2016!

Lecture 7: Feature Extraction!
Data Representation!
Xavier Bresson!
!
!"#$%&'(&%))*+'
,'
Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
Q: What are features? !
!! Goal: Find the best possible representation of data that reveal special
structures useful for further applications (like classification, recognition, etc).!
Apply!
Filters!
Meaningful!
Features!
Raw data!
!! There are two types of filters for feature extraction:!

(1) Handcrafted features from domain expertise. !
(2) Learned features from data linear or non-linear representations. This
approach has become dominant. !
!"#$%&'(&%))*+'
.'
Handcrafted Features!
!! Domain expertise: Handcrafted features are domain-dependent, i.e. designed
from experts in specific fields with years of experience (usually not
generalizable to other fields).!
!! Popular example: SIFT - Best image features in Computer Vision, used for
many applications such as image recognition. It needed 30 years of experience
(1966-1999) to design good image features! !
Image!
SIFT!
Filters!
SIFT!
Features!
Features for!
image/object !
recognition!
!"#$%&'(&%))*+'
/'
Learned Features From Data!

Paradigm shift: Handcrafted filters/features have been successful for
decades but the emergence of big data has made available enough data to
learn the best features from data directly, without handcrafting anything. !
Besides, deep learning has showed us how to extract highly meaningful
features from data.!
Learned filters: There exist two classes of learned filters depending on

the mathematical data representations:!
(1) Linear representation!
(2) Non-linear representation!
Xavier Bresson
Linear Representation!
!! Formulation:!
z = Dx
Features!
or coe"cients of x!
in the dictionary D'
Dictionary !
of filters!
(or basis functions)'
3
hD1, , xi
7
6
..
z = Dx = 4
5
.
hDK, , xi
2
High-dim!
data'
zi = hDi,
i, , xi
ith coe"cient'
!"#$%&'(&%))*+'
ith filter'
1'
!! How to learn A and z?!
'Techniques available: PCA, ICA, NMF, Sparse Coding, etc. !
!! Which technique to choose?!
Each technique assumes di"erent assumptions on data. Pick the one that
follows your data properties (later discussed).!
!! Matrix factorization: Linear representation of data can also be seen as a
matrix factorization problem:!
X = DZ
d'
Kxd
nxd
d'
n x K'
!"#$%&'(&%))*+'
2'
Non-Linear Representation!
Non-linear mapping : !
Linear representation:
Non-linear representation:
x
x
!
!
z = Dx
z = '(x)
(z 6= Dx)
These techniques are called non-linear dimensionality reduction

techniques, and they are used for feature extraction, classification,
visualization, etc. !
Examples: !
(1) Non-linear PCA, Locally Linear Embedding (LLE), Laplacian
Eigenmaps, t-Distributed Stochastic Neighbor Embedding (t-SNE)
(Lecture 8).!
(2) Deep Learning (Lectures 9-12).!
Xavier Bresson
Linear vs. Non-Linear Representations!

!! Which representation to use? The answer depends on the type of data
distributions:!
If data follow a Gaussian Mixture Model (GMM) like human faces, it is
then enough to use linear data representation (and useless to use non-linear
techniques).!
However, if data follow complex distributions like text documents then they
require non-linear techniques. !
Q: What is the shape of Gaussian Model? !
!"#$%&'(&%))*+'
4'
Outline!
Sparse PCA!
Robust PCA!
PCA on Networks!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
10
Principal Component Analysis!

Q: Who does *not* know PCA? !
History: PCA introduced by Pearson in 1901 in physics.!
PCA is the most popular technique for linear representation (also
one of the top 10 techniques in data science). !
Formulation: Given a set of data, PCA aims at projecting data onto an
orthogonal basis that best captures the data variance. !
Consequence: The first basis function or principal direction v1 captures the
largest possible variance of data.!
The second basis function or principal direction v2 captures the largest possible
variance of data while satisfying orthogonality constraint v1,v2=0. etc !
Xavier Bresson
11
Formalization!
PCA defines an orthogonal transformation that maps the data to a new
coordinate system (v1,v2,,vK) called principal directions such that the vks
capture the largest possible data variances. !
Notes: !
(1) PCA requires the data to be centered.!
(2) PCA does not say anything about data normalization, but its
analysis may change (PCA is not invariant w.r.t. data normalization).!
Xavier Bresson
12
Covariance Matrix!
!! Definition: The Covariance Matrix C is defined as!
C = XT X
d x d'
X= n x d data matrix!
n = number of data!
d = number of dimensions'
d x n'
!! Covariance matrix C encodes all data variances along each dimension:!

C = hX, , X, i = kX, k22 =
C = hX, , X, i =
n
X
n
X
i=1
Xi Xi =
i=1
|Xi |2 =
n
X
n
X
x2i,
i=1
xi, xj,
i=1
Variance of data !
along #th dimension '
Covariance of data !
along #th and 6th dimensions '
Note: xi are zero-centered along each dimension #:'

E({xi }) =
!"#$%&'(&%))*+'
n
X
i=1
xi, =
n
X
i=1
Xi = 0 8
,.'
PCA = EVD of Covariance Matrix!

Cv1 =
!! Proof: Let us show that !
Principal direction !
(PD) v1!
1 v1
Largest variance!
of data along PD v1!
Vector v1 = 1st principal direction!

= Direction of largest data variance!
= argmax
i=1
|hxi , vi|
i|2
Square distance of!

data projected on v'
kvk2 =1
n
X
Matrix notation'
= argmax
kvk2 =1
kXvk22 = (Xv)T (Xv) = v T X T Xv = v T Cv
Spectral solution: Solution v1 is the largest eigenvector of C'

Cv1 =
!"#$%&'(&%))*+'
1 v1
v1T Cv1
v1T
1 v1
2
1 kv1 k2
= argmax
kvk2 =1
n
X
i=1
|hxi , vi|2
,/'
PCA = EVD of Covariance Matrix!

Vector v2 = 2nd principal direction!
= Direction of second largest data variance!
= argmax
kvk2 =1
v T Cv s.t. hv, v1 i = 0
)
Cv2 =
2 v2 ,
with
Vector v3 = 3rd principal direction !

Matrix factorization: Full EVD of C!
C = V V T
with
Xavier Bresson
V = v1 , ..., vd , V T V = Id , = diag(
1 , ...,
d)
15
Principal Components!
Definition: PCs are the coordinates of original data projected into the basis
of principal directions (PDs): !
Xpca = XV
Xavier Bresson
16
PCA = SVD of Data Matrix!

!! Matrix factorization: !
X = U V
V T with U U T = In , V V T = Id
n x d'
nxn
n'
n x d'
d x d'
!! Relationship between EVD and SVD:!

T
C = X T X = (U V T )T (U V T ) = V (U T U )V T = Vsvd Vsvd
T
= Vevd Vevd
8
< Vsvd = Vevd
2 = ! k = k2
:
Xpca = XVevd = Usvd
!! Q: PCA with EVD or SVD? It depends on the size of data matrix X:!
(1) For d>n: use SVD.!
(2) For d<n: use EVD.!
!"#$%&'(&%))*+'
,2'
PCA as Dimensionality Reduction!

!! Essential observation: Most (linear) data are concentrated along the
first principal directions:!
xi = hxi , v1 i + hxi , v2 i
hxi , v1 i
The first PDs are enough to provide a good data representation, i.e.!
kX
XK k22 small for a small K

Approximation of X !
with first K PDs:'
XK = U K VKT
Truncated 7 with first K !

data variances'
!"#$%&'(&%))*+'
Truncated 8 with !
first K PDs'
,3'
How to select K?!

!! Rule: Data variance is captured by each principal direction, then if one wants
to retain 90% of total variance then K must be selected as follows: !
PK
k=1
Pd
k=1
k
k
0.9
90% of total variance'
K
YaleBFaces dataset!
!"#$%&'(&%))*+'
structure'
noise'
,4'
PCA as Visualization Tool!

Use first principal components for visualization: If high-dimensional
data are linear, follow a GMM distribution then they can be visualized in
2D, 3D spaces.!
Xavier Bresson
20
Demo: Standard/Linear PCA!

!"#$%&'(&%))*+'
-,'
Outline!
Sparse PCA!
Robust PCA!
PCA on Networks!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
22
Sparse PCA!
Q: Is PCA interpretable? !
!! Motivation: Standard PCA is able to !
(1) Capture most variability information contained in data.!
(2) Identify uncorrelated information (because principal directions are
orthogonal).!
!! However, PCA is limited in feature interpretation: It is hard to to identify
the most relevant features for each principal direction.!
!! Example: Analysis of genes with standard PCA gives: !
Gene1!
Gene2!
Gene3!
'
Q: What genes are the !

most meaningful?!
'Solution: Sparse PCA!

!"#$%&'(&%))*+'
-.'
Sparse PCA Techniques!

Mainly, three classes exist:!
(1) Lasso PCA!
(2) Elastic PCA!
(3) Power PCA (based on power method) !
Lasso PCA: !
Advantage: Great feature selection technique as it finds sparse, accurate
and robust solutions.!
Limitation: The number of features that can be selected by Lasso is at
most n, the number of data. It may be an issue in some applications like in
bioinformatics: !
n = # microarray data = 100!
d = # gene predictors = 10,000!
Lasso PCA can select at most 100 genes. Solution: Elastic PCA. !
Xavier Bresson
24
Elastic PCA!
!! Elastic PCA solves an elastic net regression problem:!
min kX
A,B
XBAT k2F +
2
2 kBkF
Data fidelity!
term!
Bk1
1 kBk
L1 term forces!
sparse solution!
Elastic net!
regression!
?
B,j
=
? k
kB,j
2
Sparse principal directions: !
sPDj = V,j
Sparse principal components: !
Xspca = XV
!"#$%&'(&%))*+'
s.t. AT A = IK
-0'
Algorithm!
Optimization problem is non-smooth but convex (separately):!
Am=0 = VKpca (standard PCA)
Initialization:
Iterate until convergence:
Step 1:
B m+1 = argmin kX
XBAm k2F +
1 kBk1
(1)
Step 2:
Am+1 = argmin kX
XB m+1 Ak2F s.t. AT A = IK
(2)
2
2 kBkF
Problem (1) can be solved eciently by FISTA.!

!
Problem (2) can be solved by SVD.
Xavier Bresson
26
Demo: Sparse PCA!

!"#$%&'(&%))*+'
-2'
Outline!
Sparse PCA!
Robust PCA!
PCA on Networks!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
28
Robust PCA!
Q: Is PCA robust to outliers? !
!! Motivation: Standard/sparse PCA are

sensitive to outliers, i.e. a single outlier may
change significantly the true PCA solution. !
!! Solution: Robust PCA is a technique that

separates the outliers from the clean data
where PCA is performed. !
!"#$%&'(&%))*+'
-4'
Formalization!
!! Standard PCA:!
!! Robust PCA:!
min kX
L
Lk2F s.t. rank(X) = K
min rank(X) +
L,S
card(S)) s.t. X = L + S
Data'
NP-hard combinatorial problem

problem!
'Continuous relaxation needed.'
)
min kXk? +
L,S
(1)
Low-rank matrix!
Standard PCA!
(structure)'
Sparse matrix!
captures outliers!
(no structure)'
Convex !
relaxation'
kSk1 s.t. X = L + S
(2)
Strong result: Solution of (2) is almost the solution of (1)!!
!"#$%&'(&%))*+'
.5'
Algorithm!
!! ADMM technique: Fast, robust and accurate solutions.!
Initialization:!
Lm=0 = X
S m=0 = Z m=0 = 0
Iterate until convergence:!

svd
Lm+1 = U h1/r ()V T with U V T = X
S m+1 = h /r X Lm+1 + Z m /r
Z m+1 = Z m + r(X
!"#$%&'(&%))*+'
Lm+1
S m+1 )
S m + Z m /r
h (x)
.,'
Demo: Robust PCA!

!"#$%&'(&%))*+'
.-'
Outline!
Sparse PCA!
Robust PCA!
PCA on Networks!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
33
PCA on Graphs!
Q: Can do PCA on networks like Facebook? !
!! Motivation: When data similarities are available or can be computed,
it enhances PCA.!
!! Formalization:!
min rank(X) +
L,S
card(S) +
G kLkG smooth
s.t. X = L + S
Force smoothness !
on graphs!
)
min kXk? +
L,S
!"#$%&'(&%))*+'
kSk1 +
Continuous!
convex !
relaxation'
G kLkG Dir
s.t. X = L + S
./'
Demo Video Surveillance!
!! Separate background from moving objects: State-of-the-art [ICCV15]!
!"#$%&'(&%))*+'
.0'
Outline!
Sparse PCA!
Robust PCA!
PCA on Networks!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
36
Non-Negative Matrix Factorization!

Q: Is PCA the best linear data representation? !
!! Motivation: PCA vs. NMF!
PCA learns the main variations of data.!
Data representation is based on main directions of data variations.!
NMF learns the most common parts of data.!
Data representation is based on main common data parts.!
[Lee-Seung99]'
PCA'
!"#$%&'(&%))*+'
NMF'
.2'
Matrix Factorization!
!! PCA and NMF are both factorized models:!
svd
PCA : X = U V T
m!
m!
NMF : X = LR with L, R
r! r!
R!
L!
LR!
0
n!
X!
=!
n!
Essential constraints to !
Identify data parts!
!! Dimensionality reduction technique: L,R are small low-rank matrices

(compared to X). They can be interpreted as compressed features! !
Movies!
Users!
X!
10'
Users!
L!
Compressed !
User features'
!"#$%&'(&%))*+'
10'
Movies!
R!
Compressed !
Movie features'
.3'
!! Text document representation: !
ri
40,000 words'
m!
20,000
text
n!
documents'
xi
=!
Lri
xi = Lri
'Each document is represented by a linear
combination of compressed word features.!
'Same for word:'
!"#$%&'(&%))*+'
xj = R T `j
.4'
How to Compute L,R?!

Factorization of the form:!
X = LR with L, R
can be solved by optimization these loss functions:!

!
(1) Least-squares loss:!
min kX
L,R 0
LRk2F
!
(2) Kullback-Leibler (relative entropy) loss: (histogram distances)!
min KL(X, LR) =
L,R 0
Xavier Bresson
X
ij
Xij log
(LR)ij
Xij
40
Algorithms!
Several techniques exist:!
(11) Multiplicative update techniques!
Advantage: Monotonic.!
Limitation: Slow to converge.!
!
(2) ADMM, Primal-Dual techniques!

Advantage: Fast.!
Limitation: No theoretical guarantee.!
!
(3) Power Methods (most recent)!

Advantage: Fast.!
Limitation: No theoretical guarantee.!
Xavier Bresson
41
Outline!
Sparse PCA!
Robust PCA!
PCA on Networks!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
42
Sparse Coding!
Q: Can do better than PCA and NMF? !
!! Motivation: PCA and NMF make strong assumptions about the dictionary
D used for linear representation: !
z = Dx
dictionary '
PCA!
D captures main directions!
of data variations!
NMF!
D captures main !
common parts of data!
Q:'How to relax these assumptions to learn more generic filters?!

!! Sparse coding (recently a.k.a. dictionary learning): !
New data assumption: Represent data as a sparse linear
combination of a few filters.!
!! Note: This is the best linear representation and feature extraction
technique, and for any kind of data! A class of deep learning
techniques use sparse coding at core feature extraction deconvolutional
neural networks. !
!"#$%&'(&%))*+'
/.'
Formalization!
!! Optimization problem: !
n
X
min
kxj
D,zj
j=1
Dzj k22 + kzj k1 s.t. kDi, k2 1 8i

Forces !
sparsity'
Controls!
filter energies'
Eni = kDi, k2 =
1
0
Algorithm can learn!

the best number of filters'
zj
Di,
=0
!"#$%&'(&%))*+'
//'
Algorithm !
!! Non-smooth and convex optimization:!
n
X
Dzj k22 + kzj k1 s.t. kDi, k2 1 8i
min kX
DZk2F + kZk1 s.t. kDi, k2 1 8i
Z,D
j=1
Z,D
Initialization:!
kxj
min
Dm=0 = randn
Iterate until convergence:!
Z m+1 = arg min kX
Dm Zk2F + kZk1
Dm+1 = arg min kX
DZ m+1 k2F s.t. kDi, k2 1 8i
'Each sub-optimization problems can be solved !

e"ciently by FISTA.!
!"#$%&'(&%))*+'
/0'
Demo: Sparse Coding!

Learned Dictionary= !
Human visual filters (V1 cells) !
in the primary visual cortex!
!"#$%&'(&%))*+'
/1'
Outline!
Sparse PCA!
Robust PCA!
PCA on Networks!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
47
Summary!
Feature Extraction Problem: !
(1) Handcrafted filters/features: less popular.!
(2) Learned filters/features: more and more common.!
Learned filters = Data representation problem:!

(1) Linear Representations!
(2) Non-linear representations (next lecture, and deep learning lectures) !
Linear Representations:!
(1) PCA: based on data variances.!
(2) NMF: based on positive common parts of data!
(3) Sparse Coding: based on sparse representation (highly adaptable !
technique)!
Xavier Bresson
48
Outline!
Sparse PCA!
Robust PCA!
PCA on Networks!
Sparse Coding (SC)!
Conclusion!
Xavier Bresson
49
Ques;ons?
Xavier Bresson
50
Data Science!
Sept 12-14, 2016!

Lecture 8: Data Visualization!
Xavier Bresson!
!
!"#$%&'(&%))*+'
,'
Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!
Xavier Bresson
!! Data visualization is the same problem as!
(1) Data representation!
(2) Feature extraction!
!! Data representation looks for the best filters or dictionary D where
the data x can be represented, and the projected data z on D are used
as coordinates for 2D or 3D visualization:!
zi 2 R 2
2D visualization'
xi 2 R d , d
1
3D visualization'
!"#$%&'(&%))*+'
zi 2 R 3
.'
Visualization Techniques!
!! Visualization techniques are also dimensionality reduction techniques
because they aim at mapping data into a much lower-dimensional space,
2D or 3D Euclidean spaces.!
!! Linear dimensionality reduction (LDR) techniques.!
Assumption: Data that can be represented on a low-dimensional
hyperplane.!
d
R , d
xi
+'
+'
+'+'
+'+'
+'
xi 2 Rd
!"#$%&'(&%))*+'
Hyperplane Rm , m d
LDR!
Find a linear!
mapping A s.t.!
A : xi ! zi
Am,
+'
+' z
i
+'
+' +'
A1,
zi 2 R m = A x i
/'
Non-Linear Dimensionality Reduction!

!! Assumption: Data that can be represented on low-dimensional curved
spaces, i.e. manifolds:!
Manifold M Rd ,
Rd , d
dim(M) = m d
xi
+'+' +'
+' +'+'
+'
NLDR!
zi
Find a non-linear!
mapping " s.t.!
xi 2 R
' : xi ! z i
zi 2 Rm = '(xi )
!! Non-linear dimensionality reduction (NLDR) techniques are also called

manifold learning techniques.!
!"#$%&'(&%))*+'
0'
Demo: LDR vs. NLDR!

!"#$%&'(&%))*+'
1'
Outline!
Kernel PCA!
T-SNE!
Conclusion!
Xavier Bresson
Kernel PCA
[Scholkopf-Smola-Muller97]!
!! Standard PCA:!
2
6
Gram matrix:' G = XX = 6
6
4
T
hx1 , x1 i hx1 , x2 i . . .
hx2 , x1 i hx2 , x2 i
..
..
.
.
hxn , xn i
7
7 EV D
7 = U DU T ) Xpca = U D1/2
5
!! Kernel PCA: Gram matrix in higher-dim space:!

2
h (x1 ), (x1 )i
6 h (x2 ), (x1 )i
6
G=6
..
4
.
h (x1 ), (x2 )i
h (x2 ), (x2 )i
...
..
.
h (xn ), (xn )i
7
7 EV D
7 = U DU T ) Xkpca = U D1/2
5
Apply the kernel trick to the Gram matrix:!
K(x, y) = h (x), (y)i = e
kx yk22 /
Never computed!!
!"#$%&'(&%))*+'
3'
Demo: Kernel PCA!

!"#$%&'(&%))*+'
4'
Outline!
Kernel PCA!
T-SNE!
Conclusion!
Xavier Bresson
10
Locally-Linear Embedding (LLE)
[Roweis, Saul00]!
!! Motivation: Design a mapping from high-dimensional space to lowdimensional space such that the geometric distances between
neighbor data are preserved.!
xi
+'
+
Rd , d
+ xj
+'
'
j
xi x+
+'
+'
+
R3
!! Description: LLE aims at computing a manifold M by locally linear fits,

that is each data on M and its neighbors lie on a locally linear patch of
M. !
!"#$%&'(&%))*+'
,,'
Algorithm!
Step 1: For each data xi, compute the k nearest neighbors.!
Step 2: Compute linear patches: find the weights Wij which best linearly
reconstruct xi from its neighbors:!
min
W
n
X
xi
i=1
Wij xj
2
2
s.t.
X
j
Wij = 1 8i
Solution: Ax=b
Step 3: Compute the low-dimensional embedding data zi, which is best

reconstructed by the weights Wij: !
min
Z=[z1 ,...,zm ]
n
X
i=1
zi
X
j
Wij zj
2
2
s.t.
zi = 0, Z T Z = Im
Solution: EVD
Xavier Bresson
12
Demo: LLE!
!"#$%&'(&%))*+'
,.'
Outline!
Kernel PCA!
T-SNE!
Conclusion!
Xavier Bresson
14
Laplacian Eigenmaps
[Belkin, Niyogi03]!
Very popular visualization technique.!

Motivation: Same as LLE but stronger mathematical analysis and
understanding. !
Manifold assumption: Data are sampled from a manifold M
represented by a k - nearest neighbors graph.!
Xavier Bresson
15
Dierential Geometry!
Eigenfunctions vk of continuous Laplace-Beltrami M serves as
embedding coordinates of M:!
Note: Discretization of M = graph Laplacian L !
Xavier Bresson
16
Formalization!
1D Visualization: Map a graph G=(V,E,W) to a line such that neighbor
data on G stay as close as possible on the line.!
Note: We look for the mapping but we never compute it explictely.!

We look for the coordinates zi of xi on the low-dimensional manifold M
such that yi=(xi), that is:!
min
y
Wij (yi
yj ) 2
(1)
ij
Interpretation: As Wij=1 if xi close to xj, then miny Wij(yi-yj)2 implies

that yi to be close to yj.!
Xavier Bresson
17
Generalization to 2D and 3D!

!! K-D Visualization: Generalizing (1) to K dimensions is straightforward:!
min
Y
X
k
T
Y,k
LY,k = tr(Y T LY ) s.t. Y T Y = IK
Graph Laplacian!
!! Spectral Solution: Top K eigenvectors of graph Laplacian L.!

EV D
L = U U T ! Y = UK
!! Advantages:!
(1) Global solutions (independent of initialization)!
(2) Fast algorithms!
!"#$%&'(&%))*+'
,3'
Demo: Laplacian Eigenmaps !

MNIST !
PCA!
!"#$%&'(&%))*+'
MNIST !
Lap Eigenmpas!
USPS!
Lap Eigenmpas!
,4'
Outline!
Kernel PCA!
T-SNE!
Conclusion!
Xavier Bresson
20
T-SNE
[van der Maaten, Hinton08]!
Q: What visualization technique is the most used? !

!! t-Distribution Stochastic Neighbor Embedding (t-SNE) is the most popular
visualization technique, among the top 10 algorithms in data science.!
!! Model description: t-SNE learns the mapping/embedding # function, such
that yi=#(xi), by minimizing the Kullback-Leibler distance between the
distribution of high-dim data and the distribution of the computed low-dim
data: !
e
pij = P
kxi xj k22 /
ke
2
i
kxi xk k22 /
i
2
i
= k th nearest neighbor distance from xi
(1 + kyi yj k22 ) 1
qij (y)
y)) = P
yk k22 )
k (1 + kyi
Embedding coordinates of high-dim data!
!"#$%&'(&%))*+'
-,'
Optimizing Kullback-Leibler !
Problem:!
min KL(P, Q(y)) =

y
ij
yim+1
yim
X
j
(pij
pij log
pij
qij (y)
Gradient descent technique
qij )(1 + kxi
xj k2 )
(yim
yj )
Advantages:!
(1) Local distance preservation (as LapEig, LLE): minimizing KL !
forces qij to be close to pij, the distribution of high-dim data.!
(2) t-SNE does not assume the existence of a manifold: More !
flexibility to visualize more complex hidden structures.!
Limitations:!
(1) Non-convex energy Existence of bad local solutions, problem of
initialization (PCA is used as initialization).!
(2) Slow optimization (gradient descent).!
Xavier Bresson
22
Demo: T-SNE!
!"#$%&'(&%))*+'
-.'
Outline!
Kernel PCA!
T-SNE!
Conclusion!
Xavier Bresson
24
Summary!
Data'
Non-Linear
Non-Linear!
Structure
Structure'
Linear!
Structure
Structure'
zi = '(xi )
zi = Axi
Low-dim!
data'
High-dim!
High-dim
data'
Low-dim!
data'
High-dim!
data'
Non-Linear Mapping/Embedding'
Dictionary'
Sparsity
Sparsity!
Structure
Structure'
Variability
Variability!
Structure
Structure'
Sparse Coding!
PCA!
(1997)'
(1901)'
Kernel PCA
PCA!
(1998)'
LLE!
LLE
(2000)'
T-SNE!
LapEig! T-SNE
LapEig
(2008)'
Maps!
(2000)'
Main property of these techniques:!

" preserves local distances in high-dim!
spaces:!
And in low-dim spaces:
xi
'+ xj
'
+
!"#$%&'(&%))*+'
Rd , d
xi xj
'
'++
'
R3
Most popular!
Popular!
Popular
Math sound!
Unique solution!
Manifold assumption !
Too strong!
But slow convergence!

As non-convex opt!
Local minimizers !
-0'
Gephi!
Q: What visualization software is the most used? !
!! Awesome visualization tool!!

!! Run lecture08_code06.ipynb to convert graphs to Gephi format.!
!"#$%&'(&%))*+'
-1'
Ques:ons?
Xavier Bresson
27
Data Science
Sept 12-14, 2016!

Lecture 9: Deep Learning 1!
Classification Techniques!
Xavier Bresson!
!
Note: Some slides are taken from F.F. Li,

A. Karpathy, J. Johnsons course on Deep Learning !
!"#$%&'(&%))*+'
,'
Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
Classification Problem!
Q: What it the classification problem? !
!! Classification is a core problem in many applications:!
(1) Computer Vision: Image classification !
Image " Class (original deep learning [Hinton-et.al12])!
(2) Speech: Sound recognition!
Sound " Class (original deep learning [Dahl-et.al12])!
(3) Text document: Text categorization!
Text " Class (Wikipedia analysis)!
(4) Neuroscience: Brain functionality!
Activation pattern " Vision, hearing, body control !
!! Pipeline of classification models:!

Raw data " Feature extraction " Classifier function !
!"#$%&'(&%))*+'
.'
Image Classification !
!! We will consider the image classification problem in Computer Vision as a
generic classification problem (generalization will be discussed in Lecture 11).!
!! Problem:!
Image !
Classification '
!"#$%&'(&%))*+'
/'
Main Challenge!
!! Bridge the semantic gap between raw data (N-D array of numbers) and
cognitive/human understanding.!
Images are represented as 3D

arrays of numbers, with
integers between [0, 255].'
What humans see [cat]'
!"#$%&'(&%))*+'
Matrices of e.g. size

300x100x3.'
0'
Semantic Information is Invariant !

to Many Deformations!
Deformations in Computer Vision:!
(1) Spatial variations!
(translation, rotation,!
Scaling, shearing)
(2) Illumination!
changes
(5) Background!
clutter
Xavier Bresson
(3) Object !
deformation
(4) Occlusion
(6) Intra-class!
variation
How to Solve the Classification Problem?!

!! Highly challenging problem ( sorting problem):!
In Computer Vision (CV), early works from 1950s (history later), only recently
(2012) algorithms have achieved super-human performances.!
!! Before 2012:!
Many works exist, mostly based on two separate steps:!
(1) Handcrafting best possible filters/features (e.g. SIFT features)!
(2) Linear SVM classification on extracted features!
!! After 2012:!
Deep learning revolution: Learn simultaneously !
(1) Filters/features from raw data (do not handcraft anything)!
(2) Linear SVM classification on extracted features!
" State-of-the-art in CV, speech recognition, etc!
!"#$%&'(&%))*+'
2'
Classification is a Data-Driven Approach!

!! Generic approach:!
(1) Collect a training dataset of images and labels (training set).!
(2) Train an image classifier.!
(3) Evaluate classifier on test images (test set). !
!! Note: Collecting data is easy (big data era) but labeling is time consuming.!
!"#$%&'(&%))*+'
3'
Ouline!
Linear Classifier !
Loss Function!
Softmax Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson

Q: What it the nearest neighbor classifier? !
!! Naive Classifier:!
Training set!
!"#$%&'(&%))*+'
Test data!
Nearest data!
in training set!
,5'
How to find Nearest Neighbor?!

!! Distance metrics: L2, L1, cosine, KullbackLeibler, your favorite..!
X
this example
example!
d`1 (I1 , I2 ) =
|I1 ij I2 ij |
!! Python code:!
!"#$%&'(&%))*+'
ij
,,'
Test Time!
!! Q: What is the test time? And how the classification speed depends on
the size n of training data? A: O(n), linearly ". This is a (major)
limitation. Fast test time is preferred in practice.!
Note: Neural Networks have fast test time, but expensive training time. !
!! Partial solution: Use approximate nearest neighbor techniques, which finds
approximate nearest neighbors quickly. !
!"#$%&'(&%))*+'
,-'
k-Nearest Neighbor Classifier!

!! Limitation: Nearest Neighbor is sensitive to outliers/noise.!
Solution: Find the k nearest images, and pick the label with majority voting
(sort of regularization process).!
Training set!
!"#$%&'(&%))*+'
data!
Test data
k=5!
k-Nearest data!
in training set!
,.'
Illustration!
outlier'
Data!
NN/1-NN classifier!
5NN voting is tied'
5-NN classifier!
Q1: What it the accuracy of 1-NN on training data?! 100% !

Q2: What it the accuracy of 5-NN on training data?! 100% !
Q3: What it the accuracy of 5-NN on test data?!
!"#$%&'(&%))*+'
,/'
Hyperparameters!
Q: What is the di#erence between parameter and hyperparameter?!
!! There exist two types of parameters:!
(1) Parameters: Variables that can be estimated by optimization #.!
(2) Hyperparameters: Variables that can be estimated by cross- !
validation (cannot be estimated by optimization) ".!
!! Examples of hyperparameters: distance metric, k value.!
L2, L1, cosine, KullbackLeibler
KullbackLeibler?'
k=1,2,5,10,15?
k=1,2,5,10,15?'
Q: What is cross-validation?!
!! Cross-validation:!
Q: Try out what hyperparameters work best on test set? Bad idea.
!
Test set used for the generalization performance! Use it only after training is
done. !
!"#$%&'(&%))*+'
,0'
Cross-Validation!
!! Split training data into training set and validation:!
Validation data!
Use to test hyperparameters'
Training data!
use to learn classifier'
!! Cross-validate: Cycle through the 5 folds, and record results: !
Training data
data!
!"#$%&'(&%))*+'
Validation data!
Training data!
,1'
Cross-Validation Result!
!! Example of 5-fold cross-validation for finding the value of k:!
Each point: single outcome.!

The line goes
through the mean, !
bars indicated
standard deviation!
deviation
'6alue k = 7 works best for this data.'
!"#$%&'(&%))*+'
,2'
Demo: K-Nearest Neighbor !

!"#$%&'(&%))*+'
,3'
k-Nearest Neighbor Performances!

Best accuracy (for k=7) is 29% on validation sets, may be even lower on
test set.!
Conclusion:!
(1) Never use k-NN (at least for image classification).!
(2) Not robust to perturbations (spatial variations, illumination changes, object
deformation, occlusion, background clutter, intra-class variation).!
(3) Bad test time.!
Xavier Bresson
19
Ouline!
Linear Classifier !
Loss Function!
Softmax Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
20
Linear Classifier!
!! Image classification: !
Image !
classification !
task'
Class of Images: !
CAT!
Array 32x32x3'
!! Linear classifier: !
input'
input
f (x, W, b) = W x + b
vectorize'
3D array !
32x32x3'
O#set/!
Bias'
10 numbers!
indicating
class scores!
(highest is
the choice)!
s!
1D array !
10x1!
x!
1D array
3072x1!
Linear classifier/!
Score function:'
f = Wx + b
10x1!
!"#$%&'(&%))*+'
Parameters/!
Weights'
Weights
10x3072! 3072x1!
10x1!
-,'
Interpreting Linear Classifier!

Q: What does a linear classifier do?!
A: Template matching technique: It scores the image (data) by matching
it with 10 templates.!
Template/filters !
2
3
3
2
matching!
hW
W11,,
W11,, , xi
7
7
6
6
..
W = 4 ... 5
Wx = 4
5
.
W10,
hW10, , xi
10x3072!
10x3072
10x1!
Highest score!
decides!
the class!
un-vectorize'
Wi, =
1x3072!
1x3072
32x32x3!
32x32x3
Trained weights W* (later discussed)'

!"#$%&'(&%))*+'
--'
Interpreting Linear Classifier!

Q: What does a linear classifier do?!
A: Linear mapping: It maps the high-dim image (data) R3072 to a low-dim
linear space R10.!
{x : f (x) = Wcar x + bcar }
has maximum score'
{x : f (x) = Wcar x + bcar = 0}
!"#$%&'(&%))*+'
-.'
Ouline!
Linear Classifier !
Loss Function!
Softmax Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
24
How to Compute Weights?!

(1) Define a loss function(/objective/energy): A loss function L
quantifies how well the parameters (weights, oset) are chosen to get the
highest possible score across all training data.!
Exs of loss functions (Lecture 5): SVM, L2, Logistic, Huber, etc. !
(2) Optimization: The process of changing the parameters (weights, oset) to
minimize the loss function in order to get the highest possible score across all
training data.!
Xavier Bresson
25
SVM Loss Function !

!! Multiclass SVM loss: Given (1) training data (xi,yi) and!
(2) score function si = f(xi,W),!
the multiclass SVM loss function is: !
X
Li =
max(0, sj
j6=i
si + 1)
comes form the margin
margin'
SVM loss measures how well the weights s are chosen to get the highest
possible score. Then, Li is 0 when si is well classified, that is when it has the
highest score for its own class yi, and Li is large when si is misclassified. !
Example !
when si is !
well classified:!
!"#$%&'(&%))*+'
-1'
SVM Loss Function !
Example !
when si is !
misclassified:
1X
Total SVM loss:! L =
Li
n i=1
Q: What is the min value of L?!
A: 0.!
Q: What is the max value of L?!
A: + . !
Xavier Bresson
27
Loss Functions!
Q: Will we get the same classification for this loss function?!
X
A: Probably not.!
Li =
max(0, sj si + 1)2
j6=i
Reminder (Lecture 5): There are multiple available loss functions:!

(1) Hinge/SVM loss!
(2) L2 loss!
(3) Logistic regression loss (later)!
(4) Huber loss!
Xavier Bresson
28
Non-Uniqueness of Solutions!
n
Optimization problem:!
min
W
1X
Li (W )
n i=1
(1)
1 XX
max(0, sj
n i=1
si + 1)
j6=i
1 XX
max(0, W xj
n i=1
W xi + 1)
j6=i
This opt problem is ill-posed.

Call W* the solution of (1) then W*, >1 is also solution of (1).
Example:
Xavier Bresson
Q: How to fix this issue? A: Regularization.!
29
Regularization!
n
!! Remember Lecture 5: !
min
W
1X
Li (W ) + kW k2F
n i=1
Equivalent to maximize margins

between training data!
!! Other math interpretation: Strongly convex term

unique!!
make the solution
!! Regularization terms:!
(1) L2 regularization: smooth and di#erentiable.!
(2) L1 regularization: non-smooth and non-di#erentiable, but promotes
sparsity (a few non-zero elements).!
(3) Elastic net regularization: mixture of L1 and L2.!
(4) Dropout for Neural Nets (later discussed).!
!"#$%&'(&%))*+'
.5'
Demo: Linear Classifier and SVM Loss!

!"#$%&'(&%))*+'
.,'
Limitations of Linear Classifier!

Conclusion:!
(1) Never use linear classifier directly on raw data (at least for
image classification).!
(2) Not robust to perturbations (spatial variations, illumination
changes, object deformation, occlusion, background clutter, intra-class

variation).!
(3) Excellent test time.!
Xavier Bresson
32
Ouline!
Linear Classifier !
Loss Function!
Softmax Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
33
Softmax Classifier!
!! Softmax classifier = Multinomial logistic regression !
!! Motivation (from statistics): Maximize the log likelihood of the score
probabilities of the classes:!
Scores = unnormalized
log probabilities !
of the classes '
si = f (xi , W )
e si
P (Y
Y = yi |X = xi ) = P sj
je
Li =
log P (Y = yi |X = xi )
Li =
!"#$%&'(&%))*+'
Softmax function'
esi
log P sj
je
./'
Demo: Softmax Classifier!

!"#$%&'(&%))*+'
.0'
Ouline!
Linear Classifier !
Loss Function!
Softmax Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
36

!! Image classification: !
Image !
classification !
task'
Class of Images: !
CAT!
Array 32x32x3'
!! Linear classifier: !
vectorize'
f = Wx
10x1!
10x1
3D array !
32x32x3'
!"#$%&'(&%))*+'
x!
1D array
3072x1!
10x3072! 3072x1!
s!
1D array !
10x1!
.2'

!! 2-layer classifier: !
non-linear!
activation'
weights'
10x100!
10x100
100x3072!
100x3072
W1
vectorize'
3D array !
32x32x3'
!! 3-layer classifier: !
x!
1D array
3072x1!
h!
h
max(W1x,0)!
1D array
100x1!
W2
f = W2 max(W1 x, 0)
W2max(W1x,0)!
1D array
10x1!
f = W3 max(W2 max(W1 x, 0), 0)
!! Conclusion: Neural Networks (NN) are simply series of linear classification

and non-linear activations (max).!
!"#$%&'(&%))*+'
.3'
Code for 2-Layer NN Classifier!

!! Full implementation of training a 2-layer Neural Network needs 11 lines:!
789:;;$"<=&")>?@$=7AB?$*;-5,0;52;,-;B")$CD9E=7*+D+%=F*&>'
!"#$%&'(&%))*+'
.4'
Neural Network Architecture!

Fully connected (FC) layers: Each neuron is connected to all neurons in
the next layer.!
2-layer Neural Net!

or 1-hidden-layer Neural Net
3-layer Neural Net, !

or 2-hidden-layer Neural Net
The need for more structure: FC networks are very generic but also
highly computationally expensive to learn (huge number of parameters).
They cannot be deep! !
However, using special structures of data (like local stationarity in convolutional
neural networks, and recurrence in recurrent neural networks) allow to construct
deep networks that can be learned (later discussed).!
Xavier Bresson
40
Test Time!
!! Once training is done, it is fast to classify new data (simple linear
algebra operations):!
!! This operation is called forward pass (later discussed).!
!"#$%&'(&%))*+'
/,'
Demo: Neural Network Classifier!

!"#$%&'(&%))*+'
/-'
Demo: Linear vs. Neural Network Classifiers!

!"#$%&'(&%))*+'
/.'
Online Demo!
!! ConvNetJS:
http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html!
more neurons = !
more capacity'
Regularization !
handles outliers'
!"#$%&'(&%))*+'
//'
Ouline!
Linear Classifier !
Loss Function!
Softmax Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
45
Brain Analogy!
Wx + b
!"#$%&'(&%))*+'
/1'
Very Limited Analogy!

!! Biological Neurons:!
- Many di#erent types!
- Dendrites can perform complex
non-linear computations!
- Synapses are not a single weight
but a complex non-linear
dynamical system!
- Rate code may not be adequate!
!"#$%&'(&%))*+'
/2'
Ouline!
Linear Classifier !
Loss Function!
Softmax Classifier!
Brain Analogy!
Conclusion !
Xavier Bresson
48
Summary!
Image/data classification: Given a training set, design a classifier and
predict labels for test set. !
k-Nearest Neighbor classifier:!

Predict labels from k nearest images in the training set.!
(Almost) never used as bad accuracy, and bad test time. !
Linear/softmax classifier:!
Predict labels with a linear function.!
Has been used for a long time (kernel techniques) but overcome by deep learning.!
Score function:!
f = Wx + b
SVM loss function:!
max(0, sj si + 1)
j6=i
!
esi
Softmax loss function: ! Li = log P s
j
je
Xavier Bresson
Li =
49
Summary!
!! Standard Neural Networks (NNs):!
Neurons arranged as fully connected layers. !
Series of linear functions and non-linear activations.!
Fast test time (matrix multiplications)!
Performances: bigger = better, but expensive training time (thanks GPUs)!
Bigger = (layer) width and depth (deep)!
width!
depth !
Q: How to train Neural Networks? !
!"#$%&'(&%))*+'
05'
QuesHons?
Xavier Bresson
51
Data Science
Sept 12-14, 2016!

Training Neural Networks!
Xavier Bresson!
!

!"#$%&'(&%))*+'
,'
Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!
Xavier Bresson
Loss Function for Classification!

Classification: Use training data (xi,yi) to design a score function s for
classification: !
s = f (W, x) = W x
Weight W: They are found by minimizing a loss function which quantifies
how well the training data have been classified:!
X
(1) SVM loss: !
Li (W ) =
max(0, sj si + 1)
!
j6=i
(2) Softmax loss:!

!
(3) Regularization: !
e si
Li (W ) = log P sj
je
X
E(W ) =
Li (W ) + R(W )
i
Q: How to minimize loss functions?!

A: Steepest gradient descent (follow the slope!)!
Xavier Bresson
Gradient Descent Techniques !

!! Gradient descent: Most standard optimization technique!!
Note: This class of techniques are weak in optimization, but they are the
most generic when the energy landscape is di"cult, that is non-convex.
Training neural networks is (very) slow because of the gradient descent
bottleneck new research is on-going to speed up the optimization.'
!"#$%&'(&%))*+'
/'
Gradient Operator!
!! Two types: !
(1) Analytic gradient: !
!
(2) Numerical gradient: !
rW E =
@E
= explicit formula
@W
E
E(W +
=
W
W)
W
E(W )
!! Properties of numerical gradient:!

(i) Approximation of analytic gradient.!
(ii) Slow to evaluate (compared to analytic gradient).!
Evaluation the gradient numerically:'
!"#$%&'(&%))*+'
!"#$%&'(&%))*+
0'
Analytic Gradient!
Properties:!
(1) Exact value (use Calculus)!
(2) Fast to evaluate.!
E(W ) =
kW k2F
@E
= 2W
rW E =
@W
Common practice: Gradient Check!

Always use analytical gradient but check its implementation with
numerical gradient.!
Xavier Bresson
Update Rule!
!! Update: !
W m+1 W m
W
=
=
t
Speed of !
Gradient descent!
techniques'
Time step/!
Learning rate/!
Step size'
rW E(W m )
W m+1 = W m
rW E(W m )
negative
negative'
!! Code: !
!"#$%&'(&%))*+'
2'
Monotonicity!
!! Loss/energy value decreases monotonically at each iteration m:!
Q: What happens with big data? !

Large value, e.g. n=billions'
E(W ) =
n
X
Li (W ) + R(W )
i=1
Analytic gradient uses all data at the same time it is not possible
to load all data in memory!!
!"#$%&'(&%))*+'
3'
Mini-batch/Stochastic Gradient Descent!

!! Special property of loss functions: Additively separable functions, i.e.
functions that are the sum of a single data function Li (independent of all
other data):!
n
1X
min L(W ) =
Li (W )
W
n i=1
nq
n1
n2
1 X
1 X
1 X
Li (W ) +
Li (W ) + ... +
Li (W )
=
n1 i=1
n2 i=1
nq i=1
n = n1 + n2 + ... + nq
)
nq
n1
n2
1 X
1 X
1 X
rL(W ) =
rLi (W ) +
rLi (W ) + ... +
rLi (W )
n1 i=1
n2 i=1
nq i=1
only use a small portion of the training set

to compute the gradient!
!! Stochastic gradient descent: !
!! Deterministic gradient descent: !
W m+1 = W m
m+1
=W
nj
1 X
rLj (W m )
nj i=1
rL(W
W m)
All data
data'
!"#$%&'(&%))*+'
4'
Mini-batch/Stochastic Gradient Descent!

More details:!
Iterate ne epochs: !
For each epoch, iterate over all mini-batches j=1,,nq:!
W m+1 = W m
nj
1 X
rLj (W m )
nj i=1
Note1: An epoch is a complete pass of all training data.!

Note2: At each new epoch, randomly shue training data (improve
significantly results).!
Note3: Stochastic consistency:!
nj
1 X
m
E
rLi (W )
nj i=1
E(W m+1 )
Xavier Bresson
m!1
m!1
1X
rLi (W m )
n i=1
W m+1
10
Stochastic Monotonicity!
!! Code:!
!! Loss vs epochs (iteration m):!
Note1: Mini-batch size: 32, 64, 128, limited by GPU memory.!

Note2: Several works to speed up optimization: Momentum, Adagrad, Adam,
etc (later discussed), but still major bottleneck of NN training.!
Note3: Stochastic gradient descent technique is not only used for large-scale neural
networks, but also for most big data problems: k-means clustering, SVM
classification, Lasso regression, recommendation, etc.!
!"#$%&'(&%))*+'
,,'
Influence of Learning Rate 6 !

!! Challenging problem to find the optimal step size 67!
Large "!
Small "!
Optimal "!
!"#$%&'(&%))*+'
,-'
Outline!
Backpropagation!
Activation!
Dropout!
Conclusion!
Xavier Bresson
13
Computational Graph!
!! Neural networks (NNs) are represented by computational graphs (CGs).!
Definition: A series of operators applied to inputs. Easy to combine (lego
strategy), can be huge.!
Usefulness: Clear visualization of NN operations (great for debugging).!
CG are essential to derive gradients by backpropagation.!
Computational !
Graph: '
Google Tensorflow'
!"#$%&'(&%))*+'
,/'
Backpropagation!
Definition: A recursive application of chain rule along a
computational graph (CG) provides the gradients of all inputs,
weights, intermediate variables.
Chain rule: Calculus:
@L
@L(F (x))
@L @F (x)
=
=
.
@x
@x
@F @x
Essential property of backpropagation: It can compute the gradient of

any variable in the CG by a simple local rule, independently of the size of
the CG (including very deep NNs).
Xavier Bresson
15
Local Rule!
!! Any computational graph is a series of elementary neurons (also called
nodes, gates) 'The gradient of the loss w.r.t. the inputs x,y of the local
neurons can be computed with the local rule:!
Gradient L w.r.t. x,y = !
Recursive gradient * Local gradient w.r.t. x,y!
Local gradients '
Chain rule
rule'
L'
Recursive gradient'
!"#$%&'(&%))*+'
,1'
Backpropagation Techniques!
!! Backpropagation consists of two steps:!
(1) Forward pass/flow: Compute final loss value and all intermediate
output values of neurons/nodes. Save them in memory for gradient
computations (in backward step).!
(2) Backward pass/flow: Compute the gradient of the loss functions w.r.t.
all variables on the network using the local gradient rule.!
Backward low: !
Compute gradient values'
!"#$%&'(&%))*+'
Forward low: !
Compute loss values'
,2'
An Example!
!! Step 1:!
@f
=1
@f
!! Step 2:!
!"#$%&'(&%))*+'
!"#$%&'(&%))*+
,3'
An Example!
!! Step 3:!
!! Step :
:!
!"#$%&'(&%))*+'
,4'
Another Example!
!"#$%&'(&%))*+'
-5'
Backpropagation Implementation!
Forward and Backward Functions!
!! Code:!
!"#$%&'(&%))*+'
-,'
Backpropagation Implementation!
Forward and Backward Functions!
!! Pseudo-code:!
!"#$%&'(&%))*+'
--'
Gradient with Vectorized Code!

@L
!! When variables x,y,z are row vectors: ! @L = @f . @
@x
@x @ff
Jacobian !
Matrix'
Q: What is the size of a

Jacobian matrix? !
A: If input dimensions is 4096 then J is
4096 x 4096 = 16M variables ! !
h @f i
@f
i
=
@x
@xj
@f @L
@L
=
.
@x
@x @f
4096 x 1 '
4096 x 4096 '
4096 x 1 '
!! Good news: Most computations are element-wise in computational graphs,

i.e. apply to each element x independently of the other elements. In this case,
Jacobian = diagonal matrix 'Very fast computations in parallel with
vectorized operations on CPUs.!
!"#$%&'(&%))*+'
-.'
Example!
!! Activation gate:!
!"#$%&'(&%))*+'
-/'
Backpropagation Cost!
Cost forward cost backward (slightly higher) !
Backward requires to store forward values!!
Mini-batch optimization with backpropagation:!

(1) Sample randomly a batch of data!
(2) Forward propagate to get loss values!
(3) Backward propagate to get gradients!
(4) Update neural network weights and other intermediate variables!
!
Works on huge computational graphs.!
Xavier Bresson
25
Outline!
Backpropagation!
Activation!
Dropout!
Conclusion!
Xavier Bresson
26
Activation Functions!
!! Reminder: Neural network classifiers are succession of linear classification and
non-linear activations.!
Exs: 2-layer classifier:! f = W2 max(W1 x, 0)
W1 x, 0), 0)
3-layer classifier:! f = W3 max(W2 max(W
Activation !
function!
!"#$%&'(&%))*+'
-2'
Class of Activation Functions !
Xavier Bresson
28
Sigmoid Activation!
!! Historically popular by analogy
with neurobiology.!
Sigmoid !
(x) = 1/(1 + e
!! Three issues:!
(1) Saturated neurons kill gradients!
Vanishing gradient problem (later discussed)!
r = (1
(2) Exp is a bit computationally expensive.!

!
(3) Sigmoid are not zero-centered:!

Suppose input neurons are always positive, then
gradients are either all positive, or all negative: '
f (z = w1 x1 + w2 x2 ) !
@f
@f @z
@f
xi ,
=
=
@wi
@z @wi
@z
xi
Always zero-center your data!!

!"#$%&'(&%))*+'
-4'
Tanh Activation [LeCun-et.al91]!

!! Advantage: !
Zero-centered function " !
Tanh!
(x) = tanh(x)
!! Issue:!
Kill gradients Vanishing
gradient problem (later discussed)!
!"#$%&'(&%))*+'
.5'
ReLu Activation [Hinton-et-al12]!

!! ReLU (Rectified Linear Unit):!
ReLu!
(x) = max(x, 0)
!! Advantages: !
(1) Converges 6x faster than sigmoid/tanh!
(2) Does not saturate in positive region!
(3) Max is computationally e"cient !
!! Limitations:!
(1) Not zero-centered function!
(2) It kills gradient when input is negative !
Standard trick: Initialize neurons with a small positive biases like 0.01. !
!"#$%&'(&%))*+'
.,'
Leaky ReLu [Mass-et.al13]!

!! Advantages: ReLU (Rectified Linear
Unit)!
(1) Converges 6x faster than sigmoid/tanh!
(2) Does not saturate in positive region!
(3) Max is computationally e"cient!
(4) It does not kill the gradient!
(5) Parameter 8'can be learned by
backpropagation.!
Leaky !
ReLu!
(x) = max(x, x), = 0.01
!! In practice: Use Relu, try out Leaky ReLu, do not expect much of
tanh, never use sigmoid. !
!"#$%&'(&%))*+'
.-'
Demo: Train Fully Connected Neural

Networks with Backpropagation!
!"#$%&'(&%))*+'
..'
Outline!
Backpropagation!
Activation!
Dropout!
Conclusion!
Xavier Bresson
34
Q: What happens when initial W=0 is used? !
A: All neurons compute the same outputs and the weights are the same. !
'Need to break symmetry!!
!! Natural idea: Use small random numbers for initialization.!

W = Gaussian/normal distribution with zero mean and 1e-2 standard deviation!
'Works well for small networks, but not for deep networks.'
!"#$%&'(&%))*+'
.0'
Vanishing Gradient Problem!

!! Example: 10-layer net, 500 neurons on each layer, tanh activation, and
initialization:!
!! Collect output statistics at each layer: means, variances, histograms!
All activations become zero!!

Q: Why?'
!"#$%&'(&%))*+'
.1'
Vanishing Gradient Problem!

Vanishing gradient: At initialization, all weights
W are small. This has two consequences:!
(1) At each layer, output backpropagated gradient
is small!
(2) We chain all recursive gradients by
backpropagation:!
W x W x W x x W exponential decay at each
output (e.g. W=0.110)!
The deeper the network the lower the gradient!!
This is the vanishing gradient problem (was a big issue for long time).
Xavier Bresson
37
Exploding Gradient Problem!

!! Let us try the opposite: !
' All neurons get saturated to -1 and +1 (tanh
activation), and then the gradients = 0.!
)
' Tricky to set a good value for the mean of
the normal distribution:!
2 [0.01, 1]
!"#$%&'(&%))*+'
.3'
Xaviers Initialization [Glorot-et-al10] !

!! Idea: Pick an initial W that keep all layers with variance 1:!
' It scales the gradient:!

(1) Large #inputs small weights!
(2) Small #inputs large weights'
!! Theory: Reasonable initialization but mathematical derivation assumes linear
activations It works well for tanh, but not for ReLu. It needs a small change:!
!"#$%&'(&%))*+'
.4'
Batch Normalization [Io#e-Szegedy15] !

!! Motivation: Unit Gaussian activations are desirable all over the network !
'let us enforce this property!!
!! Formula:!
BN along each dimension:'

layer!
xk
x
= p
k
!! Node/gate in NN:!
E(xk )
Var(xk )
Smooth and di#erentiable function!

'Backpropagation can be carried out! '
!"#$%&'(&%))*+'
/5'
Where to Insert BN in NN?!

!! Usually inserted after Fully Connected layers (or
convolutional layers, next lecture) and before
nonlinearity:!
!! Q: But do we necessarily want a unit Gaussian
input to a tanh layer?!
A: Not necessarily 'Let the network decide by
backpropagation.!
Normalize: '
xk
x
= p
E(xk )
Var(xk )
Then allow the network to to change the range if it wants to:'
yk =
k k
x
+
k
k
Note the network can learn the identity mapping if it wants to:'
k
!"#$%&'(&%))*+'
q
= Var(xk )
= E(xk )
/,'
Properties!
Pseudo-code:'
!! Properties:!
(1) Reduces strong dependence on
initialization!
(2) Improves the gradient flow through
the network!
(3) Allows higher learning rates !
Learn faster the network!
(4) Acts as regularization!
!! Price: 30% more computational time.!
!! At test time: Mean and variance are estimated during training and
average values are selected.!
!"#$%&'(&%))*+'
/-'
Demo: Batch Normalization !

!"#$%&'(&%))*+'
/.'
Outline!
Backpropagation!
Activation!
Dropout!
Conclusion!
Xavier Bresson
44
Optimization for NNs!

!! Training Neural Networks:!
(1) Sample a batch of data.!
(2) Forward prop it through the graph, get loss values!
(3) Backprop to calculate the gradients!
(4) Update the parameters using gradients!
!! Code (main loop):!
Q: Can we do better than

simple stochastic gradient
descent (SGD) update? !
sgd is the slowest!!
A: Yes! Extensive works. Again

major bottleneck.'
!"#$%&'(&%))*+'
/0'
Why SGD is slow?!

!! Illustration:!
Level set of loss'
Gradient loss
is steep
vertically'
Gradient loss
is flat
horizontally'
Q: What is the trajectory along which we converge to the

minimum with SGD? !
A: Trajectory is jittering (up and down) because very slow progress along
flat directions.!
'Solution: Momentum update.!
!"#$%&'(&%))*+'
/1'
Momentum [Hinton-et.al86] !
!! New update rule:!
Friction'
Acceleration'
Velocity'
Physical interpretation: Momentum update is like rolling a ball along the

landscape of the loss function. !
!! Advantages: !
(1) Velocity builds up along flat directions.!
(2) Decrease velocity in steep directions.!
!"#$%&'(&%))*+'
/2'
Limitation of Momentum!
Momentum overshoots the minimum but overall gets faster than SGD
(too much velocity).!
In practice: !
(1) = 0.5, 0.9!
(2) Initialization: v=0!
Xavier Bresson
48
Nesterov Momentum!
!! Nesterov accelerated gradient (NAG) technique used for momentum
update:!
only change'
Nesterov update:'
v t = v t rf (xt + v t )
xt+1 = xt + v t
'It can correct its trajectory

faster than momentum'
!"#$%&'(&%))*+'
/4'
Energy Landscape of NNs!

Energy landscapes of NNs are non-convex
Existence of local minimizers, which
usually is a big issue as most local solutions
are bad. !
Surprisingly, most local minimizers in large-scale networks are

satisfactory! Why? Nobody knows.!
So, initialize with random function, then it will always end up to a
good solution, no bad local minimizers!!
The problem of local minimizers is for small-scale networks.!
Xavier Bresson
50
AdaGrad [Duchi-et.al11]!
!! Origin: Convex optimization.!
!! Update rule:!
dx
Prevents division by 0.
0.'
|v n+1 |2
sign(dx) = 1
The dynamics is controlled because the

speed is 1 'It equalizes the step size
in steep and flat directions.'
!! Issue: When vm gets large then xm will stop moving! !

Solution: RMSProp!
!"#$%&'(&%))*+'
0,'
RMSProp [Hinton12]!
!! RMSProp update rule: It does not stop the learning process.!
!"#$%&'(&%))*+'
0-'
Adam [Kingma-Ba14]!
!! Adam = Momentum + Adagrad/RMSProp!
!! Adam = Default current optimization technique for NNs. !

!! Hyperparamaters: :,;'0.9<':-;'0.9=!
!"#$%&'(&%))*+'
0.'
Global Learning Rate 6 !

!! All optimization algorithms have
a learning rate " as
hyperparameter.!
!! Q: Is a constant learning rate is
good? !
A: No. A learning rate should decay
over time.!
Decay learning rate at each epoch:!
!
!
(1)!Exponential decay!
!
!
(2) Polynomial decay!
= 0 e 0 m
= 0 /(1 + 0 m)
!! Common good practice: Babysit the loss value and the learning
rate.!
!"#$%&'(&%))*+'
0/'
Demo: Update Rules/!

!"#$%&'(&%))*+'
00'
Outline!
Backpropagation!
Activation!
Dropout!
Conclusion!
Xavier Bresson
56
Dropout Regularization [Hinton14]!

!! Dropout: A regularization process
tailored to NNs!
Idea: Randomly set some neurons to
zero in the forward pass.!
!! How to deactivate neurons?!

Use simple binary masks U.!
!"#$%&'(&%))*+'
02'
Why It Works?!
It prevents overfitting: It reduces the number of parameters to learn
for the NN.!
It increases robustness of learning process: For each dropout unit,

dropout sub-samples the NN and we learn the best weights for this
sub-network. And we do it for many dierent sub-networks. Besides
all these sub-networks share weights Global consistency of
weight values and better robustness because we learn on smaller
networks.!
Xavier Bresson
58
Code!
Xavier Bresson
59
Demo: Dropout!
!"#$%&'(&%))*+'
15'
Outline!
Backpropagation!
Activation!
Dropout!
Conclusion!
Xavier Bresson
61
Summary!
!! Training Neural Networks:!
(1) Sample a batch of data.!
(2) Forward prop it through the graph, get loss values!
(3) Backprop to calculate the gradients!
(4) Update the parameters using gradients!
!! Neural Networks = Computational Graphs!

Lego approach of building large-scale NNs!
!! Activation functions:!
(1) Sigmoid (never)!
(2) Tanh (try)!
(3) ReLu (default)!
(4) Leaky ReLu (try)!
!"#$%&'(&%))*+'
1-'
Summary!
Weight initializations:!
(1) Xaviers initialization (default)!
(2) Batch Normalization (30% additional cost)!
Parameter updates/optimization:!
(1) SGD!
(2) Momentum!
(3) Nesterov momentum!
(4) Adagrad/RMSProp!
(5) Adam (default)!
Dropout regularization!
Xavier Bresson
63
Ques@ons?
Xavier Bresson
64
Data Science
Sept 12-14, 2016!

Lecture 10: Deep Learning 2 (Supplementary)!
Common Good Practices for NN Learning!
Xavier Bresson!
!

!"#$%&'(&%))*+'
,'
Outline!
Step 1: Pre-process data!
Step 2: Choose NN Architecture!
Step 3: Monitor Loss Decrease!
Step 4: Hyperparameter Optimization!
Step 5: Monitor Test Accuracy!
Xavier Bresson
Step 1: Pre-Process Data!

!! Assume a n x d data matrix X: !
!"#$%&'(&%))*+'
.'
Outline!
Xavier Bresson

!! Start small: 1 hidden layer then increase number of layers.!
Example!
CIFAR:!
!! Initialization:!
(1) Small networks: Normal distribution with 0.01 standard deviation!
(2) Large networks: Xaviers initialization !
!"#$%&'(&%))*+'
0'
Outline!
Xavier Bresson

!! Initialization: Loss value = - log(1/#classes)!
!! Add regularization: Loss value increases 'Good sanity check!
!"#$%&'(&%))*+'
2'

Let us train: !
(1) First, overfit small portion of data with SGD, reg=0!
Easy way to find a good value for the global learning rate .!
Xavier Bresson

!! Let us train: !
(2) Second, reg=small then find 3'
!
!
!
!
!
!
!
!
!
!
!
!
!
(3) Last, increase reg value!
!"#$%&'(&%))*+'
5'
Outline!
Xavier Bresson
10

!! Cross-validation strategy:!
(1) First step: Only use a few epochs to get an idea of what values work!
!
!
!
!
!
!
!
!
!
!
!
(2) Second step: Try out many values !

'Very long computational times !
!"#$%&'(&%))*+'
,,'
Grid vs. Random Search!
!"#$%&'(&%))*+'
,-'
Hyperparameters!
!! List:!
(1) Network architecture!
(2) Learning rate, decay schedule!
(3) Regularization: L2 and dropout!
!"#$%&'(&%))*+'
,.'
Outline!
Xavier Bresson
14
!"#$%&'(&%))*+'
,0'
Ques9ons?
Xavier Bresson
16
Data Science
Sept 12-14, 2016!

Convolutional Neural Networks!
Xavier Bresson!
!

!"#$%&'(&%))*+'
,'
Outline!
History of CNNs!
Standard CNNs!
CNNs for Graph-Structured Data!
Conclusion!
Xavier Bresson
A Brief History!
!! Hubel and Wiesel: Nobel Prize in Medicine in 1959 for understanding
the primary visual cortex system.!
!! Visual system is composed of receptive fields called V1 cells that are
composed of neurons that activate depending on the orientation.!
!"#$%&'(&%))*+'
.'
Hierarchical Organization of Visual Neurons!

!! The second layer of the visual cortex is composed of V2 cells that takes as
inputs the outputs of V1 neurons. This forms a hierarchical organization.!
!"#$%&'(&%))*+'
/'
Perceptron [Rosenblatt57]!
!! Application: Character recognition!
Perceptron is only hardware (circuits, electronics), no code/simulations.!
Perceptron was connected to a camera that produced 400-pixel images.!
!
Update rule was empirical:!
W t+1 = W t + (D
Activation was binary:!

!
(x) =
1
0
Y t )X
if hw, xi + b > 0
otherwise
No concepts of loss function, no gradient, no backpropagation 'Learning was bad.!

Multilayer perceptron in 1960: stack perceptron, still hardware.!
!"#$%&'(&%))*+'
0'
Neurocognitron [Fukushima80]!
Application: Handwritten character recognition!
Direct implementation of Huber-Wiesel simple and complex cells (V1 and V2
cells) with hierarchical organization.!
Introduction of concepts of local features (reception fields).!
No concepts of loss function, no gradient, no backpropagation Learning was bad.!
Inspiration of convolutional neural networks (CNNs)!
Complex cells: perform pooling
Xavier Bresson
Backpropagation [Rumelhart-et.al86] !
Introduction of backpropagation: Concepts of loss function, gradient,
gradient descent.!
Issue: Backprop did not work for large-scale/deep NNs (vanishing gradient
problem).!
Xavier Bresson
Convolutional Neural Networks (CNNs)!

[LeCun-Bengio-et.al98]!
!! Like Google PageRank, a B$ algorithm, among top 10 algorithms in data
science.!
!! Computational issue in 1998: Very long to train for large-scale/deep

networks.!
!"#$%&'(&%))*+'
3'
2012 Breakthrough [Hinton-et.al12] !

!! AlexNet: CNN with 7 layers (5CL+2FC)!
'Error on ImageNet dataset is 15.3%, second is 26.3% (w/
handcrafted features)!
!! AlexNet uses:!
(1) ReLu activation!
(2) Dropout regularization!
(3) More layers!
(3) Graphics processing units (GPUs): A breakthrough in NN learning
as they allow to learn large-scale networks. !
!! Also breakthrough in speech
recognition [Dahl-et.al12]!
'Error decreased from 23.2% to 16.0%!!
!"#$%&'(&%))*+'
4'
CNNs/Deep Learning is Everywhere..!

in Computer Vision!!
!! Classification, Retrieval!
!! Detection, Segmentation!
!! Self-driving car!
!! Face detection!
!! Medical!
!! Go game!
!! Arts/deep dreams!
!"#$%&'(&%))*+'
,5'
DeepArts !
678)9::;%%8"&<=$*'
!"#$%&'(&%))*+'
,,'
CNNs are Used by All big IT Companies!

!! Facebook (Torch software)!
!! Google (TensorFlow software, Google Brain, Deepmind)!
!! Microsoft!
!
!! Tesla (AI Open)
Open)!
!! Amazon (DSSTNE software)!
!! Apple!
!! IBM!
!"#$%&'(&%))*+'
,-'
Outline!
History of CNNs!
Standard CNNs!
Conclusion!
Xavier Bresson
13
Convolutional Neural Networks!

CNNs are extremely ecient at extracting meaningful statistic patterns in
large-scale and high-dimensional datasets.!
Key idea: Learn local stationary structures and compose them to form
multiscale hierarchical patterns.!
Why CNNs are good? It is open (math) question to prove the eciency of
CNNs.!
Note: Despite the lack of theory, the entire ML and CV communities have
shifted to Deep Learning techniques! Ex: NIPS16: 2326 submissions, 328 DL
(14%), Convex Optimization 90 (3.8%). !
Xavier Bresson
14
Local Stationarity!
!! Assumption: Data are locally stationary
across the data domain:!
"
similar local patches are shared
!"#$"%&'"#(%)
*+%,-./)
!! How to extract local stationary patterns? !

Convolutional filters (filters with !
compact support kernels)!
F1
F2
F3
!"#$%&'(&%))*+'
x F1
x F2
x F3
,0'
Convolutional Layer (CL)!

!! Step 1: Convolve the filter with the image: Slide filter
over the image spatially, and compute dot products.!
height!
width!
width
depth!
Filter/!
Reception field!
for each neuron'
!"#$%&'(&%))*+'
1 number: The result of taking the

dot product between the filter and a
small 5x5x3 chunk of the image!
(i.e. 5*5*3 = 75-dimensional dot
product + bias)'
T
w x+b
,1'
Convolutional Layer (CL)!

!! Step 2: Produce a stack of activation maps. For example, using 6 5x5 filters,
we get 6 separate activation maps that we stack up to get a new image of
size 28x28x6.!
!! Step 3: Apply a non-linear activation function: For instance, ReLU.!
!"#$%&'(&%))*+'
,2'
Multiscale Hierarchical Features!

!! Assumption: Local stationary patterns can be composed to form more abstract
complex patterns:!
'
Layer 11'
Layer 22'
Layer 33'
Layer 44'
Deep/hierarchical !
Features (simple to abstract)'
!! How to extract multiscale hierarchical patterns? !

Downsampling of data domain (s.a. image grid) with Pooling (s.a. max, average).!
2x2 max!
pooling'
2x2 max !
pooling'
!! Other advantage: Keep same computational complexity while increasing

#filters.!
!"#$%&'(&%))*+'
,3'
Illustration of CLs in CNNs!
!"#$%&'(&%))*+'
,4'
Illustration of Activation Maps!
Xavier Bresson
20
Classification Function!
Classifier: After extracting multiscale locally stationary features, use them to
design a classification function with the training labels.!
How to design a (linear) classifier? !
Fully connected neural networks.!
Class 1
Class 2
xout = W xlayer
Class K
Features
Xavier Bresson
Output signal
Class labels
21
Full Architecture of CNNs!

!"#$"%&'"#(%)
*+%,-./)
x F1
F1
F2
x
7-8&)(9'$('"#)
:)
;.+<)<"=#/(>1%+#2)
:)
?""%+#2))
@'
x F2
IJ2J)+>(2-)
F3
0&,1&,)/+2#(%)
x F3
!%(//)%(3-%/)
xl=0 = x
xl=0
conv
A#1&,)/+2#(%)
!"#$"%&'"#(%)%(B-./)
C-D,.(9,)%"9(%)/,('"#(.B)E-(,&.-/)(#<)9">1"/-),F->))
$+()<"=#/(>1%+#2)(#<)1""%+#2G)
!"#$%&'(&%))*+'
xl=1
xl=1
conv
xl
y 2 R nc
*&%%B)9"##-9,-<)%(B-./)
C!%(//+H9('"#)E&#9'"#G)
--'
Example!
Xavier Bresson
23
Case Studies!
!! LeNet5 [LeCun-Bengio-et.al98]: !
Input is 32x32!
Architecture is CL-PL-CL-PL-FC-FC!
Accuracy on MNIST is 99.6%!
!! AlexNet [Krizhevsky-et.al12]: !
Input is 227x227x3!
Architecture is 7CL-3PL-2FC!
Accuracy on ImageNet is 15.4%!
Note: CL1 with 96 filters 11x11: 227x227x3 >'55x55x96 (stride=4), #parameters=(11x11x3)x96=35K!
PL1 2x2: 55x55x96 >'27x27x96, #parameters=0!!
!"#$%&'(&%))*+'
-/'
Case Studies!
!! GoogleNet [Szegedy-et.al14]: !
Input is 227x227x3!
Architecture is 22 layers!
Architecture'
!! ResNet [He-et.al15]:
Microsoft Asia !
Input is 227x227x3!
Architecture is 152 layers!!
!"#$%&'(&%))*+'
-0'
The Deeper The Better!
!"#$%&'(&%))*+'
-1'
Demo: LeNet5!
TensorBoard!
!"#$%&'(&%))*+'
-2'
CNNs Only Process Euclidean-Structured Data!

!! CNNs are designed for Data lying on Euclidean spaces:!
(1) Convolution on Euclidean grids (FFT)!
(2) Downsampling on Euclidean grids!
(3) Pooling on Euclidean grids!
Everything mathematically well defined and computationally fast!!
Q: What type of data can be processed with CNNs?!
Images (2D, 3D) !

videos (2+1D)'
Sound (1D)'
!! But not all data lie on Euclidean grids!!

!"#$%&'(&%))*+'
-3'
Outline!
History of CNNs!
Standard CNNs!
Conclusion!
Xavier Bresson
29
Non-Euclidean Data!
!! Examples of irregularly/graph-structured data: !
(i) Social networks (Facebook, Twitter)!
(ii) Biological networks (genes, brain connectivity)!
(iii) Communication networks (Internet, wireless, tra"c)!
P'
N"9+(%)#-,=".L/!
O.(+#)/,.&9,&.-!
Q-%-9">>&#+9('"#)
#-,=".L/!
;.(1FK#-,=".LM)
)/,.&9,&.-<)<(,())
!! Main challenges: !
(1) How to define convolution, downsampling and pooling on graphs?!
(2) And how to make them numerically fast?!
!! Current solution: Map graph-structured data to regular/Euclidean grids with
e.g. kernel methods and apply standard CNNs. !
Limitation: Handcrafting the mapping is against CNN principle! !
!"#$%&'(&%))*+'
.5'

Our contribution [Deerrard-B-Vandergheynst16]: Generalizing CNNs
to any graph-structured data with same computational complexity as
standard CNNs!!
What tools for this generalization?!
(1) Graph spectral theory for convolution on graphs,!
(2) Balanced cut model for graph coarsening,!
(3) Graph pooling with binary tree structure of coarsened graphs.!
Xavier Bresson
31
Related Works!
!! Categories of graph CNNs: !
(1) Spatial approach!
(2) Spectral (Fourier) approach !
!! Spatial approach: !
! Local reception fields [Coates-Ng11, Gregor-LeCun10]:!
Find compact groups of similar features, but no defined convolution.!
! Locally Connected Networks [Bruna-Zaremba-Szlam-LeCun13]:!
Exploit multiresolution structure of graphs, but no defined convolution.!
! ShapeNet [Bronstein-et.al.1516]:!
Generalization of CNNs to 3D-meshes. Convolution well-defined in these
smooth low-dimensional non-Euclidean spaces. Handle multiple graphs. !
Obtained state-of-the-art results for 3D shape recognition.!
!! Spectral approach: !
! Deep Spectral Networks [Hena#-Bruna-LeCun15]:!
Computational complexity is O(n2), while ours is O(n). !
!"#$%&'(&%))*+'
.-'
Convolution on Graphs 1/3!

!! Graphs: G=(V,E,W), with V set of vertices, E set of edges, !
W similarity matrix, and |V|=n.!
eij 2 E
i2V
j2V
Wij = 0.9
!! Graph Laplacian (core operator to spectral graph theory [1]): !

2nd order derivative operator on graphs!
L = D W
normalized
L = In D 1/2 W D 1/2 unnormalized
[1] Chung, 1997'
!"#$%&'(&%))*+'
..'

Fourier transform on graphs [2]: L symmetric positive semidefinite matrix !
It has a set of orthonormal eigenvectors {ul}l known as graph Fourier modes,
associated to nonnegative eigenvalues {l}l known as the graph frequencies.!
The Graph Fourier Transform of f 2 Rn is
FG f = f = U T f 2 Rn ,
which value at frequency l is:

f( l ) = fl := hf, ul i =
n
X1
f (i)ul (i)
i=0
The inverse GFT is defined as:

FG 1 f = U f = U U T f = f,
which value at vertex i is:

(U f)(i) =
n
X1
fl ul (i).
l=0
[2] Hammond, Vandergheynst, Gribonval, 2011
Xavier Bresson
34

Convolution on graphs (in the Fourier domain) [2]:!
f G g = FG 1 FG f
which value at vertex i is:

(f G g)(i) =
FG g 2 R n ,
n
X1
fl gl ul (i)
l=0
It is also convenient to see that:

f G g = g(L)f,
as
f G g = U (U T f )
6
(U T g) = U 4

Xavier Bresson
g(
0)
..
.
g(
n 1)
7 T
5U f
= U g()U T f = g(L)f
35
Translation on Graphs 1/2!

Translation on graphs: A signal f defined on the graph can be translated to
any vertex i using the graph convolution operator [3]:!
Ti f := f G
i,
where Tif is the graph translation operator with vertex shift i. !

Function f, translated at vertex i, has the following value at vertex j:
(Ti f )(j) = f (j
i) = (f G
i )(j)
n
X1
fl ul (i)ul (j),
l=0
This formula is the graph counterpart of the continuous translation operator:

Z
(Ts f )(x) = f (x s) = (f s )(x) =
f()e 2is e2ix d,
R
where f() = hf, e2ix i, and e2ix are the eigenfunctions of the continuum
Laplace-Beltrami operator , i.e. the continuum version of the graph Fourier
modes ul .
[3] Shuman, Ricaud, Vandergheynst, 2016

Xavier Bresson
36
Translation on Graphs 2/2!

!! Note: Translation on graphs are easier to carry out with the spectral approach,
than directly in the spatial/graph domain.!
(a) Ts f
(b) Ts0 f
(c) Ts00 f
(d) Ti f
(e) Ti0 f
(f) Ti00 f
[Shuman-Ricaud-!
Vandergheynst16]'
Figure 1: Translated signals in the continuous R2 domain (a-c), and in the graph
domain (d-f). The component of the translated signal at the center vertex is
highlighted in green.
!"#$%&'(&%))*+'
.2'
Localized Filters on Graphs 1/3!

!! Localized convolutional kernels: As standard CNNs, we must !
define localized filters on graphs.!
!! Laplacian polynomial kernels [2]: We consider a family of spectral
kernels defined as:!
K
X
g( l ) = pK ( l ) :=
ak kl ,
(1)
k=0
where pK is a K th order polynomial function of the Laplacian eigenvalues

This class of kernels defines spatially localized filters as proved below:
l.
Theorem 1 Laplacian-based polynomial kernels (1) are K-localized in the sense

that
(Ti pK )(j) = 0
if
dG (i, j) > K,
(2)
where dG (i, j) is the discrete geodesic distance on graphs, that is the shortest
path between vertex i and vertex j.
[2] Hammond, Vandergheynst, Gribonval, 2011'
!"#$%&'(&%))*+'
.3'

Corollary 1. Consider the function
ij
Then
= Ti pK (j) = pK G
ij
=0
ij
such that
(j) = pK (L)
(j) =
K
X
ak Lk
(j).
k=0
if
dG (i, j) > K.
Spatial profile of polynomial filter given by
ij
BiK = Support of
polynomial filter at vertex i
Vertex i
!"#$%&'(&%))*+'
Figure 2. Illustration of localized filters on graphs. Laplacian-based polynomial

kernels are exactly localized in a K-ball BiK centered at vertex i.
.4'

Corollary 2. Localized filters on graphs are defined according to the principle:
Frequency smoothness
Spatial graph localization
This is obviously the Heisenbergs uncertainty principle extended to the graph

setting. Recent papers have studied the uncertainty principle on graphs.
BiK = Support of
polynomial filter at vertex i
Vertex i
!"#$%&'(&%))*+'
/5'
Fast Chebyshev Polynomial Kernels 1/2!

!! Graph filtering: Let y be a signal x filtered by a Laplacian-based!
polynomial kernel:!
K
X
y = x G pK = pK (L)x =
ak Lk x
k=0
!"#$"%&'"#(%)
*+%,-./)
F1
x F2
F2
F3
x F3
The monomial basis {1, x, x2 , x3 , ..., xK } provides localized spatial filters, but
R1
2 1
does not form an orthogonal basis (e.g. h1, xi = 0 1xdx = x2 0 = 12 ), which
limits its ability to learn good spectral filters.
polynomials: Let Tk (x) the Chebyshev polynomial of order k gen!! Chebyshev
C!
erated by the fundamental recurrence property Tk (x) = 2xTk 1 (x) Tk 2 (x)
with T0 = 1 and T1 = x. The Chebyshev basis {T0 , T1 , ..., TK } forms an orthogonal basis in [ 1, 1].
Figure 3. First six Chebyshev polynomials.

!"#$%&'(&%))*+'
x F1
/,'
Fast Chebyshev Polynomial Kernels 2/2!

Graph filtering with Chebyshev [2]: The filtered signal y is defined with the
Chebyshev polynomials is:!
K
X
y = x G qK =
k Tk (L)x,
k=0
with the Chebyshev spectral kernel:

qK ( ) =
K
X
k Tk ( ).
k=0
PK
Fast filtering: Let denote Xk := Tk (L)x and rewrite y = k=0 k Xk . Then
F!
all {Xk } are generated with the recurrence equation Xk = 2LXk 1 Xk 2 .
As L is sparse, all matrix multiplications are done between a sparse matrix
and a vector. The computational complexity is O(EK), and reduces to linear
complexity O(n) for k-NN graphs.
GPU parallel implementation: Linear algebra operations can be done in

parallel, allowing a fast GPU implementation of Chebyshev filtering. !
Xavier Bresson
42
Graph Coarsening!
!! Graph coarsening: As standard CNNs, we must define a grid coarsening
process for graphs. It will be essential for pooling similar features together.!
Graph coarsening/
clustering
Gl=0 = G
Graph coarsening/
clustering
Gl=1
Gl=2
Figure 4: Illustration of graph coarsening.'
!! Graph partitioning: Graph coarsening is equivalent to graph clustering, which

is a NP-hard combinatorial problem.!
!"#$%&'(&%))*+'
/.'
Graph Partitioning!
!! Balanced Cuts [4]: Two powerful measures of graph clustering are the
Normalized Cut and Normalized Association defined as:!
Normalized Cut:
K
X
Cut(Ck , Ckc )
min
C1 ,...,CK
Vol(Ck )
k=1
Partitioning by min edge cuts.
Ckc
Ck
Equivalence by
complementarity
Normalized Association:
K
X
Assoc(Ck )
max
C1 ,...,CK
Vol(Ck )
k=1
Ck
Partitioning by max vertex matching.
Figure 5: Equivalence between NCut and NAssoc partitioning.'
P
P
where Cut(A,
B)
:=
W
,
Assoc(A)
:=
ijP
i2A,j2B
i2A,i2B Wij ,
P
Vol(A) := i2A,j2B di , and di := j2V Wij is the degree of vertex i.
[4] Shi, Malik, 2000'

!"#$%&'(&%))*+'
//'
Graclus Graph Partitioning!

!! Graclus [5]: It is a greedy (fast) technique that computes clusters that locally
maximize the Normalized Association.!
(P1) Vertex matching:
@'
@
i, j = argmax
j
l+1
(P2): G
dli + dlj
?
?'
?
?'
Wijl+1 = Cut(C
Cil , Cjl )
ll+1
+1
Wii = Assoc(C
Cil )
Graph coarsening/
clustering
?'
?
?'
?
?'
?
Gl
l o
Wiil + 2W
Wijl + Wjj
Matched vertices { i , j } are

merged into a super-vertex
at the next coarsening level.
Gl
+1
Gll+1
@
@'
Gl+2
Partition energy at level l :

X
matched{i,j}
l
Wiil + 2Wijl + Wjj
dli + dlj
K
X
Assoc(C l )
k
k=1
Vol(Ckl )
where Ckl is a super-vertex computed by (P1), i.e. Ckl := matched{i, j}.

It is also a cluster with at most 2k+1 original vertices.
Figure 6: Graph coarsening with Graclus. Graclus proceeds by two successive steps: (P1)
Vertex matching, and (P2) Graph coarsening. These two steps provide a local solution to
the Normalized Association clustering problem at each coarsening level l.'
[5] Dhillon, Guan, Kulis, 2007'
!"#$%&'(&%))*+'
/0'
Fast Graph Pooling 1/2!

Graph pooling: As standard CNNs, we must define a pooling process such
as max pooling or average pooling. This operation will be done many times
during the optimization task. !
Unstructured pooling is inecient: The graph and its coarsened versions

indexed by Graclus require to store a table with all matched vertices.
!
Memory consuming and not (easily) parallelizable.
!
Structured pooling is as ecient as a 1D grid pooling: Start from the

coarsest level, then propagate the ordering to the next finer level such that node
k has nodes 2k and 2k+1 as children binary tree arrangement of the nodes
such that adjacent nodes are hierarchically merged at the next coarser level. !
Xavier Bresson
46
Fast Graph Pooling 2/2!

Graph coarsening:
Graph coarsening:
Matched vertices
0
Gl=0 = G
Gl=0 = G
55,, 7
2 0, 2
33,, 6
55,, 7
3, 6
0, 2
1, 4
Graph pooling:
0
2
0
Reindexing w.r.t.
coarsening structure
66,, 7
0 0, 1
Gl=1
Gl=2
44,, 5
1 22,, 3
0, 1
2, 3
44,, 5
66,, 7
Graph pooling:
3
Gl=1
3 11,, 4
Gl=2
7
1
Unstructured arrangement of vertices
3
1
7
3
Binary tree arrangement of vertices
Figure 7: Fast graph pooling using graph coarsening structure. The binary tree arrangement of
vertices allows a very e"cient pooling on graphs, as fast as a regular 1D Euclidean grid pooling.'
!"#$%&'(&%))*+'
/2'
Full Architecture of Graph CNNs!

!"#$"%&'"#(%)*+%,-./)
0C6G)1(.(>-,-./)
0CIJ6G)"1-.('"#/)C;?R/G)
G = Gl=0
g1K1
7-8&)(9'$('"#)
:)
;.(1F)9"(./-#+#2)
Gl=2
Gl=1
*(9,".)51)
?.-M9">1&,-<)
:)
?""%+#2)C;?R/G)
@'
g2K1
g3K1
0&,1&,)/+2#(%)
;.(1F)
!%(//)%(3-%/)
IDS)/"9+(%T)3+"%"2+9(%T))
,-%-9">>&#+9('"#)2.(1F/)
x 2 Rn
x
l=0
2R
nl=0
A#1&,)/+2#(%)
"#)2.(1F/)
y 2 R nc
l=6 2 Rn5 nc
xl=0
2 R n0 F 1
g
l=1 2 RK1 F1
xl=1 2 Rn1 F1
n1 = n0 /2p1
xl=5 2 Rn5 F5
l=5 2 RK5 F1 ...F5
;.(1F)9"#$"%&'"#(%)%(B-./)
C-D,.(9,)>&%'/9(%-)%"9(%)/,('"#(.B)E-(,&.-/)"#)2.(1F/G)
*&%%B)9"##-9,-<)%(B-./)
Figure 8. Architecture of the proposed CNNs on graphs.

Notation: l is the coarsening level, xl are the down sampled signals at layer
l, Gl is the coarser graph, gKl are the spectral filters at layer l, xlg are the
filtered signals, pl is the coarsening exponent, nc is the number of classes, y is
the output signal, and l is the number of parameters to learn at l.
!"#$%&'(&%))*+'
/3'
Optimization!
!! Backpropagation [6] = Chain rule applied to the neurons at each layer.!
yj =
Fin
X
gij (L)xi
i=1
'
yj
ij
'
xi
@E
@xi
Loss function: E =
ls log ys
Local !
computations!
s2S
Gradient descents:
n+1
n
= ij
ij
xn+1
= xni
i
@E
. @
ij
@E
. @xi
[6] Werbos 1982 and Rumelhart, Hinton, Williams, 1985'

!"#$%&'(&%))*+'
@yj
@ij
@E
E
@yyj
Backpropagation '
Accumulative
Accumulative!
computations!
Local gradients:
Xh
@E @yj
@E
X0,s , ..., XK
=
=
@ij
@yj @ij
s2S
F
out
X
@E @yj
@E
@E
=
=
gij (L)
@xi
@yj @xi
@yj
j=1
iT @E
1,s
@yj,s
/4'
Revisiting Euclidean CNNs!

!! Sanity check: MNIST is the most popular dataset in Deep Learning [8].!
It is a dataset of 70,000 images represented on a 2D grid of size 28x28
(dim data = 282 = 784) of handwritten digits, from 0 to 9.!
!! Graph: A k-NN graph (k=8) of the Euclidean grid: !
Wij = e
kxi
kxi xj k22 /
xj k2
[8] LeCun, Bottou, Bengio, 1998'

!"#$%&'(&%))*+'
05'
Revisiting Euclidean CNNs!
!! Results: Classification rates!
Algorithm
Linear SVM
Softmax
CNNs [LeNet5]
graph CNNs: CN32-P4-CN64-P4-FC512-softmax
!"#$%&'(&%))*+'
Accuracy
91.76
92.36
99.33
99.18
0,'
Non-Euclidean CNNs!
!! Text categorization with 20NEWS: It is a benchmark dataset introduced at
CMU [9]. It has 20,000 text documents (dim data = 33,000, #words in
dictionary) across 20 topics.!
Table 1. 20 Topics of 20NEWS!
Instance of document in topic:!

Auto!
Instance of document in topic:!

Medicine!
[9] Leng, 1995'

!"#$%&'(&%))*+'
0-'
Non-Euclidean CNNs!
Results: Classification rates!

Algorithm
Linear SVM
Multinomial NB
Softmax
FC + softmax + dropout
FC + FC + softmax + dropout
graph CNNs: CN32-softmax
Xavier Bresson
Word2vec features
65.90
68.51
66.28
64.64
65.76
68.26
53
Demo: Convolutional Neural Networks for

Graph-Structured Data!
!"#$%&'(&%))*+'
0/'
Outline!
History of CNNs!
Standard CNNs!
Conclusion!
Xavier Bresson
55
Summary!
CNNs is game changer:!
(1) Breakthrough for all Computer Vision-related problems!
(2) Revive dream of Artificial Intelligence!
(3) Deep learning = Big Data + GPUs/Cloud + Neural Networks!
(4) Big question why it works so well?!
CNNs for unstructured data: Beyond Computer Vision!

(1) Generalization of CNNs to non-Euclidean domains/graph-structured data!
(2) Localized filters on graphs!
(3) Same learning complexity as CNNs while being universal to graphs!
(4) GPU implementation !
Xavier Bresson
56
QuesBons?
Xavier Bresson
57
Data Science
Sept 12-14, 2016!

Recurrent Neural Networks!
Xavier Bresson!
!

!"#$%&'(&%))*+'
,'
Outline!
Motivation!
Vanilla Recurrent Neural Networks (RNNs)!
Long Short-Term Memory (LSTM)!
Conclusion!
Xavier Bresson
Motivation!
!! Recurrent Neural Networks (RNNs) operate on ordered sequences of
inputs and outputs. Examples: Text, financial series, videos, robot motion, etc.!
Output data!
Ex: class!
Hidden !
Layers!
Input data!
Ex: image'
Vanilla NNs!
One input vector maps to
one output vector.!
Ex: CNNs (image to
class)'
!"#$%&'(&%))*+'
Output data!
Ex: sequence of
words!
Hidden !
Layers!
Input data!
Ex: image'
RNNs!
One input vector maps to
multiple output vectors.!
Ex: Image captioning
(image to caption sentence)'
.'
Motivation!
Output data!
Ex: sequence of
words!
Output data!
Ex: class!
Hidden !
Layers!
Hidden !
Layers!
Input data!
Ex: sequence of
images!
Input data!
Ex: sequence of
words!
RNNs!
Multiple input vectors map
to an output vector.!
Ex: Video classification
(sequence of images to class)'
RNNs!
Multiple input vectors map
to multiple output vectors.!
Ex: Machine translation
(sentence to sentence)'
!! RNNs learn dynamics/temporal properties of data.!

!"#$%&'(&%))*+'
/'
Outline!
Motivation!
Conclusion!
Xavier Bresson
General Description!
!! RNNs are recurrent learning machine:!
!
(1) RNNs have a state ht. This state ht can be

modified by changing the RNN parameters
W. '
!
(2) RNNs receive at each time t an input vector

x, and learn to predict the next input vector
x at time t+1 with the output vector y.!
Example: hello'
hello
Block RNN !
State ht,!
Parameters W'
x'y'
!"#$%&'(&%))*+'
1'
Recurrence Formula!
!! Update of the RNN state is done with a
recurrence formula at each time step:!
Recurrence! Previous!
function'
state'
ht = fW ((ht
New state of
RNN'
Weights/!
Parameters of
recurrence
function'
1 , xt )
ht
Recurrence !
Formula '
Input vector at
current time step'
!! Notes:!
(1) Recurrence function is independent of time t!!
Same function f is used at every time step.!
(2) Changing W will change the behavior of RNNs.!
(3) Weights W are learned by backpropagation on training data.!
!"#$%&'(&%))*+'
2'
Vanilla RNNs!
!! Simplest RNNs:!
ht = fW (ht
1 , xt )
ht
Recurrence !
Formula '
RNN state at !
step t'
ht = tanh(Whh ht
yt = Why ht
+ Wxh xt )
tanh!
!"#$%&'(&%))*+'
3'
Example: Character-Level Language Model!

!! Ask RNN to predict the next character in a sequence.!
Simple example: Training sequence is hello, Vocabulary={h,e,l,o} input
vector is 4-dimensional.!
Unormalized !
probability!
Recurrence formula:!
Not!
good!
good!!
ht = tanh(Whh ht
+ Wxh xt )
Linear/softmax classifier!
for next character:
character:!
yt = Why ht
Note: In text
analysis, we never
work with characters
directly, but with
numbers (via 1-to-1
mapping between
characters and
numbers)!
Learn by !
backpropagation!
Vocabulary!
Vector!
!"#$%&'(&%))*+'
4'
VRNN = 100 Python lines!
https://gist.github.com/karpathy/d4dee566867f8291f086!
Xavier Bresson
10
VRNN = 100 Python lines!
!"#$%&'(&%))*+'
,,'
Example: Shakespeare-like Sequences !

!! Generate sequences during training: Seed with a few characters and
look at outputs. !
!"#$%&'(&%))*+'
,-'
Demo: Vanilla Recurrent Neural Networks!

!"#$%&'(&%))*+'
,.'
Example: Mathematics!
!! Training data: Open source textbooks on algebraic geometric!
!"#$%&'(&%))*+'
,/'
Example: Code!
!! Training data: Linux code!
!"#$%&'(&%))*+'
,0'
Image Captioning !
!! It is possible to merge CNNs and RNNs!!
Example: Image captioning !
!"#$%&'(&%))*+'
,1'
Design!
!! Step 1: Remove the last FC layer and softmax classifier in CNNs
(classification is not needed, only visual feature extractors).!
CNNs !
!"#$%&'(&%))*+'
,2'
Design!
!! Step 2: Connect CNN output to RNN.!
New!!
!"#$%&'(&%))*+'
,3'
Design!
!! Step 3: Construct the whole RNN.!
!"#$%&'(&%))*+'
,4'
Results!
!"#$%&'(&%))*+'
-5'
Demo: Image Captioning with RNNs!

!"#$%&'(&%))*+'
-,'
Deep RNNs!
!! Multilayer RNNs:!
+ Wxh xt )
ht = tanh(Whh ht
rewriting'
ht = tanh W
!"#$%&'(&%))*+'
xt
ht 1
layer'
with W =
Wxh
0
0
Whh
State for layer l!
--'
Outline!
Motivation!
Conclusion!
Xavier Bresson
23
Long Short-Term Memory (LSTM) !

[Hochreiter-Schmidhuber97]!
!! With CNNs, another B$ algorithm, among top 10 algorithms in data science.!

Use by all big IT companies: Facebook, Google, Microsoft, Apple, IBM, etc!
!! Standard RNNs su"er from vanishing gradient problem, so they
cannot scale up to deep networks. LSTM does not have this issue.!
!! What is LSTM?!
LSTM is also a RNN but with a more complex
recurrence formula:!
(1) The state of RNN has more variables, and
more weights.!
(2) The update of the state variables is more
complex.!
Recurrence!
formula:'
!"#$%&'(&%))*+'
-/'
Understanding LSTM!
!! From paper: !
!"#$%&'(&%))*+'
-0'
Understanding LSTM!
LSTM has two state vectors: !
h: hidden state vector!
c: cell state vector. !
Besides:!
f: called forget vector!
i: called input vector!
o: called output vector!
Let suppose the variables to be binary, for easier analysis.!

Xavier Bresson
26
Understanding LSTM!
Time step t'
c=next cell state!

f=forget vector, f={0,1}!
This gate can reset the flow, or!
It can let flow the previous cell value'
!"#$%&'(&%))*+'
i=input vector, i={0,1}!

g = {-1,1}!
The gate i*g can add nothing, or!
It can increment the flow by 1, or -1'
-2'
Understanding LSTM!
Cell state c flows!
to hidden state'
tanh for activation

activation'
h=next hidden state!

c=cell state!
o={0,1}!
This gate can reset the flow, or!
It can let flow the previous hidden state'
Hidden state h flows!

to cell state'
!"#$%&'(&%))*+'
-3'
Understanding LSTM!
Stack up to get multilayer LSTM: !
Xavier Bresson
29
RNNs vs. LSTM!

The (+) gates distributes the gradient equally during
backpropagation. It allows to avoid the vanishing gradient problem
(otherwise the gradient vanishes quickly).!
Xavier Bresson
30
LSTM Variants!
At the end of the day, LSTM gives the best performances over many
possible experimental conditions. !
Xavier Bresson
31
Demo: Image Captioning with LSTMs!

!"#$%&'(&%))*+'
.-'
Outline!
Motivation!
Conclusion!
Xavier Bresson
33
Summary!
RNNs oer lots of flexibility in NN architecture.!
!
Vanilla RNNs do not work well (vanishing gradient).!

!
Use LSTM (no vanishing gradient).!

!
Hot research:!
(1) Architecture design.!
(2) Better understanding.!
(3) Why performances are so good? Open theoretical question.!
Xavier Bresson
34
Ques8ons?
Xavier Bresson
35
Data Science!
Sept 12-14, 2016!

Lecture 13: Conclusion!
Xavier Bresson!
!
!"#$%&'(&%))*+'
,'
Data Science !
Science of transforming raw data into meaningful
knowledge to provide smart decisions to real-world
problems.!
!"#$%&'(&%))*+'
-'
Data Science!
Computer Science
Science!
Scalable databases for storing, accessing data. !

E.g. Cloud computing, Amazon EC2, Hadoop.!
Distributed and parallel frameworks !

for data processing. !
E.g. MapReduce, GraphLab.
GraphLab.!
Personalized !
Services!
Services
E.g. Healthcare (enhanced diagnostics)

diagnostics)!
(products)!
Commerce (products)
Mathematical!
Mathematical
Modeling!
Modeling
Design algorithms that transform

transform!
data into knowledge.
knowledge.!
Use Linear algebra, optimization, !
graph theory, statistics.!
statistics.
Data Science
Science!
Multidisciplinary field: 1+1=3

1+1=3!
Data!
Knowledge !
Discovery !
E.g. Physics, genomics, !
social sciences.
sciences.!
Collection of massive amounts of !

data at increasing rate.!
E.g. Social networks, sensor networks, !
mobile devices, biological networks,!
administrative, economics data!
Issues of privacy, !
ownership!
security, ownership
Domain
Domain!
Expertise
Expertise!
Sciences!
Sciences
E.g. Economy, Biology, Physics, Neuroscience, sociology.

sociology.!
Government!
Government
E.g. Healthcare, Defense, Education, Transportation..!
Industry!
Industry
Intelligent !
Systems!
Systems
E.g. Autonomous cars, security, !

interactive tools for data organization
organization!
and exploration. !
E.g. E-commerce, Telecommunications, !

Finance.
Finance.!
Major challenges: Multidisciplinary integration, large-scale databases, scalable

computational infrastructures, design math algorithms for massive datasets, trade-o"
speed and accuracy for real-time decisions, interactive visualization tools. !
Deep Learning!
Data Science = Big Data + Computational Infrastructure + Artificial Intelligence!
3rd industrial !
revolution!
!"#$%&'(&%))*+'
Cloud computing
computing!
GPU!
Math parts!
.'
A Brief History of Data Science/Deep Learning!

RNN!
Schmidhuber!
CNN!
LeCun!
First !
NIPS!
Visual primary cortex!
Hubel-Wiesel!
1959'
1962 1975'
1962'
1958'
Backprop !
Perceptron
Perceptron!
Werbos!
Rosenblatt
Rosenblatt!
First !
KDD
KDD!
1989'
1989
1987'
Neocognitron!
Fukushima!
Birth of!
Data Science!
Split from Statistics!
Tukey!
AI Hope!
!"#$%&'(&%))*+'
Big Data!
Volume doubles/1.5 year!
1998
1997 1998'
1997'
1995'
1999
1999'
Hardware!
GPU speed doubles/ year!
First Amazon!
Cloud Center!
Google AI !
TensorFlow!
Facebook AI!
Torch!
Kaggle!
Platform!
2010'
2006'
Auto-encoder!
LeCun, Hinton, Bengio!
First NVIDIA !
GPU!
SVM/Kernel techniques!
Vapnik!
AI Winter [1966-2012]!
Kernel techniques!
Handcrafted features!
Graphical models!
2012'
2012
2014' 2015'
Data scientist!
Facebook Center!
1st Job in US!
OpenAI Center!
4th industrial revolution?!

Digital Intelligence !
Deep Learning!
Revolution!
Breakthough !
or new AI bubble?!
Hinton, Ng!
AI Resurgence!
/'
Outline of the Course!

1st day!
Graph Science
Science!
Data structure
structure!
Pattern extraction!
extraction
Unsupervised
Clustering
Clustering!
k-means, graph cuts
cuts!
Python!
Python
Language for !
data science!
science
Supervised
Classification!
Classification
SVM!
SVM
Deep Learning
Learning!
NNs, CNNs, RNNs,
RNNs,!
Data Science
Science!
Pagerank, collaborative
Pagerank
filtering
content filtering!
3rd day!
Data
Visualization!
Visualization
Manifold, t-SNE
t-SNE!
!"#$%&'(&%))*+'
Recommender
Systems
Systems!
Feature
Extraction!
Extraction
PCA, NMF, Sparse

coding
coding!
2nd day!
0'
Current Deep Learning!
!"#$%&'(&%))*+'
1'
Rapid Development!!
!"#$%&'(&%))*+'
2'
Future of Deep Learning!

!! Deep Learning is a new revolutionary
paradigm in AI.!
It has the capability to find highly meaningful
patterns in big data.!
!! Deep Learning is a breakthrough in

Computer Vision and Voice Recognition. !
However, it does not have yet the same
breakthrough in other fields.!
We are far away from a true AI.!
!"#$%&'(&%))*+'
3'
Future of Deep Learning!

!! Unsupervised learning: Google
brain. Self-taught learning with
unlabelled youtube videos and 16,000
computers!
!! Better hardware with bigger
machine cluster "!
!! Bigger data "!
!! Better understanding - why it works? # !
!"#$%&'(&%))*+'
,4'
Thank you!
Xavier Bresson
11

Introduction To Data Science!

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Data Science!

Uploaded by

Copyright:

Available Formats

Data Science!

Sept 12-14, 2016!

EPFL-UNIL Continuing Education !

Swiss Federal Institute of Technology (EPFL) !

Professor, Vice-Provost EPFL!

Prof. Pierre Vandergheynst!

Dr. Xavier Bresson !

Source: Drew Conway '

Best job in the U.S in 2015 [Forbes].!

What is Data Science? - Short Answer !

What is Data Science? - Long Answer

Q: What are the applications? !

Scalable databases for storing, accessing data. !

Distributed and parallel frameworks !

E.g. Healthcare (enhanced diagnostics)

Design algorithms that transform

Multidisciplinary field: 1+1=3

Collection of massive amounts of !

E.g. Economy, Biology, Physics, Neuroscience, sociology.

Q: What are the main !

E.g. Autonomous cars, security, !

E.g. E-commerce, Telecommunications, !

Major challenges: Multidisciplinary integration, large-scale databases, scalable

What is Data Science? - Medium Answer !

Data Science = Big Data + Computational Infrastructure + Artificial Intelligence

A Brief History of Data Science!

Q: Did you hear about !

4th industrial revolution?!

Data Science and Graph Science!

and Graph Science !!

!! Graphs are the most important discrete

Network of Text Documents!

Why Networks are important?!

Outline of the Course!

PCA, NMF, Sparse

Structure of the Course!

Coding illustration on real-world data.!

Please, ask questions! !

Please, share your own data science problem for discussion.!

Goal of the Course!

EPFL-UNIL Continuing Education !

Swiss Federal Institute of Technology (EPFL) !

Python Pros, for Prototyping

easy to share & install packages via pip

Python Pros, for Production

EPFL-UNIL Continuing Education !

Swiss Federal Institute of Technology (EPFL) !

Q: Why are they useful? !

Graph Science = Graph Theory!

A: Not possible. Needs cycles

Q: Cite a few networks? !

Minnesota Road Network'

!! Essential data $$ lie on network structures, like medical, social,

MNIST Image Graph!

(1) MNIST image network!

GTZAN Music Graph !

(4) 3D mesh points!

!! Graph Construction: No universal recipe but good common practices

time < 1sec!

time > 1 hour!

FLANN is a library for performing fast !

approximate nearest neighbor searches in high dimensional spaces.!

Q: Why using artificial math networks?!