You are on page 1of 34

Neighbourhood Component

Analysis

T.S. Yo
References
Outline

● Introduction
● Learn the distance metric from data
● The size of K
● Procedure of NCA
● Experiments
● Discussions
Introduction (1/2)

● KNN
– Simple and effective
– Nonlinear decision surface
– Non-parametric
– Quality improved with more data
– Only one parameter, K -> easy for tuning
Introduction (2/2)
● Drawbacks of KNN
– Computationally expensive: search through the
whole training data in the test time
– How to define the “distance” properly?

● Learn the distance metric from data, and


force it to be low rank.
Learn the Distance from Data (1/5)
● What is a good distance metric?
– The one that minimize (optimize) the cost!

● Then, what is the cost?


– The expected testing error
– Best estimated with leave-one-out (LOO) cross-
validation error in the training data
Kohavi, Ron (1995). "A study of cross-validation and bootstrap for accuracy estimation and model selection".
Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2 (12): 1137–1143. (Morgan
Kaufmann, San Mateo)
Learn the Distance from Data (2/5)
● Modeling the LOO error:
– Let pij be the probability that point xj is selected as
point xi's neighbour.
– The probability that points are correctly classified
when xi is used as the reference is:

● To maximize pi for all xi means to minimize


LOO error.
Learn the Distance from Data (3/5)
● Then, how do we define pij ?
– According to the softmax of the distance dij

Softmax Function
1

0.9

– Relatively smoother than dij


0.8

0.7

0.6

exp(-X)
0.5

0.4

0.3

0.2

0.1

0
X
Learn the Distance from Data (4/5)
● How do we define dij ?
● Limit the distance measure within Mahalanobis
(quadratic) distance.

● That is to say, we project the original feature


vectors x into another vector space with q
transformation matrix, A
Learn the Distance from Data (5/5)
● Substitute the dij in pij :

● Now, we have the objective function :

● Maximize f(A) w.r.t. A → minimize overall


LOO error
The Size of k
● For the probability distribution pij :

● The perplexity can be used as an estimate for


the size of neighbours to be considered, k
Procedure of NCA (1/2)
● Use the objective function and its gradient to
learn the transformation matrix A and K from
the training data, Dtrain(with or without dimension
reduction).
● Project the test data, Dtest, into the transformed
space.
● Perform traditional KNN (with K and ADtrain) on
the transformed test data, ADtest.
Procedure of NCA (2/2)
● Functions used for optimization
Experiments – Datasets (1/2)
● 4 from UCI ML Repository, 2 self-made
Experiments – Datasets (2/2)

n2d is a mixture of two bivariate normal distributions with different means and
covariance matrices. ring consists of 2-d concentric rings and 8 dimensions of
uniform random noise.
Experiments – Results (1/4)

Error rates of KNN and NCA with the same K.


It is shown that generally NCA does improve the
performance of KNN.
Experiments – Results (2/4)
Experiments – Results (3/4)
● Compare with
other classifiers
Experiments – Results (4/4)
● Rank 2
dimension
reduction
Discussions (1/8)
● Rank 2 transformation for wine
Discussions (2/8)
● Rank 1 transformation for n2d
Discussions (3/8)
● Results of
Goldberger
et al.
(40 realizations of
30%/70% splits)
Discussions
(4/8)

● Results of
Goldberger
et al.
(rank 2
transformation)
Discussions (5/8)
● Results of experiments suggest that with the
learned distance metric by NCA algorithm, KNN
classification can be improved.

● NCA also outperforms traditional dimension


reduction methods for several datasets.
Discussions (6/8)
● Comparing to other classification methods (i.e.
LDA and QDA), NCA usually does not give the
best accuracy.

● Some odd performance on dimension reduction


suggests that a further investigation on the
optimization algorithm is necessary.
Discussions (7/8)
● Optimize a matrix

Can we Optimize these Functions? (Michael L. Overton)
– Globally, no. Related problems are NP-hard (Blondell-
Tsitsiklas, Nemirovski)
– Locally, yes.

But not by standard methods for nonconvex,
smooth optimization

Steepest descent, BFGS or nonlinear conjugate
gradient will typically jam because of nonsmoothness
Discussions (8/8)
● Other methods learn distant metric from data
– Discriminant Common Vectors(DCV)
● Similar to NCA, DCV focuses on optimizing the distance
metric on certain objective functions

– Laplacianfaces(LAP)
● Emphasizes more on dimension reduction

J. Liu and S. Chen , Discriminant Common Vecotors Versus Neighbourhood Components


Analysis and Laplacianfaces: A comparative study in small sample size problem. Image and
Vision Computing
Question?
Thank you!
Derive the Objective Function (1/5)
● From the assumptions, we have :
Derive the Objective Function (2/5)
Derive the Objective Function (3/5)
Derive the Objective Function (4/5)
Derive the Objective Function (5/5)

You might also like