Neighborhood Component Analysis - 20071108

Neighbourhood Component
Analysis
T.S. Yo
References
Outline
● Introduction
● Learn the distance metric from data
● The size of K
● Procedure of NCA
● Experiments
● Discussions
Introduction (1/2)
● KNN
– Simple and effective
– Nonlinear decision surface
– Non-parametric
– Quality improved with more data
– Only one parameter, K -> easy for tuning
Introduction (2/2)
● Drawbacks of KNN
– Computationally expensive: search through the
whole training data in the test time
– How to define the “distance” properly?
● Learn the distance metric from data, and

force it to be low rank.
Learn the Distance from Data (1/5)
● What is a good distance metric?
– The one that minimize (optimize) the cost!
● Then, what is the cost?

– The expected testing error
– Best estimated with leave-one-out (LOO) cross-
validation error in the training data
Kohavi, Ron (1995). "A study of cross-validation and bootstrap for accuracy estimation and model selection".
Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2 (12): 1137–1143. (Morgan
Kaufmann, San Mateo)
● Modeling the LOO error:
– Let pij be the probability that point xj is selected as
point xi's neighbour.
– The probability that points are correctly classified
when xi is used as the reference is:
● To maximize pi for all xi means to minimize

LOO error.
● Then, how do we define pij ?
– According to the softmax of the distance dij
Softmax Function
1
0.9
– Relatively smoother than dij

0.8
0.7
0.6
exp(-X)
0.5
0.4
0.3
0.2
0.1
0
X
● How do we define dij ?
● Limit the distance measure within Mahalanobis
(quadratic) distance.
● That is to say, we project the original feature

vectors x into another vector space with q
transformation matrix, A
● Substitute the dij in pij :
● Now, we have the objective function :
● Maximize f(A) w.r.t. A → minimize overall

LOO error
The Size of k
● For the probability distribution pij :
● The perplexity can be used as an estimate for

the size of neighbours to be considered, k
Procedure of NCA (1/2)
● Use the objective function and its gradient to
learn the transformation matrix A and K from
the training data, Dtrain(with or without dimension
reduction).
● Project the test data, Dtest, into the transformed
space.
● Perform traditional KNN (with K and ADtrain) on
the transformed test data, ADtest.
Procedure of NCA (2/2)
● Functions used for optimization
Experiments – Datasets (1/2)
● 4 from UCI ML Repository, 2 self-made
Experiments – Datasets (2/2)
n2d is a mixture of two bivariate normal distributions with different means and
covariance matrices. ring consists of 2-d concentric rings and 8 dimensions of
uniform random noise.
Experiments – Results (1/4)
Error rates of KNN and NCA with the same K.

It is shown that generally NCA does improve the
performance of KNN.
● Compare with
other classifiers
● Rank 2
dimension
reduction
Discussions (1/8)
● Rank 2 transformation for wine
Discussions (2/8)
● Rank 1 transformation for n2d
Discussions (3/8)
● Results of
Goldberger
et al.
(40 realizations of
30%/70% splits)
Discussions
(4/8)
● Results of
Goldberger
et al.
(rank 2
transformation)
Discussions (5/8)
● Results of experiments suggest that with the
learned distance metric by NCA algorithm, KNN
classification can be improved.
● NCA also outperforms traditional dimension

reduction methods for several datasets.
Discussions (6/8)
● Comparing to other classification methods (i.e.
LDA and QDA), NCA usually does not give the
best accuracy.
● Some odd performance on dimension reduction

suggests that a further investigation on the
optimization algorithm is necessary.
Discussions (7/8)
● Optimize a matrix
●
Can we Optimize these Functions? (Michael L. Overton)
– Globally, no. Related problems are NP-hard (Blondell-
Tsitsiklas, Nemirovski)
– Locally, yes.
●
But not by standard methods for nonconvex,
smooth optimization
●
Steepest descent, BFGS or nonlinear conjugate
gradient will typically jam because of nonsmoothness
Discussions (8/8)
● Other methods learn distant metric from data
– Discriminant Common Vectors(DCV)
● Similar to NCA, DCV focuses on optimizing the distance
metric on certain objective functions
– Laplacianfaces(LAP)
● Emphasizes more on dimension reduction
J. Liu and S. Chen ， Discriminant Common Vecotors Versus Neighbourhood Components

Analysis and Laplacianfaces: A comparative study in small sample size problem. Image and
Vision Computing
Question?
Thank you!
Derive the Objective Function (1/5)
● From the assumptions, we have :

Neighborhood Component Analysis - 20071108

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neighborhood Component Analysis - 20071108

Uploaded by

Copyright:

Available Formats

Neighbourhood Component

● Learn the distance metric from data, and

● Then, what is the cost?

● To maximize pi for all xi means to minimize

– Relatively smoother than dij

● That is to say, we project the original feature

● Now, we have the objective function :

● Maximize f(A) w.r.t. A → minimize overall

● The perplexity can be used as an estimate for

Error rates of KNN and NCA with the same K.

● NCA also outperforms traditional dimension

● Some odd performance on dimension reduction

J. Liu and S. Chen ， Discriminant Common Vecotors Versus Neighbourhood Components

You might also like