You are on page 1of 48

Metric and Kernel Learning

Inderjit S. Dhillon University of Texas at Austin

Cornell University Oct 30, 2007

Joint work with Jason Davis, Prateek Jain, Brian Kulis and Suvrit Sra

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Metric Learning
Goal: Learn Distance Metric between Data Important problem in Data Mining & Machine Learning Can govern success or failure of data mining algorithm

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Metric Learning: Example I

Similarity by Person(identity) or by Pose


Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

Metric Learning: Example II


Consider a set of text documents Each document is a review of a classical music piece Might be clusterable in one of two ways
By Composer (Beethoven, Mozart, Mendelssohn) By Form (Symphony, Sonata, Concerto)

Similarity by Composer or by Form

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Mahalanobis Distances
We restrict ourselves to learning Mahalanobis distances:
Distance parameterized by positive denite matrix : d (x1 , x2 ) = (x1 x2 )T (x1 x2 ) Often is the inverse of the covariance matrix Generalizes squared Euclidean distance ( = I ) Rotates and scales input data

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Example: Four Blobs


1.5

0.5

0.5 0.5

0.5

1.5

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Example: Four Blobs


1.5

0.5

0.5 0.5

0.5

1.5

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Example: Four Blobs


1.5

0.5

0.5 0.5

0.5

1.5

Want to learn: =
Inderjit S. Dhillon University of Texas at Austin

1 0 0
Metric and Kernel Learning

Example: Four Blobs


1.5

0.5

0.5 0.5

0.5

1.5

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Example: Four Blobs


1.5

0.5

0.5 0.5

0.5

1.5

Want to learn: =
Inderjit S. Dhillon University of Texas at Austin

0 0 1
Metric and Kernel Learning

Problem Formulation
Metric Learning Goal: min loss(, 0 ) (xi xj )T (xi xj ) u if (i , j) S [similarity constraints] (xi xj )T (xi xj ) if (i , j) D [dissimilarity constraints] Learn spd matrix that is close to the baseline spd matrix 0 Other linear constraints on are possible Constraints can arise from various scenarios
Unsupervised: Click-through feedback Semi-supervised: must-link and cannot-link constraints Supervised: points in the same class have small distance, etc.

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Problem Formulation
Metric Learning Goal: min loss(, 0 ) (xi xj )T (xi xj ) u if (i , j) S [similarity constraints] (xi xj )T (xi xj ) if (i , j) D [dissimilarity constraints] Learn spd matrix that is close to the baseline spd matrix 0 Other linear constraints on are possible Constraints can arise from various scenarios
Unsupervised: Click-through feedback Semi-supervised: must-link and cannot-link constraints Supervised: points in the same class have small distance, etc.

QUESTION: What should loss be?


Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

LogDet Divergence
We use loss(, 0 ) to be the Log-Determinant Divergence: Dd (, 0 ) = trace(1 ) log det(1 ) d 0 0 Our Goal: min Dd (, 0 ) (xi xj )T (xi xj ) u if (i , j) S [similarity constraints] (xi xj )T (xi xj ) if (i , j) D [dissimilarity constraints]

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Preview
Salient points of our Approach:
Metric Learning is equivalent to Kernel Learning Generalizes to Unseen Data Points Can improve upon an input metric or kernel No expensive eigenvector computation or semi-denite programming

Most existing methods fail to satisfy one or more of the above

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Brief Digression

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Bregman Divergences
Let : S R be a dierentiable, strictly convex function of Legendre type (S Rd ) The Bregman Divergence D : S relint(S) R is dened as D (x, y) = (x) (y) (x y)T (y)

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Bregman Divergences
Let : S R be a dierentiable, strictly convex function of Legendre type (S Rd ) The Bregman Divergence D : S relint(S) R is dened as D (x, y) = (x) (y) (x y)T (y)
(z)= 1 zT z 2

D (x,y)= 1 xy 2 h(z)

Squared Euclidean distance is a Bregman divergence


Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

Bregman Divergences
Let : S R be a dierentiable, strictly convex function of Legendre type (S Rd ) The Bregman Divergence D : S relint(S) R is dened as D (x, y) = (x) (y) (x y)T (y)
(z)=z log z

D (x,y )=x log x x+y y

h(z) y x

Relative Entropy (or KL-divergence) is another Bregman divergence


Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

Bregman Divergences
Let : S R be a dierentiable, strictly convex function of Legendre type (S Rd ) The Bregman Divergence D : S relint(S) R is dened as D (x, y) = (x) (y) (x y)T (y)

(z)= log z y x

h(z)

x x D (x,y )= y log y 1

Itakura-Saito Dist.(used in signal processing) is also a Bregman divergence


Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

Examples of Bregman Divergences


Function Name

(x)
1 2 x 2

D (x,y )
1 (xy )2 2

Squared norm Shannon entropy Bit entropy Burg entropy Hellinger


p quasi-norm

x log xx x log x+(1x) log(1x) log x xp |x|p 1x 2 (0<p<1) (1<p<) ex 1/x

x log x x+y y x log x +(1x) log y


x x log y y 1x 1y

(1xy )(1y 2 )1/2 (1x 2 )1/2 x p +p xy p1 (p1) y p |x|p p x sgn y |y |p1 +(p1)|y |p e x (xy +1)e y 1/x+x/y 2 2/y
Metric and Kernel Learning

p norm

Exponential Inverse

Inderjit S. Dhillon University of Texas at Austin

Bregman Matrix Divergences


Dene D (X , Y ) = (X ) (Y ) trace(((Y ))T (X Y ))

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Bregman Matrix Divergences


Dene D (X , Y ) = (X ) (Y ) trace(((Y ))T (X Y )) Squared Frobenius norm: (X ) = X D (X , Y ) =
2. F

Then
2 F

1 X Y 2

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Bregman Matrix Divergences


Dene D (X , Y ) = (X ) (Y ) trace(((Y ))T (X Y )) Squared Frobenius norm: (X ) = X D (X , Y ) = von Neumann Divergence: For X
2. F

Then
2 F

1 X Y 2

0, (X ) = trace(X log X ). Then

D (X , Y ) = trace(X log X X log Y X + Y )


also called quantum relative entropy

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Bregman Matrix Divergences


Dene D (X , Y ) = (X ) (Y ) trace(((Y ))T (X Y )) Squared Frobenius norm: (X ) = X D (X , Y ) = von Neumann Divergence: For X
2. F

Then
2 F

1 X Y 2

0, (X ) = trace(X log X ). Then

D (X , Y ) = trace(X log X X log Y X + Y )


also called quantum relative entropy

LogDet divergence: For X 0, (X ) = log det X . Then D (X , Y ) = trace(XY 1 ) log det(XY 1 ) d


Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

LogDet Divergence: Properties I


Dd (X , Y ) = trace(XY 1 ) log det(XY 1 ) d Properties:
Not symmetric Triangle inequality does not hold Can be unbounded Convex in rst argument (not in second) Pythagorean Property holds: Dd (X , Y ) Dd (X , P (Y )) + Dd (P (Y ), Y ) Divergence between inverses: Dd (X , Y ) = Dd (Y 1 , X 1 )

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

LogDet Divergence: Properties II


Dd (X , Y ) = trace(XY 1 ) log det(XY 1 ) d,
d d

=
i =1 j=1

(viT uj )2

i i log 1 j j

Properties:
Scale-invariance Dd (X , Y ) = Dd (X , Y ), In fact, for any invertible M Dd (X , Y ) = Dd (M T XM, M T YM) 0

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

LogDet Divergence: Properties II


Dd (X , Y ) = trace(XY 1 ) log det(XY 1 ) d,
r r

=
i =1 j=1

(viT uj )2

i i log 1 j j

Properties:
Scale-invariance Dd (X , Y ) = Dd (X , Y ), In fact, for any invertible M Dd (X , Y ) = Dd (M T XM, M T YM) Denition can be extended to rank-decient matrices Finiteness: Dd (X , Y ) is nite i X and Y have the same range space
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

Information-Theoretic Interpretation
Dierential Relative Entropy between two Multivariate Gaussians: N (x|, 0 ) log N (x|, 0 ) 1 dx = Dd (, 0 ) N (x|, ) 2

Thus, the following two problems are equivalent Relative Entropy Formulation LogDet Formulation N (x|,0 ) min Dd (, 0 ) min N (x|, 0 ) log N (x|,) dx T) u tr((xi xj )(xi xj tr((xi xj )(xi xT ) u j tr((xi xj )(xi xj )T ) tr((xi xj )(xi xj )T ) 0 0

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Steins Loss
LogDet divergence is known as Steins loss in the statistics community Steins loss is the unique scale invariant loss-function for which the uniform minimum variance unbiased estimator is also a minimum risk equivariant estimator

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Quasi-Newton Optimization
LogDet Divergence arises in the BFGS and DFP updates
Quasi-Newton methods Approximate Hessian of the function to be minimized

[Fletcher, 1991] BFGS update can be written as: min


B

Dd (B, Bt ) B s t = yt (Secant Equation)

subject to

st = xt+1 xt , yt = ft+1 ft Closed-form solution: Bt+1 = Bt Similar form for DFP update
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

T B t ss st B t yt y T + Tt T st B t st s t yt

Algorithm: Bregman Projections for LogDet


Algorithm: Cyclic Bregman Projections (successively onto each linear constraint) converges to globally optimal solution Use Bregman projections to update the Mahalanobis matrix: min

Dd (, t ) (xi xj )T (xi xj ) u

s.t.

Can be solved by rank-one update: t+1 = t + t t (xi xj )(xi xj )T t Advantages:


Automatic enforcement of positive semideniteness Simple, closed-form projections No eigenvector calculation Easy to incorporate slack for each constraint
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

Example: Noisy XOR


1.5

0.5

0.5 0.5

0.5

1.5

No linear transformation for XOR grouping


Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

Kernel Methods
Map input data to higher-dimensional feature space: x (x) Idea: Run machine learning algorithm in feature space Noisy XOR Example: 2 x1 x 2x1 x2 2 x2

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Kernel Methods
Map input data to higher-dimensional feature space: x (x) Idea: Run machine learning algorithm in feature space Noisy XOR Example: 2 x1 x 2x1 x2 2 x2

Kernel function: K (x, y) = (x), (y) Kernel trick no need to explicitly form high-dimensional features Noisy XOR Example: (x), (y) = (xT y)2
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

Connection to Kernel Learning


LogDet Formulation (1) Kernel Formulation (2) min Dd (, 0 ) minK Dd (K , K0 ) s.t. tr((xi xj )(xi xj )T ) u s.t. tr(K (ei ej )(ei ej )T ) u tr((xi xj )(xi xj )T ) tr(K (ei ej )(ei ej )T ) 0 K 0 (1) optimizes w.r.t. the d d Mahalanobis matrix (2) optimizes w.r.t. the N N kernel matrix K Let K0 = X T 0 X , where X is the input data Theorem: K = X T X

Let be optimal solution to (1) and K be optimal solution to (2)

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Kernelization
Metric learning in kernel space
Assume input kernel function (x, y) = (x)T (y) Want to learn d ((x), (y)) = ((x) (y))T ((x) (y)) Equivalently: learn a new kernel function of the form (x, y) = (x)T (y) How to learn this only using (x, y)?

Learned kernel can be shown to be of the form (x, y) = (x, y) +


i j

ij (x, xi )(y, xj )

Can update ij parameters while optimizing the kernel formulation


Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

Related Work
Distance Metric Learning [Xing, Ng, Jordan & Russell, 2002] Large margin nearest neighbor(LMNN) [Weinberger, Blitzer & Saul, 2005] Collapsing Classes (MCML) [Globerson & Roweis, 2005] Online Metric Learning (POLA) [Shalev-Shwartz, Singer & Ng, 2004] Many others!

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Experimental Results
Framework k-nearest neighbor (k = 4) and u determined by 5th and 95th percentile of distribution 20c 2 constraints, chosen randomly 2-fold cross validation Algorithms Information-theoretic Metric Learning (oine and online) Large-Margin Nearest Neighbors (LMNN) [Weinberger et al.] Metric Learning by Collapsing Classes (MCML) [Globerson and Roweis] Baseline Metrics: Euclidean and Inverse Covariance

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Results: UCI Data Sets

Ran ITML with 0 = I (ITML-MaxEnt) and the inverse covariance (InverseCovariance) Ran online algorithm for 105 iterations
Error

0.4

0.0

0.1

0.2

0.3

ITMLMaxEnt ITMLOnline Euclidean Inverse Covariance MCML LMNN

Wine

Ionosphere

Scale

Iris

Soybean

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Application 1: Clarify
Clarify improves error reporting for software that uses black box components Motivation: Black box components complicate error messaging Solution: Error diagnosis via machine learning Representation: System collects program features during run-time
Function counts Call-site counts Counts of program paths Program execution represented as a vector of counts

Class labels: Program execution errors Nearest neighbor software support


Match program executions with others Underlying distance measure should reect this similarity

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Results: Clarify

Very high dimensionality Feature selection reduces the number of features to 20


Error 0.3

ITMLMaxEnt ITMLInverse Covariance Euclidean Inverse Covariance MCML LMNN

0.0

0.1

0.2

Latex

Mpg321

Foxpro

Iptables

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Application 2: Learning Image Similarity


Goal: Learn a metric to compare images Start with a baseline measure
Use the pyramid match kernel [Grauman and Darrell] Compares sets of image features Ecient and eective measure of similarity between images

Application of metric learning in kernel space


Other metric learning methods (LMNN, etc) cannot be applied Does metric learning work in kernel space?

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Caltech 101 Results


Data Set: Caltech 101 Standard benchmark for multi-category image recognition 101 classes of images Wide variance in pose etc. Challenging data set Experimental Setup 15 images per class in training set; rest in test set (2454 images) Performed 1-NN using original PMK and learned PMK

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Caltech 101 Results


Learned PMK Original PMK

0.55

0.5 Accuracy

0.45

0.4

0.35

700

800

900

1000 1100 1200 Number of Constraint Points

1300

1400

1500

When constraints are drawn from all training data, kNN accuracy is 52%, versus 32% for original PMK ([Jain, Kulis & Grauman, 2007]) Data set is well-studiedbest performance with 15 training images per class is 60%
Uses dierent features (geometric-blur)
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

Metric Learning in High Dimensions


Text analysis & Software analysis: Feature sets larger than 1,000 Learning full distance matrix requires over 1 million parameters! Overtting problems, intractable Solution: Learning low-rank Mahalanobis matrices LogDet divergence can be generalized to low-rank matrices Dd (X , Y ) is nite X and Y have the same range space Extending ITML to the low-rank case
If 0 is low-rank is low-rank Clustered Mahalanobis matrices If 0 is a block matrix, then will also be a block matrix

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Low-rank Metric Learning: Preliminary Results


Classication accuracy for Low-Rank ITML Classic3: Text dataset, PCA basis Mpg321, Gcc: Software analysis, Clustered basis

0.9 0.8 Accuracy Accuracy 0.7 0.6 0.5 2 4 6 Rank LowRank LMNN MCML ITML 8 10 0.85 Accuracy 6 8 Rank 10 12 0.85 0.8 0.75 0.7

0.8

0.75 4

10

20

30 Rank

40

50

Classic3

Mpg321

Gcc

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

Conclusions
Metric Learning Formulation Uses LogDet divergence Information-theoretic interpretation Equivalent to kernel learning problem
Can improve upon input metric or kernel Generalizes to unseen data points

Algorithm Bregman projections result in simple rank-one updates Can be kernelized Online variant has provable regret bounds Empirical Evaluation Method is competitive with existing techniques Scalable to large data sets Applied to nearest-neighbor software support & image recognition
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

References
J. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, Information-Theoretic Metric Learning, International Conference on Machine Learning(ICML), pages 209216, June 2007. B. Kulis, M. Sustik, and I. S. Dhillon, Learning Low-Rank Kernel Matrices, International Conference on Machine Learning(ICML), pages 505512, July 2006 (longer version submitted to JMLR).

Inderjit S. Dhillon University of Texas at Austin

Metric and Kernel Learning

You might also like