Professional Documents
Culture Documents
Joint work with Jason Davis, Prateek Jain, Brian Kulis and Suvrit Sra
Metric Learning
Goal: Learn Distance Metric between Data Important problem in Data Mining & Machine Learning Can govern success or failure of data mining algorithm
Mahalanobis Distances
We restrict ourselves to learning Mahalanobis distances:
Distance parameterized by positive denite matrix : d (x1 , x2 ) = (x1 x2 )T (x1 x2 ) Often is the inverse of the covariance matrix Generalizes squared Euclidean distance ( = I ) Rotates and scales input data
0.5
0.5 0.5
0.5
1.5
0.5
0.5 0.5
0.5
1.5
0.5
0.5 0.5
0.5
1.5
Want to learn: =
Inderjit S. Dhillon University of Texas at Austin
1 0 0
Metric and Kernel Learning
0.5
0.5 0.5
0.5
1.5
0.5
0.5 0.5
0.5
1.5
Want to learn: =
Inderjit S. Dhillon University of Texas at Austin
0 0 1
Metric and Kernel Learning
Problem Formulation
Metric Learning Goal: min loss(, 0 ) (xi xj )T (xi xj ) u if (i , j) S [similarity constraints] (xi xj )T (xi xj ) if (i , j) D [dissimilarity constraints] Learn spd matrix that is close to the baseline spd matrix 0 Other linear constraints on are possible Constraints can arise from various scenarios
Unsupervised: Click-through feedback Semi-supervised: must-link and cannot-link constraints Supervised: points in the same class have small distance, etc.
Problem Formulation
Metric Learning Goal: min loss(, 0 ) (xi xj )T (xi xj ) u if (i , j) S [similarity constraints] (xi xj )T (xi xj ) if (i , j) D [dissimilarity constraints] Learn spd matrix that is close to the baseline spd matrix 0 Other linear constraints on are possible Constraints can arise from various scenarios
Unsupervised: Click-through feedback Semi-supervised: must-link and cannot-link constraints Supervised: points in the same class have small distance, etc.
LogDet Divergence
We use loss(, 0 ) to be the Log-Determinant Divergence: Dd (, 0 ) = trace(1 ) log det(1 ) d 0 0 Our Goal: min Dd (, 0 ) (xi xj )T (xi xj ) u if (i , j) S [similarity constraints] (xi xj )T (xi xj ) if (i , j) D [dissimilarity constraints]
Preview
Salient points of our Approach:
Metric Learning is equivalent to Kernel Learning Generalizes to Unseen Data Points Can improve upon an input metric or kernel No expensive eigenvector computation or semi-denite programming
Brief Digression
Bregman Divergences
Let : S R be a dierentiable, strictly convex function of Legendre type (S Rd ) The Bregman Divergence D : S relint(S) R is dened as D (x, y) = (x) (y) (x y)T (y)
Bregman Divergences
Let : S R be a dierentiable, strictly convex function of Legendre type (S Rd ) The Bregman Divergence D : S relint(S) R is dened as D (x, y) = (x) (y) (x y)T (y)
(z)= 1 zT z 2
D (x,y)= 1 xy 2 h(z)
Bregman Divergences
Let : S R be a dierentiable, strictly convex function of Legendre type (S Rd ) The Bregman Divergence D : S relint(S) R is dened as D (x, y) = (x) (y) (x y)T (y)
(z)=z log z
h(z) y x
Bregman Divergences
Let : S R be a dierentiable, strictly convex function of Legendre type (S Rd ) The Bregman Divergence D : S relint(S) R is dened as D (x, y) = (x) (y) (x y)T (y)
(z)= log z y x
h(z)
x x D (x,y )= y log y 1
(x)
1 2 x 2
D (x,y )
1 (xy )2 2
(1xy )(1y 2 )1/2 (1x 2 )1/2 x p +p xy p1 (p1) y p |x|p p x sgn y |y |p1 +(p1)|y |p e x (xy +1)e y 1/x+x/y 2 2/y
Metric and Kernel Learning
p norm
Exponential Inverse
Then
2 F
1 X Y 2
Then
2 F
1 X Y 2
Then
2 F
1 X Y 2
=
i =1 j=1
(viT uj )2
i i log 1 j j
Properties:
Scale-invariance Dd (X , Y ) = Dd (X , Y ), In fact, for any invertible M Dd (X , Y ) = Dd (M T XM, M T YM) 0
=
i =1 j=1
(viT uj )2
i i log 1 j j
Properties:
Scale-invariance Dd (X , Y ) = Dd (X , Y ), In fact, for any invertible M Dd (X , Y ) = Dd (M T XM, M T YM) Denition can be extended to rank-decient matrices Finiteness: Dd (X , Y ) is nite i X and Y have the same range space
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Information-Theoretic Interpretation
Dierential Relative Entropy between two Multivariate Gaussians: N (x|, 0 ) log N (x|, 0 ) 1 dx = Dd (, 0 ) N (x|, ) 2
Thus, the following two problems are equivalent Relative Entropy Formulation LogDet Formulation N (x|,0 ) min Dd (, 0 ) min N (x|, 0 ) log N (x|,) dx T) u tr((xi xj )(xi xj tr((xi xj )(xi xT ) u j tr((xi xj )(xi xj )T ) tr((xi xj )(xi xj )T ) 0 0
Steins Loss
LogDet divergence is known as Steins loss in the statistics community Steins loss is the unique scale invariant loss-function for which the uniform minimum variance unbiased estimator is also a minimum risk equivariant estimator
Quasi-Newton Optimization
LogDet Divergence arises in the BFGS and DFP updates
Quasi-Newton methods Approximate Hessian of the function to be minimized
subject to
st = xt+1 xt , yt = ft+1 ft Closed-form solution: Bt+1 = Bt Similar form for DFP update
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
T B t ss st B t yt y T + Tt T st B t st s t yt
Dd (, t ) (xi xj )T (xi xj ) u
s.t.
0.5
0.5 0.5
0.5
1.5
Kernel Methods
Map input data to higher-dimensional feature space: x (x) Idea: Run machine learning algorithm in feature space Noisy XOR Example: 2 x1 x 2x1 x2 2 x2
Kernel Methods
Map input data to higher-dimensional feature space: x (x) Idea: Run machine learning algorithm in feature space Noisy XOR Example: 2 x1 x 2x1 x2 2 x2
Kernel function: K (x, y) = (x), (y) Kernel trick no need to explicitly form high-dimensional features Noisy XOR Example: (x), (y) = (xT y)2
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Kernelization
Metric learning in kernel space
Assume input kernel function (x, y) = (x)T (y) Want to learn d ((x), (y)) = ((x) (y))T ((x) (y)) Equivalently: learn a new kernel function of the form (x, y) = (x)T (y) How to learn this only using (x, y)?
ij (x, xi )(y, xj )
Related Work
Distance Metric Learning [Xing, Ng, Jordan & Russell, 2002] Large margin nearest neighbor(LMNN) [Weinberger, Blitzer & Saul, 2005] Collapsing Classes (MCML) [Globerson & Roweis, 2005] Online Metric Learning (POLA) [Shalev-Shwartz, Singer & Ng, 2004] Many others!
Experimental Results
Framework k-nearest neighbor (k = 4) and u determined by 5th and 95th percentile of distribution 20c 2 constraints, chosen randomly 2-fold cross validation Algorithms Information-theoretic Metric Learning (oine and online) Large-Margin Nearest Neighbors (LMNN) [Weinberger et al.] Metric Learning by Collapsing Classes (MCML) [Globerson and Roweis] Baseline Metrics: Euclidean and Inverse Covariance
Ran ITML with 0 = I (ITML-MaxEnt) and the inverse covariance (InverseCovariance) Ran online algorithm for 105 iterations
Error
0.4
0.0
0.1
0.2
0.3
Wine
Ionosphere
Scale
Iris
Soybean
Application 1: Clarify
Clarify improves error reporting for software that uses black box components Motivation: Black box components complicate error messaging Solution: Error diagnosis via machine learning Representation: System collects program features during run-time
Function counts Call-site counts Counts of program paths Program execution represented as a vector of counts
Results: Clarify
0.0
0.1
0.2
Latex
Mpg321
Foxpro
Iptables
0.55
0.5 Accuracy
0.45
0.4
0.35
700
800
900
1300
1400
1500
When constraints are drawn from all training data, kNN accuracy is 52%, versus 32% for original PMK ([Jain, Kulis & Grauman, 2007]) Data set is well-studiedbest performance with 15 training images per class is 60%
Uses dierent features (geometric-blur)
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
0.9 0.8 Accuracy Accuracy 0.7 0.6 0.5 2 4 6 Rank LowRank LMNN MCML ITML 8 10 0.85 Accuracy 6 8 Rank 10 12 0.85 0.8 0.75 0.7
0.8
0.75 4
10
20
30 Rank
40
50
Classic3
Mpg321
Gcc
Conclusions
Metric Learning Formulation Uses LogDet divergence Information-theoretic interpretation Equivalent to kernel learning problem
Can improve upon input metric or kernel Generalizes to unseen data points
Algorithm Bregman projections result in simple rank-one updates Can be kernelized Online variant has provable regret bounds Empirical Evaluation Method is competitive with existing techniques Scalable to large data sets Applied to nearest-neighbor software support & image recognition
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
References
J. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, Information-Theoretic Metric Learning, International Conference on Machine Learning(ICML), pages 209216, June 2007. B. Kulis, M. Sustik, and I. S. Dhillon, Learning Low-Rank Kernel Matrices, International Conference on Machine Learning(ICML), pages 505512, July 2006 (longer version submitted to JMLR).