You are on page 1of 57

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO.

4, AUGUST 2007 847

Choosing Parameters of Kernel Subspace LDA for


Recognition of Face Images Under Pose and
Illumination Variations
Jian Huang, Pong C. Yuen, Member, IEEE, Wen-Sheng Chen, and Jian Huang Lai

Abstract—This paper addresses the problem of automatically Index Terms—Gaussian radial basis function (RBF) kernel, gen-
tuning multiple kernel parameters for the kernel-based linear eralization capability, kernel Fisher discriminant (KFD), kernel
discriminant analysis (LDA) method. The kernel approach has parameter, model selection.
been proposed to solve face recognition problems under complex
distribution by mapping the input space to a high-dimensional
feature space. Some recognition algorithms such as the kernel I. I NTRODUCTION
principal components analysis, kernel Fisher discriminant, gen-
ACE recognition research started in the late 1970s and has
eralized discriminant analysis, and kernel direct LDA have been
developed in the last five years. The experimental results show that
the kernel-based method is a good and feasible approach to tackle
F become one of the most active and exciting research areas
in computer science and information technology since 1990
the pose and illumination variations. One of the crucial factors [1], [2]. Many face recognition algorithms/systems have been
in the kernel approach is the selection of kernel parameters, developed during the last decade. Among various approaches,
which highly affects the generalization capability and stability of the appearance-based method, in general, gives a promis-
the kernel-based learning methods. In view of this, we propose
an eigenvalue-stability-bounded margin maximization (ESBMM)
ing result. Principal components analysis (PCA) and linear
algorithm to automatically tune the multiple parameters of the discriminant analysis (LDA) are the two most popular methods
Gaussian radial basis function kernel for the kernel subspace LDA in appearance-based approaches for face recognition. Their su-
(KSLDA) method, which is developed based on our previously de- perior performances have been reported in the literature during
veloped subspace LDA method. The ESBMM algorithm improves the last decade [3]–[6]. Moreover, in general, LDA-based meth-
the generalization capability of the kernel-based LDA method ods perform better than PCA-based methods. This is because
by maximizing the margin maximization criterion while main-
LDA-based methods aim to find projections with most discrim-
taining the eigenvalue stability of the kernel-based LDA method.
An in-depth investigation on the generalization performance on inant information, whereas PCA-based methods find projec-
pose and illumination dimensions is performed using the YaleB tions with minimal reconstruction errors. The first well-known
and CMU PIE databases. The FERET database is also used for LDA-based face recognition algorithm, called FisherFace, was
benchmark evaluation. Compared with the existing PCA-based developed in 1997. After that, a number of LDA-based face
and LDA-based methods, our proposed KSLDA method, with the recognition algorithms/systems have been developed. Gener-
ESBMM kernel parameter estimation algorithm, gives superior ally speaking, the LDA-based face recognition algorithm gives
performance.
a satisfactory result under controlled lighting conditions and
small face image variations such as facial expressions and small
occlusion. However, the performance is not satisfactory under
Manuscript received May 22, 2006; revised October 17, 2006. This project
was supported in part by Earmarked Research Grant HKBU-2113/06E of the large variations such as pose and illumination variations. To
Research Grants Council, by the Science Faculty Research grant of Hong solve such complicated image variations, a kernel trick has been
Kong Baptist University, by the Sun Yat-Sen University Science Foundation, by employed. The basic idea is to nonlinearly map the input data
the National Science Foundation of Guangdong under Contract 06105776, by
the National Science Foundation of China (NSFC) under Contract 60373082,
from the input space to a higher dimensional feature space
and by the 973 Program under Contract 2006CB303104. This paper was and then perform LDA in the feature space. By performing
recommended by Associate Editor J. Su. this nonlinear mapping, we hope that the complex distribution
J. Huang is with the Department of Computer Science, Hong Kong Baptist becomes linearly separable in the feature space.
University, Kowloon, Hong Kong, and also with the Department of Com-
puter Science, Guangdong Province Key Laboratory of Information Security, Following the success of applying the kernel trick in support
School of Information Science and Technology, Sun Yat-Sen (Zhongshan) vector machines (SVMs), many kernel-based PCA and LDA
University, Guangzhou 510275, China (e-mail: jhuang@comp.hkbu.edu.hk; methods have been developed and applied in pattern recogni-
hjian@mail.sysu.edu.cn). tion tasks. In 1996, a nonlinear form of PCA, namely, kernel
P. C. Yuen is with the Department of Computer Science, Hong Kong Baptist
University, Kowloon, Hong Kong (e-mail: pcyuen@comp.hkbu.edu.hk). PCA (KPCA) is proposed by Schölkopf et al. [7]. In 1999,
W.-S. Chen is with Institute of Intelligent Computing Science, College Mika et al. proposed the kernel Fisher discriminant (KFD) [8]
of Mathematics and Computational Science, Shenzhen University, Shenzhen method by introducing the kernel trick on the Fisher discrim-
518060, China (e-mail: chenws@szu.edu.cn).
J. H. Lai is with the Department of Electronics and Communication
inant. It provides a classification framework for a two-class
Engineering, School of Information Science and Technology, Sun Yat-Sen problem. In 2000, Baudat and Anouar [9] proposed the
(Zhongshan) University, Guangzhou 510275, China (e-mail: stsljh@mail. generalized discriminant analysis (GDA) method by extending
sysu.edu.cn). the KFD method to multiple classes. Their method assumes
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. that the kernel matrix K is nonsingular with the applications
Digital Object Identifier 10.1109/TSMCB.2007.895328 on low-dimensional patterns. Experimental results on Iris and
1083-4419/$25.00 © 2007 IEEE
848 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

1 Σ Σi .xi − mΣ .xi − mΣT


C N
Seed data show that the GDA outperforms linear techniques.
Motivated by these success, kernel-based LDA methods have St = . (3)
j j
also been applied in face recognition in order to solve the pose N
i=1 j=1
and illumination problems. In 2002, Yang applied KFD in face
recognition and compared its performance with KPCA, SVM, The S3 problem occurs when the dimensionality of the
and other linear techniques [10]. Experimental results demon- sample is larger than the number of training samples and the
strate that kernel-based methods achieve lower error rates. within-class scatter matrix becomes singular. To solve the S3
Lu et al. proposed the kernel direct LDA (KDDA) method [11], problem, a typical method is to make use of PCA for dimension
which combined the idea of GDA and direct LDA (D-LDA). reduction before applying LDA. The classical example is the
In 2004, Zheng et al. proposed the modified GDA method by Fisherfaces [5]. A potential problem of the Fisherfaces is that
removing the degeneracy of the GDA method [12]. the PCA step may discard important discriminative information
This brief review shows that the kernel-based LDA method by removing the null-space. In view of this limitation, the
is a good and feasible approach to solve the illumination and subspace approach is proposed to solve the S3 problem, such
pose problems in the face recognition technology. However, as [14] and [15]. The null-space method [13], [16]–[27] is a
the kernel-based LDA method suffers from two limitations. typical method in the subspace approach. The basic idea of the
First, the performance of the kernel-based LDA method is null-space approach is to reduce the dimension by keeping the
sensitive to the selection of a kernel function and its pa- null-space of the within-class matrix and/or range (i.e., com-
rameters, which dramatically affects the generalization per- pliment of the null-space) of the between-class scatter matrix
formance of kernel-based LDA methods. Second, like the such that the Fisher Index is the maximum. D-LDA [16] first
LDA-based method, the kernel-based LDA method also suffers diagonalizes the between-class scatter matrix Sb and discards
from the small sample size (S3) problem, which occurs when the null-space of Sb, whereas the method of Chen et al. [17]
the dimensionality of the sample is larger than the number first diagonalizes the within-class scatter matrix Sw and keeps
of training samples. In turn, the within-class scatter matrix the null-space of S w. When Sw is large, which is normally the
becomes singular. case in face recognition, the method of Chen et al. will suffer
In view of these limitations in existing kernel-based from the complexity problem. The D-LDA method is compu-
LDA methods, we propose an eigenvalue-stability-bounded tationally efficient but suffers from the performance limitation.
margin maximization (ESBMM) algorithm to automatically This is because discarding the null-space of Sb would indirectly
tune the multiple parameters of the Gaussian radial basis lead to Sw losing its null-space. Huang et al. [19] proposed
function (RBF) kernel for the kernel subspace LDA (KSLDA) to replace Sw with a total-class scatter matrix St and remove
method, which is developed based on our previously developed the null-space of S t. Huang et al. proposed a subspace LDA
subspace LDA [13] method. method [13] that keeps the advantages of both the method of
The rest of this paper is organized as follows. We will give Chen et al. and the D-LDA method while at the same time
a brief review on the S3 problem and existing methods in overcomes their limitations. Liu and Wechsler [28] proposed
determining the kernel parameters in Section II. In Section III, the enhanced Fisher linear discrimination model to improve
our proposed ESBMM algorithm and the KSLDA method are the generalization performance by preserving a proper balance
presented. The experimental results are discussed in Section IV. between the need that the selected eigenvalues account for most
Finally, conclusions of this paper are given in Section V. of the spectral energy of the raw data and the requirement that
the eigenvalues of the within-class scatter matrix are not very
small. Wilkes et al. [26] proposed a discriminative analysis
II. R EVIEW ON E XISTING M ETHODS
in which the training sample xi is projected onto the null-
This section is divided into two parts. The first part briefly de- space of Sw(= AAT ) to get the common vector by eigen-
scribes the S3 problem in LDA and existing methods for solving analysis of the smaller matrix AT A. Then, the null-space of
the S3 problem. A review on existing methods for automatically Scom = is discarded by using the smaller matrix
AcomAT co
m
tuning a kernel parameter is given in the second part. ATco Acom. Another simple approach, the RDA method [15],
m
solves the S3 problem by slightly modifying Sw to Sw + εI,
where ε is a very small positive number such that Sw + εI is
A. S3 Problem strictly positive definite.

Considering a C-class problem, the jth class contains Nj


samples. Let N be the total number of training samples, the B. Review on Choosing Kernel Parameters
jth sample of class i be xij, mi be the mean of the ith class, and
m be the mean of all samples. Sw, Sb, and St can be formulated Research on automatically tuning multiple parameters for
the kernel-based learning method on SVM started in 1999.
as follows:
However, little work can be found on the kernel-based LDA
C N T method for face on
recognition. It is important to point out that
1 ΣΣ
i
. Σ. Σ the formulations SVM and the kernel-based LDA method are
Sw = xij − mi xij − mi (1)
N i=1
i=1 j=1

C
Sb = 1 N (m m)(m m)T (2)
N Σ
− −i
i i
848 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

different
because of the
different
objective
functions.
Therefore,
directly
applying the
methods in SVM
to the kernel-
based LDA
method is not
feasible.
In SVM, the
typical
approach for
tuning a kernel
para- meter is
to use the
leave-one-out
(LOO)
estimator or
the
HUANG et al.: CHOOSING PARAMETERS OF KSLDA FOR RECOGNITION OF FACE IMAGES 849

cross-validation estimator. However, this is time consuming. and applying the subspace LDA method in the feature space.
Currently, a more computationally efficient strategy is to con- Details of each step are discussed in the following.
struct an upper bound or approximate the true generalization 1) Nonlinear Mapping From the Input Space to the Feature
d
error without an LOO procedure. Space: Let φ : R→ Fdf ,x →φ(x), be a nonlinear mapping
In 1999, Jaakkola and Haussler present a cross-validation from the input space R to a high-dimensional feature space
theoretical bound [29]. No experimental results are provided in F. Considering a C-class problem in the feature space, the set
their article. In 2002, Chapelle et al. [30] employed a gradient- of all classes is represented by C = {C1, C 2 , . .. , CC }, and
descent-based method to minimize the LOO estimation of the the jth class Cj contains Nj samples, j = 1, 2, . . . , C. Let N
generalization error of SVM over the parameters. Each opti- C
be
Σ
mized parameter is selected individually. Their results show the total number of training samples, i.e., N = j=1 N j . Let
that an accurate estimate of the error is not required, and a sim- ix
j be the jth sample in class i, and mi and m be the mean
ple estimate performs well. Thus, the LOO or cross-validation of the ith class and all samples, respectively. The within-class
procedure can be avoided. Mika [31] presented two techniques: scatter matrix S w, between-class scatter matrix Sb, and total-
stability bounds and algorithmic luckiness. He mentioned that class scatter matrix St can be formulated in the feature space F
these could be used as a basis for deriving the generalization as follows:
error for the kernel-based LDA method, but no details are given
Sw = 1 Σ Σ .φ .xij Σ − m Σ .φ .xij Σ − m Σ T
C Ni
in his Ph.D. dissertation. Schittkowski [32] optimized the kernel
N i i
parameters for SVM using a two-level approach. They split the i=1 j=1
training data into two sets: one set for formulating the SVM Σ
C Σ
Ni . . . Σ Σ Σ. 1 . . Σ Σ ΣT
and the other one for minimizing the generalization error. Lee = √1 φ xi − mi √ φ xi − mi
j
j
and Lin [33] studied the relation between the LOO rate and the i=1 j=1
N N
T
stopping criteria for SVM from the numerical analysis point of =Φ Φ (4)
view and proposed loose stopping criteria for the decomposi- w w

tion method in SVM. In 2003, Keerthi and Lin [34] analyzed the
behavior of SVM when the kernel parameter takes a very small where Φw is a df × N matrix and is defined as
or a very large value. They developed an efficient heuristic
Φ = ΣΦ̃i,j

method of searching for the kernel parameter with small gen- i=1,...,C, j=1,...,Ni
eralization errors. Zhang et al. [35] recently proposed to deter-
w Σ
mine the kernel parameters by optimizing an objective function = Φ̃1,1
w
, Φ̃w1,2 , . . . , Φ̃w1,N1 , Φ̃w2,1 , Φ̃w2,2 , . . . , Φ̃
w
2,N2
,...,
for a kernel minimum distance classifier. Moreover, some re-
search articles also discussed and proposed methods for mea- Σ (5)
suring the similarity between two kernel functions [36], [37]. Φ̃w
C,1 , Φ̃w
C,2 , . . . , Φ̃w
C,NC
df ×N

with Φ̃i,j = (1/ N )(φ(xi ) − mi ) being a df × 1 column vec-
w j
III. O UR P ROPOSED M ETHOD tor. Then, we have
C
This section consists of two parts. The first part introduces 1 Σ
a newly developed kernel-based LDA method, i.e., the KSLDA Sb = Ni(mi − m)(mi − m)T
N
method. The KSLDA method is derived by applying the kernel i=1
trick on our previously developed subspace LDA method [13]. C T
.. Ni
The ESBMM algorithm is proposed in the second part. The =Σ (m Σ .. Σ
i i
ESBMM algorithm automatically chooses multiple parameters N N
Ni
of the Gaussian RBF kernel function for the KSLDA method.
i= − m) (m − m)
1 (6)
With the use of the ESBMM algorithm, the proposed KSLDA
face recognition method could further improve the generaliza- = ΦbΦbT
tion capability. We would like to point out that the ESBMM where Φb is a df × C matrix and is defined as
Σ Σ
algorithm is generic and, with minor modifications, can be
Φb = Φ̃i
applied to all kernel-based LDA algorithms. Details of each part b
i=1,...,C
are discussed in the following. Σ
= Φ̃b , Φ̃b2 , . . . , Φ̃bC
1
(7)
Σ df ×C
A. KSLDA Algorithm

˜
with Φbi = (√ Ni/N )1/2(mi m) being a df 1 column vec-
The KSLDA method is developed based on the subspace tor. Finally, we have − ×
LDA method [13], which solves the S3 problem and has been
C N
St = 1 Σ Σ . . ji Σ Σ. . Σ T Σ
demonstrated to give a better performance than existing LDA- i
based methods such as FisherFace [5], direct LDA [6], and
N φ x − m φ j xi − m
Chen et al. LDA [17]. The earlier version of the KSLDA method i=1 j=1
Σ
C ΣNi . . . Σ ΣΣ. 1. . Σ ΣΣT
√1 φ xi − m
has been reported in [38]. The basic idea of our proposed
= √ φ xi − m
KSLDA method is to apply the subspace LDA method in the j
HUANG et al.: CHOOSING PARAMETERS OF KSLDA FOR RECOGNITION OF FACE IMAGES 849

j
kernel feature space, and it consists of two steps, namely, i=1 j=1
N N
nonlinear mapping from the input space to the feature space = ΦtΦtT (8)
850 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

where Φt is a df × N matrix and is defined as matrix, and ΛNi is an Ni × Ni matrix with all terms equal to
Σ Σ 1/Ni. [See Appendix III for the derivation of (12).]
Φt = Φ̃i,j By singular value decomposition, there exist orthonormal
Σ t i=1,...,C, matrices U, V ∈ RN×N and a diagonal matrix Λ = diag(σ1,
1,1 1,2 1,N 2,1 2,2 2,N2
= Φ̃ t ,j=1,...,N
Φ̃t , . . i. , Φ̃t 1 , Φ̃t , Φ̃t , . . . , Φ̃
t . . . , σr, 0 ,. . ., 0) ∈ RN×N (σ1 ≥ σ2 ≥ · · · ≥ σr > 0) such
that Swt = U ΛV T .
,...,
Σ
Φ̃C,1 , Φ̃ C,2
, . . . , Φ̃ C,NC (9) Since Sw = Φ wΦ wT , by Theorem 1, we get
t t t
√ df ×N . 2 2 Σ
i,j
with Φ̃ t = (1/ N )(φ(xij) − m) being a df × 1 column (ΦtV )T Sw(Φ tV ) = diag σ 1,..., σ ,r0 , . . . , 0 .
N ×N
vector. Here, we also define
Let V j = [vr+1, vr+2,..., vN ]. Then, the sub-null-space of
1
Ni
1
C Ni Sw, Y , is given by Y = ΦtV j. It can be seen that Y satisfies
mi = Σ . ij Σ i
Y T Sw Y = 0(N −r)×(N −r).
Ni φ x N Σ Σ . j
Σ
j= i=1 j=1 φx. b) Discarding the null-space of Sb: After determining the
1 m=
For classes C and C , an N × N dot-product matrix Ki is sub-null-space of Sw, the projection is then determined outside
i l i l l the null-space of Sb. Thus, the second step is to discard the null-
defined as space of Sb to ensure that the numerator of the Fisher Index will
not be zero.
. Σ Define Sˆb = Y T SbY , and then Sˆb = ( TYΦ )(b T Φ T
Kli = Kijlk k=1,...,N
(10) Y ) .b Let
Z = Y T Φb = (V j)T ΦTt Φb, which is a matrix of (N − r) × C
l
j=1,...,N i

where K ijlk= k(xi j, xlk) = φ(xij ) · φ(x dimensions. Utilizing the kernel function, matrix Z can be
k ). Then, for all C
l
expressed in terms of the kernel matrix K as follows:
classes, we can define an N × N kernel matrix K as
. Σ
K = K ij lk . (11) Z = (V j)T · (K · Λ NC − K · I NC
i=1,...,C,j=1,...,Ni

− 1NN · K · ∆NC + 1NN · K · ΓNC ) (13)


l=1,...,C,k=1,...,Nk

2) Applying the Subspace LDA Algorithm on the Feature


Space: After the mapping, all feature vectors are transformed where Λ NC = diag[ΛN1 , ΛN2 , . . . , ΛNc ] is an N× C block
from a lower dimensional input space into a higher dimensional diagonal matrix, and ΛNi is an Ni × 1 matrix with all terms
equal to 1/N√Ni. INC is an N C×matrix with each el-
feature space. If the input feature dimension is smaller than the √ 2
number of training samples, the GDA can be applied. However, ement of the ith column equal to Ni/N , and 1NN is an
in face recognition, very often, we do not have sufficient NN × matrix with all terms equal to one. ∆ NC = diag[∆ N1 ,
training samples such that the S3 problem occurs. To solve the ∆N2 , . . . , ∆Nc ] is an N ×C block diagonal matrix, and ∆Ni is
2 √
S3 ill-posed problem, the subspace LDA [13] method is applied an Ni 1 × matrix with all terms equal to 1/N Ni. Γ NC is an
on the feature space. The formulation consists of three steps, N C × matrix with each element of the ith column equal to
namely, determining the sub-null-space of Sw, discarding the 3
(Ni/N )1/2. [See Appendix IV for the derivation of (13).]
null-space of Sb, and feature extraction for testing data. Before If the norm of one row in matrix Z is very small (for example,
discussing the method in details, we need the following two less than 1e −6), then discard this row in matrix Z. Denote
theorems. the number of the discarded row as rd; rj = r + rd. Accord-
Theorem 1: Assume matrices Φ, Ψ ∈ Rd×n(d ≥ n), S1 =
ingly, discard the corresponding column in (matrix V j. rRewrite
ΦT Ψ ∈ Rn×n, and S = ΦΦT ∈ Rd×d. Then, there exist a j as Zj ∈ R N−r )×(N −r ) and
r
thejj modified( matrix Z and V
V ∈ RN× N−r ). By singular value decomposition, there exist
r
matrix d×n 2 Λ = diag(
Q ∈R and a diagonal matrix σ1,..., orthonormal matrices U b ∈ R(N −r )×(N r
−r ) and V ∈ RC×C
r

σrT, 0,... , 0) 2∈ Rn×n, σ1 ≥ σ2 ≥ · · · ≥ σr > 0, such that such that Z j = U Λ V T , where Λ = . b Σ


Σ b
and Σ =
Q S2Q = Λ . (The proof is given in Appendix I.) bbb b 0 (N −r r )×C b
Theorem 2: Assume S = ΦΦT , Φ ∈ Rd×n(d ≥ n), and diag(τ 1 ,..., τm, 0 , . . . , 0)C×C , τ1 ≥ τ2 ≥ · · · ≥ τm > 0. Re-
(
write Ub = (u 1 ,..., um, u m + 1 ,..., uN−rr ) ∈ R N−r )×(N −r
r r
rank(Φ) = r. Then, the singular value decomposition Φ =
.Σ Σ ),
U 0 d×n V T , where U = (u 1 ,..., ur, u r + 1 ,..., ud) ∈ Rd×d where ui is the ith column of the orthonormal matrix Ub.
n×n Denote A = (u , . . . , u ) .
and V ∈ R
r
are orthonormal matrices, Σ = diag(σ 1 ,..., 1 m (N −r )×m
T ˆ
σr , 0,... , 0) ∈T Rn×n, σ1 ≥ 2σ2 ≥ By Theorem 2, we have A SbA = D2m, and then
2 σr > 0, U1 = (u1,
2 ··· ≥
..., ur), and U SU1 = diag(σ , σ ,..., σ ). (The proof is
1 1 2 r 2
given in Appendix II.) (Y A)T Sb(Y A) = D m.
a) Determining the sub-null-space of Sw: The objective
of this step is to determine the sub-null-space of S w. Let Swt = Let
T Φ R N×N . Utilizing the kernel function, we express ma-
Φw t∈
W = YA = ΦtV jjA. (14)
trix Swt in terms of the kernel matrix K as follows:
1 where INN is an N × N matrix with all terms equal to 1/N ,
Swt = (K − K ·IN N − ΛN N ·K + ΛN N ·K ·IN N ) (12)
N Λ NN = diag[ΛN1 , ΛN2 , . . . , ΛNc ] is an N × N block diagonal
850 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

Then, W is a df × m matrix, which yields WT SwW = 0m×m


2
m and WT SbW = D .
Thereby, W is the LDA projection matrix in which the
Fisher criterion function J (W ) = tr(W T SbW )/tr(WT S wW )
reaches the maximum.
HUANG et al.: CHOOSING PARAMETERS OF KSLDA FOR RECOGNITION OF FACE IMAGES 851

B. ESBMM Algorithm a function. If there exist absolute constants Ci,i = 1,...,M ,


such that
Basically, any kernel functions can be used in the KSLDA
method. In this paper, we adopt the popular Gaussian RBF sup |f (X 1 ,..., XM ) − f (X 1 ,..., Xi−1, Xij ,
kernel, which is defined as X1 ,...,XM .Xir ∈Z

2 X i + 1 ,..., XM ) |≤ ci, 1 ≤ i≤ M (16)


. Σ
Σ (xi − yi ) wherein we call f stable.
k(x, y) = exp − i
2 2θ
i
. (15)
To analyze the eigenvalue stability, we need the following
It is because, in general, the Gaussian RBF kernel function two theorems.
gives a better performance than other types of kernels. The Theorem 3: LetX = { x 1 ,. .., x}M⊂ R N be our training
Gaussian RBF kernel function consists of scale factors θi, sample, which is a training set data. We define set operation
which are the kernel parameters that we are going to estimate. X ∪ x to be the addition of x to X and X \ x to be the deletion
The number of parameters can be equal to the dimension of of x from X . Then, we have
the input data. Existing kernel-based LDA methods consider M −1 Σ
Cov X ∪x = Cov X + M (m − x)(mX Σ
θ = θ1 = θ2 = ··· = θn and manually select an appropriate M
value for θ. M 2 −1 X − x)T
In estimating the kernel parameters, they can neither be very M (17)
large (e.g., θ → ∞) nor very small (e.g., θ → 0), as both situ- M −1 )T
Cov = Cov + ( )(
Σ
ations will give a poor generalization performance on kernel- X \xi Σ X mX −xi mX −xi
based methods. The determination of the kernel parameter is (M −1) 2 .
required and is crucial to the generalization performance of M −2 (18)
kernel-based methods. The straightforward strategy is to search
the whole parameter space and find the best value(s). However, Exchanging one element results in an update of the covariance
this procedure is very time consuming and intractable when matrix, i.e.,
the number of parameters exceeds two [30]. Here, we propose M
the ESBMM algorithm to automatically tune multiple kernel Cov(X \x )∪x = CovX + (mX − xi)(mX − xi)T
( 1) 2
M−
i
parameters for the proposed KSLDA method. Instead of min-
1
imizing the estimation of the generalization error, we propose + (m − x)(m − x)T . (19)
X \xi X \xi
to maximize an objective function that explicitly depends on M
kernel parameters while maintaining the eigenvalue stability
(Please refer to [31] for the proof.)
of the KSLDA method. The objective function adopted here Σ
is derived from the maximum margin criterion [39] and is Theorem 4: Suppose B = A + kj=1τjcjcTj , where A ∈
given by F (W ) = tr(W T SbW−W T S wW ), where W is the N ×N is symmetric, c N has unit l -norm, and τ .
R j∈ R 2 j ∈R
projection matrix determined by the LDA algorithm. Then, there exist mij ≥ 0, i = 1 , . . . , N,j = 1 , . . . , k, with
Σ
i mij = 1 for all j, such that
The ESBMM algorithm is divided into two parts, namely,
eigenvalue stability analysis and objective function maximiza-
k
tion. Details are discussed in the following. Σ
1) Eigenvalue Stability Analysis: It is known that high λi ( B ) = λ i ( A ) + mijτj ∀i = 1 , . . . , N. (20)
stability is a desirable property for a learning algorithm [40]. j=1
Rogers and Wagner [41] showed that the variance of the
LOO error can be upper bounded by hypothesis stability. The (Please refer to [31] for the proof.)
stability measures the approximation of the learned function Now, we have all necessary background to state the eigen-
when a training sample is replaced by another sample outside value stability of Sb and Sw. We have then developed a new
the training set. In 2002, Bousquet and Elisseeff [40] derived theorem as follows.
exponential upper bounds on the generalization error based on Theorem 5: Assume thatX = { x 1 , .. ., x} N⊂ RN is gen-
notions of stability and applied them to SVM for regression and erated according to some unknown but fixed distribution P and
classification. To design a robust system that is not sensitive that ≥x − E(X) ≥ ≤R < ∞for all x ∼P . Let Sw and Sb be
the within-class scatter matrix and between-class scatter matrix
to the noise, they use sensitivity analysis to determine how from X , and let Sw(X \xp0 )∪xp1 and Sb(X \xp0 )∪xp1 be the within-
much the variations of the input will influence the output of
class scatter matrix and between-class scatter matrix estimated
a system and derived the generalization error bounds based on
from Xwith the training sample xp0 in class p replaced by xp1
empirical and LOO errors. Therefore, the stability is employed
in class p. Then, for any 1 , . . . , N and i = 1 , . . . , N ,
to measure the generalization performance of the algorithm. we have
Here, to measure the stability of the KSLDA algorithm, we Σ.
propose
between to
theanalyze the eigenvalue
between-class scatter stability
matrix Son the difference Σ
N − λ j(Sb −Sw )
.

. Σ
and the within-
I⊂{ }
HUANG et al.: CHOOSING PARAMETERS OF KSLDA FOR RECOGNITION OF FACE IMAGES 851

b λj S −Sw(X\xp
0 )∪xp1
Σ
class scatter matrix Sw. .j=1 b(X\xp0 )∪xp1
.
The stability is defined as follows [31].
Definition (stability). Let X 1,...,X M be independent ran- 8nR
≤ 2 . (21)
dom variables taking values in a set Z, and let f : Z M → R be (C − 1)(n − 1)
852 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

The proof is given in Appendix V. From Theorem 5, we can of F with respect to kernel parameters θ. In the following, we
2
see that the term 8R /(C− 1)(n −1) measures the change of will give the formulation for gradient computing.
the eigenvalue after replacing one training sample by another To compute the gradient of F with respect to θ, we partial
2
outside the training set. If the term 8R /(C − 1)(n −1) is differentiate the two matrices, P and Q, with respect to θ. In
small, it means that under distribution P , the eigenvalues of order to clearly show the formulation, we first formulate a gen-
the covariance matrix are less sensitive to the training data eral term (∂/∂θ s )K(X i ,X l ). Let Nθ be the number of kernel
j
set. In other words, we can say that the kernel-based learning parameters to be used and Nkθ be the feature dimension.
algorithm is less sensitive on the training data set. In turn, the First, we denote the partial ≤ differentiation of the kernel
algorithm is generalized well from the training data set to the matrix K with respect to θs as ∂Kθs , which is an N× N matrix
testing data set. From the earlier discussion, we know that R, expressed as follows:
which is the minimum radius of the hypersphere containing all
the training data, is crucial to the generalization capability of Σ i,j Σ

the KSLDA algorithm. In maximizing the objective function, the ∂Kθs = ∂Kl,k . (26)
separability between different classes will increase. In turn, the ∂θs i=1,...,C,j=1,...,Ni
l=1,...,C,k=1,...,Nk
radius R will also increase. This means that the eigenvalue sta-
bility and the generalization capability of the learning algorithm Each element in matrix ∂Kθs is the differentiation of the
decrease. Therefore, the objective of the ESBMM algorithm is corresponding element in the kernel matrix K with respect to
to choose proper kernel parameters that maximize the objective θs and is formulated as
function while guaranteeing a good 0
generalization capability by
not exceeding the threshold of R . i,j i l
2) Objective Function Maximization: We propose to adopt . Σ
∂Kl,k ∂K Xj, Xk
the maximum margin criterion [39] as the objective function ∂θs ≡ ∂θs
as follows: Nu . Σ2
i,u l,u
Nθ Σ xj,t − xk,t
F (W, θ) = tr(W T S W − W T S W ) ∂ t=
(22) Σ 1
b w = ∂θ exp − 2θ2
u
s u=1
where W is the projection matrix, and θ = (θ 1 ,..., ) is the Σ
kernel parameters used in the kernel function. As there is no
θd
N
. i,u l,u
need to calculate the inverse of Sw, the objective function u 2
becomes efficient and stable. The basic idea is to determine Nθ Σ xj,t − xk,t
t=
the proper kernel parameters by maximizing the objective Σ 1 2θ2
= exp −u=1 u
function F with respect to the parameter set θ. Our method is
gradient descent based, and an iterative procedure is required
s 2
to update the kernel parameters along the direction of maxi-
. xj,t − xk,t Σ
mizing the objective function. Based on the derivation of the Σ
N
i,s l,s
. −Σ
KSLDA method, the projection matrix W is defined as W = × − t= 2 −2θ3s
Φ tV jjA in (14). Denote tt = V jjA, and tt is determined by our 1
proposed KSLDA method. Then, the objective function F can
Ns
Σ.
be reformulated as follows: 2
i,s
j,t k,tΣ
l,s
F = tr(W T SbW − W TSwW ) t=1 x −x
=
= tr tt Φt Φ Φb Φ tt − tt Φt Φ Φw Φ tt θs
3
. T T T
T T w T t
Σ
= tr(ttT PP T tt − ttT QQT tt)
b t (23)
T i,j
where P = Φ t Φb = [P l ]N×C is an N × C matrix, and Q =
i,u
Σ2
l,u
t w Σ
N
.
ΦT Φ = [Qi,j l,k]N ×N is an N × N matrix. Each element in Σu x − x . (27)
matrix P is an explicit function in terms of the kernel function. × exp −u=1
Nθ j,t k,t
From (13) and (12), matrix P and matrix Q can be expressed in t=
1
terms of the kernel matrix K as follows: 2θu
Then, we derive ∂P/∂θs as follows:2
P = K ·ΛN C − K ·IN C − 1N N ·K ·∆N C + 1N N ·K ·ΓN C
(24) ∂P ∂
= (K · Λ NC − K · INC
1 ∂θs ∂θs
Q = N (K − K ·INN − ΛNN ·K + ΛNN ·K ·INN )T . (25)
− 1NN · K · ∆NC + 1NN · K · ΓNC )
Hence, the objective function F can be described as an explicit = ∂Kθs · ΛNC − ∂Kθs · INC
function of kernel parameters. When initial values of kernel
0
parameters θ are given, to maximize the objective function, we − 1NN · ∂Kθs · ∆NC + 1NN · ∂Kθs · ΓNC (28)
need to find the search direction to update the kernel parameters
0
θ . To get the search direction, we need to compute the gradient where ∂P/∂θ s is an N × C matrix.
HUANG et al.: CHOOSING PARAMETERS OF KSLDA FOR RECOGNITION OF FACE IMAGES 853

We then derive ∂Q/∂θs as follows:


∂Q 1 ∂
= (K − K ·I −Λ ·K + Λ ·K · I )T
NN NN NN NN
∂θs N ∂θs
1
= (∂K Fig. 1. Images of two persons from the FERET database.
θs − ∂Kθs · I NN − Λ NN · ∂Kθ s
N 0
The threshold R in our experiments is set to 5e − 3. For
T
+ λNN · ∂Kθs · INN ) . (29) the initial value of the kernel parameters, we select the mean
distance of all the training vectors as follows:
Then, the gradient of F with respect to θ can be formulated Σ
as follows: 0 0 0 1 · N ≥xi − xj ≥2 = θ 0
θ = θ = ··· = θ = ( 1))
. Σ ( · −
tr(tt T P P T tt − ttT QQT tt)

∂θs = ∂θ.. Σ
∂F ∂ 1 2 Nθ N N (31)
∂P ∂P T 1 i,j=
= tr s ttT · · PT · tt + ttT · P · · tt 0 0
and the step size ρ is set to θ /100 in our experiments reported
. ∂θs ∂θs ΣΣ in Section IV.
∂Q ∂QT
− ttT · · QT · tt + ttT · Q · · tt .
∂θs ∂θs IV. E XPERIMENTAL R ESULTS
(30) In this section, the FERET [42] database is selected to
evaluate the performance of our KSLDA method with the
Now, the gradient of F with respect to θ is formulated in ESBMM algorithm. After that, an in-depth investigation on the
terms of matrix K and matrix ∂Kθs . We can use the objective generalization performance of our proposed method and other
function F as a model selection criterion and will make use of existing methods along pose and illumination dimensions is
this to optimize the kernel parameters in the next section. performed using the YaleB and CMU PIE databases.

C. Procedure of ESBMM Algorithm


A. Face Image Databases
Based on the analysis above, the detailed procedure of our
In the FERET database, we select 250 people, with four
proposed kernel parameter optimization algorithm ESBMM is
frontal-view images from each individual. Face image vari-
described as follows.
0 0
ations in these 1000 images include illumination, facial ex-
Step 1) Given threshold R , step size ρ , initialize iteration pression, partial occlusion, and aging [42]. All images are
counter k = 0, and the initial values for the kernel aligned by the centers of the eyes and the mouth and then
0 0 0 0
parameters θ = (θ 1, θ2 , . . . , θNθ ), where Nθ is the normalized with a resolution of 92 × 112. The pixel value of
number of kernel parameters to be used in the RBF each image will be normalized between 0 and 1. Images from
kernel function. two individuals are shown in Fig. 1. The Yale Group B face
Step 2) Using the EDLDA method, find the projection matrix database contains 5850 source images of 10 subjects each cap-
Step 3) Construct the objective function tured under 585 viewing conditions (9 poses × 65 illumination
conditions). In our experiments, we use images under
Step 4) Update the parameter set θ by a gradient step such 45 illumination conditions. These 4050 images (9 poses ×
that the objective function is maximized as follows. 45 illumination conditions) are divided into four subsets ac-
• Compute the gradient Dk toθ get the search cording to the angle that the light source direction makes with
direction along the direction of maximizing the camera axis, namely, subset 1 (0◦–25◦, 7 images per pose),
the objective function Dk = subset 2 (up to 12◦, 12 images per pose), subset 3 (up to 50 ◦,
12 images per pose), and subset 4 (up to 77◦, 14 images per
pose) [43]. All frontal-pose images are aligned by the centers
of the eyes and the mouth, and the other images are aligned by
the center points of the faces. Then, all images are normalized
(∂F/∂θ , ∂F/∂θ , . . . , ∂F/∂θ ). θ
with the same resolution 92 × 112. Some images from one
1 2 n
Compute the bound R using R = max ≥yi − individual are shown in Fig. 2. For each subset, we show images
of nine poses under one illumination condition. The first row
E(Y )≥, i = 1, 2,... ,N , where N is the num-
0
are images of nine poses under one illumination condition from
ber of training samples, and yi = W T φ(xi), subset 1. The second row are images of nine poses under one
i = 1, 2,... ,N , is the projected sample vector illumination condition from subset 2, etc.
0
using kernel-based LDA methods. If R > R , In the CMU PIE face database, there are 68 people, and
then go to Step 5). each person has 13 pose variations that range from the right
• Compute the unit vector D̂kθ of the gradient profile image to the left profile image and 43 different lighting
vector Dkθ as D̂kθ = Dkθ / ≥Dkθ ≥. conditions, 21 flashes with ambient light on or off. In our
0 experiments, for each person, we select 56 images including
• Let θ = θ+ ρ D̂kθ and k = k + 1. Go to Step 2).
13 poses with neutral expression and 43 different lighting
Step 5) Terminate and output results
854 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

Fig. 2. Images of one person from the YaleB database.

Fig. 4. Tuning a single kernel parameter for KSLDA.

1) Case of a Single Kernel Parameter: In this part, we use a


single parameter θ for all θi in the kernel function. Since each
individual has four images, two images of each individual are
randomly selected for training, and the other two images are
for testing. The results in each iteration are recorded and shown
in Fig. 4. Fig. 4(a) shows the variations of F in (23) against
the number of iterations, whereas Fig. 4(b) and (c) shows the
trend of bound R and rank-1 accuracy. Fig. 4 shows that when
the number of iterations increases, the value of the objective
function F increases as well. This implies that the training
sample points in different classes become separated from each
other. On the other hand, the bound R value is also increasing
at the same time. This means that the generalization capability
Fig. 3. Images of one person from the CMU PIE face database.
is gradually degrading. At the 43rd iteration, the bound R
exceeds the predefined threshold, and the ESBMM algorithm
conditions in the frontal view. For all frontal-view images, we is terminated. The blue dotted line indicates the terminating
apply alignment based on two eye center and nose center points, point, and the corresponding values of F , bound R, and rank-
and no alignment is applied on the other images with a pose. 1 accuracy are displayed in the textboxes. Fig. 4 shows that
Several images of one person are shown in Fig. 3. the rank-1 accuracy is 82.46% at the termination point, which
All the segmented images in these three databases are is the best rank-1 accuracy value. The computational time for
rescaled to the resolution of 92×112. To reduce the compu- applying the ESBMM algorithm for determining the kernel
tational complexity, the Daubichies-4 (Db4) wavelet transform parameter is around 1 min per iteration (in a P4 1.6-GHz
is applied on all images, and the LL band with a resolution of computer). The required number of iterations is data dependent.
30 × 25 is kept as the final input image. In this case, around 43 min is required. Moreover, it is important
to point out that the ESBMM algorithm is used to determine
B. Results on the FERET Face Database the kernel parameter in the training stage and is performed only
one time. Once the parameter(s) is estimated, the computational
This section demonstrates the procedure of determining the time in the testing (recognition) stage is the same as in other
kernel parameters and evaluates the performance of our KSLDA kernel-based LDA methods.
method with the ESBMM algorithm. The performances on both After determining the kernel parameter, we can evaluate the
single kernel parameter and multiple kernel parameters are performance of our proposed algorithm and compare with exist-
presented. A performance comparison between our proposed ing PCA- and LDA-based methods. It is known that the results
method with existing methods using the FERET face database will be different with different training and testing images. To
is also given. Existing methods are divided into two categories, avoid this error, the experiments are repeated ten times, with
namely, PCA-based methods and LDA-based methods. PCA- randomly picked two samples for training and the other two
based methods include the eigenface [3] and KPCA [7] meth- samples for testing. Then, the average accuracy is calculated
ods. LDA-based methods include the FisherFace [5], D-LDA and reported. Table I shows the means and standard variations
[16], Huang et al. [19], RDA [15], subspace LDA [13], GDA [9], of the rank-1 to rank-3 accuracies of the proposed method and
and KDDA [11] methods. other existing methods.
HUANG et al.: CHOOSING PARAMETERS OF KSLDA FOR RECOGNITION OF FACE IMAGES 855

TABLE I
PERFORMANCE USING A SINGLE PARAMETER

Fig. 5. Recognition accuracy and computational load using different numbers


of kernel parameters.

The experimental settings for different methods are described


2
as follows. For KPCA, 2θ = 1e4. KDDA consists of two
parameters, namely, the kernel parameter for the Gaussian RBF
function θ and the regularization parameter η for the ill-posed
2
within-class scatter matrix. That is, 2θ = 5e5 and η = 1e − 5.
2
For GDA, 2θ is set to 4e4. In selecting these parameters, we re-
peatedly performed experiments within a certain range of para-
meters and then manually selected the parameters with the best Fig. 6. Example for a construct image vector block by block using four kernel
performance within that range. This means that those are the parameters.
optimal parameters within the range using the FERET data set.
The results in Table I show that, in general, LDA-based the following experiment, we divide images into different block
methods have a better performance than PCA-based methods. sizes. They are 1 1,×2 1, 2 5,
× 3 5, 5 ×
5, 6 5, × × ×
For LDA-based methods, kernel-based methods, in general, give 10 × 5, 15 ×5, 6 ×25, 30 ×25, and then the number
a better performance than linear-based LDA methods. Among of parameters are 1, 2, 10, 15, 25, 30, 50, 75, 150, and 750
all these methods, the proposed KSLDA method gives the best respectively. Fig. 5 shows the accuracy and computational
performance, in which the rank-1 accuracy is 82.40%.The time against the number of parameters. Results show that the
improvement of the proposed method is not very significant accuracy is increasing when the number of kernel parameter is
because of the following two reasons. First, the parameters used increasing from 1 to 30 and then flat afterward. At the same
in different methods are optimal in the predefined range. This time, the computational load is also increasing. When the num-
means that the proposed method can automatically select the ber of kernel parameters is larger than 50, the computational
optimal parameters. Second, only a single parameter is used in load increases dramatically. Therefore, 30 parameters are used
this experiment. In Section IV-B2, we will report the experi- in this paper. This means that an image will be divided into
mental results with multiple parameters, and the improvement 30 blocks, and all pixels in one block will share the same
becomes obvious. parameter.
2) Case of Multiple Parameters: After inspecting the case To efficiently compute the gradient using (27) in
of single parameter, this section discusses the use of multiple Section III-B, we construct a 1-D vector by concatenating
parameters for the Gaussian RBF function. The experimental each row of the input image vector. Fig. 6 shows an example
setting is the same as in the case of single parameter. that illustrates the procedure of constructing a 100 × 1 image
Generally speaking, the more the kernel parameters used in vector from the original image of 10 × 10 pixels.
the kernel function, the higher the flexibility of the KSLDA The iterative process of tuning multiple parameters is shown
method to handle the mapped feature distribution. In turn, a in Fig. 7. The patterns are very much similar with the case of
better performance can be achieved. However, at the same time, single parameter. Using the same procedure discussed in the
it will increase the risk of overfitting and the computational previous section. The iteration terminates at the 39th iteration,
load. In practice, when the number of parameters used in the and the corresponding kernel parameters are selected for our
kernel function exceeds a certain value, the improvement will KSLDA method. The blue dotted line indicates the terminating
become smaller, or even negative, while the computational load point, and the corresponding values of F , bound R, and rank-
is still increasing. In view of that, an experiment is conducted 1 accuracy are displayed in blue edge textboxes. The rank-
to determine the number of parameters that should be used. In 1 accuracy is 84.48% at the terminating point, whereas it is
856 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

TABLE II
PERFORMANCE COMPARISON USING SINGLE
AND MULTIPLE PARAMETERS

TABLE III
FIXED POSE VARIATION AND EVALUATION OF THE GENERALIZATION
PERFORMANCE ALONG THE ILLUMINATION DIMENSION;
ALL IMAGES OF SET 1 FOR TRAINING AND THOSE
OF SETS 2–4 FOR TESTING

Fig. 7. Procedure of tuning multiple kernel parameters for KSLDA.

our KSLDA method using multiple kernel parameters has a


better performance than using a single parameter. The rank-1
accuracy increases from 82.40% to 84.44%. By using multiple
kernel parameters in the Gaussian RBF function, our KSLDA
method will provide more flexibility on handling the distribu-
tion of mapped feature data points, and this will improve the
classification performance. For the rest of the experiments, the
results of the KSLDA method are all acquired with 30 kernel
parameters.

C. Results on the YaleB Face Database


This section investigates the generalization performance of
the KSLDA method with the ESBMM algorithm being applied
along illumination and pose dimensions. We do not list the
standard variations when there is only a pose or illumination
variation in the sample images because most of the algorithms
investigated in this section have good and stable performances.
Fig. 8. Kernel parameters for different blocks after the tuning procedure. When the variations are along both the pose and illumination
dimensions, we show both the mean and standard variations of
85.26% at the next point, which is the best rank-1 accuracy the results.
value. After that point, the rank-1 accuracy begins to drop 1) Fixed Pose With Illumination Variations: In this part,
gradually. we fix the pose variations and investigate the generalization
The sketch map of different parameter values for each block performance of the proposed method along the illumination
found is shown in Fig. 8. The upper-left corner is a face image. dimension. For each pose, all seven images in subset 1 are
The lower-left corner is the sketch map of different parameter selected for training and images from subsets 2–4 are used for
values. The image on the right illustrates the corresponding testing. The average results on nine poses are recorded, and the
relationship between face image pixels and kernel parameters. average rank-1 accuracies of the different methods are listed
Pixels in each block in the sketch map share the same kernel in Table III.
parameter value. The darker the color is, the smaller the value It can be seen from the results on subset 2 in Table III that
of kernel parameter will be. To compare the performance, the both linear and nonlinear methods give equally good results
experimental setting for the case of multiple parameters is the when the illumination directions among the testing samples
same as that in the case of single parameter. The experiments slightly deviate from those among the training samples. When
are running ten times, and the mean of the rank-1 to rank- the illumination variations become large, e.g., illumination
3 accuracies are listed in Table II. As shown in Table II, subset 3, the performance of the linear methods begins to
HUANG et al.: CHOOSING PARAMETERS OF KSLDA FOR RECOGNITION OF FACE IMAGES 857

TABLE IV TABLE V
FIXED POSE VARIATION AND EVALUATION OF THE GENERALIZATION FIXED ILLUMINATION AND EVALUATION OF THE GENERALIZATION
PERFORMANCE ALONG THE ILLUMINATION DIMENSION; PERFORMANCE OF THE KSLDA METHOD ALONG THE POSE DIMENSION; ALL
RANDOMLY SELECTED TWO IMAGES FROM EACH SUBSET FOR TRAINING AND IMAGES OF POSE 1 FOR TRAINING AND THOSE OF POSES 2–45 FOR
THE OTHERS FOR TESTING TESTING

TABLE VI
degrade dramatically, except for the subspace LDA method. PERFORMANCE WITH BOTH THE POSE AND ILLUMINATION VARIATIONS
When the illumination directions among the testing samples
further deviated from those of the training samples, e.g., subset
4, both linear and nonlinear method cannot give satisfactory
results. The results in Table III show the following.
1) The generalization performance of all linear and nonlin-
ear methods will degrade when the illumination direc-
tions among the training images are significantly different
from those in the testing images. This implies that both
linear and nonlinear methods cannot give satisfactory
results when there are no representative samples for
training.
2) Generally speaking, nonlinear methods outperform linear
methods.
3) LDA-based methods give a relatively better performance
compared with PCA-based methods.
4) The proposed method always gives the best (or among the
best) performance in all cases.
The above experiment shows that both linear and nonlinear
methods require representative samples for training. Therefore, condition 1. For all cases, the proposed KSLDA method gives a
in the next experiment, for each pose, we randomly select two better performance than the other methods, regardless of which
images from each subset for training (2 images × 4 subset = illumination subsets. Furthermore. in general, kernel-based
8 images for training per individual), and all the other images LDA methods have a relatively better performance than linear-
from the four subsets (45−8 = 37 images) are selected for based LDA methods, except that the subspace LDA method
testing. For each pose, the experiment is repeated ten times, performs as well as kernel-based LDA methods. Results in
and the average rank-1 to rank-3 accuracies are recorded. The Table V show that kernel-based LDA methods have a better
average results on all poses are listed in Table IV. In Table IV, generalization performance along pose dimensions than do
it can be seen that both the kernel-based and linear-based LDA linear-based LDA methods.
methods give very good performance with more representative 3) Both Pose and Illumination Variations: Finally, we
samples for training. would like to perform experiments on both pose and illumina-
2) Fixed Illumination With Pose Variations: Now, we fix tion variations and to evaluate the generalization performance
the illumination variations and change the poses. Images of of all methods along both the pose and illumination dimensions.
nine poses from illumination condition 1 are used for train- The experimental settings are as described follows. For each
ing, whereas the testing will be performed on the remaining pose, we select two images from each illumination subset
44 illumination conditions. As the page width is limited, the (out of four subsets). This is to say that we will randomly
results under these 44 illumination conditions are grouped × 4 subsets
select 720 images (10 persons 9 poses × ×
according to four subsets, as shown in Table V. As images 2 images) for training, and the remaining 3330 (10 persons
under illumination condition 1 are used for training, the results × 9 poses × 37 images) images will be for testing. The
of subset 1 becomes the average accuracy among illumina- experiments are repeated ten times, and the average rank-1 to
tion condition 2 to illumination condition 7. The results of rank-3 accuracies are recorded and shown in Table VI. It can
subsets 2–4 are the average accuracies under the illumination be seen in Table VI that when both the pose and illumination
conditions of each subset, respectively. It can be seen that variations are considered, the performance of kernel-based LDA
the performance of each method is degrading with the illu- methods become obvious compared with previous cases. All
mination condition going extremely different with illumination kernel-based LDA methods give a better performance than do
858 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

TABLE VII imum margin criterion while preserving the eigenvalue


PERFORMANCE ON THE CMU PIE DATABASE
stability.
a) Eigenvalue stability analysis is performed on the
between-class scatter matrix Sb and the within-class
scatter matrix Sw. A new eigenvalue stability mea-
surement on the algorithm is proposed.
b) Through tuning kernel parameters, the proposed
KSLDA method gives more flexibility on adjusting
the distribution of kernel-mapped feature data points.
Consequently, we can further improve the general-
ization capability of the KSLDA method by selecting
proper kernel parameters. This also provides a new
direction for other kernel-based LDA methods to im-
prove their generalization capability.
c) For linear-based LDA methods, there is no convenient
way to adjust the distribution of feature data points.
2) A new KSLDA method is proposed to handle illumination
and pose variations.
a) The proposed KSLDA method is a nonlinear technique
generalized from the subspace LDA method without
suffering from the S3 problem.
linear based LDA methods. This is to say that the nonlinear
b) Compared with existing kernel-based LDA and linear-
generalization capability of kernel-based LDA methods is better
based LDA method, the proposed KSLDA method
than that of linear-based LDA methods when complex variations
has stronger generalization capability when complex
existed in the training and testing samples. The accuracy of
variations exist in training and testing face images.
the KSLDA method is 91.95%. Comparing with the best linear-
Three public available face databases, namely, the
based LDA method, i.e., subspace LDA, the proposed KSLDA
FERET, YaleB, and CMU PIE face databases, have been
method has nearly 20% improvement. Comparing with the best
selected to evaluate the proposed method and other exist-
existing kernel-based LDA method, GDA also has about 3%
ing face recognition methods. When the image variations
improvement.
are not very large, the proposed method gives equally
good results with the current state-of-the-art methods.
When the image variations are large, such as both illumi-
D. Results on the CMU PIE Face Database nation and pose variations, the proposed method clearly
This section reports the results of the proposed method on the outperforms existing methods.
CMU PIE database. We randomly select 14 images from each
person for training (14 ×68 = 952 images), while the rest of APPENDIX I
the images are selected for testing (42×68 = 2856 images). PROOF OF THEOREM 1
The experiments are repeated ten times, and the average rank-1
to rank-3 accuracies are recorded in Table VII. The recognition Proof: Assume the singular value decomposition S1 =
accuracy of our method increases from 78.50% with rank 1 U ΛV T , where U, V ∈ Rn×n are orthonormal matrices, and
to 81.51% with rank 3. Comparing with other LDA-based Λ= diag(σ 1 ,..., σr, 0 , . . . , 0)∈ Rn×n, σ1 ≥σ2 ≥ · · · ≥ σr > 0.
methods, the results show that our proposed method gives the Then, we have S1T S1 = ΨT ΦΦT Ψ= V ΛU T U ΛV T = V Λ 2V T ,
best performance.
that is, ΨT S 2Ψ = V Λ 2V T . Therefore, (ΦV )T S2(ΦV ) = Λ 2.
Let Q = ΨV . Then, QT S2Q = Λ 2. □
V. C ONCLUSION
APPENDIX II
To further improve the generalization capability of the PROOF OF THEOREM 2
Σ. Σ
kernel-based LDA method, the ESBMM algorithm has been
proposed to automatically tune the kernel parameters for the Proof: Because S = ΦΦT = U V0 T V (ΣT |0)U T = U
2 2
kernel-based LDA method. To demonstrate the feasibility of the diag(σ ,..., 1 σ ,r0 , . . . , 0)d×dUT . Rewrite U = (u 1 ,..., ur,
ESBMM algorithm, a KSLDA method is developed and adopted u r + 1 ,... ud) ∈
Rd×d, and2let U 1 ,..., ur ) R
2 1 = (u 2
d×r. ∈
Then, U1T SU1 = diag(σ 1, σ 2, . . . , σ r). □
as the kernel-based LDA method for the ESBMM algorithm. The
novelties of each part are listed as follows.
APPENDIX III
1) To further improve the generalization capability of the DERIVATION OF MATRIX Swt
KSLDA method, we propose to use eigenvalue stability to
measure the generalization capability of a kernel-based
learning algorithm. The ESBMM algorithm is designed 1
T
and developed to automatically tune multiple kernel pa- T
Φw Φt Σ w
Σ Σ t
Σ Σ. ΣT . ΣΣ
w t
rameters for the KSLDA method by maximizing the max- = Φ̃i,j i,j Φ̃l,k l,k = N Φ̃i,j Φ̃l,k i,j
l,k .
HUANG et al.: CHOOSING PARAMETERS OF KSLDA FOR RECOGNITION OF FACE IMAGES 859

√ C Ns C N p
By formulating the dot product in terms of the kernel function, N
we get l
ΣΣΣΣ . Σ
T 1 + √ k xs,t xpq
˜i,j ˜ l,k i T 1 l N 3 s=1
N C N
t=1 p=1
. . Σ Σ . Σ 1 q=1
Nl
s

. Σ Φt = √ φ xj − mi √ φ(xk ) − m Σl
Φw N N √ Klsij − ij
=
1 i T l i T N Nl ΣΣ
s=1 Kst
N2 s=1
. φ(xk )− φ(xj ) m 1 C Ns N l t=1
= φ(xj )
N lq
l T ΣΣΣ
Σ −2 N√l Kst
− mTi φ(xk ) + mi m N s=1 t=1
q=1

1 1
C Np √ lΣ Σ
C N Σ
CΣNp st
. lΣ + ··· N s
i Kpq . (35)
= k x ,x − ΣΣ . N3
Σ
k xi , x p
j k j q
N N2 s=1 t=1 p=1 q=1
p=1 q=1
Ni Therefore, we have
1 Σ
. Σ
− k xi , xl
NNi s k
Σ. ΣT Σ
s=1 i,j
j· φ̃lb
Z = V j ·φTt φ b = V φ̃ t
Ni C Np i,j
1
+ ··· N Σ Σ Σ . s qΣ
s=1 p=1 k x , x = K ·Λ NC −K ·IN C −1NN · K ·∆N C +1NN · K ·Γ NC .
i p l

2Ni
q=1
1 ij C Np N (36)
1 ij 1 Σ K is
i

= Klk − Σ Σ Kpq − lk
N N Ni s=1
p=1 q=1 APPENDIX V
P ROOF OF T HEOREM 5
C Np
1 Ni
Proof: Assume each class contains n training samples.
pq
ΣΣΣ . (32) Then, we can simplify between-class scatter matrix and within-
is
+ NN i K class scatter matrix as
s=1 p=1 q=1
Then, we get Σ
C
1 (mi − m)(mi − m)T
Σ T Σ Sb =

Swt = ΦTwΦt = . w Σ . t Σ C
Φ̃i,j Φ̃l,k
i,j
l,k

C
C− 1 1
= i=1
( )( )T
· mi − m m − m (37)
1 C Σ
C 1 i=1 i
= (K −K ·IN N −ΛN N ·K +ΛN N ·K ·IN N ). (33) C−1
N = Cov−
M
C
A PPENDIX IV
OF MATRIX Z
DERIVATION ΣC T
where Cov M = 1/C − 1 i=1(mi − m)(mi − m)
covariance matrix on the set of = m1, m 2 ,. .., is
mCthe.
Σ. M { }
Σ Σ Σ Σ ΣT . ΣΣ
Then, we have
ΦT Φ = Φ̃i,j T Φ̃l = Φ̃i,j Φ̃lb . (34) C
t b t b t
i,j l 1
i,j
l Sw = Σ
N (xij − m)(xij − m)T
By get
formulating the dot product in terms of the kernel function, i=
we 1 C n
n 1 1

. = · n − 1Σ Σ(x − m)(x − m)T
. Σ . Σ . . Σ Σ l N ij ij
Φ̃i,j
T Φ̃ l
= √1 φ xi − m T (m − m)
N C
i=1

N j=1
t b j
N
l
n−1Σ
√ = N Covi (38)
N
j l

. . Σ . iT
i=1
T
= l φ xi m −φ x m
N Σ n
− m T ml + m m
T Σ where Covi = (1/(n − 1)) Σ (xij − m)(xij − m)T is the
j
1
Nl covariance
Suppose matrix of class
we replace i.1 j=sample x
a training in class p with
j s
HUANG et al.: CHOOSING PARAMETERS OF KSLDA FOR RECOGNITION OF FACE IMAGES 859

Σ . Σ
= √ k x i , xl p0
√ mp with mj
p. Therefore, we have
Nl Σ Σ
C Ns
. Σ
− 2
k xij, xts
C
N s=1 t=1 T
Cov M(M\mp )∪mrp = CovM − (m − mp )(m − mp )
C Ns Nl (C − 1)2
1 1
ΣΣΣ . t qΣ . (M\mp)
T
√ pΣ . (M\mp) pΣ
−··· k xs, xl + C m − mj m − mj . (39)
N 2 Nl s=1 t=1
q=1
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

After replacing xp0 with xp1 , we denote SbM(M\mp )∪mrp Therefore, we have
as Sbj . Then (Sj −S j )−(S −S )= (Sj −S )−(Sj −S )
Σ b w b w b 1b w w
C −1 C
S =S −
j (m − m )(m − m )T =− ≥m−mp2≥ Oi1 OT
p p 1 i
b b
C (C −1)2 C− 1
C −1 j 2 T
+ m
1 (M\m p)
− mp Oi2 Oi2
(M\mp) Σ . (M\mp) Σ C 2
n
. − mjp m ΣT 2 T
m + − mj p
C
= Sb − 1 +··· ≥mp −xp ≥ ∆i1∆
n−1N (n−1)
C − 1 (m − mp)(m − mp)
T
T i1
mp(X \xp0 ) −xp1 2 ∆i2∆ i .
0

C−1 Σ. ΣT − 2
+ m (M\mp) − mjp m(M\mp) − mjp . N ·n
C.
(40) (46)
2
Using Theorem 4, we have
Let Oi1 = (m − mp)/≥m − mp≥ and Oi2 = (m (M\ mp) −
mjp )/≥m(M\mp ) − mp j≥, where we assume m − mp ƒ= 0 and =λ − 1 ≥m − m ≥2 m
λj (S r −S r )
m(M\mp ) − mp j =
ƒ 0. Then, we have b w j(Sb−Sw)
C −1
p j1

1 C−1 m − mj
2 + 2
b p
T m
Sbj = S − C − 1 ≥m − m ≥ O i1
O + ···CN n (M\mp) p j2
i1 2 (n − 1) ≥mp − xp0 ≥ mj3 2
C 1 2
T
− Oi2 Oi2. (41)
+ m(M\mp) − mjp n −1 −xm
2
(47)
C − mp(X \xp0 ) p1 j4
2
Suppose we replace a training sample xp0 in class p with N ·n
another sample xp1 in class p. For Sw , we know that this Σ mji = 1, for i = 1, 2, 3, and 4. Then
where j
will only change the covariance matrix on class p, i.e., for
ƒ p, we have the following equation:
i = 1, 2, . . . , C and I = .N Σ Σ.
Σ
Cov = Cov . λj (S r −Swr ) − λj(Sb −Sw ) ..
. b
i(X \xp0 ) i N
j=1
Covi(X ∪xp1 ) = Covi 1
= . Σ Σ− ≥m − mp≥ mj1
Cov i(X \xp0 )∪xp1 = Covi. (42) . j= C − 1 2

However, for class p, we have .1 (M\mp) − m p m


j 2 j
2
2
n C−1 2
Cov = Cov − (m − x ) (m − x )T + C m n ≥mp − xp ≥ mj3
+··· 0
p(X \xp0 )∪xp1
p
(n − 1)2
p p0 p p0 N (n − 1) Σ.
. Σ. ΣT − n −1 m −xm
+1 m p0 − x1 m p0 − x1 . (43) N ·n
1

n
p(X \xp0 )
p(X \x ) p p(X \x ) p C−
2
2
p j4

.
)∪x as .
After replacing xp with xp , we denote SwX (X \x 1
1 − mj
2
.
j
= − ≥m − m ≥ + m
SW p0 p1
2

0 1 (M\mp) p
p
. C− 1 C2

− ··· ≥m
= Cov
, thenN 0 1
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

n
+ N (n − 1)
n −1
n

n 1 p− xp0 ≥
C

2 T

j .
Sw −x . (48)

− m

Σ
i(X \xp )∪xp
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
2
p(X \xp0 ) p1
.

i=1 N ·n

n−1
=
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

Σ
Covi −
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

Σ the triangle inequality


Using
and
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

.
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
Σ. Σ
n− p(X \xp0 ) − xp1 mp(X \xp0 ) − xp1 . (44) Σ.
+ m λj (S r −Swr ) − λj(Sb −Sw )
N·n b

N Σ
Let ∆i1 = (mp − xp0 )/≥mp − xp0 ≥ and ∆i2 = (mp(X \xp ) − .Σ
0
xp1 )/≥mp(X \xp0 ) − xp1 ≥. Then we have
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

2 C− 1 2
+

i
2
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

j
≤ ≥m − mp≥ C2 1 p) − m p
m(M\m
C−1

Sj = Sw − n

≥mp − xp ≥ ∆i1∆

w
N (n − 1) n 2 n− 2
+ +

0 i1

n − 1 mp(X \xp0
+
N ·n
860 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

) − xp1
T
∆ i2 ∆ . mp(X \xp0 ) − xp1 .
(45) ··· N·n
N (49)
(n −
1)
≥mp
− x p0

2
HUANG et al.: CHOOSING PARAMETERS OF KSLDA FOR RECOGNITION OF FACE IMAGES 861

Since we assume that all training data lie in a2 hypersphere with [12] W. Zheng, L. Zhao, and C. Zou, “A modified algorithm for generalized
radius R, we know that ≥m − mp≥2 ≤ 4R , ≥m(M\m ) −p discriminant analysis,” Neural Comput., vol. 16, no. 6, pp. 1283–1297,
j 2 2 2 2 2004.
mp≥ ≤ 4R , ≥mp − xp0 ≥ ≤ 4R , and ≥mp(X \xp0 ) − [13] J. Huang, P. C. Yuen, W. S. Chen, and J. H. Lai, “Component-based
2 subspace LDA method for face recognition with one training sample,”
xp1 ≥2 ≤ 4R . Then, we have Opt. Eng., vol. 44, no. 5, p. 057 002, 2005.
[14] B. Moghaddam, T. Jebara, and A. Pentland, “Bayesian face recognition,”
. N . Pattern Recognit., vol. 33, no. 11, pp. 1771–1782, 2000.
ΣΣ Σ
[15] W. Zhao, R. Chellappa, and P. Phillips, “Subspace linear discriminant
. j=1 λj(S j−S r ) − λj(S −S ) . analysis for face recognition,” Center Autom. Res., Univ. Maryland,
. b w b w . College Park, MD, Tech. Rep. CAR-TR-914, 1999.
[16] H. Yu and J. Yang, “A direct LDA algorithm for high-dimensional
1 2 C − 1 n 2 n −1
4 + 4 2+ R + 4 2
≤ R R 4 R data—With application to face recognition,” Pattern Recognit., vol. 34,
C−1 C2 N (n − 1) N·n [17]
no. 10, pp. 2067–2070, 2001.
L. F. Chen, H. Y. Liao, M. T. Ko, J. C. Lin, and G. J. Yu, “A new LDA-
. 2 1 C−1 1 based face recognition system which can solve the small sam- ple size
problem,” Pattern Recognit., vol. 33, no. 10, pp. 1713–1726, 2000.
≤ 4R C − 1 (C −
+ + 1) 2 (C − 1)(n − 1) W. Zheng, L. Zhao, and C. Zou, “An efficient algorithm to solve the
n−1 (C − 1)(n − Σ [18]
small sample size problem for LDA,” Pattern Recognit., vol. 37, no. 5,
+ pp. 1077–1079, 2004.
1)(n − 1) [19] R. Huang, Q. S. Liu, H. Q. Lu, and S. D. Ma, “Solving small sample size
2 2 problem in LDA,” in Proc. Int. Conf. Pattern Recog., Aug. 2002, vol. 3,
= 4R . + Σ pp. 29–32.
2
C − 1 (C − 1)(n − 1) [20] J. Yang, J.-Y. Yang, and D. Zhang, “What’s wrong with Fisher criterion?”
2 Pattern Recognit., vol. 35, no. 11, pp. 2665–2668, 2002.
8nR
= . (50) [21] J. Yang and J.-Y. Yang, “Why can LDA be performed in PCA transformed
space?” Pattern Recognit., vol. 36, no. 2, pp. 563–566, 2003.
(C − 1)(n − 1) [22] J. Yang, J.-Y. Yang, and A. Frangi, “Combined Fisherfaces framework,”
Image Vis. Comput., vol. 21, no. 12, pp. 1037–1044, 2003.
□ [23] J. Yang, A. Frangi, and J.-Y. Yang, “A new kernel Fisher discriminant
algorithm with application to face recognition,” Neurocomputing, vol. 56,
no. 1, pp. 415–421, 2004.
ACKNOWLEDGMENT [24] J. Yang, Z. Jin, J. Yang, D. Zhang, and A. F. Frangi, “Essence of kernel
Fisher discriminant: KPCA plus LDA,” Pattern Recognit., vol. 37, no. 10,
The authors would like to thank the U.S. Army Research pp. 2097–2100, 2004.
[25] J. Yang, D. Zhang, and J.-Y. Yang, “A generalised k–l expansion method
Laboratory for the FERET database and Yale University and which can deal with small sample size and high-dimensional problems,”
Carnegie Mellon University for providing the YaleB and CMU Pattern Anal. Appl., vol. 6, no. 1, pp. 47–54, 2003.
PIE databases. Furthermore, the authors would like to thank [26] M. Wilkes, A. Barkana, H. Cevikalp, and M. Neamtu, “Discriminative
Dr. Lu for providing the Matlab code of KDDA and V. Franc for common vectors for face recognition,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 27, no. 1, pp. 4–13, Jan. 2005.
providing the Statistical Pattern Recognition Toolbox, which [27] J. Liu and S. Chen, “Discriminant common vectors versus neighbourhood
includes the Matlab code of KPCA and GDA used in this paper. components analysis and Laplacianfaces: A comparative study in small
sample size problem,” Image Vis. Comput., vol. 24, no. 3, pp. 249–262,
2006.
REFERENCES [28] C. Liu and H. Wechsler, “Gabor feature based classification using the
enhanced Fisher linear discriminant model for face recognition,” IEEE
[1] R. Chellappa, C. Wilson, and S. Sirohey, “Human and machine recog- Trans. Image Process., vol. 11, no. 4, pp. 467–476, Apr. 2002.
nition of faces: A survey,” Proc. IEEE, vol. 83, no. 5, pp. 705–740, [29] T. Jaakkola and D. Haussler, “Probabilistic kernel regression models,” in
May 1995. Proc. Conf. AI and Statist., 1999.
[2] W. Zhao, R. Chellappa, A. Rosenfeld, and P. Phillips, “Face recognition: [30] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multi-
A literature survey,” ACM Comput. Surv., vol. 35, no. 4, pp. 399–458, ple parameters for support vector machines,” Mach. Learn., vol. 46, no. 1,
Dec. 2003. pp. 131–159, 2002.
[3] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cogn. Neu- [31] S. Mika, “Kernel fisher discriminants,” Ph.D. dissertation, Tech. Univ.
rosci., vol. 3, no. 1, pp. 71–86, 1991. Berlin, School Electr. Eng. Comput. Sci., Berlin, Germany, Dec 2002.
[4] A. M. Martinez and A. C. Kak, “PCA versus LDA,” IEEE Trans. Pattern [32] K. Schittkowski, “Optimal parameter selection in SVM,” J. Ind. Manag.
Anal. Mach. Intell., vol. 23, no. 2, pp. 228–233, Feb. 2001. Optim., vol. 1, no. 4, pp. 465–476, 2005.
[5] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. [33] J.-H. Lee and C.-J. Lin, “Automatic model selection for support vector
Fisherfaces: Recognition using class specific linear projection,” IEEE machines,” Dept. Comput. Sci. Inf. Eng., Nat. Taiwan Univ., Taipei,
Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 711–720, Jul. 1997. Taiwan, Tech. Rep., 2000.
[6] D. L. Swets and J. Weng, “Using discriminant eigenfeatures for [34] S. Keerthi and C. Lin, “Asymptotic behaviors of support vector machines
image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 8, with Gaussian kernel,” Neural Comput., vol. 15, no. 7, pp. 1667–1689,
pp. 831–836, Aug. 1996. Jul. 2003.
[7] B. Schölkopf, A. Smola, and K. Müller, “Nonlinear component analysis as [35] D.-Q. Zhang, S. Chen, and Z.-H. Zhou, “Learning the kernel parameters
a kernel eigenvalue problem,” MPI fur biologische kybernetik, Tubingen, in kernel minimum distance classifier,” Pattern Recognit., vol. 39, no. 1,
Germany, Tech. Rep. 44, 1996. pp. 133–135, 2006.
[8] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. R. Müller, “Fisher [36] N. Cristianini, J. Shawe-Taylor, and A. Elisseeff, “On kernel-target align-
discriminant analysis with kernels,” in Proc. IEEE Workshop Neural Netw. ment,” in Proc. Neural Inf. Process. Syst., 2001, pp. 367–373.
Signal Process. IX, 1999, pp. 41–48. [37] H. Xiong, M. N. S. Swamy, and M. O. Ahmad, “Optimizing the kernel
[9] G. Baudat and F. Anouar, “Generalized discriminant analysis using a in the empirical feature space,” IEEE Trans. Neural Netw., vol. 16, no. 2,
kernel approach,” Neural Comput., vol. 12, no. 10, pp. 2385–2404, 2000. pp. 460–474, Mar. 2005.
[10] M. H. Yang, “Kernel eigenfaces vs. kernel Fisherfaces: Face recognition [38] J. Huang, P. C. Yuen, W. S. Chen, and J. H. Lai, “Face recognition using
using kernel methods,” in Proc. 5th IEEE Int. Conf. Autom. Face and kernel subspace-LDA algorithm,” in Proc. Asian Conf. Comput. Vis.,
Gesture Recog., 2002, pp. 215–220. Jan. 2004, vol. 1, pp. 61–66. no. 10.
[11] J. W. Lu, K. Plataniotis, and A. N. Venetsanopoulos, “Face recognition [39] H. Li, T. Jiang, and K. Zhang, “Efficient and robust feature extrac-
using kernel direct discriminant analysis algorithms,” IEEE Trans. Neural tion by maximum margin criterion,” in Advances in Neural Information
Netw., vol. 14, no. 1, pp. 117–126, Jan. 2003.
862 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 37, NO. 4, AUGUST 2007

Processing Systems 16, S. Thrun, L. Saul, and B. Schölkopf, Eds. Wen-Sheng Chen received the B.Sc. and Ph.D. de-
Cambridge, MA: MIT Press, 2004. grees in mathematics from Sun Yat-Sen (Zhongshan)
[40] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Mach. University, Guangzhou, China, in 1989 and 1998,
Learn. Res., vol. 2, no. 3, pp. 499–526, 2002. respectively.
[41] W. H. Rogers and T. J. Wagner, “A finite sample distribution-free per- He is currently a Professor in the Institute of
formance bound for local discrimination rules,” Ann. Stat., vol. 6, no. 3, Intelligent Computing Science, College of Mathe-
pp. 506–514, May 1978. matics and Computational Science, Shenzhen Uni-
[42] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “The FERET evalu- versity, Shenzhen, China. His current research
ation methodology for face-recognition algorithms,” IEEE Trans. Pattern interests include pattern recognition, kernel methods,
Anal. Mach. Intell., vol. 22, no. 10, pp. 1090–1104, Oct. 2000. and wavelet analysis and its applications.
[43] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few Dr. Chen is a member of the Chinese Mathe-
to many: Illumination cone models for face recognition under variable matical Society.
lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6,
pp. 643–660, Jun. 2001.
Jian Huang Lai received the M.Sc. degree in ap-
plied mathematics and the Ph.D. degree in mathe-
Jian Huang received the B.Sc. and M.Sc. degrees in matics from Sun Yat-Sen (Zhongshan) University,
applied mathematics from Sun Yat-Sen (Zhongshan) Guangzhou, China, in 1989 and 1999, respectively.
University, Guangzhou, China, in 1999 and 2002, He is currently a Professor in the School of
respectively, and the Ph.D. degree from Hong Kong Information Science and Technology, Sun Yat-Sen
Baptist University, Kowloon, in 2006. (Zhongshan) University. He is also a Board Member
He is currently with the School of Information of the Image and Graphics Association China. His
Science and Technology, Sun Yat-Sen (Zhongshan) current research interests include image processing,
University and Hong Kong Baptist University. His pattern recognition, computer vision, and wavelet
research interests include pattern recognition, face analysis. He has published more than 50 scientific
recognition, image processing, linear discriminant articles in these areas.
analysis algorithm, and kernel methods.

Pong C. Yuen (S’92–M’93) received the B.Sc. de-


gree in electronic engineering (first class honors)
from the City Polytechnic of Hong Kong, Kowloon,
in 1989 and the Ph.D. degree in electrical and
electronic engineering from the University of Hong
Kong, Hong Kong, in 1993.
In 1993, he joined the Department of Computer
Science, Hong Kong Baptist University, Kowloon,
as an Assistant Professor, where he is currently a
Professor. He was with the Laboratory of Imaging
Science and Engineering, Department of Electrical
Engineering. In 1998, he spent a six-month sabbatical leave in the University
of Maryland Institute for Advanced Computer Studies (UMIACS), University
of Maryland, College Park. He was with the Computer Vision Laboratory,
Center for Automation Research. From June 2005 to January 2006, he was a
Visiting Professor in GRAVIR laboratory (GRAphics, VIsion and Robotics),
Institut National de Recherche en Informatique et en Automatique (INRIA),
Rhone Alpes, France. He was with the PRIMA Group. He was the Director
of Croucher Advanced Study Institute on biometric authentication 2004 and
biometric security and privacy 2007. He is currently an Editorial Board Member
of Pattern Recognition. His current research interests include human face
processing and recognition, online and offline signature recognition, tongue
image analysis for traditional Chinese medicine, and context modeling and
learning.
Dr. Yuen received a University Fellowship to visit the University of Sydney,
Sydney, Australia, in 1996. He has been actively involved in many international
conferences as an Organizing Committee and/or Technical Program Committee
Member. He was a Track Cochair of the International Conference on Pattern
Recognition 2006.

You might also like