You are on page 1of 4

Proceedings of the 8th

World Congress on Intelligent Control and Automation


July 6-9 2010, Jinan, China























































978-1-4244-6712-9/10/$26.00 2010 IEEE


Multi-Modal Face Recognition
Haihong Shen
1,2
, Liqun Ma
1,3
, Qishan Zhang
1
1
School of Electronic and Information Engineering
Beijing University of Aeronautics and Astronautics
2
China University of Geosciences
3
China Astronaut Research and Training Center
E-mail:haihong shen@163.com
AbstractIn this paper, we exploit the multi-modal face
recognition capability by a comparative study on 6 fusion
methods in the score level, which can be divided into 2 kinds:
(1) simple fusion without data training, such as Sum, Product,
Max and Min; (2) complex fusion including a predened data
training section, such as Linear Discriminant Analysis (LDA)
and Support Vector Machine (SVM). Our experiments are based
on the CASIA 3D Face Database and can be divided into
two modes: verication and classication. Major conclusions
are: (1) 2D modality can achieve similar performance as to
3D modality, and fusion scheme can substantially improve
the recognition performance; (2) Product rule gives the best
recognition performance in simple fusion methods without
training stage; (3) There is no guarantee that the complicated
fusion methods will achieve better recognition performance than
the simple fusion methods, and it is important to select the most
suitable model for fusion according to the tasks.
Index Termsmulti-modal face recognition, comparative
study, score level, fusion strategies
I. INTRODUCTION
Most of the research on face recognition area is focused
on intensity or color images of faces [1]. With the rapid
development of 3D acquisition system, we can obtain 2D
data and 3D data simultaneously. Therefore, the multi-modal
face recognition has attracted more and more interests and a
great deal of research effort has been devoted to this area.
Various multi-modal face recognition methods have been
proposed in last decade [2]. Beumier et al. [3] extract the
central/lateral proles in both intensity and depth modalities.
They have four classiers in total, and the recognition is
made by combining the scores using sher rule. Wang
et al. [4] adopt Gabor features in intensity modality and
point-signature features in depth modality. Multi-modal face
recognition is done by SVM with a decision directed acyclic
graph (DDAG). Lu et al. [5] adopt LDA in intensity modality
and ICP in depth modality. The multi-modal recognition is
achieved using a hierarchical matching design. Tsalakanidou
et al. [6] use EHMM on both color and depth modalities, and
a combination of the results is followed using the weighted
sum rule. Chang et al. [7] use PCA on both intensity and
depth modalities. Multi-modal recognition is made using the
weighted sum rule. However, all of the above methods [3] [4]
[6] [7] only include limited fusion methods, and most of the
strategies have no training stages (simple fusion rules), which
can not exploit the potential capabilities of multi-modal face
recognition system.
In this paper we make a comparative study on 6 fusion
methods in score level to evaluate the recognition capac-
ities of multi-modal strategy. And some of the adopted
fusion methods, such as SVM, need a data training stage
to guarantee its performance. Therefore, our comparisons
can further improve the multi-modal face recognition system
performance. The owchart of our multi-modal system is
shown in Fig.1. In each modality, we adopt Uniform Lo-
cal Binary Pattern (ULBP) for face representation and the
Nearest Neighbor (NN) classier for recognition [8] [9]. In
score normalization, we use the min-max rule to map the
matching scores of each modality to the same measurement
[10]. In fusion strategies, we compare the performances
of Sum, Product, Max, Min, LDA and SVM, 6 kinds of
different fusion methods to evaluate their abilities in either
the verication mode or the classication mode.
The rest of this paper is organized as follows: In Section 2,
we describe the face representation and recognition in both
intensity and depth modalities. In Section 3, we describe
the score normalization method and the fusion strategies.
Experimental results and discussions are shown in Section
4. Finally, the paper is concluded in Section 5.
II. MULTI-MODAL FACE RECOGNITION
A. Uniform Local Binary Patterns
As shown in Eqn.1, Local Binary Pattern (LBP) is a
function which can mapping the local neighbor pattern to
a value.
LBP
P,R
=
P1

p=0
s(g
p
g
c
)2
p
(1)
where P is the number of the neighbors around the central
pixel, R is the distance of the neighbors from the central pixel
[8].
Certain local binary patterns are fundamental properties of
textures, which can provide most of the patterns presented in
the observed textures. These fundamental patterns are called
Uniform Local Binary Patterns (ULBP). Because in circular
720
Fig. 1. The owchart of the multi-modal face recognition system, in which the fusion is implemented in the score stage.
structure these patterns contain few spatial transitions, which
can be viewed as texture micro-structures, such as spot and
edge.
B. Representation and Recognition
For 2D modality, appearance based features have been suc-
cessfully adopted into face recognition [9]. For 3D modality,
because the features in 3D coordinates, such as curvatures,
are sensitive to the data noises, recently more and more
researchers also adopt the appearance based features for face
recognition. Therefore, in both intensity and depth modalities,
we adopt ULBP for face representation. And the NN classier
is adopted to compute the matching scores.
III. FUSION IN SCORE STAGE
A. Score Normalization
Because the recognition scores from different modalities
are in different measurements, so it is necessary to map
them into the same scale rst. To overcome the inuence
of the data noises, we use a modied min-max rule for score
normalization [10]. Given a set of matching scores {S
k
},
k = 1, 2, , n, the normalized scores are given by
S
n
k
=
S
k
min
max min
(2)
where min is the minimum value estimated form {S
k
},
which means that the two images have the similar appearance.
But max is the value larger than 95% of the scores in {S
k
},
which is not the maximum value. In this way, we discard
the data noises which have large matching scores and the
accuracy of normalization is guaranteed. Therefore, we also
need a training stage to estimate these values (min and max)
for score normalization.
B. Fusion strategies
In this paper we select 6 fusion strategies to improve
the multi-modal face recognition performance, which can be
divided by two types. The simple fusion methods which are
without the training stage, such as Sum, Product, Max and
Min. And the complex fusion methods which need some pre-
prepared data training stage, such as LDA and SVM. For the
convenience of description, suppose X
j
i
is the ith sample of
class j and there are C classes of samples. The matching
scores from each modality for a sample are represented as a
feature vector X
j
i
= [s
1
, s
2
, , s
N
], where N is the number
of the modalities and s
n
is the output of each modality. The
fusion score is represented as F. The adopted fusion rules
are described in detail as follows:
1) Sum Rule:
F =
N

i=1
s
i
(3)
2) Product Rule:
F =
N

i=1
s
i
(4)
3) Max Rule:
F = MAX
N
i=1
{s
i
} (5)
4) Min Rule:
F = MIN
N
i=1
{s
i
} (6)
5) Fisher Rule (LDA): The sher rule can be summarized
as follows: Let the between-class scatter matrix be dened
as
S
B
=
c

i=1
N
i
(u
i
u)(u
i
u)
T
(7)
and the within-class scatter matrix be dened as
S
W
=
c

i=1

x
k
X
i
(x
k
u
i
)(x
k
u
i
)
T
(8)
721
where u
i
is the mean image in class X
i
, N
i
is the number
of images in class X
i
. Then the projection matrix W can be
obtained as Equ.9
J(W
opt
) = argmax
|W
T
S
B
W|
|W
T
S
W
W|
(9)
And the matching score can be obtained as
F = W X (10)
You can refer to [11] for the detailed information of Fisher
Rule.
6) Support Vector Machine (SVM): Dene the labeled data
set as (x
i
, y
i
), i = 1, , n, x R
d
, y {+1, 1}. If there
exists a hyperplane which can separates the positive samples
from negative samples. Then the points lie in the hyperplane
must satisfy the Equ.11
w x +b = 0 (11)
where w is normal to the hyperplane and b/w is the
perpendicular distance from the hyperplane to the origin. Let
d
+
(d

) be the shortest distance between the hyperplane to


the nearest positive(negative) samples, which can be dened
as the margin. And the SVM algorithm simply looks for the
separating hyperplane with the largest margin. You can refer
to [12] for the detailed information of SVM.
IV. EXPERIMENTAL RESULTS AND DISCUSSION
Fig. 2. Some examples from CASIA 3D Database. The rst row shows the
images of intensity modality. The second row shows the images of depth
modality.
Our experiments are based on CASIA 3D Face Database,
which including 123 individuals in total. Each individual con-
tains the variations of illuminations, expressions and poses,
which are the main problems for both intensity and depth
modalities. Therefore, this database is challenging for multi-
modal face recognition area. In our experiments, we adopt
10 images for each individual. For illumination variations,
there are two kinds of illuminations, frontal spot light and
right spot light. For expression variations, there are 5 neutral
expression and 5 different expressions ( smile, laugh, anger,
surprise, eye closed ). Some examples from CASIA 3D Face
Database are shown in Fig.2.
The experiments are arranged as follows: for each individ-
ual, 1 image (neutral expression) is stored as gallery set. 2
images (neutral expressions) are used as training set, which
is to train the score normalization rules and some fusion rules
requiring training stage, such as LDA and SVM. 3 images (2
neutral and 1 smile) are adopted as the small expression probe
set (PS1). 4 remaining images (1 laugh, 1 anger, 1 surprise
and 1 eye closed) are adopted as the large expression probe
set (PS2).
A. Experiments on PS1
In PS1, the situations of the testing images are similar as
the training samples, without some illumination or expression
variations. To verify the efciency of the fusion strategy,
rst, we compute the recognition performance in the single
intensity and depth modality. Next, we combine the above
two modalities using different fusion rules. The recognition
performances are measured in two modes: verication mode
and classication mode. In verication mode, Equal Error
Rate (EER) is adopted to represent the recognition perfor-
mance, which are shown in Table.I. In classication mode,
Correct Recognition Rate (CRR) is adopted to represent the
classication performance, which are shown in Table.I.
TABLE I
EER AND CRR PERFORMANCE OF EXPERIMENTS IN PS1.
Methods 2D 3D Sum Product
EER 3.90% 3.59% 1.35% 1.03%
CRR 94.85% 96.75% 98.92% 99.73%
Methods Max Min LDA SVM
EER 3.50% 2.14% 1.21% 0.83%
CRR 95.39% 98.64% N/A N/A
B. Experiments on PS2
In PS2, the situations of the testing images are more
challenging than PS1. For both intensity and depth modal-
ities, some illumination and expression variations are in-
cluded. To verify the efciency of the fusion strategy, we
also rst compute the recognition performance in the single
intensity and depth modality. Then we combine the above
two modalities using different fusion rules. The recognition
performances are measured in two modes: verication mode
and classication mode. In verication mode, Equal Error
Rate (EER) is adopted to represent the recognition perfor-
mance, which are shown in Table.II. In classication mode,
Correct Recognition Rate (CRR) is adopted to represent the
classication performance, which are shown in Table.II.
C. Discussions
From experimental results, we can make the following
observations:
1) 2D modality and 3D modality have the similar recog-
nition performance when considered individually.
2) Fusion of these two modalities can substantially im-
prove the performance.
722
TABLE II
EER AND CRR PERFORMANCE OF EXPERIMENTS IN PS2.
Methods 2D 3D Sum Product
EER 11.48% 11.05% 5.82% 5.62%
CRR 75.61% 86.79% 93.90% 95.12%
Methods Max Min LDA SVM
EER 10.43% 7.78% 5.90% 5.44%
CRR 79.47% 90.45% N/A N/A
3) Expression variations can degenerate the recognition
performance in either intensity or depth modality.
4) Some complicated fusion rules which need training
stage, such as LDA, will not always give better recog-
nition performance than some simple ones, such as
Product rule.
Although 2D modality and 3D modality both represent
the appearance information of the individual, however, they
come from different data sources. 2D modality represent
the intensity information and 3D modality represent the
depth information. In our experiments, in both PS1 and PS2,
these two modalities can achieve similar performance, which
means that it is difcult to tell which modality is better for the
recognition task and they can give similar score distributions
for the fusion strategies in the score level.
From Table.I and Table.II we can nd the fusion strategies
can substantially improve the recognition performance for
both intensity and depth modality when considered individ-
ually. Even using the simple sum rule, taking PS1 as an
example, the EER performance can be improved from 3.90%
and 3.59% to 1.35%. This means that these two modalities
can complement each other for better performance and the
fusion scheme of these two modalities in score level can
boost the recognition performance for a real face recognition
system in both verication mode and classication modes.
Because the expression variations can directly inuence the
shape of the faces and the 3D depth information. Therefore,
it is generally accepted that the 3D modality will be more
sensitive to expression variation problems. However, from
our experiments as shown in Table.I and Table.II, we nd
that the performance degenerates at similar level for both
modalities. The reasons can be summarized as follows: First,
the intensity information represents the light reected from
the face, which is also tightly related to the facial geometric
shape. Second, we only adopt the area of 3D faces robust to
expression variations, so the degree of the expressions will
be relatively weakened.
Finally, in the comparison of some simple fusion rules
without training stage, Product rule achieves the best perfor-
mance in both PS1 and PS2 situations, in both verication
and classication modes. And for the complex fusion rules,
SVM achieves the best performance in verication mode
for both PS1 and PS2 situations. However, the performance
of LDA is a little frustrating. Its EER is a little higher
than Product rule in both experiments, which illustrates the
importance of selecting the suitable fusion model to achieve
better performance. And some simple fusion rules, such
as Sum and Product, can also achieve the robustness and
effectiveness simultaneously.
V. CONCLUSION
In this paper, we have made a comparative study of 6
fusion rules in score level to improve the performance of the
Multi-Modal face recognition system. From our experiments
we can nd that: (1) For face recognition task, the intensity
data and depth data will achieve similar performance. And
they can complement each other to compose a more ef-
cient multi-modal face recognition system, which can give
a substantially better performance. (2) Expression variations
will degenerate the performance in both intensity and depth
modalities at similar degree. (3) Some complicated fusion
models will not always give better fusion performance.
Therefore it is necessary to select the suitable fusion models
according to our tasks. (4) SVM is a better choice for the
fusion strategy in verication mode.
REFERENCES
[1] W. Zhao, R. Chellappa, and A. Rosenfeld. Face recognition: a literature
survey. ACM Computing Surveys, 35:399458, 2003.
[2] K. Bowyer, K. Chang, and P. Flynn. A survey of approaches and
challenges in 3d and multi-modal 3d + 2d face recognition. Computer
Vision and Image Understanding, (1):115, January 2006.
[3] C. Beumier and M. Acheroy. Face verication from 3d and grey level
clues. Pattern Recognition Letters, pages 13211329, October 2001.
[4] Y. Wang, C. Chua, and Y. Ho. Facial feature detection and face
recognition from 2d and 3d images. Pattern Recognition Letters, pages
11911202, August 2002.
[5] X. Lu and A. K. Jain. Integrating range and texture information for
3d face recognition. Proceedings of the Seventh IEEE Workshops on
Application of Computer Vision, pages 156163, 2005.
[6] Tsalakanidou, D. Malassiotis, and M. G. Strintzis. Face localization
and authentication using color and depth images. Image Processing,
IEEE Transactions on, pages 152168, February 2005.
[7] K. I. Chang, K. W. Bowyer, and P. J. Flynn. An evaluation of multi-
modal 2d+3d face biometrics. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 27(4):619624, 2005.
[8] T. Ojala, P. Matti, and M. Topi. Multiresolution gray-scale and
rotation invariant texture classication with local binary patterns. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 24(7):971
987, 2002.
[9] T. Ahonen, A. Hadid, and P. Matti. Face recognition with local binary
patterns. European Conference on Computer Vision, pages 469481,
2004.
[10] A. K. Jain, K. Nandakumar, and A. Ross. Score normalization in
multimodal biometric systems. Pattern Recognition, (12):22702285,
December 2005.
[11] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classication, Second
Edition. John Wiley & Sons, Inc., 2001.
[12] C. Burges. A tutorial on support vector machines for pattern recogni-
tion. Data Mining and Knowledge Discovery, 2:121167, 1998.
723

You might also like