You are on page 1of 5

A Comparative evaluation for Fusion Strategies for Multimodal Biometric

System using Palmprint and Speech Signal



Mahesh P.K.
1
, M.N. ShanmukhaSwamy
2

Department of Electronics and Communication,
J.S.S. research foundation, Mysore University,
S.J.C.E., Mysore-6
mahesh24pk@gmail.com

Abstract
Multimodal Biometric Person authentication method
using unimodal biometric is deteriorated by feature,
which changes with time. Unimodal biometric
technology has reached a bottleneck and people show
more and more attention to multimodal biometric
technology now. A multimodal biometric systems
consolidate the evidence presented by multiple biometric
sources and typically better recognition performance
compare to system based on a single biometric modality.
This paper proposes an authentication method for a
multimodal biometric system identification using two
traits i.e. speech signal and palmprint. Integrating two
biometric traits will increases robustness of person
authentication. This paper also evaluates different score-
level fusion techniques, and the results of a variety of
fusion experiments using palmprint and speech signal
data from 120 individuals. Four score-level fusion
techniques were implemented and evaluated. These
differed in effectiveness, in the types of training data
required, and in the complexity of modeling of genuine
and imposter distributions. Multi-modal fusion is highly
effective: fusing one palmprint and speech signal resulted
in a 60-85% reduction in false reject rate at a constant
false accept rate of 0.01.

Keywords: Biometrics, multimodal, speech signal,
palmprint, fusion, matching score.

1. Introduction
Despite considerable advances in recent years, there are
still serious challenges in obtaining reliable
authentication through unimodal biometric systems.
These are due to a variety of reasons. For instance, there
are problems with enrolment due to the non-universal
nature of relevant biometric traits. Equally troublesome is
biometric spoofing. Moreover, the environmental noise
effects on the data acquisition process can lead to
deficient accuracy which may disable systems, virtually
from inception [1]. Speaker verification, for instance,
degrades rapidly in noisy environments. Some of the
limitations imposed by unimodal biometrics systems can
be overcome by using multiple biometric modalities.
Multiple evidence provision through multimodal
biometric data acquisition may focus on multiple samples
of a single biometric trait, designated as multi-sample
biometrics. It may also focus on samples of multiple
biometric types. This is termed multimodal biometrics.
Higher accuracy and greater resistance to spoofing are
basic advantages of multimodal biometrics over
unimodal biometrics.
The fusion of the complementary information in
multimodal biometric data has been a research area of
considerable interest, as it plays a critical role in
overcoming certain important limitations of unimodal
systems. The efforts in this area are mainly focused on
fusing the information obtained from a variety of
independent modalities. For instance, a popular approach
is to combine palmprint and speech modalities to achieve
a more reliable recognition of individuals. Through such
an approach, separate information from different
modalities is used to provide complementary evidence
about the identity of the users. In such scenarios, fusion is
normally at the score level. This is because the individual
modalities provide different raw data types, and involve
different classification methods for discrimination. To
date, a number of score-level fusion techniques have
been developed for this task [2].
The proposed paper shows that integration of speech
signal and palmprint biometrics can achieve higher
performance that may not be possible using a single
biometric indicator alone. 2D Gabor filter with Hamming
distance and Mel Frequency Cepstral Coefficients
(MFCC) with Gaussian Mixture Model (GMM) are used
for feature vector fusion context for palmprint and speech
signal respectively. This paper investigates the
effectiveness of various fusion approaches based on two
biometrics (palmprint and speech signal).
The rest of this paper is organized as fallows.
Section 2 presents the system structure, which is used to
increase recognition quality. Section 3 presents feature
extraction using 2D Gabor and MFCC. Section 4, the
individual traits are fused at matching score level using
( )
1
1
0
( ) ln ( ) ( , )
N
k
X m X k H k m

=
=

fusion techniques. Finally, the experimental results are
given in section 5. Conclusions are given in the last
section.

2. System Structure
The multimodal biometric system is developed using two
traits i.e. speech signal and palmprint as shown in figure
1. For the speech signal and palmprint recognition, the
input image is recognized using MFCC and 2D Gabor
filter algorithm respectively. When we are using a gabor
filter, the matching score is calculated using Hamming
distance also when we are using MFCC, GMM is used.
The modules based on the individual traits returns an
integer vector after matching the database and query
feature vectors. The final score is generated by fusing
both the scores using fusion technique, which is passed to
the decision module.


Figure 1. Block diagram of speech signal and palmprint multimodal
biometric system.

3. Feature Extraction using MFCC and
Gabor filter

3.1. Feature extraction using MFCC
Feature extraction is the first component in an automatic
speaker recognition system [3]. This phase consists of
transforming the speech signal in a set of feature vectors
called also parameters. The aim of this transformation is
to obtain a new representation, which is more compact,


Figure 2. Components of a speaker recognition system

less redundant, and more suitable for statistical modeling
and calculation of distances. Most of the speech
parameterizations used in speaker recognition systems
relies on a Cepstral representation of the speech signal [4].
The Mel-frequency Cepstral coefficients (MFCC) are
motivated by studies of the human peripheral auditory
system. Firstly, the speech signal x(n) is divided into Q
short time windows which are converted into the spectral
domain by a Discrete Fourier Trans form(DFT). The
magnitude spectrum of each time window is then
smoothed by a bank of triangular bandpass filters (Figure 3)
that emulate the critical band processing of the human ear.


Figure 3. Mel filter bank

Each one of the bandpass filter H (k, m) computes a
weighted average of that subband, which is then log|.|
arithmically compressed:

(1)


where X (k) is the DFT of a time window of the signal x(n)
having the length N, the index k, k =0, . . . , N 1,
corresponds to the frequency f
k
= k f
s
/N, with f
s
the
sampling frequency, the index m, m =1, . . . M and M <<
N, is the filter number, and the filters H (k, m) are
triangular filters defined by the center frequencies f
c
(m)
(Sigurdsson et al., 2006). The log compressed filter outputs
X f (m) are then decorrelated by using the Discrete Cosine
Transform (DCT):


(2)


where
( ) c l is the
th
l MFCC of the considered time window.

Figure 4. Extraction of MFCC
Feature
Extraction
Speaker
Modeling
Pattern
Matching
Speaker
Modal
Databas
e
Decision
Training
Mode
Recognitio
n
Mode
Speech
Signal
Pre-
emphasis
and
windowing
Log
|.|
DFT

MFCC
DCT
Filter Bank
with
Mel-frequency
Filter Bank with
Linear-
frequency
( )
1
1
( ) cos
2
M
m
c l X m l m
M
t
=
'
=
| | ||
||
\ \ ..

1 1
0
1 1
( , ) ( , ) ( ( , ) ( , )) ( , ) ( , ) ( ( , ) ( . ))
2 ( , ) ( , )
N N
M M R R M M I I
i j
N N
M M
i j
P i j Q i j P i j Q i j P i j Q i j P i j Q i j
D
P i j Q i j
=
= =
+
=

There are several analytic formulae for the Mel scale used
to compute the center frequencies f
c
(m). In this study we
use the following common mapping:

(3)


3.2. The Gaussian Mixture Model
In this study, a Gaussian Mixture Model approach
proposed in [5] is used where speakers are modeled as a
mixture of Gaussian densities. The use of this model is
motivated by the interpretation that the Gaussian
components represent some general speaker-dependent
spectral shapes and the capability of Gaussian mixtures to
model arbitrary densities.
The Gausssian Mixture Model is a linear
combination of M Gaussian mixture densities, and given
by the equation

(4)


Where x

is a D-dimensional random vector,


( ), 1,...
i
b x i M =

are the component densities and p


i
,
i=1,M are the mixture weights. Each component
density is a D-dimensional Gaussian function of the form


(5)


Where
i

denotes the mean vector and


i
denotes the
covariance matrix. The mixture weights satisfy the law of
total probability,
1
1
M
i i
p
=
=

.The major advantage of this


representation of speaker models is the mathematical
tractibility where the complete Gaussian mixture density
is represented by only the mean vectors, covariance
matrices and mixture weights from all component
densities.

3.3. Feature Extraction and Coding (Gabor Filter)
We proposed a 2D Gabor phase coding scheme for
palmprint representation[6]. The circular Gabor filter is
an effective tool for texture analysis, and has the
following general form.


(6)




Where i= \ -1, u is the frequency of the sinusoidal wave,
controls the orientation of the function, and is the
standard deviation of the Gaussian envelope. To make it
more robust against brightness, a discrete Gabor
filter, ( , , , , , ) G x y u u o , is turned to zero DC(direct
current) with the application of the following formula:


(7)

Where (2n + 1)
2
is the size of the filter. In fact, the
imaginary part of the Gabor filter automatically has zero
DC because of odd symmetry. The adjusted Gabor filter
is used to filter the preprocessed images.
It should be pointed out that the success of 2D Gabor
phase coding depends on the selection of Gabor filter
parameters, , , and u. In our system, we applied a
tuning process to optimize the selection of these three
parameters. As a result, one Gabor filter with optimized
parameters, =/4, u=0.0916, and = 5.6179 is
exploited to generate a feature vector with 2,048
dimensions.

3.4. Hamming Distance
Given two data sets, a matching algorithm determines the
degree of similarity between them. To describe the
matching process clearly, we use feature matrices, real
and imaginary. A normalized Hamming distance used in
[6] is adopted to determine the similarity measurement
for palmprint matching. Let P and Q be two palmprint
and speech signal vectors. The normalized hamming
distance can be described as



(8)


where P
R
(Q
R
), P
I
(Q
I
) and P
M
(Q
M
) are the real part, the
imaginary part and the mask of P(Q), respectively. The
result of the Boolean operator () is equal to zero if and
only if the two bits, P
R (i)
(i, j), are equal to Q
R (i)
(i, j); the
symbol represents the AND operator and, the size of
the feature matrixes is NxN. It is noted that D
0
is between
1 and 0. For the best matching, the hamming distance
should be zero. Because of imperfect preprocessing, we
need to vertically and horizontally translate one of the
features and match again. The ranges of the vertical and
horizontal translations are defined from -2 to 2. The
minimum D
0
value obtained from the translated matching
is

considered to be the final matching score.

4. Range-Normalization
Range-normalization is also known as score
normalization [7,8,9]. The term range-normalization is
used to bring raw scores from different matchers to the
same range. Range-normalization is a necessary step in
any fusion system, as fusing the scores without such
normalization would de-emphasize the contribution of
the matcher having a lower range of scores. There are
various well-known range-normalization techniques (i.e.
Min-Max, Z-score, Tanh, Median-MAD, Double-
sigmoid). Min-Max and Z-score (in most cases) have
shown to be amongst the most effective and widely used
methods for this purpose [8, 10]. In our approach Z-score
has been used.

4.1. Z-score Normalization (ZS)
Z-score normalization converts the scores to a
distribution with the mean of 0 and standard deviation of
1. Z-score normalization retains the original distribution
of the scores. However, the numerical range after Z-score
normalization is not fixed. Z-score normalization is given
as

10
( ) 2595log 1
700
f
B f = +
| |
|
\ .
1
( | ) ( )
M
i i
i
p x p b x
=
=


1
/2 1/2
1 1
( ) exp ( ) ( )
2 (2 ) | |
T
i i D
i
i
bi x x x


`
)
=
H


{ }
2 2
2 2
1
( , , , , , ) exp
2 2
exp 2 ( cos sin )
x y
G x y u
ni ux uy
u o
to o
u u

=
+

`
)

| | | |
| |
( )
2
, , , ,
, , , , , , , ,
2 1
n n
i n j n
G i y u
G s y u G x y u
n
u o
u o u o
= =

=
+

(9)


Where, n is any raw score, and and are the mean and
standard deviation of the stream specific scores and are
computed on some development data.

5. Fusion
Multibiometric fusion refers to the fusion of multiple
biometric indicators. Such systems seek to improve the
speed and reliability (accuracy) of a biometric system by
integrating matching scores obtained from multiple
biometric sources.

5.1. Matcher Weighting using False Acceptance Rate
and False Rejection Rate (MW FAR/FRR)
This fusion technique can be used again in the case of
having two matcher types only. In this technique the
performance of the individual matchers determines the
weights so that smaller error rates result in larger weights.
The performance of the system is measured by False
Acceptance Rate (FAR) and False Rejection Rate (FRR).
These two types of errors are computed at different
thresholds. The threshold that minimizes the absolute
difference between FAR and FRR on the development set
is then taken into consideration. The weights for the
respective matchers are computed as follows.


(10)

and

(11)


Where FAR
1
, FRR
1
and w
1
are the false acceptance rate,
false rejection rate and the weight for one matcher and
FAR
2
,FRR
2
are the false acceptance rate, false rejection
rate for the other matcher with the weight w
2
. Note that
the weight (obtained on some development data) is in the
interval of 0 and 1, with the constraint w
1
+w
2
=1. The
fused score using different matchers is given as

(12)

where, x
m
is the normalized score of matcher m and f is
the fused score.

5.2. Matcher Weighting based on Equal Error Rate
(MW - EER)
The matcher weights in this case depend on the Equal
Error Rates (EER) of the intended matchers for fusion.
These EERs are computed using the given development
data. EER of matcher m is represented as E
m
,
m=1,2,,M and the weight w
m
associated with matcher
m is computed as


(13)



Note that 0sw
m
s1. It is apparent that the weights are
inversely proportional to the corresponding errors in the
individual matchers. The weights for less accurate
matchers are lower than those of more accurate matchers.
The fused score is calculated as


(14)


Where f is the fused score, x
m
is the normalized match
score from the m
th
matcher and w
m
is the corresponding
weight.

5.3. Logistic Regression (LR)
Another simple classification method can be used in the
case of a two-class problem (Clients / Impostors) is that
based on the principles of logistic regression [11, 12-14].
The Logistic Regression method classifies the data based
on using two functions: logistic regression function and
logit transformation as follows:

(15)


Where E(Y|x) is the conditional probability for the binary
output variable Y and where the M-dimensional input
vector x=( x
1
,x
2
,....,x
M
) exists and g(x) is defined as:

(16)

where w
m
is the weight for the mth modality. Due to the
fact that each w
m
with i0multiplies one of the M
modalities, it is evaluated as the level of the importance
of that modality in the fusion process. A high w
m
shows
an important modality whilst a low w
m
shows a modality
not contributing a great deal. Parameters in the above
equation (w
0
,w
1
,...,w
M
) can be calculated with the
maximum likelihood approach. The outcome is thus
compared with optimal threshold calculated on the
development data.

5.4. Quadratic Discriminant Analysis (QDA)
This technique is similar to FLD but is based on forming
a boundary between two classes using a quadratic
equation given as [15]

(17)

For training data 1 and 2 from two different classes,
which are distributed as M[i,Si], ie1 and 2,the
transformation parameters A and B can be obtained on
the development data as:


(18)


(19)

C is a constant that depends on the mean vectors and
covariance matrices and is computed as follows:


(20)


n
x

o

=
1 1
1
2 2 1 1
1 ( )
2 ( )
FAR FAR
w
FAR FRR FAR FRR
+
=
+ + +
2 2
2
1 1 2 2
1 ( )
2 ( )
FAR FAR
w
FAR FRR FAR FRR
+
=
+ + +
1 1 2 2
f wx w x = +
1
1
1
m
M
m
m m
w
E
E
=
=
| |
|
\ .

1
M
m m
m
f w x
=
=

( )
( )
( | )
1
g x
g x
e
E Y x
e
=
+
0 1 1
( ) ...
M M
g x w wx w x = + + +
( )
T T
h x x Ax B x C = + +
( )
1 1
1 2
1
2
A S S

=
1 1
1 1 2 2
B S S

=
1 1 1
1 1 1 2 2 2
2
| |
ln
| |
T T
S
C S S
S


= +
6. Experimental results
We evaluate the proposed multimodal system on a data
set including 720 pairs of images and speech signal from
120 subjects. The training database contains a speech
signals and palmprint images for each subject. Each
subject has 6 palm images and 6 different words taken at
different time intervals, which is stored in the database.
Before extracting features of palmprint, we locate
palmprint images to 128x128.
The multimodal system has been designed at
matching score level. At first experimental the individual
systems were developed and tested for FAR, FRR &
accuracy. In the last experiment both the traits are
combined at different fusion techniques and compared.
The results are found to be very encouraging and
promoting for the research in this field. The overall
accuracy of the system is more than 99% with EER less
than 1.21%. Table1 shows Accuracy in terms of EER
with different fusion techniques.


Figure 5. Accuracy vs. threshold curves for four different fusion
techniques
TABLE1
Verification results in terms of EER(%), based on score level fusion.

Fusion Candidates
M
W


F
A
R
/
F
R
R

M
W


E
E
R

L
R

Q
D
A

Face Speech
Feature Classifier Feature Classifier
Gabor
Hamming
Distance
MFCC GMM 0.85 0.64 0.47 1.21
Figure 6. Comparison of Modalities measured at FRR=10
-2


7. Conclusion
Biometric systems are widely used to overcome the
traditional methods of authentication. But the unimodal
biometric system fails in case of biometric data for
particular trait. Thus the individual score of two traits
(speech signal & palmprint) are combined at classifier
level and trait level to develop a multimodal biometric
system. The performance table shows that multimodal
system performs better as compared to unimodal
biometrics with accuracy of more than 97%.

8. References
[1] U. M. Bubeck, "Multibiometric authentication: An
overview of recent developments," in Term Project
CS574 Spring. San Diego State University, 2003.
[2] C. Sanderson and K. K. Paliwal, "Identity
Verification using Speech and Face Information,"
Digital Signal Processing, vol. 14, pp.449-480, 2004.
[3] G. Feng, K. Dong, D. Hu, and D. Zhang. When Faces
Are Combined with Palmprints: A Novel Biometric
Fusion Strategy. In Proceedings of ICBA, pages 701
707, 2004.
[4] G. Feng, K. Dong, D. Hu & D. Zhang, When Faces
are Combined with Palmprints: A Noval Biometric
Fusion Strategy, ICBA, pp.701-707, 2004.
[5] D. A. Reynolds, Experimental Evaluation of
Features for Robust Speaker Identification, IEEE
Transactions on SAP, vol. 2. Pp. 639-643,1994.
[6] J.G. Daugman, High Confidence Visual Recognition
of Persons by a Test of Statistical Independence,
IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 15, no. 11, pp. 1148-1161, Nov.
1993.
[7] A. K. Jain, K. Nandakumar, and A. Ross, "Score
normalisation in multimodal biometric systems"
Pattern Recognition, vol. 38, pp. 22702285, 2005.
[8] R. Snelick, U. Uludag, A. Mink, M. Indovina, and A.
K. Jain, "Large Scale Evaluation of Multimodal
Biometric Authentication Using State-of-the-Art
Systems," IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 27, pp. 450-455, 2005.
[9] N. Poh and S. Bengio, "A Study of the Effects of
Score Normalisation Prior to Fusion in Biometric
Authentication Tasks," IDIAP Research Report No.
IDIAPRR 04-69, 2004.
[10] k. Nandakumar, "Integration of Multiple Cues in
Biometric Systems," M.S. Thesis: Michigan State
University, 2005.
[11] P. Verlinde, P. Druyts, G. Chollet, and M. Acheroy,
"A multi-level data fusion approach for gradually
upgrading the performances of identity verification
systems," Sensor Fusion: Architectures, Algorithms
and Application III, vol. 3719, pp. 14-25, 1999.
[12] B. D. Ripley, Pattern Recognition and Neural
Networks. U.K: Cambridge University 1996.
[13] D. W. Hosner and S. Lemeshow, Applied logistic
regression: John Wiley & Sons, 1989.
[14] Y. So, "A Tutorial on Logistic Regression," SAS
Institute Inc, 1995.
[15] B. Flury, Common Principle Components and Related
Multivariate Models. USA: John Wiley and Sons,
1988.

You might also like