Professional Documents
Culture Documents
Abstract In Speaker Recognition (SR) system, feature It is based on human auditory system aiming for
extraction is one of the crucial steps where the particular artificial implementation of the ear physiology. Their
speaker related information are extracted. The state of the art complementary part IMFCC[4]also exists. The idea is to
algorithm for this purpose is Mel Frequency Cepstral capture those information which otherwise could have been
Coefficient (MFCC), and its complementary feature, Inverted missed by MFCC. It captures speaker specific information
Mel Frequency Cepstral Coefficient (IMFCC). MFCC is based lying in the lower frequency part more accurately than in the
on mel scale and IMFCC is based on inverted mel (imel) higher frequency part whereas IMFCC captures speaker
scale. In this paper, another complementary set of features are specific information in the higher frequency part more
proposed which is also based on mel-imel scale, and the accurately than in the lower frequency part. This is due to the
filtering operation makes these set of features different from non-linear distribution of filterbanks in the linear frequency
MFCC and IMFCC. On the background of this proposed scale.
features, the filter banks are placed linearly on the nonlinear Despite the fact that MFCC is based on human
scale which makes the features different from the state-of-the- auditory system, it is not the best feature extraction technique
art feature extraction techniques. We call these two features as for speaker recognition. It still provides scope for
mMFCC, and mIMFCC. mMFCC is based on mel scale, improvement. In this paper, that improved feature has been
whereas, mIMFCC is based on imel. mMFCC is compared discussed.
with MFCC and mIMFCC is compared with IMFCC. The
result has been verified on two standard databases YOHO, and II. SPEAKER RECOGNITION SYSTEM
POLYCOST using Gaussian Mixture Model (GMM) as the In this section, a basic speaker recognition system [5] has
speaker modeling paradigm. been described which is based on GMM [6,7]. The overall
block diagram is given in Fig. 1.
Keywords MFCC, IMFCC, mMFCC, mIMFCC, Feature
extraction, Fusion, Triangular filter, GMM. Train data
I. INTRODUCTION SES-1
978-1-4799-8792-4/15/$31.00 2015
c IEEE 1052
and separate the voice samples which contain mostly speaker
related information. These voice portions are passed through where, Q is the number of filterbanks.
the pre-emphasis filter which emphasizes the higher Relation between mel scale [1] and linear frequency
frequencies of the spectrum. In the next step, we, the speech scale is given by
signal is framed by 20 ms to make the signal stationary.
= 2595 (4)
Normally, speech signal is a quasi-periodic signal, and non-
stationary in nature. After framing operation, it becomes
IMFCC is an acronym for Inverted Mel Frequency
stationary signal. Here, 50% overlapping of the previous
Cepstral Coefficient. It has complementary characteristics of
samples is done, and each of the frames is multiplied by
MFCC. Relation between imel scale and linear frequency
Hamming Window to reduce the spectral leakage from the
scale is given by
spectrum. After that, each of the frames is sent for the feature
extraction, and using these feature a model is generated of the
particular speaker. The details of the feature extraction is (5)
described as follows.
Human auditory system can be modeled by a set of
Feature Extraction: filters which are uniformly spaced on mel scale. Since there is
a non-linear relationship between mel scale and linear
Feature Extraction is a mapping process from higher frequency scale, when the filters are mapped back to the linear
dimension to lower dimension. The input data to an algorithm frequency scale using the MFCC curve, the equation being
is too large to be processed; it is transformed into a reduced
f = 700 * [ -1] (6)
set of feature vectors. A feature vector is an n-dimensional
vector of numerical features. the filters are non uniformly spaced on linear frequency
In case of speaker recognition, these features scale. The filterbank density is high in the low frequency
represent speakers characteristics. If they are carefully region and low in the high frequency region. Frequency bins
extracted, they will give relevant information to perform the are equally spaced in the linear frequency scale. So, number of
desired task. bins present in a filter in linear frequency scale is not the
MFCC is an acronym for Mel Frequency Cepstral same. If we map the filter to the linear frequency scale using
Coefficient. It is a feature motivated by human auditory IMFCC curve the equation being
system. The frequency perceived by us, i.e., pitch is different
from what is emitted from the source. Mel is the unit of pitch.
f = 4031.25-700[
-1] (7)
It is logarithmic in nature. Mel scale relates perceived
frequency i.e., pitch to its actual measured frequency.
Let represent the speech frame, where N is the
number of samples (Here, N=160). DFT of each frame is
calculated. Energy of each frame is computed by the following
equation
(1)
2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 1053
region and high in the low frequency region. Number of bins bins can be mapped to the mel scale and Imel scale using
present in a filter in linear frequency scale is not the same. MFCC and IMFCC curves (Fig. 1) depending upon the
technique used. Earlier twenty-two points were mapped from
Modeling: one scale to the other. Now, 128 points will be mapped from
one scale to the other. Now, the frequency bins will no longer
By using modeling, we extract and represent the
be equally spaced in the mel and Imel scale. Thus, each filter
desired information from the spectral sequence. Distribution
will contain different number of frequency bins.
property of a speaker is captured. These distributions have
parameters extracted from feature vectors which contain the
speaker characteristics. We use here the Gaussian Mixture
Model[8]. Its parameters are estimated from training data
using the iterative Expectation-Maximization (EM) algorithm.
These parameters are mean vectors, covariance matrices and
mixture weights from all component densities. These
parameters which collectively constitute the speaker model,
are represented by the notation { , where i = 1 to
M. A Gaussian mixture model is a weighted sum of M
component Gaussian densities as given by the equation
P (x|) =
(8)
= (9)
D is the dimension of the feature space. x is a D-dimensional Fig. 3 FREQUNCY BINS & mMFCC FILTER BANK, FREQUNCY VS.
continuous valued feature vector. The are the mixture PITCH
weights satisfying the stochastic constraint
= 1.
are the component Gaussian densities.
Pattern Matching :
(10)
1054 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI)
filters have highly dense frequency bins; on the other hand, for through the European telephone network and was recorded
mIMFCC lower filters are having dense frequency bins. These through an ISDN card on XTL SUN platform with an 8 kHz
make the feature complementary to each other. Therefore, the sampling rate.
proposed feature can be combined as a complementary The database was divided in groups of twenty
information feature set for fusion work. speakers, both for training and testing phase. There were seven
Advantage of the complementary information is groups in total. The first six groups has twenty speakers each,
found effective in the context of classifier fusion where the but the last group in YOHO database has eighteen speakers,
errors of the classifier under combination must be mutually while there are eleven speakers in POLYCOST database.
correlated in order to get better performance than a single Result was calculated using the cross-validation technique. It
classifier based system. is a model validation technique which gives an insight on how
The complementary information extracted from two the model will generalize to an independent dataset. It defines
features [11] can be used to get better results. The combination a dataset to test the model in the training phase in order to
of two or more classifiers would perform better if they were limit the problems like over fitting. It generally occurs when a
supplied with information that are complementary in nature. model is very complex, such as having too
So, after pattern matching, we fuse the scores of MFCC- many parameters relative to the number of observations.
IMFCC, and mMFCC-mIMFCC. This type fusion is known as
Score Level fusion [6]. Experimental Platform:
YOHO Database: Result has been shown in TABLE I and TABLE II.
From TABLE I, it can be inferred that, for MFCC &
The YOHO voice verification corpus was collected mMFCC, POC for each model order is more for mMFCC than
over 3 month period from office environment. A high quality MFCC except for model order 8. Also, on comparing IMFCC
telephone handset (Shure XTH-383) was used to collect the and mIMFCC, it can be observed that except for model order
speech. However, the speech signal was not passed through 4, POC for mIMFCC is better than that of IMFCC. So,
the telephone channel. The vocabulary employed is called mMFCC has outperformed MFCC and mIMFCC has
combinational-lock phrase. In this scheme, words that are outperformed IMFCC.
spoken consists of two-digit numbers in sets of three. There According to TABLE II, it can be seen that ,for
are 138 speakers (106 males & 32 females ).For each speaker, MFCC and mMFCC, POC for model order 8 and 16 is more
there are 4 enrollment sessions of 24 utterances each & 10 for mMFCC than MFCC. For IMFCC and mIMFCC, POC for
verification sessions of 4 utterances each. mIMFCC is greater than IMFCC for all model orders. Hence,
mIMFCC is showing better results than IMFCC.
POLYCOST Database[12]:
2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 1055
TABLE I. YOHO DATABASE TABLE IV. FUSION IN POLYCOST DATABASE
1056 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI)
Speaker Recognition Techniques in Telephony, pp. 59- 69, Vigo, Spain,
November 1996.
2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 1057