You are on page 1of 6

A Modified MFCC Feature Extraction Technique For

Robust Speaker Recognition


Diksha Sharma, Israj Ali
School of Electronics Engineering
KIIT University, Bhubaneswar, 751024
dikshas049@gmail.com, israj.research@gmail.com

Abstract In Speaker Recognition (SR) system, feature It is based on human auditory system aiming for
extraction is one of the crucial steps where the particular artificial implementation of the ear physiology. Their
speaker related information are extracted. The state of the art complementary part IMFCC[4]also exists. The idea is to
algorithm for this purpose is Mel Frequency Cepstral capture those information which otherwise could have been
Coefficient (MFCC), and its complementary feature, Inverted missed by MFCC. It captures speaker specific information
Mel Frequency Cepstral Coefficient (IMFCC). MFCC is based lying in the lower frequency part more accurately than in the
on mel scale and IMFCC is based on inverted mel (imel) higher frequency part whereas IMFCC captures speaker
scale. In this paper, another complementary set of features are specific information in the higher frequency part more
proposed which is also based on mel-imel scale, and the accurately than in the lower frequency part. This is due to the
filtering operation makes these set of features different from non-linear distribution of filterbanks in the linear frequency
MFCC and IMFCC. On the background of this proposed scale.
features, the filter banks are placed linearly on the nonlinear Despite the fact that MFCC is based on human
scale which makes the features different from the state-of-the- auditory system, it is not the best feature extraction technique
art feature extraction techniques. We call these two features as for speaker recognition. It still provides scope for
mMFCC, and mIMFCC. mMFCC is based on mel scale, improvement. In this paper, that improved feature has been
whereas, mIMFCC is based on imel. mMFCC is compared discussed.
with MFCC and mIMFCC is compared with IMFCC. The
result has been verified on two standard databases YOHO, and II. SPEAKER RECOGNITION SYSTEM
POLYCOST using Gaussian Mixture Model (GMM) as the In this section, a basic speaker recognition system [5] has
speaker modeling paradigm. been described which is based on GMM [6,7]. The overall
block diagram is given in Fig. 1.
Keywords MFCC, IMFCC, mMFCC, mIMFCC, Feature
extraction, Fusion, Triangular filter, GMM. Train data

I. INTRODUCTION SES-1

Feature extraction plays a significant role in speaker SES-2

recognition process. It is a method of reducing the dimension SES-3


of data while retaining the discriminative information. It
SILENCE PRE- FEATURE
retains the necessary information of the speech signal while PROCESSING EXTRACTION
SES-N VAD
rejecting redundant and unwanted information. The careful SPEECH SPEAKER
choice of features will give relevant information from the MODEL

input data to perform the desired task. Various techniques


used for feature extraction are MFCC [1,2,3], Real Cepstral PRE-
FEATURE PATTERN
Test data
Coefficients (RCC), Linear Prediction Coding (LPC), Linear VAD
PROCESSING
EXTRACTION MATCHING

Predictive Cepstral Coefficients (LPCC) and Perceptual IDENTIFIED


Linear Predictive Cepstral Coefficients (PLPC). The state of SPEAKER

the art algorithm for this purpose is MFCC. It is one of the


most popular and commonly used technique in most of the
applications of speech signal for feature extraction. They were Fig. 1 BLOCK DIAGRAM OF SPEAKER RECOGNITION
introduced by Davis and Mermelstein in the 1980s, and have SYSTEM
been state of the art ever since. MFCC was first proposed for
speech recognition to identify monosyllabic words in In this system, first the voice samples are recorded using
continuously spoken sentences and not for speaker microphone. Sometimes, the recording is also performed over
recognition. telephone channel. Once, the voice samples are recorded, it is
required to remove the silence portions from the speech signal,

978-1-4799-8792-4/15/$31.00 2015
c IEEE 1052
and separate the voice samples which contain mostly speaker
related information. These voice portions are passed through where, Q is the number of filterbanks.
the pre-emphasis filter which emphasizes the higher Relation between mel scale [1] and linear frequency
frequencies of the spectrum. In the next step, we, the speech scale is given by
signal is framed by 20 ms to make the signal stationary.
= 2595 (4)

Normally, speech signal is a quasi-periodic signal, and non-
stationary in nature. After framing operation, it becomes
IMFCC is an acronym for Inverted Mel Frequency
stationary signal. Here, 50% overlapping of the previous
Cepstral Coefficient. It has complementary characteristics of
samples is done, and each of the frames is multiplied by
MFCC. Relation between imel scale and linear frequency
Hamming Window to reduce the spectral leakage from the
scale is given by
spectrum. After that, each of the frames is sent for the feature
extraction, and using these feature a model is generated of the
particular speaker. The details of the feature extraction is  (5)

described as follows.
Human auditory system can be modeled by a set of
Feature Extraction: filters which are uniformly spaced on mel scale. Since there is
a non-linear relationship between mel scale and linear
Feature Extraction is a mapping process from higher frequency scale, when the filters are mapped back to the linear
dimension to lower dimension. The input data to an algorithm frequency scale using the MFCC curve, the equation being

is too large to be processed; it is transformed into a reduced
f = 700 * [ -1] (6)
set of feature vectors. A feature vector is an n-dimensional
vector of numerical features. the filters are non uniformly spaced on linear frequency
In case of speaker recognition, these features scale. The filterbank density is high in the low frequency
represent speakers characteristics. If they are carefully region and low in the high frequency region. Frequency bins
extracted, they will give relevant information to perform the are equally spaced in the linear frequency scale. So, number of
desired task. bins present in a filter in linear frequency scale is not the
MFCC is an acronym for Mel Frequency Cepstral same. If we map the filter to the linear frequency scale using
Coefficient. It is a feature motivated by human auditory IMFCC curve the equation being
system. The frequency perceived by us, i.e., pitch is different

from what is emitted from the source. Mel is the unit of pitch.
f = 4031.25-700[

-1] (7)
It is logarithmic in nature. Mel scale relates perceived
frequency i.e., pitch to its actual measured frequency.
Let represent the speech frame, where N is the
number of samples (Here, N=160). DFT of each frame is
calculated. Energy of each frame is computed by the following
equation

  (1)

where, k : Frequency index, n : Time index, N : Total


samples.

These DFT coefficients are passed through 20


number of triangular filterbanks. Output of each filter response
is calculated by weighted summation of the filters as in the
equation given below


E(i) = (2)

where, i : Index of each filter, is called the kernel


matrix.
After this, logarithm is calculated on both sides and
DCT is performed. Due to overlapped filterbanks, the filter
energies are correlated. So, DCT is done to make them
orthogonal. Final cepstral coefficient is calculated by the Fig. 2(a) Mel-Imel curves, (b) MFCC Filter Bank, (c) IMFCC Filterbank.
formula below
the filters are again non-uniformly spaced on linear frequency

 (3)
scale. The filterbank density is low in the high frequency

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 1053
region and high in the low frequency region. Number of bins bins can be mapped to the mel scale and Imel scale using
present in a filter in linear frequency scale is not the same. MFCC and IMFCC curves (Fig. 1) depending upon the
technique used. Earlier twenty-two points were mapped from
Modeling: one scale to the other. Now, 128 points will be mapped from
one scale to the other. Now, the frequency bins will no longer
By using modeling, we extract and represent the
be equally spaced in the mel and Imel scale. Thus, each filter
desired information from the spectral sequence. Distribution
will contain different number of frequency bins.
property of a speaker is captured. These distributions have
parameters extracted from feature vectors which contain the
speaker characteristics. We use here the Gaussian Mixture
Model[8]. Its parameters are estimated from training data
using the iterative Expectation-Maximization (EM) algorithm.
These parameters are mean vectors, covariance matrices and
mixture weights from all component densities. These
parameters which collectively constitute the speaker model,
are represented by the notation { , where i = 1 to
M. A Gaussian mixture model is a weighted sum of M
component Gaussian densities as given by the equation

P (x|) =
(8)



 = (9)


D is the dimension of the feature space. x is a D-dimensional Fig. 3 FREQUNCY BINS & mMFCC FILTER BANK, FREQUNCY VS.
continuous valued feature vector. The are the mixture PITCH
weights satisfying the stochastic constraint
= 1.
are the component Gaussian densities.

Pattern Matching :

For speaker identification, the value of log-likelihood


is calculated for all speaker models and the owner of the
model having the highest score is identified as the speaker.
Expectation and Maximization (E & M) algorithm is used to
train the feature vectors. The parameters of the model are
updated in each iteration. Here, ten iterations are used. The
log-likelihood[9] of a speaker model s is given by

 (10)

X={ , .., } are the feature vectors for an utterance


with T frames.
III. PROPOSED METHOD Fig. 4 FREQUNCY BINS & MIMFCC FILTER BANK, FREQUNCY VS.
PITCH
In the pre existing technique of feature extraction
using MFCC, and IMFCC, twenty filterbanks with fifty Fusion Methodologies[10]:
percent overlapping having equal spacing were there in the
mel scale and Imel scale respectively. There will be twenty- MFCC captures information present in the low
two vertices of the twenty overlapped triangles. Those twenty- frequency region better than that in the high frequency region.
two points are mapped back to the linear frequency scale using The situation is opposite in case of IMFCC. It captures
MFCC and IMFCC curves, depending on the feature information in the high frequency region better than that in the
extraction technique used. low frequency region. This is due to the non-linear distribution
Another approach for extracting the features can be of filter banks in both the feature extraction techniques.
there by bringing out some modification in the pre existing Hence, they are complementary to each other. In our proposed
MFCC and IMFCC feature extraction techniques. Despite feature, we are using complementary Mel-Imel scale for
mapping the filterbank to the linear frequency scale, frequency mMFCC, and mIMFCC respectively. In mMFCC, the higher

1054 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI)
filters have highly dense frequency bins; on the other hand, for through the European telephone network and was recorded
mIMFCC lower filters are having dense frequency bins. These through an ISDN card on XTL SUN platform with an 8 kHz
make the feature complementary to each other. Therefore, the sampling rate.
proposed feature can be combined as a complementary The database was divided in groups of twenty
information feature set for fusion work. speakers, both for training and testing phase. There were seven
Advantage of the complementary information is groups in total. The first six groups has twenty speakers each,
found effective in the context of classifier fusion where the but the last group in YOHO database has eighteen speakers,
errors of the classifier under combination must be mutually while there are eleven speakers in POLYCOST database.
correlated in order to get better performance than a single Result was calculated using the cross-validation technique. It
classifier based system. is a model validation technique which gives an insight on how
The complementary information extracted from two the model will generalize to an independent dataset. It defines
features [11] can be used to get better results. The combination a dataset to test the model in the training phase in order to
of two or more classifiers would perform better if they were limit the problems like over fitting. It generally occurs when a
supplied with information that are complementary in nature. model is very complex, such as having too
So, after pattern matching, we fuse the scores of MFCC- many parameters relative to the number of observations.
IMFCC, and mMFCC-mIMFCC. This type fusion is known as
Score Level fusion [6]. Experimental Platform:

The overall experimental work is carried out in HP


GMM Train data GMM ProBook 6460b having RAM 4.00GB, and Processor Speed is
Feature 1
PRE- PRE-
Feature 2
2.6GHz of Intel(R) Core(TM) i5-2540M. The algorithms are
PROCESSING PROCESSING
Pattern
developed under Matlab environment. The version of
Pattern
matching Test data
matching MATLAB used over here is 7.8.0.347 (Release 2009A).
V. RESULTS & DISCUSSION
1-w
w
FUSION
Result was calculated using the cross-validation
technique. The detailed process is explained here. For both
Score Score
calculation calculation training and testing phase, feature vectors were evaluated
using different feature extraction techniques. For each speaker
in each group, separate models were developed from MFCC,
OUTPUT
IMFCC, mMFCC and mIMFCC. Now, pattern matching of
each test sample belonging to that group was done with the
Fig. 5 BLOCK DIAGRAM FOR FUSION model of each speaker of that group. In this way, the score
was calculated. From that score, overall performance was
IV. DATABASE & EXPERIMENTAL PLATFORM
judged by taking the mean of POC of each feature from
To evaluate the performance of our proposed feature, different speaker groups. The percentage of correctness is
we use two different kinds of databases. One is YOHO [6] given by the following equation :
which is recorded by a high quality microphone, and second
one is POLYCOST [8] which is recorded through telephone 
POC = x 100
channel. The details of these databases are given below. 

YOHO Database: Result has been shown in TABLE I and TABLE II.
From TABLE I, it can be inferred that, for MFCC &
The YOHO voice verification corpus was collected mMFCC, POC for each model order is more for mMFCC than
over 3 month period from office environment. A high quality MFCC except for model order 8. Also, on comparing IMFCC
telephone handset (Shure XTH-383) was used to collect the and mIMFCC, it can be observed that except for model order
speech. However, the speech signal was not passed through 4, POC for mIMFCC is better than that of IMFCC. So,
the telephone channel. The vocabulary employed is called mMFCC has outperformed MFCC and mIMFCC has
combinational-lock phrase. In this scheme, words that are outperformed IMFCC.
spoken consists of two-digit numbers in sets of three. There According to TABLE II, it can be seen that ,for
are 138 speakers (106 males & 32 females ).For each speaker, MFCC and mMFCC, POC for model order 8 and 16 is more
there are 4 enrollment sessions of 24 utterances each & 10 for mMFCC than MFCC. For IMFCC and mIMFCC, POC for
verification sessions of 4 utterances each. mIMFCC is greater than IMFCC for all model orders. Hence,
mIMFCC is showing better results than IMFCC.
POLYCOST Database[12]:

The POLYCOST speech database was recorded


during January-March 1996. The database was collected

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 1055
TABLE I. YOHO DATABASE TABLE IV. FUSION IN POLYCOST DATABASE

Model Order Model Order


Feature Feature
2 4 8 16 2 4 8 16
MFCC 89.46 94.31 97.00 97.80 MFCC-IMFCC 87.35 91.61 93.67 94.54
mMFCC 89.57 94.39 96.97 97.82 MFCC-mMFCC 84.33 89.59 92.42 95.04
IMFCC 91.54 95.38 97.01 97.83 mIMFCC-IMFCC 82.17 87.56 90.06 91.21
mIMFCC 91.96 94.96 97.37 97.91 MFCC-mIMFCC 87.22 92.26 94.79 95.24
mMFCC-IMFCC 86.86 91.89 94.28 95.01
For fusion, the weight factor was taken as w=0.5. In mMFCC-mIMFCC 87.07 92.26 94.58 95.44
each speaker group, taking any two complementary features, VI. CONCLUSION
their respective score was multiplied by a factor of w and then
added. In this way, score fusion using all the combinations There has been a considerable amount of
was done for each group of speakers. development in the field of speaker recognition. Due to
reduction in the recognition rate for various reasons like
TABLE II. POLYCOST DATABASE recording conditions, speaker-generated varibility, the
techniques used still needs to be improved.
Model Order In this paper, a new method of feature extraction by
Feature
2 4 8 16 modifying the baseline features MFCC and IMFCC has been
MFCC 84.33 89.59 92.28 94.91
discussed. The new features termed as mMFCC and mIMFCC
mMFCC 84.33 89.53 92.35 95.11
IMFCC 80.69 86.81 90.02 90.35 have shown better recognition accuracy than the baseline
mIMFCC 82.04 88.18 90.30 91.74 features. This makes the system more robust.
References
From that value of score, POC was calculated. At the
[1] Davis, S.; Mermelstein, P., "Comparison of parametric representations for
end, mean of POC was computed for each combination. Result monosyllabic word recognition in continuously spoken sentences," Acoustics,
has been shown in TABLE III and TABLE IV. Speech and Signal Processing, IEEE Transactions on , vol.28, no.4,
From TABLE III, the results of other combinations pp.357,366, Aug 1980.
of features are compared with the baseline combination [2] Furui, S., "Comparison of speaker recognition methods using statistical
MFCC-IMFCC. For the combinations, MFCC-mMFCC and features and dynamic features," Acoustics, Speech and Signal Processing,
mIMFCC-IMFCC, for each model order, POC is less than the IEEE Transactions on , vol.29, no.3, pp.342,350, Jun 1981.
baseline combination. Taking MFCC-mIMFCC and mMFCC- [3] R. Vergin, B. O Shaughnessy and A. Farhat, Generalized Mel frequency
IMFCC into account, we can see that for model orders 2 and cepstral coefficients for large-vocabulary speaker-indenpendent continuous-
speech recognition, IEEE Trans. On ASSP,vol. 7, no. 5, pp. 525-532, Sept.
4, POC is greater than the baseline, while for model orders 8
1999.
and 16, POC is less than the baseline.Taking mMFCC-
mIMFCC, it can be observed that for each model order, POC [4] Chakroborty, S., Roy, A. and Saha, G., Improved Closed set Text
of the baseline combination is less than that of mMFCC- Independent Speaker Identification by Combining MFCC with Evidence from
Flipped Filter Banks. International Journal of Signal Processing, Vol. 4, No.
mIMFCC. 2, Page(s):114-122, 2007.
As per TABLE IV, the results of other combinations
of features with the baseline combination MFCC-IMFCC are [5] Faundez-Zanuy M. and Monte-Moreno E., State-of-the-art in speaker
compared. For the combination, MFCC-mMFCC, POC is less recognition, Aerospace and Electronic Systems Magazine, IEEE, vol.20, No.
5, pp. 7-12, Mar. 2005.
than the baseline for all model orders except 16. For
mIMFCC-IMFCC,POC is less for each model orders. [6] Reynolds, D.A.; Rose, R.C., "An integrated speech-background model for
Observing the combinations MFCC-mIMFCC and mMFCC- robust speaker identification," Acoustics, Speech, and Signal Processing,1992.
IMFCC, we can see that except for model order 2, POC is ICASSP-92., 1992 IEEE International Conference on , vol.2, no., pp.185,188
vol.2, 23-26 Mar 1992.
greater than the POC of baseline. For mMFCC-mIMFCC, for
each model order, except for model order 2, POC is more than [7] D.A. Reynolds, A Gaussian mixture modeling approach to text
independent speaker identification. Ph.D thesis,Georgia Institute of
that of MFCC-IMFCC. Technology. Atlanta, Ga, USA, September 1992.
TABLE III. FUSION IN YOHO DATABASE
[8] A. Papoulis and S. U. Pillai, Probability, Random variables and
Stochastic Processes, Tata McGraw-Hill Edition, Fourth Edition, Chap. 4,
Model Order pp. 72-122, 2002.
Feature
2 4 8 16
MFCC-IMFCC 93.45 96.90 98.46 98.90 [9] J. Kittler, M. Hatef, R. Duin, J. Mataz, On combining classifiers,IEEE
MFCC-mMFCC 89.55 94.37 97.02 97.79 Trans. Pattern Anal. Mach. Intell. 20 (1998) 226-239.
mIMFCC-IMFCC 92.10 95.68 97.43 98.02
MFCC-mIMFCC 94.07 97.36 98.45 98.88 [10] K. Sri Rama Murty and B. Yegnanarayana, Combining evidence from
mMFCC-IMFCC 94.20 97.17 98.43 98.86 residual phase and MFCC features for speaker recognition, IEEE Signal
mMFCC-mIMFCC 94.04 97.34 98.47 98.94 Processing Letters, vol 13, no. 1, pp. 52-55, Jan. 2006.

[11] H. Melin and J. Lindberg. Guidelines for experiments on the polycost


database, In Proceedings of a COST 250 workshop on Application of

1056 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI)
Speaker Recognition Techniques in Telephony, pp. 59- 69, Vigo, Spain,
November 1996.

[12]Matsui, T.; Furui, S., "Comparison of text-independent speaker


recognition methods using VQ-distortion and discrete/continuous HMMs,"
Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE
International Conference on , vol.2, no., pp.157,160 vol.2, 23-26 Mar 1992.
[13]Zheng F., Zhang, G. and Song, Z., Comparison of different
implementations of MFCC, J. Computer Science & Technology, vol.
16 no. 6, pp. 582-589, Sept. 2001.

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 1057

You might also like