Text Independent Speaker Identification Using VQ

G. Sidorov et al. (Eds.): MICAI 2010, Part II, LNAI 6438, pp. 116125, 2010.
Springer-Verlag Berlin Heidelberg 2010

Text-Independent Speaker Identification Using
VQ-HMM Model Based Multiple Classifier System
Ali Zulfiqar
1
, Aslam Muhammad
2
, A.M. Martinez-Enriquez
3
,
and G. Escalada-Imaz
4

1
Departament of CS & IT, University of Gujrat, Pakistan
zulfiqar.butt@uog.edu.pk
2
Departament of CS & E, U. E. T., Lahore, Pakistan
maslam@uet.edu.pk
3
Department of Computer Science, CINVESTAV-IPN, Mexico
ammartin@cinvestav.mx
4
Artificial Intelligence Research Institute, CSIC, Barcelona, Spain
gonzalo@iiia.csic.es
Abstract. Every feature extraction and modeling technique of voice/speech is
not suitable in all type of environments. In many real life applications, it is not
possible to use all type of feature extraction and modeling techniques to design
a single classifier for speaker identification tasks because it will make the sys-
tem complex. So instead of exploring more techniques or making the system
complex it is more reasonable to develop the classifier by using existing tech-
niques and then combine them by using different combination techniques to en-
hance the performance of the system. Thus, this paper describes the design and
implementation of a VQ-HMM based Multiple Classifier System by using
different combination techniques. The results show that the developed system
by using confusion matrix significantly improve the identification rate.
Keywords: Speaker identification, classifier combination, HMM, VQ, MFCC,
LPC.
1 Introduction
Speaker identification (SI) process identifies an unknown registered speaker by compar-
ing it with those registered speaker voice stored in the database. SI can be
text-dependent and text-independent [1]. Text-independent SI system is not limited to
recognize speakers on the basis of same sentences stored in the database. While text-
dependent SI system only can recognize speakers by uttering the same sentence every
time [2]. SI can be further divided into closed set SI and open set SI [3]. In closed set
speaker identification, unknown speech signal came from one of the registered speakers.
Open-set speaker identify unknown signal from either the set of the registered speakers
or unregistered speakers. Closed-set text-independent speaker identification (CISI)
system must allow capturing particular voice features even in a noisy environment.
Two widely used feature extraction techniques - Mel-frequency Cepstral Coeffi-
cients (MFCC) [4] and Linear Prediction Coefficients (LPC) [5], and two modeling
Text-Independent Speaker Identification Using VQ-HMM Model 117

techniques Hidden Markov model (HMM) [6] and Vector Quantization (VQ) [7] are
used to construct different classifiers Each classifier is different from each other, in
feature extraction and the modeling techniques used. MFCC simulates the behavior of
human ear and uses Mel Frequency scale. LPC features represent the main vocal tract
resonance property in the acoustic spectrum and make possible to distinguish one
speaker from others, due to each speaker is characterized by his/her own formant
structure. HMM based on Markov chain mathematical model is a doubly stochastic
process that recognizes speakers very well in both text-dependent and text-
independent SI system. VQ is implemented through LBG algorithm to reduce and
compress feature vectors into a small number of highly representative vectors.
The speaker identification made by a single decision making scheme is always a
risky because each type of features are not suitable for all environments. Thus, this
paper describes a Multiple Classifier System (MCS) for CISI which reduces errors
and wrong identification. The basic idea is to analyze the results obtained by different
classifiers. Then, these classifiers are integrated such that their reliability is enhanced
due to a proper combination technique. The principle objective of this work is to get
the better identification rate of MCS for CISI by using various combination tech-
niques to compound the output of individual classifiers.
The paper proceeds as follows: Section 2 describes the steps followed by all three
developed classifiers for the SI. Section 3 explains the different combination tech-
niques used in MCS to coalesce the normalized measurement level output of single
classifiers for the joint decision. Section 4 depicts the results obtained from system
testing and experimentations. Finally, we conclude our work in Section 5.
2 Single Classifier Speaker Identification System
Three different classifiers are designed and implemented: - LPC based on Vector
Quantization (VQ) (classifier K1); - MFCC based VQ (classier K2); and - MFCC
based on HMM (classifier K3), (see Figure 1). All three classifiers are able to per-
form the closed-set text-independent speaker identification. Each classifier of any SI
task includes following steps [8], [9], [10]:

Digital speech data acquisition. Acoustic events like phonemes occur in
the frame of 10 mS to 100 mS [11]. Therefore, every speech signal is digi-
tized into frames where duration of each frame is 23 mS for sampling fre-
quencies 11025 Hz and that of 16 mS for the sampling frequency 8000 Hz.
Feature extraction is the process by which the speech signal is converted to
some type of parametric representation for further analysis and processing.
This is a very important process in the high performance of CISI system.
Appropriate features should be extracted from speech, otherwise the identi-
fication rate is influenced significantly. LPC and MFCC based feature vec-
tors are extracted from the speech of each registered speaker.
Acoustic model. It is not possible to use all extracted feature vectors to de-
termine the identity of that speaker. Therefore, two modeling techniques:
HMM and VQ are used to construct the acoustic model of each registered
speaker. Then, for the identification of unknown registered speaker, its fea-
ture vectors are compared with each speakers acoustic model present in
the speaker database.
118 A. Zulfiqar et al.

Pattern matching. Then, for the identification of unknown registered
speaker, its feature vectors are compared with each speakers acoustic
model present in the speaker database.
Identification decision. When feature vectors of an unknown speaker are
compared with acoustic model of each registered speaker, decision is made
by computing distortion in the case of VQ modeling technique. Speakers
acoustic model having minimum distortion with unknown speech signal is
recognized as a true speaker. For HMM, decisions are made by using
maximum likelihood criteria.

Fig. 1. Three Different Classifiers

In order to construct and check their performance of three individual classifiers K1,
K2, and K3, we make following experimentation. K1 is obtained by combining
MFCC with VQ, K2 and K3 by mixing respectively MFCC with HMM and LPC with
VQ (see Figure 1). Additionally to make the system robust against noise, a consistent
noise is added during the recording of each sentence. A database, having more than
700 voice samples, is recoded into two different sessions with the gap of two to three
weeks to evaluate the performance of the system. It contains utterances of 44 speakers
including 30 males and 14 females. Each speaker has recorded 6 different sentences at
sampling frequencies 8000 Hz and 11025 Hz by using PRAAT software
(http://www.fon.hum.uva.nl/praat/download_win.html). The list of these sentences is:
Sentence 1: Decimal digits from Zero to Nine
Sentence 2: All people smile in the same language.
Sentence 3: Betty bought bitter butter. But the butter was so bitter that she bought
new butter to make the bitter butter better.
Sentence 4: Speakers recorded a random text from selected topic for 35 sec.
Sentence 5: Speakers recorded his/her roll number or employee ID.
Sentence 6: Speakers recorded a random text from selected topic for 12 sec.
Identification rates of all three classifiers at both sampling frequency 8000 Hz and
11025 Hz are respectively depicted in Figure 2 and Figure 3 when they are trained
and tested by using the following sentences:
Training Sentence: Betty bought bitter butter. But the butter was so bitter that she
bought new butter to make the bitter butter better.
Testing Sentence: All people smile in the same language.

MFCC
Classifier K1
Classifier K2
Classifier K3
MFCC
LPC
VQ VQ
HMM

90.91
61.36
86.36
0
20
40
60
80
100
Classifier K1 Classifier K2 Classifier K3
P
e
r
c
e
n
t
a
g
e
P
e
r
c
e
n
t
a
g
e

Fig. 2. Identification rate of classifiers at 8000 HZ
97.73
68.18
93.18
0
20
40
60
80
100
Classifier K1 Classifier K2 Classifier K3
P
e
r
c
e
n
t
a
g
e
P
e
r
c
e
n
t
a
g
e

Fig. 3. Identification rate of classifiers 11025 HZ

Identification rate of these classifiers is computed by using the following relation:
Truly Identified Speaker
Identification Rate =
Total No. of Speakers

Above experiments show that higher sampling frequency demonstrate good
identification rate than lower sampling frequency, and classifier K1 (MFCC based
VQ classifier) is better than all other classifiers at both sampling frequencies. Now,
we use the results of sampling frequency 8000 HZ to construct Multiple Classifier
System (MCS) because in that case even best classifier has identification of 90.91%.
So, there is a lot room for the improvement of identification rate.
3 Multiple Classifier Speaker Identification System (MCSIS)
MCSIS uses the output of all three classifiers to make the joint decision about the
identity of the speaker. The MCSIS can be categorized according to the level of clas-
sifier outputs which they use during the combination [12], [13]. There are three dif-
ferent levels of classifiers outputs:
Abstract Level. Each classifier outputs the identity of a speaker only, and
this level contains the lowest information.
Rank Level. Classifiers provide a set of speaker identities ranked in order of
descending likelihood.

Measurement Level. It conveys the greatest amount of information about
each particular speaker that may be correct one or not.
Different combination techniques [14]: Sum Rule, Product Rule, Min Rule, and Max
Rule are used in this work to combine the normalized measurement level outputs of
classifiers for the joint decision about the identity of the speaker.
3.1 Normalized Measurement Level Output of Classifiers
Measurement Level provides quite useful information about every speaker as shown
in Table 1 and Table 2. First row of each table presents the measurement level output
of Classifier K1 and K2 respectively, for speaker S1 when it is compared with 8 other
speakers. A major problem within measurement level combination is the incompara-
bility of classifier outputs [15]. As we can observe, the 1
st
row of both tables, it is
clear that output of classifiers having different feature vectors differ in range and the
outputs are incomparable. In consequence, before combining the outputs, it is neces-
sary to treat these outputs of each classifier. Therefore, measurement level output of
each classifier is normalized to the probabilities by dividing each element of the row
by the sum of all the elements of that row.
Table 1. Measurement level Output for a Classifier using MFCC Features
S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8
S 1
1.549 1.922 2.096 2.216 2.098 2.377 2.192 2.308
Normalized
0.092 0.115 0.125 0.132 0.125 0.142 0.131 0.138

Table 2. Measurement level Output for a Classifier using LPC Features
S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8
S 1
0.585 0.692 1.429 0.742 0.754 1.310 1.395 1.694
Normalized
0.069 0.082 0.170 0.088 0.090 0.134 0.166 0.201

After normalization, the outputs of all classifiers lie in the interval of [0,1]. Now,
we find a suitable combination technique for MCSIS. These techniques are discussed
in following sections, which provide the better identification rate than individual
classifiers.
3.2 Sum Rule (Linear Combination)
Linear combination is the simplest technique for MCS. For each speaker, sum of out-
puts of all classifiers is calculated. The decision of the true speaker depends on the
maximum value obtained, after combination [13] [16] [17]. Suppose that there are 3
classifiers K1, K2, and K3, and five speakers (S1, S2, S3, S4, and S5). Outputs of these
classifiers are represented by O1, O2, and O3. These output vectors are given as:
[ ]
1 2 3 4 5
1
T
O a a a a a =
,
[ ]
1 2 3 4 5
2
T
O b b b b b =
,
[ ]
1 2 3 4 5
3
T
O c c c c c =


j
a
,
j
b
,
j
c
where
1, 2, 3, 4, 5 j =
are positive real numbers. These output vectors are
combined to make an output matrix which is
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
a b c
a b c
O a b c
a b c
a b c

=

The Sum rule is defined as
1
1, 2,
k
sum i
i
O O where i k
=
= =
L

where
i
O is the i
th
column of output matrix. After combing by sum rule output matrix
becomes
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
sum
a b c
a b c
O a b c
a b c
a b c
+ +

+ +

= + +

+ +

+ +

When the value in the first row is larger than other values, the result of MCS is
speaker1 (S1). Similarly, if value in the second row is larger than other values then
the result of MCS is speaker2 (S2) and so on.
3.3 Product Rule (Logarithmic Combination)
Product rule, also called logarithmic combination, is another simple rule for classifier
combination system. It works in the same manner than linear combination but instead
of sum, the outputs for each speaker from all classifiers are multiplied [12], [16]. The
product rule is defined as
1
1, 2, 3,...,
k
prod i
i
O O where i k
=
= =

When the output of any classifier for a particular speaker is zero, this value is re-
placed by a very small positive real number. After combining the output vectors of all
classifiers, output matrix becomes:
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
prod
a b c
a b c
O a b c
a b c
a b c

=

The decision criterion is similar as the Sum rule.

3.4 Min Rule
Min rule combination method measures the likelihood of a given speaker by
finding the minimum normalized measurement level output for each speaker.
Then final decision for identifying a speaker is made by determining the maximum
value [12] [13].

Example 1: Consider an output matrix which is obtained by combining the output
vectors of the three classifiers. Each column of the matrix represents the output of a
classifier for five speakers.
0.0 0.3 0.2
0.4 0.3 0.2
0.6 0.5 0.4
0.0 0.0 0.1
0.2 0.1 0.3
O

=

Each element of O
min
is the minimum value selected from each row of the output ma-
trix O. Each row corresponds to the output of a classifier for a particular speaker. The
final decision is the maximum value of the vector O
min
which is 0.4. This value shows
that the true speaker is the speaker number 3.
[ ]
min
0.0 0.2 0.4 0.0 0.1
T
O =

3.5 Max Rule
In the Max rule, the combined output of a class is the maximum value of the output
values provided by different classifiers for the corresponding speaker [12] [16]. For a
better explanation, consider the following example.

Example 2: Assume that we have three classifiers and five speakers. Their output
matrix is given below:
0.0 0.3 0.2
0.4 0.3 0.2
0.6 0.5 0.4
0.0 0.0 0.1
0.2 0.1 0.3
O

=

The combined output vector is obtained by selecting maximum value from each row
of the output matrix. The resultant vector is
[ ]
max
0.3 0.4 0.6 0.1 0.3
T
O =

Maximum value in the vector O
max
is 0.6 which corresponds to speaker number 3. So,
the joint decision of all the classifiers is the speaker3.

3.6 Confusion Matrix
Confusion matrix is a handy tool to evaluate the performance of a classifier. It con-
tains the information of both truly identified speakers as well as misclassified speak-
ers [15] [18]. Each column of this matrix represents the true speaker. Let us assume
that 50 voice samples of speaker3 are tested by the identification system. If all these
voice samples are truly identified then value at the 3
rd
row, 3
rd
column will be 50 with
zeros elsewhere. On the other hand, if values of 1
st
, 2
nd
, 3
rd
, 4
th
, and 5
th
row of 4
th

column are 3, 1, 0, 41, and 5 respectively show that 3 times speaker4 is misclassified
as speaker1, 1 time speaker4 is misclassified as speaker2, 45 times speaker4 is truly
identified, and 5 times speaker4 is misclassified as speaker5 by the system. A confu-
sion matrix is shown in Figure 4.
50 0 0 3 0
0 47 0 1 0
0 2 50 0 1
0 0 0 41 0
0 1 0 5 49

1
2
3
4
5
1 2 3 4 5
True Speakers
Identified Speakers

Fig. 4. A Confusion Matrix
4 Results
The identification rates of MSCIS, after applying Sum, Product, Min, and Max Rule
on the output of individual classifiers, are presented in Figure 5. Max Rule as a Com-
bination rule in MCS has shown an increase of 4.54% in identification rate than that
of best individual classifier which was 90.91%.
8 8 .6 4
8 1. 8 2
70 .4 5
9 5. 4 5
0
2 0
4 0
6 0
8 0
10 0
Sum Rule Product Rul e M i n Rule M ax Rul e
P
e
r
c
e
n
t
a
g
e

Fig. 5. Identification Rates of Combination Techniques

Some combination techniques show poor identification rate even than individual
classifiers. A comparison between identification rates of best individual classifier K1,
best combination technique (Max Rule), and confusion matrix is depicted in Figure 6.
9 0 . 9 1
9 5. 4 5
9 7. 72
0
2 0
4 0
6 0
8 0
10 0
I nd i vi d ual C l assi f i er M ax R ul e Co nf usi o n M at r i x
P
e
r
c
e
n
t
a
g
e

Fig. 6. Comparison of Confusion Matrix Technique with Max Rule and Individual Classifier
5 Conclusion and Future Work
MFCC based VQ classifier, LPC based VQ classifier, and MFCC based HMM are
combined to make Multiple Classifier System (MCS). Normalized measurement level
outputs of the classifiers are combined by using Min Rule, Max Rule, Product Rule
and Sum Rule. Combination technique Max Rule demonstrated good results as com-
pared to other combination technique. Max rule improves identification rate by 4.54%
than best individual classifiers. But when classifiers are combined by using Confusion
matrix, it shows improvement of 6.81% than best individual classifier and 2.27% than
Max Rule in the proposed multiple classifier text-independent system. Experiment
shows that Confusion matrix based MCS produces excellent result as compared to
each individual classifier. These results are also better than various combination tech-
niques, i.e. Sum, Product, Min Rule, and Max Rule.
In the identity of the speaker case studied, our proposed MCS for CISI system
gives the same importance to the results obtained by each classifier. In order to en-
hance the performance in the decision process, the output of a classifier can be
pondered by a weight, when its performance is better than other classifiers within the
environment tested. It is our future validation in which we continuous making tests.
References
[1] Furui, S.: Recent Advances in Speaker Recognition. Pattern Recognition Letter 8(9),
859872 (1997)
[2] Chen, K., Wang, L., Chi, H.: Methods of Combining Multiple Classifiers with Different
Features and Their Application to Text-independent Speaker Identification. International
Journal of Pattern Recognition and Artificial Intelligence 11(3), 417445 (1997)

[3] Reynolds, D.A.: An Overview of Automatic Speaker Recognition Technology. Proc.
IEEE 4, 40724075 (2002)
[4] Godino-Llorente, J.I., Gmez-Vilda, P., Senz-Lechn, N., Velasco, M.B., Cruz-Roldn,
F., Ballester, M.A.F.: Discriminative Methods for the Detection of Voice Disorder. In: A
ISCA Tutorial and Research Workshop on Non-Linear Speech Processing, The COST-
277 Workshop (2005)
[5] Xugang, L., Jianwu, D.: An investigation of Dependencies between Frequency
Components ans Speaker Characteristics for Text-independent Speaker Identification.
Speech Communication 2007 50(4), 312322 (2007)
[6] Huang, X.D., Ariki, Y., Jack, M.A.: Hidden Markov Model for Speech Recognition.
Edinburgh University Press, Edinburgh (1990)
[7] Linde, Y., Buzo, A., Gray, R.M.: An Algorithm for Vector Quantizer Design. IEEE
Transaction on Communications 28, 8495 (1980)
[8] Higgins, J.E., Damper, R.I., Harris, C.J.: A Multi-Spectral Data Fusion Approach to
Speaker Recognition. In: Fusion 1999, 2nd International Conference on Information
Fusion, Sunnyvale, CA, pp. 11361143 (1999)
[9] Premakanthan, P., Mikhael, W.B.: Speaker Verification /Recognition and the Importance
of Selective Feature Extraction:Review. In: Proc. of 44th IEEE MWSCAS 2001, vol. 1,
pp. 5761 (2001)
[10] Razak, Z., Ibrahim, N.J., Idna Idris, M.Y., et al.: Quranic Verse Recitation Recognition
Module for Support in J-QAF Learning: A Review. International Journal of Computer
Science and Network Security (IJCSNS) 8(8), 207216 (2008)
[11] Becchetti, C., Ricotti, L.P.: Speech Recognition Theory and C++ Implementation. John
Wiley & Sons, Chichester (1999)
[12] Kittler, J., Hatef, M., Duin, R.P.W., Mates, J.: On Combining Classifiers. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20(3), 226239 (1998)
[13] Kuncheva, L.I., Bezdek, J.C., Duin, R.P.W.: Decision Templates for Multiple Classifier
Fusion: An Experimental Comparison. Pattern Recognition 34(2), 299314 (2001)
[14] Shakhnarovivh, G., Darrel, T.: On Probabilistic Combination of face and Gait Cues for
Identification. In: Proc. 5th IEEE Intl Conf. Automatic Face Gesture Recognition, pp.
169174 (2002)
[15] Ho, T.K., Hull, J.J., Srihari, S.N.: Decision Combination in Multiple Classifier Systems.
IEEE Transactions on Pattern Analysis and Machine Intelligence 16(12), 6675 (1994)
[16] Tumer, K., Ghosh, J.: Linear and Order Statistics Combiners for Pattern Classification.
In: Sharkey, A. (ed.) Combining Artificial Neural Networks, pp. 127162. Springer,
Heidelberg (1999)
[17] Chen, K., Chi, H.: A Method of Combining Multiple Probabilistic Classifiers through
Soft Competition on Different Feature Sets. Neuro Computing 20(1-3), 227252 (1998)
[18] Kuncheva, L.I., Jain, L.C.: Designing Classifier Fusion systems by Genetic Algorithms.
IEEE Tran. on Evolutionary Computation 4(4), 327336 (2000)

Text Independent Speaker Identification Using VQ

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Independent Speaker Identification Using VQ

Uploaded by

Copyright:

Available Formats

G. Sidorov et al. (Eds.): MICAI 2010, Part II, LNAI 6438, pp. 116125, 2010.

Springer-Verlag Berlin Heidelberg 2010

You might also like