Ntu Thesis

Graduate Institute of Computer Science and Information Engineering
College of Electrical Enginnering and Computer Science
National Taiwan University

Master Thesis
Phone Recognition using Structural Support Vector Machine
Meng Chao-Hong

Advisor: Lee Lin-Shan, Ph.D.
June, 2009
Phone Recognition using Structural Support Vector

Machine
(R96922007)
97 6 15
()
()
()

(Speech Recognition)

(Bayes Theorem)(Acoustic Model)(Language Model)
(Hidden
Markov Model)
(Maximum
Likelihood Estimation)
(Discriminative Training)

(Structural Support Vector Machine) (Phone
Accuracy)(Tandem System)1%
ii
Contents
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 . . . . . . . . . . . . . . . . . . . . . . .
1.2 . . . . . . . . . . . . . . . . . . . . . . .
1.3 . . . . . . . . . . . . . . . . . . . . . . .
1.4 . . . . . . . . . . .
1.5 . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
2.1 . . . . . . . . . . . .
2.1.1 . . . . . . . . . . . . . . . . . . .
2.1.2 . . . . . . . . . . . . . . . .
2.2 (Multi-Layered Perceptron) . . . . . . .
2.2.1 . . . . . . . . . . . . . . . . . . .
2.2.2 . . . . . . . . . . . . . . . .
2.3 (Tandem) . . . . . . . . . . . . . .
2.4 . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
3.1 . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 (Primal Form) . . . . . . . . . . . . . . .
3.3 (Lagrange Duality) . . . . . . . . . . . .
3.4 . . . . . . . . . . . . . . . . . . . . . . .
3.5 . . . . . . . . . . . . . . . . . . . . . . .
3.6 . . . . . . . . . . . . . . . . . . . .
3.7 (Sequential Minimal Optimization)
3.8 . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
4.1 . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 (Primal Form)(Dual Form) . .
4.3 (Cutting Plane Method) . . . . . . . . .
4.4 . . . . . . . . . . . . . .
4.5 . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
5.1 . . . . . . . . . . . . . . . . . . . .
5.1.1 . . . . . . . . . . . . . . . . . . .
5.1.2 . . . . . . . . . . . . . . . . . . .
5.1.3 . . . . . . . . . . . . . . .
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
i
ii
1
1
3
6
7
7
9
9
10
14
20
22
23
25
27
28
28
29
33
36
39
40
42
48
49
49
50
55
62
63
65
65
65
68
71
5.1.4 . . . . . . . . . .
5.1.5
5.1.6 . . . . . . .
5.2 . . . . . . . . . . . . . .
5.2.1 . . . . . . .
5.3 . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
6.1 . . . . . . . . . . . . . . . . .
6.2 . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
73
77
77
77
86
87
87
87
89

1.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
. . . . . .
. . . . . . . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
14
21
25
3.1
3.2
3.3
3.4
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
38
41
44
4.1
4.2
4.3
4.4
4.5
4.6
4.7
. . . . . . . . . . .
w
. . . .
w
. . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
52
52
57
58
59
61
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
. . . . . . . . . . . . . . . . . . . . . . . .
(x, y)
. . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . .
(
)
(
)
(
)
. . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
70
75
76
80
82
84
85
86
.
.
.
.

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
TIMIT
. . . . . . . . . . . . . .
TIMIT
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
48
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
(
)
(
)
(
)
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
66
67
67
69
80
82
84
85
1.1
(Rabiner) [1]

(Hidden Markov Model; HMM)
1970
[2]

(Bayes Theorem)

(Acoustic Model)
(Language Model)

(Maximum Likelihood
Estimation; MLE)
(Transcription)

(Posterior Probability)

(Competing Word Sequence)
(Testing Set)
(Likelihood)

1980

IBM
(Bahl) [3]
(Maximum Mutual Information; MMI)

1
(Minimum Classification Error; MCE) [4]

(Minimum Phone Error; MPE) [5]
(Minimum
Phone Frame Error; MPFE) [6]
(Minimum Divergence;
MD) [7]

(Bayes Risk)
(Object
Function)

Y X
P (Y |X)

P (Y )
P (X|Y )

(Bayesian School)
P (Y |X)

(Statistical School)

(
Y )
Y
P (Y |X)

P (Y |X)

(Structural Support Vector

Machine; SVM-struct)
P (Y |X)

1.1:
1.2
[8]
(Radio
Rex)
(
1.1)
1920

(Rex)

500
(Hz)
500
Rex

1940

1940

(Bell Lab)

10

(Patterns)

(Correlation Coefficient)

3
97%
99%

(Phoneme Recognizer)

(Phoneme Transition Probability)

1960

(Feature Extraction Algorithms)
(Fast Fourier Transform; FFT)

(Cepstral)

(Linear Prediction Coding; LPC)

(Speech Coding)
(Warping)

(Speaking Rate)

(Dynamic Programming)

(Match)
1966
(Baum)
[9]

(Carnegie Mellon University)
(Baker)

IBM
(Watson Lab)
(Frederick Jelinek)
(Shannon)
(Information Theory)
IBM
Baker
(Decoding Algorithm)
Baker
(Viterbi)
IBM
(Stack)

20

Resource Management, Wall Street Journal Air Traffic Information System, Broadcast
News, CALLHOME

x

(Observation)
(Feature Vector
Sequence)
X
Y
y
y = arg max P (y|x)
yY
(1.1)
y
Y P (y|x)
x
y

P (Y |X)

P (Y |X)

(Bayes Theorem)

y = arg max P (y|x)
yY
P (x|y)P (y)
= arg max
yY
P (x)
(1.2)
P (x)

5

y = arg max P (y|x)
yY
(1.3)
= arg max P (x|y)P (y)

yY
P (X|Y )
P (Y )

P (X|Y )
X
P (Y )
Y
(Acoustic Model)
(Language Model)

P (Y |X)
Y

(Free-Phone Decoding)
1.3

(Component)

(

[10]
[11])
(
[4]
[5])
P (Y |X)

(Hidden Conditional Random Field; HCRF) [12,13]

(Multi-Layered Perceptron; MLP)
[14]

(Structural Support Vector Machine)
6
[15]

(Sequence Tagging) [16]
(
)
[17,18]

[19]

[20]
1.4
TIMIT

(Mel-Frequency Cepstral Coefficient; MFCC)

(Perceptron Linear Prediction; PLP)

(Tandem)

(Posterior Probability)

1%
1.5

(Optimization)

(Tandem)
(Continuous Hidden Markov Model; Continuous-HMM)

(Discrete Hidden Markov
Model; Discrete-HMM)

(Multi-Layered Perceptron; MLP)

2.1
(Hidden Markov Model, HMM)
(Sequence Tagging)
(Seqeunce
Segmentation)

(Part-of-speech Tagging)
(Statistical Language Modeling)

(Natural Language Processing; NLP)

(Discrete)

(Feature Vector Space)
(Finitely Countable)

(State)
(Emission Probability)

(Probability Mass Function)

(Discrete Hidden Markove Model)

(Uncountable)

(Probability Density Function)

2.1.1
(Generative Model)

Algorithm 1
1: M : = {1, 2, . . . M }

2:
(Uniform Distribution)

3:
A
M
l

4:
(Finitly Countable)
(Infinitly Countable)
(Uncountable)

5:
for t = 1, . . . , T do
6:
7:
8:
end for
(
)
1

10
0.7
0.2
0.1
0.05
0.4
0.3
0.55
0.4
0.3
2.1:
A, B,
(Finite State Machine;
FSM)
(Edge)

2.1
B

l = 1)

Xt t
st t

l
l

l
(l-th-order
Hidden Markov Model; l-th-order HMM)

(First-order Hidden Markov Model; First-order HMM)
11
= {1, 2, . . . , M }
(Alphabet Set)
M 1
M
OW (Observation Alphabet Set)

(Finitly Countable)
(Infinitely Countable)
(Uncountable)

A = {aij }
A
l

(Transition Probability)

A
(Matrix)
aij i
j

aij = P (st = j|st1 = i)
(2.1)
B = {bj (x)}
B

j
(Observation)
x
(Emission
Probability)
bj (x) = P (Xt = x|st = j)
(2.2)

bj (ok ) = P (Xt = ok |st = j)
(2.3)
ok k
B
: (
)(
)

(Probability Distribution Family)

12
(Multivariate Gaussian Mixture Density Function)
bj (x) =
C
X
cjk N (x, jk , jk ) =
k=1
C
X
cjk bjk (x)
(2.4)
k=1
bjk (x) = N (x, jk , jk )

j
(Mean)
jk (Covariance)
jk cjk
(Component)
1

C
X
cjk = 1
(2.5)
k=1
= {i }:

i
(Initial Probability)
M

aij 0, bi (x) 0, i 0, i, j, k
M
X
aij = 1
(2.6)
(2.7)
j=1
Z
bi (x) = 1
(2.8)
xOW
M
X
i = 1
(2.9)
i=1

2.2
13
0.6
1.0
0.9
0.5
0.4
0.5
0.1
2.2:

:
= (A, B, )
(2.10)

A, B,
2.1.2
(Training Set)Z
1
N
Z = {X (n) }N
n=1
(2.11)
(n)
(Feature Vector Sequence)

(Xi )
n
i

(Maximum Likelihood Estimation)

N (Likelihood)
:
N
Y
P (X (n) ; )
n=1
14
(2.12)
(Log-Likelihood)
:
l() =
N
X
log P (X (n) ; )
(2.13)
n=1
(Fit)

(Hidden Variable)

(Expectation Maximization)
:
l() =
N
X
n=1
N
X
log P (X (n) ; )
!
log
P (X (n) , S (n) ; )
n=1
S (n)
N
X
XX
(2.14)
!
log
n=1
P (X (n) , S (n) , K (n) ; )
S (n) K (n)
Qn (S (n) , K (n) )

S (n) (State Sequence)
K (n)

XX
S (n)
Qn (S (n) , K (n) ) = 1
(2.15)
K (n)
Qn (S (n) , K (n) ) 0
15
(2.16)
(Jensens Inequality)
(Lower Bound)
!
N
X
XX
l() =
log
P (X (n) , S (n) , K (n) ; )
n=1
S (n) K (n)
N
X
XX
(n)
(n)
(n)
P (X , S , K ; )
Qn (S (n) , K (n) )
n=1
S (n) K (n)

N XX
X
P (X (n) , S (n) , K (n) ; )
(n)
(n)
Qn (S , K ) log
Qn (S (n) , K (n) )
(n)
(n)
n=1
log
S
K
N
XX X
n=1 S (n) K (n)

N XX
X
Qn (S (n) , K (n) )
(2.17)

Qn (S (n) , K (n) ) log P (X (n) , S (n) , K (n) ; )
Qn (S (n) , K (n) ) log Qn (S (n) , K (n) )
n=1 S (n) K (n)

1

P (X (n) , S (n) , K (n) ; )
=c
Qn (S (n) , K (n) )
(2.18)
P (X (n) , S (n) , K (n) ; ) Qn (S (n) , K (n) )
(2.19)
S (n)
K (n)
Qn (S (n) , K (n) ) = 1

P (X (n) , S (n) , K (n) ; )
P
(n) , S (n) , K (n) ; )
S (n)
K (n) P (X
Qn (S (n) , K (n) ) = P
=
P (X (n) , S (n) , K (n) ; )

P (X (n) ; )
(2.20)
= P (S (n) , K (n) |X (n) ; )

Qn (S (n) , K (n) ) = P (S (n) , K (n) |X (n) ; )

2.17

N XX
X

Qn (S (n) , K (n) ) log Qn (S (n) , K (n) )
n=1 S (n) K (n)
16
(2.21)

N XX
X

(2.22)
n=1 S (n) K (n)
(

)
P (X (n) , S (n) , K (n) ; )
Qi
N XX
X

n=1 S (n) K (n)

N XX
X
n=1 S (n) K (n)
(2.23)

P (X (n) , S (n) , K (n) ; )
(n)
(n)
(n)
log
P
(X
,
S
,
K
;
)
P (X (n) ; )

P (X (n) , S (n) , K (n) |)
P (X
(n)
,S
(n)
,K
(n)
=
|)
T
Y
t=1
T
Y
(n)
a
st1 st bst (Xt )
(2.24)
a
st1 st bst kt (Xs(n)
)
cs t k t
t
t=1

log P (X
(n)
,S
(n)
,K
(n)
=
|)
T
X
log a
st1 st +
t=1
T
X
log bst kt (xt ) +
t=1
T
X
log cst kt
(2.25)
t=1
2.17

N XX
X
P (S (n) , K (n) |X (n) ; ) log P (X (n) , S (n) , K (n) |)
n=1 S (n) K (n)

N X
M X
M X
T
X
P (st1 = i, st = j|X (n) ; ) log a

ij
n=1 i=1 j=1 t=1
N X
M X
C X
T
X
(2.26)
P (st = j, kt =
(n)
k|X (n) ; ) log bjk (xt )
n=1 j=1 k=1 t=1
N X
M X
C X
T
X
P (st = j, kt = k|X (n) ; ) log cjk
n=1 j=1 k=1 t=1
17

F (x) =
yi log xi
(2.27)
xi = 1
(Lagrange Multiplier)

yi
xi = P
i
(2.28)
yi
(n)
n=1
t=1 t (i, j)
PM PN PT
(n)
t=1 t (i, j)
n=1
k=1
PN PT (n)
t=1 t (j, k)
n=1
PC PN PT (n)
k=1
n=1
t=1 t (j, k)
PN PT
a
ij =
cjk =
(2.29)
(2.30)
(n)
t (j, k)
(n)
t (j, k) = P (st = j, kt = k|X (n) ; )
(2.31)
(n)
t (i, j)
(n)
t (i, j) = P (st1 = i, st = j|X (n) ; )
(2.32)
{jk , jk }

(Chain Rule)

PT
P (st = j, kt = k|X (n) ; )xt
jk = Pt=1
T
(n) ; )
t=1 P (st = j, kt = k|X
(2.33)
PT (n)
(j,
k)x
t
t
= Pt=1
(n)
T
t=1 t (j, k)
jk =
PT
t=1
P (st = j, kt = k|X (n) ; )(xt

jk )(xt
jk )T
PT
(n) ; )
t=1 P (st = j, kt = k|X
(n)
jk )(xt
t=1 t (j, k)(xt
PT (n)
t=1 t (j, k)
PT
=
18
jk )T
(2.34)
Algorithm 2 (Forward Algorithm)

1: for i = 1, . . . M do
1 (i) = i bi (X1 )
2:
3:
end for
4:
for t = 2, . . . , T do
6:
for i = 1, . . . , M do
hP
i
M
t (j) =
(i)a
bj (Xt )
t1
ij
i=1
7:
end for
5:
8:
end for
9:
P (X; ) =
PM
i=1
T (i)
(n)
(n)
t (j, k)
t (i, j)

-
(Baum-Welch Algorithm)
-

2
3
(
2
t (i)

t (i) = P (X1t |st = i; )
(2.35)
(
3
t (i)

T
t (i) = P (Xt+1
|st = i; )
(n)
(2.36)
(n)
t (i, j)
t (i)
t (i)
t (j, k)
(n)
t (i, j)
(n)
t (j, k)
(n)
PC
=
k=1
PM
=
i=1
t1 (i)aij cjk bjk (xt )t (j)

PM
i=1 T (i)
(2.37)
t1 (i)aij cjk bjk (xt )t (j)

PM
i=1 T (i)
(2.38)
(n)
t (j, k)
t (i, j)

19
Algorithm 3 (Backward Algorithm)

1
1: T (i) = M
2:
for t = T 1, . . . , 1 do
4:
for i = 1, . . . , M do
hP
i
M
t (i) =
a
b
(X
)
(j)
ij
j
t+1
t+1
j=1
5:
end for
3:
6:
end for
(a)
2.2
(b)
(Multi-Layered Perceptron)
(Neuro Science)

2.3a

!
F (x) =
X
i
20
w i xi
(2.39)
Input
Hidden
Output
layer
layer
layer
Input #1
Input #2
Output
Input #3
Input #4
2.3:
x

(Transformation)
wi
(Weight)
2.3b

(Neuro
Network)

(Acyclic)
(Multi-Layered Perceptron)
(Feed-Forward Neural Network)

2.3
(Classification)
(Regression)

(Binary Classification)

(Univariate Regression)

(Threshold)

21
2.2.1
f : Rd R
d
g

g
f
L

(Neuron)
l(1 l L)
d(l) d(l1)
d(l) l 1
l
(

l
l 1

)

d(0) = d
d(L) = 1
(l1)
l
xi
(l)
, (1 i d(l1) )
xj , (1
(l)
j d(l) )
l
wij , (1 l L, 0 i d(l1) , 1 j d(l) )
l
j

(l1)
dX
(l)
(l) (l1)
xj =
wij xi
(2.40)
i=0
(s) = tanh(s) =
es es
es +es
(l)
sj
22

(l)
sj
(l1)
dX
(l) (l1)
wij xi
(2.41)
i=0
(l)
(l)
xj = (sj )
(2.42)
g(x)
(0)
(0)
x
x1 , . . . xd( 0)
L

(L)
x1 g(x)

2.2.2
Z =
{(xn , yn )}N
n=1
(Error Function)

En (w) = (g(xn ) yn )2
(2.43)
(Optimization)

(Gradient Decent)

n
(Stochastic Gradient Decent)
E(l)
,0 i
wij
d(l1) , (1 j d(l) , 1 l L)

(l)
(l)
wij wij
23
En
(l)
wij
(2.44)

n
E(l)
wij
Algorithm 4 (Stochastic Gradient Decent Algorithm)

1: w (0) 0

2:
3:
4:
for t = 1, 2, . . . , T do
w(t) w(t1) E(w(t1) )
end for
(Backpropagation Algorithm)

(l)
(l)
(l)
(l)
xj sj wij sj
(Chain Rule)

En
(l)
wij
(l)
sj
(l1)
= xi
(l)
wij
(l)
j =
En
(l)
sj
(l)
En
sj
(l)
(2.45)
(l)
sj
wij
E(l)n
sj
(L)
l = L
En = (x1 y)2
(L)
(L)
1
(L)
E(l)n = 2(x1 y)
x1
En
(l)
sj
(L)
x1
(l)
s1
(L)
En
(l)
x1
x1
(2.46)
(l)
s1
(L)
= (s1 )

(l1)
l 1
j
(l1)
En
(l1)
si
(l)
d
X
En
(l)
j=1
(l)
sj
(l1)
sj
(l1)
xi
xi
(l1)
si
(l)
d
X
(l)
(l)
(l1)
j wij (si
j=1
24
(2.47)
2.4:
0
(l1)
(l)
(l)
(s) = 1 2 (s)
(si ) = xi i
(l1)
i
= 1
(l1)
(xi )2
d(l)
X
(l) (l)
wij j
(2.48)
j=1
2.3
(Tandem)
(Tandem) [10]
2.3

(Pricipal Component
25
Analysis; PCA)

1000
(Subword)

(Hybrid System)

(Hermansky)
[10]
30
50

(Skew)

(Log)

[10]

(Testing Set)
(Evaluation)

26
2.4
27
(Support Vector Machine; SVM)

[21,22]

3.1
(Linear Classifier)
(Binary Classification Problem)
(
3.1)
B
C
B
(Seperating Line)
(Geometric
Distance)
C
A
B

(Support Vector Machine; SVM)
(Margin)

3.1
(w, b)
(
w1 x1 + b = 0
hw, xi + b
h, i
)
(w , b )

(w , b )
hw , xi + b 0
(w , b )
hw , xi + b 0
(w , b )
hw , xi + b
0
0
0
0
0
)

28
y
A
3.1:
f
sign(z) =
1 if z 0
(3.1)
1 otherwize
sign(f (x))
n
(Seperating
Hyperplane)

(Maximum Margin Classifier)
3.2
(Primal Form)

(Hyperplane)
(Seperable Case)

29
N
Z = {(xi , yi )}N
i=1
(3.2)
xi Rn yi {1, 1}

max Thickness(w, b)
w,b
(3.3)
Thickness
w
b

i = ( distance of xi to hw, xi i + b = 0)
(3.4)

hw, xi i + b
hw,
x
i
+
b
i
= yi
i =
kwk
kwk
(3.5)
Thickness(w, b) = min i
i
(3.6)

max min i ,
i
w,b

hw, xi i + b
s.t. i :i = yi
kwk
(3.7)
kwk = 1
kwk
(
kwk =
30
1
(functional distance))
max min i
i
w,b
s.t. i :i = yn (hw, xi i + b),
(3.8)
kwk = 1
= mini i
i , i

max
w,b
s.t. i : i ,
(3.9)
kwk = 1
i hw, xi i + b
max
w,b
s.t. i :yi (hw, xi i + b) ,
(3.10)
kwk = 1
kwk = 1
kwk
kwk =
1
(non-convex)

= kwk

(
kwk
1
yi (hw, xi i + b)
kwk
)
w,b:w6=0 kwk
max
(3.11)
s.t. i :yi (hw, xi i + b)

kwk = 1

(Object Function)

(w , b )
c
(cw , cb )

31

= 1

kwk
max
w,b
s.t. i :yi (hw, xi i + b) ,
(3.12)
= 1
max
w,b
1
kwk
(3.13)
s.t. i :yi (hw, xi i + b) 1

1
kwk
kwk
kwk

kwk2
21
1

1
min kwk2
w,b 2
(3.14)
s.t. i :yi (hw, xi i + b) 1

(Primal Form)

(Quadratic Programming)

(Lagrange Duality)
(Dual
Form)

32
3.3
(Lagrange Duality)
(Lagrange
Multiplier)
l

min f (w)
w
(3.15)
s.t. i {1, , k} : gi (w) 0,

j {1, , l} : hi (w) = 0
(Lagrange Function)
L(w, , ) = f (w) +
k
X
i gi (w) +
i=1
l
X
i hi (w)
(3.16)
i=1
i i (Lagrange Multiplier)

0

L
=0
wi
L
=0
i
L
=0
i
(3.17)
(3.18)
(3.19)
(Stationary Point)(w,)

(Primal Form)
P (w)
P (w) = max L(w, , )
,:i 0
(3.20)
P
min P (w)
w
33
(3.21)
(Primal Form)

w
i
gi (w) > 0
hi (w) 6= 0
i i
i i
P (w) = max L(w, , )
,,i 0
= max
,,i 0
f (w) +
k
X
i gi (w) +
i=1
l
X
!
i hi (w)
(3.22)
i=1
=
w
f

f (w) if w satisfies primal constraints

P (w) =
otherwize
(3.23)
P f

min P (w) = min max L(w, , )
w
,:i 0
(3.24)

p = min P (w)
w
(3.25)
(Dual Form)

34

D (, ) = min L(w, , )
w
(3.26)
P w

max D (, ) = max min L(w, , )
,:i 0
,:i 0
(3.27)

d = max D (w)
,:i 0
(3.28)

d = max min L(w, , ) min max L(w, , ) = p
,:i 0
,:i 0
(3.29)
f gi (Convex)

hi (Affine)
gi (Feasible)(
w
i : gi (w) < 0) d p w w
p = d = L(w , , )
w , , (Karush-Kuhn-Tucker Condition; KKT Condition)
L(w , , ) = 0, 1 i n
wi
L(w , , ) = 0, 1 i l
i
(3.30)
(3.31)
i gi (w ) = 0, 1 i k
(3.32)
gi (w ) 0, 1 i k
(3.33)
0, 1 i k
(3.34)
35

w , ,
(3.32)

(KKT Dual Complementarity Condition)
(
i > 0
gi (w ) = 0
i > 0
(Active)

)

(Support Vectors)

3.4

1
min kwk2
w,b 2
(3.35)
i : yi (hw, xi i + b 1
gi (w)

gi (w) = yi (hw, xi i + b) + 1 0
(3.36)
X
1
L(w, b, ) = kwk2
i (yi (hw, xi i + b) 1)
2
i=1
(3.37)
i i
min max L(w, b, )
w
,:i 0
36
(3.38)

max min L(w, b, )
,:i 0
(3.39)
L(w, b, )
w
b

w L(w, b, ) = w
N
X
L(w, b, ) =
b
i yi xi = 0
i=1
N
X
i y (i) = 0
(3.40)
(3.41)
i=1
(3.40)

w=
N
X
i yi xi
(3.42)
i=1
(3.42)
(3.37)

L(w, b, ) =
N
X
i=1
N
N
X
1X
i
y i y j i j xi xj b
i yi
2 i,j=1
i=1
(3.43)
(3.41)
0
L(w, b, ) =
N
X
i=1
N
1X
y i y j i j xi xj
2 i,j=1
(3.44)
w
b
L

max W () =
N
X
i=1
1X
yi yj i j hxi , xj i
2 i,j=1
s.t.i 0, i
N
X
(3.45)
i yi = 0
i=1
hi

37
3.2:
3.7

(3.42)
w w
b
b =
maxi:yi =1 wT xi + mini:yi =1 wT xi
2
(3.46)

yi (hw, xi i ) = 1 (Funtional Margin)
1) i > 0
3.2

i 0
yi (hw, xi i ) = 1
(Support Vectors)
i > 0

x
x
hw, xi + b
38
0
0
hw, xi + b

* N
+
X
hw, xi + b =
i yi xi , x + b
i=1
N
X
(3.47)
i yi hxi , xi + b
i=1
(3.48)
i > 0
hxi , xi

3.5
hw, xi + b =
N
X
i yi hxi , xi + b
(3.48)
i=1
x

x2 x3
hx, zi
h(x), (z)i

h(x), (z)i

K(x, z) = h(x), (z)i
39
(3.49)
x, z Rn
K(x, z) = (xT z)2
(3.50)
K(x, z)

! n
!
n
X
X
K(x, z) =
xi zi
xi zi
i=1
j=1
n X
n
X
xi xj zi zj
(3.51)
i=1 j=1
n
X
(xi xj )(zi zj )
i,j=1
n = 3

(x) = [x1 x1 , x1 x2 , x1 x3 , x2 x1 , x2 x2 , x2 x3 , x3 x1 , x3 x2 , x3 x3 ]T
(3.52)
(x)
O(n2 )
K(x, z)
O(n)
3.6
(Separable)

(Outlier)
3.3
L1 (L1 Regularization)
(Primal Form)

(Slack Variable)i
N
X
1
i
min kwk2 + C
w,b 2
i=1
s.t.yi (hw, xi i + b) 1 i
i 0
40
(3.53)
3.3:
(Functional Margin)
1

1 i Ci C

1

N
X
X
X
1
i
i [yi (hw, xi + b) 1 + i ]
ri i (3.54)
L(w, b, , , r) = hw, wi + C
2
i=1
i=1
i=1
i ri

i = 0 yi (hw, xi i + b) 1
(3.55)
i = C yi (hw, xi i + b) 1
(3.56)
0 < i < C yi (hw, xi i + b) = 1
(3.57)
41
L1
max W () =
N
X
i=1
N
1X
i
2 i,j=1
s.t.0 i C
N
X
(3.58)
i yi = 0
i=1
L1 i 0

0 i C

3.7
(Sequential Minimal Optimization)
(Sequential Minimal Optimization; SMO)

(Platt)
1998
[23]

(Quadratic
Programming)

(Gradient Decent)

42

max W () =
N
X
i=1
N
1X
2 i,j=1
(3.59)
s.t. k, 0 k C
N
X
k yk = 0
k=1
i i j
N
X
k y k = 0
k=1
(3.60)
i y i =
j yj
j6=i
i

5
tol
Algorithm 5
1: repeat
2:
(Heuristic)
i j

3:

i j W ()

4:
until
i j
W ()

i j
(3.60)
i yi + j yj =
k y k
(3.61)
k6=i,j

i y i + j y j =
43
(3.62)
1
C
H
i yi + j yj =
3.4:

3.4
i j
[0, C] [0, C]
i yi + j yj =
j
L
H
(L j H)
(i , j )

L
H
If yi 6= yj , L = max(0, j i ), H = min(C, C + j i )
If yi = yj , L = max(0, i + j C), H = min(C, i + j )
i + j =
i j
i = ( j yj )yi
(3.63)
W (i , j , )
j
W (i , j , ) = W (( j yj yi ), j , ) = aj2 + bj + c
44
(3.64)
W (i , j , )

[0, C] [0, C]

yj (Ei Ej )
(3.65)
Ek = f (xk ) yk
(3.66)
= 2hxi , xj i hxi , xi i hxj , xj i
(3.67)
j = j
j [0, C] [0, C]
j
H, if j > H
(3.68)
j =
j , if L j H
L, if j < L
(
L
H
3.4) j
(old)
i = i + yi yj (j
j )
(3.69)
i
i j
b
i = yi (hw, xi i + b) 1
(3.70)
i = C yi (hw, xi i + b) 1
(3.71)
0 < i < C yi (hw, xi i + b) = 1
(3.72)
b
i
(0 < i < C) b1 b
(old)
b1 = b Ei yi (i i
(old)
)hxi , xi i yj (j j
45
)hxi , xj i
(3.73)
0 < j < C
b2 b
(old)
b2 = b Ej yi (i i
(old)
hxi , xj i yj (j j
)hxj , xj i
(3.74)
(i = 0
i = C
j = 0
j = C)

b1 b2 b
b

b1 , if 0 < i < C
b=
b2 , if 0 < j < C
(b1 +b2 ) , otherwise

2
(3.75)
3.7
(

i, j

[23]

Require: C, tol, max passes, Z = {(xi , yi )}N
i=1
Ensure: , b
1:
i = 0, b = 0
2:
passes = 0
3:
while passes < max passes do
4:
num changed alphas = 0
5:
for i = 1, 2, . . . , m do
6:
Ei = f (xi ) yi
7:
if (yi Ei < tol and i < C) or (yi Ei > tol and i > 0) then
8:
j 6= i
9:
Ej = f (xj ) yj
10:
L
H
11:
if L = H then
46
12:
13:
end if
14:
15:
if 0 then
16:
17:
end if
18:
19:
if |j j
(old)
| < 105 then
20:
21:
end if
22:
23:
b1 ,b2
24:
25:
num changed alphas = num changed alphas + 1

end if
26:
27:
end for
28:
end while
29:
if num changed alphas = 0 then
30:
31:
32:
33:
passes = passes + 1
else
passes = 0
end if
47
3.8
48

4.1

(Multiclass Classification)

n

(
n
2
n
2
(
(One Versus One)) k

(
n
)
n

(
(One Versus All))


(Parsing)
(

(Parse Tree))
(Bayesian School)

(Structural Support Vector Machine; SVM-struct) [15]

f x X y Y

y
{1, 1}
{1, 2, . . . k}
k
(Sequence)
(String)
49
(Tree)
(Graph)
)

f x
(

)
y(Parse Tree)

Z = (xi , yi )ni=1
(4.1)
xi y i
4.2
(Primal Form)
(Dual Form)

w
x
w

(
4.1
w1 w9 9
w
w4
)
x X
y Y
x
(y

)
y
w
hw, xi + b
y

(x, y)
(
(Conditional
Random Fields; CRF)
(Feature Function)
x
y

50
hw4 , xi + b4
hw6 , xi + b6
hw1 , xi + b1
hw
hw85 ,, xi
xi +
+ bb85
hw7 , xi + b7
hw3 , xi + b3
hw2 , xi + b2
hw9 , xi + b9
4.1:

)

hw, (x, y)i
w
w

(Discriminant
Function)F : X Y R x
y
x
y
(x, y)
F y
x
4.2
4.3
(
F (x, y; w)
x
y
x
x
y
y

x
y
w

4.2
4.3
)
w

(xi , yi )
F (xi , yi ; w) F (xi , y; w), y 6= yi xi yi
y
w

51
4.2: w

4.3: w

f (x; w) = arg max F (x, y; w)

yY
(4.2)

F (x, y; w) = hw, (x, y)i
(4.3)
(

(No Free Lunch Theorem)

)
F (xi , yi ; w) F (xi , y; w), y 6= yi

(Maximum Margin)

52

i = F (xi , yi ; w) max F (xi , y; w)
yY\yi
(4.4)
i = hw, (xi , yi )i max {hw, (xi , y)i}

yY\yi
(4.5)
i
max {hw, (xi , yi )i} < hw, (xi , yi )i}
yY\yi
(4.6)
max

n 1

i, y Y\yi : hw, (xi , yi )i hw, (xi , y)i > 0
(4.7)
(
i (y) hw, (xi , yi )i hw, (xi , y)i)
w
kwk 1

1
min kwk2
w 2
(4.8)
s.t.i,y Y\yi : hw, i (y)i 1

(Separable)

n
1
CX
min kwk2 +
i
w, 2
n i=1
(4.9)
s.t.i,i 0,
i,y Y\yi : hw, i (y)i 1 i
53
0-1
(Zero-One Loss Function)

: Y Y R
(y, y)
y
y P (x, y)
(x, y)

RP (f )
Z
(y, f (x))dP (x, y)
(4.10)
X Y
P (x, y)

(Empirical Risk)
RS (f )
n
X
(y, f (x))(f (xi ) 6= yi )
(4.11)
i=1

y 6= yi (yi , y)
(yi , y)

(Slack Variable)

n
1
CX
2
min kwk +
i
w, 2
n i=1
(4.12)
s.t.i,i 0,
i,y Y\yi : hw, i (y)i 1
54
i
(yi , y)

n
1
CX
2
min kwk +
i
w, 2
n i=1
(4.13)
s.t.i,i 0,
i,y Y\yi : hw, i (y)i (yi , y) i

n|Y|
|Y|
(Exponential
Order)
(
)

max
X
i,y6=yi
iy
1X
iy j yhi (y), j (
y )i
2
(4.14)
s.t.i, y 6= Y\yi : iy 0
(Lagrange Multiplier)
iy (xi , yi )
y
4.3
(Cutting Plane Method)
(Approximation)

(Cutting Plane Method)

(Optimization)
(Linear Inequality)
55
(Feasible Set)
(Object Function)

(Integer Linear Programming)

max x1 + 5x2
s.t. x1 + 10x2 20
x1 2
(4.15)
x1 , x2 0
x1 , x2 Z
(Naive)

(Linear Programming Relaxation)

max x1 + 5x2
s.t. x1 + 10x2 20
(4.16)
x1 2
x1 , x2 0
(2, 1.8)
4.4
(2, 1.8)
(2,2)
(
)

(2, 1)
7
4.4
(0,2)
10

7
3

x1 + 2x2 4

56
(0,2)
x1 + 10x2 = 20
(2,0)
4.4:
(
4.5)
max x1 + 5x2
s.t. x1 + 10x2 20
x1 2
(4.17)
x1 + 2x2 4
x1 , x2 0

x1 + 2x2 4
(0,2)

(
)
1. (Feasible Integer Set)

2. (Feasible Set)

57
(0,2)
x1 + 10x2 = 20
(2,0)
x1 + 2x2 = 4
x
4.5:

(Convex Optimization)

4.6
(

)
(
)

(
)

(Active Constraint)
4.6

[24]
(x, y)
k(x, y) (x, y)k, y, y Y

58
4.6:

[24]

6
6
(

) 6

(

)

[24]

6

4.7(

)

(
4.7
)

59
(Most-Violated Constraint)
(
4.7
)
y (

y y H(y)

)
i (
)

i
(
)
y (
4.7
)
(
4.7
)

Algorithm 6 SVM-struct

1: Si i = 1, . . . , n
2:
3:
repeat
for i = 1, 2 , n do
4:
H(y) (1 hi (y), wi)(yi , y)
5:
y = arg maxyY H(y)
6:
i = max{0, maxySi H(y)}
7:
if H(
y ) > i + then
8:
Si Si {
y}
9:
S S
S = i Si
10:
11:
12:
end if
end for
until Si
60
4.7:
61
4.4

(Large Margin Principle)
(Conditional Random Field)
(Discriminative Model)

(Discriminative
Training)

[25]

(Bayes Risk)
(Bayesian School)

(Logistic
Regression)
(Linear Discriminant Analysis)
(Support
Vector Machine)

62

(Large Margin HMM) [26]

(

)

(Approximation)

TI Digit
(
)

4.5

(x, y)
63
(Parsing)
(Information Retrieval)

64
TIMIT

5.1
5.1.1
TIMIT
(The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus
TIMIT)
TIMIT
(Defense Advanced
Research Projects Agency; DARPA)
(Texas
Instruments; TI)
(Massachusetts Institute of Technology; MIT)

(Stanford Research Institute; SRI)

(TI +
MIT = TIMIT)
TIMIT
6300

(Dialect)
630
10

5.1 10

2
(SA):
65
31 (63%)
18 (27%)
49 (8%)
71 (70%)
31 (30%)
102 (16%)
79 (67%)
23 (23%)
102 (16%)
69 (69%)
31 (31%)
100 (16%)
62 (63%)
36 (47%)
98 (16%)
30 (65%)
16 (35%)
46 (7%)
74 (74%)
26 (26%)
100 (16%)
22 (67%)
11 (33%)
33 (5%)
438 (70%)
192 (30%)
630 (100%)
5.1: TIMIT

5
(Phonetically Balanced)
(SX):
450

3
(SI):
(Brown Corpus)
1890
1890
SA

[27]
SA

SA
3696
(462
)

192
(
TIMIT
24
5
SX
3
SI
)
TIMIT
64
(
5.2)
(Phone
Recognition)
[27]
(Glottal Stops)

48
(
5.4)
48
66
bee
day
gay
pea
tea
key
dx
muddy, dirty
bat
jh
joke
ch
choke
sea
sh
she
zone
zh
azure
fin
th
thin
van
dh
then
mom
noon
ng
sing
em
bottom
en
button
eng
washington
nx
winner
lay
ray
way
yacht
hh
hay
hv
ahead
el
bottle
iy
beet
ih
bit
eh
bet
ey
bait
ae
bat
aa
bott
aw
bout
ay
bite
ah
but
ao
bought
oy
boy
ow
boat
uh
book
uw
boot
ux
toot
er
bird
ax
about
ix
debit
axr
butter
ax-h
suspect
pau
pause
epi
silence
h#
silence
#h
silence
pcl
(unvoiced sil)
tcl
(unvoiced sil)
kcl
(unvoiced sil)
qcl
(unvoiced sil)
vcl
(vioced sil)
bcl
(voiced sil)
dcl
(voiced sil)
gcl
(voiced sil)
5.2: TIMIT

sil, cl, vcl, epi

el, l
en, n
sh, zh
ao, aa
ih, ix
ah, ax
5.3:
67
48
(
5.3)
A
B
39
5.1.2
(Mel-Frequency Cepstrum Coefficient; MFCC) [28]

(Perceptual Linear Prediction; PLP) [29]
,
5.1a

(10ms
25ms
)
13
(Time Derivatices)
39
13
40

(Filter Banks) (
64 Hz
8k Hz)

(Ceptral Mean Substraction; CMS) [30]
5.1b

(Time Domain)
(Pre-emphasis)
(Frequency Domain)

(Autoregressive
Coefficient)
13

39

68
iy
beat
en
button
ih
bit
ng
sing
eh
bet
ch
church
ae
bat
jh
judge
ix
roses
dh
they
ax
the
bob
ah
butt
dad
uw
boot
dx
(butter)
uh
book
gag
ao
about
pop
aa
cot
tot
ey
bait
kick
ay
bite
zoo
oy
boy
zh
measure
aw
bough
very
ow
boat
fief
led
th
thief
el
bottle
sis
red
sh
shoe
yet
hh
hay
hv
wet
cl(sil)
(unvoiced closure)
pcl, tcl, kcl, qcl
er
bird
axr
vcl(sil)
(voiced closure)
bcl, dcl, gcl
mom
em
epi(sil)
(epinthetic closure)
non
nx
sil
(silence)
ux
5.4: 48

69
eng
h#, #h, pau
(a) MFCC

(b) PLP

5.1:
70
5.1.3

(HTK) [31];
(QuickNet) [32]; SV M struct [33];
5.1.4

(Hidden Markov Model; HMM)
(Tandem System)

TIMIT
(Free Phone Decoding) (

)

(Phone Accuracy)

5.1.1
TIMIT
64
48
71
(
)
48

(Left-to-Right Continuous Mono-phone HMM)

20
(Expectation Maximization;
EM)
(
)

:
1 2 4 8 16 32
20

5.1.2

(HTK)

48

72
(Hidden Layer)
1000
351

39
39 9 = 351
48
48

(Principal Component Analysis; PCA)

(Eigen Value)
95%
37
5.1.5
(Framework)

(x, y)

(x, y)

aij bj (ot )

(Conditional Random Fields; CRF)

73
(Feature Function)

p(y, x) =
T
Y
p(yt |yt1 )p(xy |yt )
t=1
(
)
XX
XX X
1
= exp
ij (yt = i)(yt1 = j) +
oi (yt = i)(xt = o)
Z
t i,jS
t iS oOw
(
)
XX
1
k fk (yt , yt1 , xt )
= exp
Z
t
k
(5.1)
x
y
(
(Observable)
)
ij = log aji = log p(yt = i|yt1 = j)
oi =
log bi (o) = log p(xt = o|yt = i)
Z = 1
f
(x, y)

(Tensor Product)
: RD RK RDK , (a b)i+(j1)D ai bj
(5.2)
c (y) ((y1 , y), (y2 , y), . . . , (yK , y)) {0, 1}K

(x, y)

PT
t=1 (xt ) c (y t )
(x, y) =
P
1 c t
Tt=1
(y ) c (y (t+1) )
(5.3)
(5.4)
(xt t
MFCC
PLP
PT
t=1
(xt ) c (y t )
PT 1
t=1
c (y t ) c (y (t+1) )

74
5.2: (x, y)

3
{A, B, C}

5.2 )
(x, y)
(Transition Count)
(Emission
Count)
A
B (A
index
0
B
index 1
C
index 3
= 1
(xt ) = xt )
0 1 0
3.2 5
A=
1 1 0 , B = 3.7 6.4
0 0 0
0
0
(5.5)
(x, y)
A
B

(x, y) =

0 1 0 1 1 0 0 0 0 3.2 5 3.7 6.4 0 0
(5.6)
(x, y)
k
w
w

k

5.3

75
5.3:
(Viterbi
Algorithm)
(Most Violated Constraint)

(Active Constraint)

(y, y)

(y, y) =
T
X
(y t , yt )
(5.7)
t=1
(Upper Bound)

76

64
48
(
)

5.1.6

(Phone Accuracy)

H =N DS
H
100%
N
H I
100%
Acc =
N
Corr =
(5.8)
(5.9)
(5.10)
H(Hit; )
D(Deletion; )
S(Substitution; )
I(Insertion; )

N(Number)
(
word

)
D
S
I
(Edit Distance)

7

5.2
5.2.1
(
5.5
5.4)

(Phone Accuray)
5.5
77
Algorithm 7
1: for i = 0, 1 . . . , m do
2:
d[i, 0] i
3:
end for
4:
for j = 0, 1 . . . , n do
5:
d[0, j] j
6:
end for
7:
for i = 1, 2, . . . , m do
8:
for j = 1, 2, . . . , n do
9:
if a[i] = b[j] then
10:
11:
12:
13:
cost 0
else
cost 1
end if
14:
end for
15:
d[i, j] min(d[i 1, j] + 1, d[i, j 1] + 1, d[i 1, j 1] + cost)
16:
end for
78
MFCC
PLP
MLP-MFCC
MLP-PLP
(PCA)

PCA-37-MLP-MFCC
PCA-37-MLP-PLP
PCA

Flat Start
Init By Label

TIMIT

(Flat
Start)

(Forced Alignment)

(Init By Label)

Init By Label
Flat Start

62%, 63%
68%, 69%
PCA
1%
70%

79
5.4:
MFCC
PLP
MLP-MFCC
MLP-PLP
PCA-37-MLP-MFCC
PCA-37-MLP-PLP
HMM(Flat Start)
62.47%
62.69%
68.77%
69.25%
70.19%
70.26%
HMM(Init By Label)
63.26%
62.91%
69.26%
69.50%
70.30%
70.42%
5.5:
80
MFCC
PLP
PLP

MFCC

5.5

(Slack Variable)

MLP-MFCC
MLP-PLP
PCA-37-MLP-MFCC
PCA37-MLP-PLP

5.6
5.5

(x, y)

(First-Order Markov)
MFCC
PLP

51%
MFCC
PLP

63%

11%

50%
MFCC
PLP

57%
71.7%

70.42%
1%

81
5.5: (
)
MFCC
PLP
MLP-MFCC
MLP-PLP
PCA-37-MLP-MFCC
PCA-37-MLP-PLP
38.87%
39.08%
57.40%
57.57%
56.71%
56.79%
10
46.55%
44.81%
67.74%
67.65%
67.13%
67.20%
100
49.74%
49.65%
70.86%
70.89%
70.18%
70.19%
1000
51.29%
51.22%
71.71%
71.75%
71.29%
71.32%
5.6: (
)
82
PCA

PCA
PCA

5.7
5.6

5.6

(Second-Order
Markov)
(x, y)
5.8
5.7

MFCC
PLP

10
70.07%

100
1000
5.8

83
5.6: (
)
MFCC
PLP
MLP-MFCC
MLP-PLP
PCA-37-MLP-MFCC
PCA-37-MLP-PLP
38.77%
37.03%
57.54%
57.33%
56.35%
56.07%
10
46.13%
44.71%
67.73%
67.48%
67.19%
66.98%
5.7: (
)
84
5.7: (
)
MFCC
PLP
MLP-MFCC
MLP-PLP
PCA-37-MLP-MFCC
PCA-37-MLP-PLP
39.03%
39.19%
64.43%
64.25%
63.65%
63.75%
10
46.38%
44.61%
69.94%
70.07%
69.84%
69.91%
5.8: (
)
85
5.8:
5.3
TIMIT

86

6.1

1%

(
)

(

)
6.2

(Conditional Random Field; CRF)

(Discrminative Training)
(
87
88

[1] Lawrence R. Rabiner, A tutorial on hidden Markov models and selected applications
in speech recognition, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
1990.
[2] J. K. Baker, The dragon system - an overview, in IEEE Trans. Acoust. Speech
Signal Process, 1975, pp. 2429.
[3] L. Bahl, P. Brown, P de Souza, and R. Merce, Maximum mutual information
estimation of hidden markov model parameters for speech recognition, in ICASSP
1986, 1986.
[4] B.-H. Juang, W. Chou, and C.-H Lee, Minimum classification error rate methos for
speech recognition, in IEEE Transactions on Speech and Audio Processing, 1997.
[5] D. Povey and P.C. Woodland, Minimum phone error and i-smoothing for improved
discriminative training, in ICASSP 2002, 2002.
[6] J. Zheng and A. Stolcke, Improved discriminative training using phone lattices,
in Interspeech 2005, 2005.
[7] J. Du, P. Liu, F. K. Soong, J.-L. Zhou, and R.-H. Wang, Minimum divergence based
discriminative training, in Interspeech 2006, 2006.
[8] Daniel Jurafsky and James H. Martin, Speech and Language Processing, Pearson
Education Taiwan Ltd., 2005.
89
[9] Leonard E. Baum and Ted Petrie, Statistical inference for probabilistic functions
of finite state markov chains, The Annals of Mathematical Statistics, vol. 37, no. 6,
pp. 15541563, 1966.
[10] Hynek Hermansky Daniel, Daniel P. W. Ellis, and Sangita Sharma, Tandem connectionist feature extraction for conventional hmm systems, in Proc. ICASSP, 2000,
pp. 16351638.
[11] Eric Fosler Lussier and Jeremy Morris, Crandem systems: Conditional randem
field acoustic models for hidden markove model, in Proc. ICASSP, 2008, pp. 4049
4052.
[12] Asela Gunawardana, Milind Mahajan, Alex Acero, and John C. Platt, Hidden
conditional random fields for phone classification, in in Interspeech, 2005, pp.
11171120.
[13] Yun-Hsuan Sung, Constantinos Boulis, Christopher Manning, and Dan Jurafsky,
Regularization, adaptation, and non-independent features improve hidden conditional random fields for phone classification, in IEEE ASRU 2007, 2007, pp. 639
642.
[14] J. Morris and E. Fosler-Lussier, Discriminative phonetic recognition with conditional random fields, in HLT-NAACL Workshop on Computationally Hard Problems
and Joint Inference in Speech and Language Processing, 2006.
[15] Tsochantaridis Ioannis, Hofmann Thomas, Joachims Thorsten, and Altun Yasemin,
Support vector machine learning for interdependent and structured output spaces,
in ICML 04, New York, NY, USA, 2004, p. 104, ACM.
90
[16] Yasemin Altun, Ioannis Tsochantaridis, and Thomas Hofmann, Hidden markov
support vector machines, 2003.
[17] Thorsten Joachims, A support vector method for multivariate performance measures, in ICML 05: Proceedings of the 22nd international conference on Machine
learning, New York, NY, USA, 2005, pp. 377384, ACM.
[18] Yue Yisong, Finley Thomas, Radlinski Filip, and Joachims Thorsten, A support
vector method for optimizing average precision, in SIGIR 07: Proceedings of the
30th annual international ACM SIGIR conference on Research and development in
information retrieval, New York, NY, USA, 2007, pp. 271278, ACM.
[19] Thorsten Joachims, Training linear svms in linear time, in KDD 06: Proceedings
of the 12th ACM SIGKDD international conference on Knowledge discovery and
data mining, New York, NY, USA, 2006, pp. 217226, ACM.
[20] S. Sathiya Keerthi and S. Sundararajan, Crf versus svm-struct for sequence labeling, Tech. Rep., Yahoo Research Techinical Report, 2007.
[21] Vladimir N. Vapnik, The nature of statistical learning theory, Springer-Verlag New
York, Inc., New York, NY, USA, 1995.
[22] Christopher J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, vol. 2, pp. 121167, 1998.
[23] John Platt Microsoft and John C. Platt, Sequential minimal optimization: A fast
algorithm for training support vector machines, Tech. Rep., Advances in Kernel
Methods - Support Vector Learning, 1998.
91
[24] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun, Large margin methods for structured and interdependent output variables,
J. Mach. Learn. Res., vol. 6, pp. 14531484, 2005.
[25] Tom Minka, Discriminative models, not discriminitave training, Tech. Rep., Microsoft Research, 2005.
[26] X. Li, H. Jiand, and C. Liu, Large margin hmms for speech recognition, in IEEE
Transactions on Acoustics, Speech and Signal Processings (ICASSP 05), 2005, pp.
513516.
[27] K. Lee and H. Hon, Speaker-independent phone recognition using hidden markove
models, in IEEE Transactions on Acoustics, Speech and Signal Processings, 1989,
pp. 16411648.
[28] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing, Pearson Education Taiwan Ltd., 2005.
[29] H. Hermansky, B. Hanson, and H. Wakita, Perceptually based linear predictive
analysis of speech, Apr 1985, vol. 10, pp. 509512.
[30] S. Furui, Cepstral analysis technique for automatic speaker verification, Acoustics,
Speech and Signal Processing, IEEE Transactions on, vol. 29, no. 2, pp. 254272,
Apr 1981.
[31] Machine Intelligence Laboratory Cambridge University Engineering Dept. (CUED),
Htk, http://htk.eng.cam.ac.uk.
92
[32] Speech Group International Computer Science Institue, Quicknet, http://

www.icsi.berkeley.edu/Speech/qn.html.
[33] Thorsten Joachims, Svm-hmm, http://www.cs.cornell.edu/People/
tj/svm_light/svm_hmm.html.
93

Ntu Thesis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ntu Thesis

Uploaded by

Copyright:

Available Formats

Graduate Institute of Computer Science and Information Engineering

College of Electrical Enginnering and Computer Science

National Taiwan University

Phone Recognition using Structural Support Vector Machine

Phone Recognition using Structural Support Vector

(Minimum Classification Error; MCE) [4]

(Structural Support Vector

(Fast Fourier Transform; FFT)

= arg max P (x|y)P (y)

(Continuous Hidden Markov Model; Continuous-HMM)

(Hidden Markov Model, HMM)

bj (x) = P (Xt = x|st = j)

(Multivariate Gaussian Mixture Density Function)

cjk bjk (x)

bjk (x) = N (x, jk , jk )

(Feature Vector Sequence)

P (X (n) , S (n) , K (n) ; )

n=1 S (n) K (n)

n=1 S (n) K (n)

P (X (n) , S (n) , K (n) ; ) Qn (S (n) , K (n) )

P (X (n) , S (n) , K (n) ; )

= P (S (n) , K (n) |X (n) ; )

n=1 S (n) K (n)

Qn (S (n) , K (n) ) log P (X (n) , S (n) , K (n) ; )

n=1 S (n) K (n)

Qn (S (n) , K (n) ) log P (X (n) , S (n) , K (n) ; )

n=1 S (n) K (n)

n=1 S (n) K (n)

log bst kt (xt ) +

P (S (n) , K (n) |X (n) ; ) log P (X (n) , S (n) , K (n) |)

n=1 S (n) K (n)

P (st1 = i, st = j|X (n) ; ) log a

n=1 i=1 j=1 t=1

n=1 j=1 k=1 t=1

P (st = j, kt = k|X (n) ; ) log cjk

n=1 j=1 k=1 t=1

t (j, k) = P (st = j, kt = k|X (n) ; )

t (i, j) = P (st1 = i, st = j|X (n) ; )

P (st = j, kt = k|X (n) ; )(xt

Algorithm 2 (Forward Algorithm)

t (i) = P (X1t |st = i; )

t1 (i)aij cjk bjk (xt )t (j)

t1 (i)aij cjk bjk (xt )t (j)

Algorithm 3 (Backward Algorithm)

(Feed-Forward Neural Network)

Algorithm 4 (Stochastic Gradient Decent Algorithm)

(Support Vector Machine; SVM)

s.t. i :i = yn (hw, xi i + b),

s.t. i :yi (hw, xi i + b) ,

s.t. i :yi (hw, xi i + b)

s.t. i :yi (hw, xi i + b) ,

s.t. i :yi (hw, xi i + b) 1

s.t. i :yi (hw, xi i + b) 1

s.t. i {1, , k} : gi (w) 0,

f (w) if w satisfies primal constraints

0 < i < C yi (hw, xi i + b) = 1

(Sequential Minimal Optimization)

(Sequential Minimal Optimization; SMO)

= 2hxi , xj i hxi , xi i hxj , xj i

0 < i < C yi (hw, xi i + b) = 1

(b1 +b2 ) , otherwise

while passes < max passes do

num changed alphas = 0

| < 105 then

F (xi , yi ; w) F (xi , y; w), y 6= yi