You are on page 1of 100

Graduate Institute of Computer Science and Information Engineering

College of Electrical Enginnering and Computer Science

National Taiwan University


Master Thesis

Phone Recognition using Structural Support Vector Machine

Meng Chao-Hong

Advisor: Lee Lin-Shan, Ph.D.

June, 2009

Phone Recognition using Structural Support Vector


Machine
(R96922007)
97 6 15

()
()

()


(Speech Recognition)

(Bayes Theorem)(Acoustic Model)(Language Model)
(Hidden
Markov Model)
(Maximum
Likelihood Estimation)
(Discriminative Training)



(Structural Support Vector Machine) (Phone
Accuracy)(Tandem System)1%

ii

Contents
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 . . . . . . . . . . . . . . . . . . . . . . .
1.2 . . . . . . . . . . . . . . . . . . . . . . .
1.3 . . . . . . . . . . . . . . . . . . . . . . .
1.4 . . . . . . . . . . .
1.5 . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
2.1 . . . . . . . . . . . .
2.1.1 . . . . . . . . . . . . . . . . . . .
2.1.2 . . . . . . . . . . . . . . . .
2.2 (Multi-Layered Perceptron) . . . . . . .
2.2.1 . . . . . . . . . . . . . . . . . . .
2.2.2 . . . . . . . . . . . . . . . .
2.3 (Tandem) . . . . . . . . . . . . . .
2.4 . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
3.1 . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 (Primal Form) . . . . . . . . . . . . . . .
3.3 (Lagrange Duality) . . . . . . . . . . . .
3.4 . . . . . . . . . . . . . . . . . . . . . . .
3.5 . . . . . . . . . . . . . . . . . . . . . . .
3.6 . . . . . . . . . . . . . . . . . . . .
3.7 (Sequential Minimal Optimization)
3.8 . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
4.1 . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 (Primal Form)(Dual Form) . .
4.3 (Cutting Plane Method) . . . . . . . . .
4.4 . . . . . . . . . . . . . .
4.5 . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
5.1 . . . . . . . . . . . . . . . . . . . .
5.1.1 . . . . . . . . . . . . . . . . . . .
5.1.2 . . . . . . . . . . . . . . . . . . .
5.1.3 . . . . . . . . . . . . . . .

iii

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

i
ii
1
1
3
6
7
7
9
9
10
14
20
22
23
25
27
28
28
29
33
36
39
40
42
48
49
49
50
55
62
63
65
65
65
68
71

5.1.4 . . . . . . . . . .
5.1.5
5.1.6 . . . . . . .
5.2 . . . . . . . . . . . . . .
5.2.1 . . . . . . .
5.3 . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
6.1 . . . . . . . . . . . . . . . . .
6.2 . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .

iv

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

71
73
77
77
77
86
87
87
87
89


1.1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1
2.2
2.3
2.4

. . . . . .

. . . . . . . . . . .
. . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

11
14
21
25

3.1
3.2
3.3
3.4

. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

29
38
41
44

4.1
4.2
4.3
4.4
4.5
4.6
4.7

. . . . . . . . . . .
w
. . . .
w
. . . .
. . . . . . . . . . . .

. . . . . . . . . . . . . . .
. . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

51
52
52
57
58
59
61

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8

. . . . . . . . . . . . . . . . . . . . . . . .
(x, y)
. . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . .
(
)
(
)
(
)
. . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

70
75
76
80
82
84
85
86

.
.
.
.


5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8

TIMIT
. . . . . . . . . . . . . .
TIMIT
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
48
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
(
)
(
)
(
)

vi

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

66
67
67
69
80
82
84
85

1.1

(Rabiner) [1]

(Hidden Markov Model; HMM)
1970

[2]

(Bayes Theorem)

(Acoustic Model)
(Language Model)


(Maximum Likelihood
Estimation; MLE)
(Transcription)

(Posterior Probability)

(Competing Word Sequence)
(Testing Set)

(Likelihood)

1980

IBM
(Bahl) [3]
(Maximum Mutual Information; MMI)

1

(Minimum Classification Error; MCE) [4]



(Minimum Phone Error; MPE) [5]
(Minimum
Phone Frame Error; MPFE) [6]
(Minimum Divergence;
MD) [7]

(Bayes Risk)
(Object
Function)




Y X

P (Y |X)

P (Y )
P (X|Y )


(Bayesian School)

P (Y |X)

(Statistical School)


(
Y )
Y
P (Y |X)

P (Y |X)

(Structural Support Vector


Machine; SVM-struct)
P (Y |X)

1.1:

1.2

[8]
(Radio
Rex)
(
1.1)
1920

(Rex)

500

(Hz)
500
Rex

1940

1940

(Bell Lab)

10

(Patterns)

(Correlation Coefficient)

3

97%
99%

(Phoneme Recognizer)

(Phoneme Transition Probability)

1960


(Feature Extraction Algorithms)

(Fast Fourier Transform; FFT)


(Cepstral)

(Linear Prediction Coding; LPC)

(Speech Coding)

(Warping)

(Speaking Rate)

(Dynamic Programming)



(Match)

1966

(Baum)
[9]

(Carnegie Mellon University)

(Baker)

IBM
(Watson Lab)

(Frederick Jelinek)
(Shannon)
(Information Theory)

IBM
Baker

(Decoding Algorithm)
Baker

(Viterbi)
IBM
(Stack)

20

Resource Management, Wall Street Journal Air Traffic Information System, Broadcast
News, CALLHOME





x

(Observation)
(Feature Vector
Sequence)
X
Y
y
y = arg max P (y|x)
yY

(1.1)

y
Y P (y|x)
x
y

P (Y |X)

P (Y |X)

(Bayes Theorem)

y = arg max P (y|x)
yY

P (x|y)P (y)
= arg max
yY
P (x)

(1.2)

P (x)

5


y = arg max P (y|x)
yY

(1.3)

= arg max P (x|y)P (y)


yY

P (X|Y )
P (Y )

P (X|Y )
X

P (Y )
Y
(Acoustic Model)
(Language Model)


P (Y |X)
Y

(Free-Phone Decoding)

1.3

(Component)

(

[10]
[11])

(
[4]
[5])

P (Y |X)

(Hidden Conditional Random Field; HCRF) [12,13]

(Multi-Layered Perceptron; MLP)
[14]

(Structural Support Vector Machine)
6

[15]

(Sequence Tagging) [16]
(
)
[17,18]

[19]


[20]

1.4

TIMIT


(Mel-Frequency Cepstral Coefficient; MFCC)

(Perceptron Linear Prediction; PLP)

(Tandem)

(Posterior Probability)

1%

1.5



(Optimization)

(Tandem)

(Continuous Hidden Markov Model; Continuous-HMM)



(Discrete Hidden Markov
Model; Discrete-HMM)

(Multi-Layered Perceptron; MLP)

2.1

(Hidden Markov Model, HMM)

(Sequence Tagging)
(Seqeunce
Segmentation)

(Part-of-speech Tagging)
(Statistical Language Modeling)

(Natural Language Processing; NLP)

(Discrete)

(Feature Vector Space)
(Finitely Countable)

(State)
(Emission Probability)

(Probability Mass Function)

(Discrete Hidden Markove Model)

(Uncountable)

(Probability Density Function)

2.1.1

(Generative Model)


Algorithm 1
1: M : = {1, 2, . . . M }

2:

(Uniform Distribution)

3:

A
M
l

4:

(Finitly Countable)

(Infinitly Countable)
(Uncountable)

5:

for t = 1, . . . , T do

6:

7:

8:

end for

(
)
1

10

0.7

0.2
0.1
0.05

0.4
0.3

0.55

0.4

0.3

2.1:
A, B,
(Finite State Machine;
FSM)
(Edge)

2.1
B

l = 1)


Xt t
st t

l
l

l
(l-th-order
Hidden Markov Model; l-th-order HMM)



(First-order Hidden Markov Model; First-order HMM)

11

= {1, 2, . . . , M }
(Alphabet Set)

M 1
M
OW (Observation Alphabet Set)

(Finitly Countable)
(Infinitely Countable)

(Uncountable)

A = {aij }
A
l

(Transition Probability)

A
(Matrix)
aij i

j

aij = P (st = j|st1 = i)

(2.1)

B = {bj (x)}
B

j
(Observation)
x
(Emission
Probability)

bj (x) = P (Xt = x|st = j)

(2.2)


bj (ok ) = P (Xt = ok |st = j)

(2.3)

ok k
B
: (

)(
)


(Probability Distribution Family)

12

(Multivariate Gaussian Mixture Density Function)

bj (x) =

C
X

cjk N (x, jk , jk ) =

k=1

C
X

cjk bjk (x)

(2.4)

k=1

bjk (x) = N (x, jk , jk )


j
(Mean)

jk (Covariance)
jk cjk
(Component)
1

C
X

cjk = 1

(2.5)

k=1

= {i }:

i

(Initial Probability)
M


aij 0, bi (x) 0, i 0, i, j, k
M
X

aij = 1

(2.6)
(2.7)

j=1

Z
bi (x) = 1

(2.8)

xOW
M
X

i = 1

(2.9)

i=1



2.2
13

0.6

1.0

0.9

0.5
0.4

0.5

0.1

2.2:


:
= (A, B, )

(2.10)


A, B,

2.1.2

(Training Set)Z
1
N

Z = {X (n) }N
n=1

(2.11)
(n)

(Feature Vector Sequence)


(Xi )

n
i

(Maximum Likelihood Estimation)

N (Likelihood)
:
N
Y

P (X (n) ; )

n=1

14

(2.12)

(Log-Likelihood)
:
l() =

N
X

log P (X (n) ; )

(2.13)

n=1

(Fit)

(Hidden Variable)

(Expectation Maximization)

:
l() =

N
X
n=1
N
X

log P (X (n) ; )
!
log

P (X (n) , S (n) ; )

n=1

S (n)

N
X

XX

(2.14)
!

log

n=1

P (X (n) , S (n) , K (n) ; )

S (n) K (n)

Qn (S (n) , K (n) )

S (n) (State Sequence)
K (n)

XX
S (n)

Qn (S (n) , K (n) ) = 1

(2.15)

K (n)

Qn (S (n) , K (n) ) 0

15

(2.16)

(Jensens Inequality)
(Lower Bound)

!
N
X
XX
l() =
log
P (X (n) , S (n) , K (n) ; )
n=1

S (n) K (n)

N
X

XX

(n)

(n)

(n)

P (X , S , K ; )
Qn (S (n) , K (n) )
n=1
S (n) K (n)


N XX
X
P (X (n) , S (n) , K (n) ; )
(n)
(n)

Qn (S , K ) log
Qn (S (n) , K (n) )
(n)
(n)
n=1

log

S
K
N
XX X

n=1 S (n) K (n)


N XX
X

Qn (S (n) , K (n) )

(2.17)


Qn (S (n) , K (n) ) log P (X (n) , S (n) , K (n) ; )
Qn (S (n) , K (n) ) log Qn (S (n) , K (n) )

n=1 S (n) K (n)


1

P (X (n) , S (n) , K (n) ; )
=c
Qn (S (n) , K (n) )

(2.18)

P (X (n) , S (n) , K (n) ; ) Qn (S (n) , K (n) )

(2.19)

S (n)

K (n)

Qn (S (n) , K (n) ) = 1

P (X (n) , S (n) , K (n) ; )
P
(n) , S (n) , K (n) ; )
S (n)
K (n) P (X

Qn (S (n) , K (n) ) = P
=

P (X (n) , S (n) , K (n) ; )


P (X (n) ; )

(2.20)

= P (S (n) , K (n) |X (n) ; )


Qn (S (n) , K (n) ) = P (S (n) , K (n) |X (n) ; )

2.17

N XX
X


Qn (S (n) , K (n) ) log Qn (S (n) , K (n) )

n=1 S (n) K (n)

16

(2.21)


N XX
X




Qn (S (n) , K (n) ) log P (X (n) , S (n) , K (n) ; )

(2.22)

n=1 S (n) K (n)

(



)
P (X (n) , S (n) , K (n) ; )

Qi
N XX
X




Qn (S (n) , K (n) ) log P (X (n) , S (n) , K (n) ; )

n=1 S (n) K (n)


N XX
X

n=1 S (n) K (n)

(2.23)


P (X (n) , S (n) , K (n) ; )
(n)
(n)
(n)
log
P
(X
,
S
,
K
;
)
P (X (n) ; )


P (X (n) , S (n) , K (n) |)
P (X

(n)

,S

(n)

,K

(n)

=
|)

T
Y
t=1
T
Y

(n)

a
st1 st bst (Xt )
(2.24)
a
st1 st bst kt (Xs(n)
)
cs t k t
t

t=1


log P (X

(n)

,S

(n)

,K

(n)

=
|)

T
X

log a
st1 st +

t=1

T
X

log bst kt (xt ) +

t=1

T
X

log cst kt

(2.25)

t=1

2.17

N XX
X

P (S (n) , K (n) |X (n) ; ) log P (X (n) , S (n) , K (n) |)

n=1 S (n) K (n)


N X
M X
M X
T
X

P (st1 = i, st = j|X (n) ; ) log a


ij

n=1 i=1 j=1 t=1

N X
M X
C X
T
X

(2.26)
P (st = j, kt =

(n)
k|X (n) ; ) log bjk (xt )

n=1 j=1 k=1 t=1

N X
M X
C X
T
X

P (st = j, kt = k|X (n) ; ) log cjk

n=1 j=1 k=1 t=1

17



F (x) =

yi log xi

(2.27)

xi = 1

(Lagrange Multiplier)

yi
xi = P
i

(2.28)

yi

(n)
n=1
t=1 t (i, j)
PM PN PT
(n)
t=1 t (i, j)
n=1
k=1
PN PT (n)
t=1 t (j, k)
n=1
PC PN PT (n)
k=1
n=1
t=1 t (j, k)

PN PT

a
ij =
cjk =

(2.29)
(2.30)

(n)

t (j, k)

(n)

t (j, k) = P (st = j, kt = k|X (n) ; )

(2.31)

(n)

t (i, j)

(n)

t (i, j) = P (st1 = i, st = j|X (n) ; )

(2.32)

{jk , jk }

(Chain Rule)

PT
P (st = j, kt = k|X (n) ; )xt

jk = Pt=1
T
(n) ; )
t=1 P (st = j, kt = k|X
(2.33)
PT (n)

(j,
k)x
t
t
= Pt=1
(n)
T
t=1 t (j, k)
jk =

PT

t=1

P (st = j, kt = k|X (n) ; )(xt


jk )(xt
jk )T
PT
(n) ; )
t=1 P (st = j, kt = k|X

(n)
jk )(xt
t=1 t (j, k)(xt
PT (n)
t=1 t (j, k)

PT
=

18

jk )T

(2.34)

Algorithm 2 (Forward Algorithm)


1: for i = 1, . . . M do
1 (i) = i bi (X1 )

2:
3:

end for

4:

for t = 2, . . . , T do

6:

for i = 1, . . . , M do
hP
i
M
t (j) =

(i)a
bj (Xt )
t1
ij
i=1

7:

end for

5:

8:

end for

9:

P (X; ) =

PM

i=1

T (i)
(n)

(n)

t (j, k)
t (i, j)

-
(Baum-Welch Algorithm)
-

2
3
(
2
t (i)

t (i) = P (X1t |st = i; )

(2.35)

(
3
t (i)

T
t (i) = P (Xt+1
|st = i; )

(n)

(2.36)
(n)

t (i, j)

t (i)
t (i)
t (j, k)
(n)
t (i, j)

(n)
t (j, k)

(n)

PC
=

k=1

PM
=

i=1

t1 (i)aij cjk bjk (xt )t (j)


PM
i=1 T (i)

(2.37)

t1 (i)aij cjk bjk (xt )t (j)


PM
i=1 T (i)

(2.38)

(n)

t (j, k)
t (i, j)

19

Algorithm 3 (Backward Algorithm)


1
1: T (i) = M
2:

for t = T 1, . . . , 1 do

4:

for i = 1, . . . , M do
hP
i
M
t (i) =
a
b
(X
)
(j)
ij
j
t+1
t+1
j=1

5:

end for

3:

6:

end for

(a)

2.2

(b)

(Multi-Layered Perceptron)

(Neuro Science)

2.3a





!
F (x) =

X
i

20

w i xi

(2.39)

Input

Hidden

Output

layer

layer

layer

Input #1
Input #2
Output
Input #3
Input #4

2.3:
x

(Transformation)
wi
(Weight)
2.3b


(Neuro
Network)

(Acyclic)

(Multi-Layered Perceptron)

(Feed-Forward Neural Network)



2.3
(Classification)

(Regression)

(Binary Classification)

(Univariate Regression)

(Threshold)



21

2.2.1

f : Rd R

d
g

g
f
L

(Neuron)
l(1 l L)
d(l) d(l1)
d(l) l 1
l
(

l
l 1

)

d(0) = d
d(L) = 1

(l1)

l
xi

(l)

, (1 i d(l1) )
xj , (1

(l)

j d(l) )
l
wij , (1 l L, 0 i d(l1) , 1 j d(l) )
l
j

(l1)
dX
(l)
(l) (l1)
xj =
wij xi

(2.40)

i=0

(s) = tanh(s) =

es es

es +es
(l)

sj

22


(l)
sj

(l1)
dX

(l) (l1)

wij xi

(2.41)

i=0
(l)

(l)

xj = (sj )

(2.42)

g(x)

(0)

(0)

x
x1 , . . . xd( 0)
L

(L)

x1 g(x)

2.2.2

Z =
{(xn , yn )}N
n=1

(Error Function)

En (w) = (g(xn ) yn )2

(2.43)

(Optimization)

(Gradient Decent)


n
(Stochastic Gradient Decent)
E(l)
,0 i

wij

d(l1) , (1 j d(l) , 1 l L)

(l)

(l)

wij wij

23

En
(l)

wij

(2.44)



n
E(l)

wij

Algorithm 4 (Stochastic Gradient Decent Algorithm)


1: w (0) 0

2:
3:
4:

for t = 1, 2, . . . , T do
w(t) w(t1) E(w(t1) )
end for

(Backpropagation Algorithm)


(l)

(l)

(l)

(l)

xj sj wij sj
(Chain Rule)

En

(l)

wij

(l)

sj

(l1)

= xi

(l)

wij

(l)

j =

En
(l)
sj

(l)

En

sj

(l)

(2.45)

(l)

sj

wij

E(l)n
sj

(L)

l = L
En = (x1 y)2

(L)

(L)

1
(L)

E(l)n = 2(x1 y)
x1

En

(l)

sj

(L)

x1

(l)

s1

(L)

En
(l)

x1

x1

(2.46)

(l)

s1

(L)

= (s1 )

(l1)

l 1
j
(l1)

En
(l1)

si
(l)

d
X
En
(l)

j=1

(l)

sj

(l1)

sj

(l1)

xi

xi

(l1)

si

(l)

d
X

(l)

(l)

(l1)

j wij (si

j=1

24

(2.47)

2.4:
0

(l1)

(l)

(l)

(s) = 1 2 (s)
(si ) = xi i
(l1)
i

= 1

(l1)
(xi )2

d(l)
X

(l) (l)

wij j

(2.48)

j=1

2.3

(Tandem)

(Tandem) [10]
2.3



(Pricipal Component
25

Analysis; PCA)








1000

(Subword)

(Hybrid System)

(Hermansky)
[10]
30
50

(Skew)

(Log)



[10]



(Testing Set)
(Evaluation)

26

2.4

27

(Support Vector Machine; SVM)



[21,22]

3.1

(Linear Classifier)
(Binary Classification Problem)
(
3.1)
B
C

B
(Seperating Line)
(Geometric
Distance)
C
A
B

(Support Vector Machine; SVM)

(Margin)



3.1
(w, b)
(
w1 x1 + b = 0
hw, xi + b
h, i
)

(w , b )

(w , b )
hw , xi + b  0
(w , b )
hw , xi + b  0

(w , b )
hw , xi + b
0
0
0
0

0
)

28

y
A

3.1:
f
sign(z) =

1 if z 0

(3.1)

1 otherwize
sign(f (x))
n
(Seperating
Hyperplane)

(Maximum Margin Classifier)

3.2

(Primal Form)


(Hyperplane)

(Seperable Case)

29

N
Z = {(xi , yi )}N
i=1

(3.2)

xi Rn yi {1, 1}

max Thickness(w, b)
w,b

(3.3)

Thickness
w
b


i = ( distance of xi to hw, xi i + b = 0)

(3.4)





hw, xi i + b
hw,
x
i
+
b
i
= yi
i =
kwk
kwk

(3.5)

Thickness(w, b) = min i
i

(3.6)


max min i ,
i
w,b


hw, xi i + b
s.t. i :i = yi
kwk

(3.7)

kwk = 1

kwk
(
kwk =

30

1
(functional distance))

max min i
i

w,b

s.t. i :i = yn (hw, xi i + b),

(3.8)

kwk = 1
= mini i
i , i

max
w,b

s.t. i : i ,

(3.9)

kwk = 1
i hw, xi i + b

max
w,b

s.t. i :yi (hw, xi i + b) ,

(3.10)

kwk = 1
kwk = 1
kwk
kwk =
1
(non-convex)

= kwk

(
kwk

1
yi (hw, xi i + b)
kwk
)

w,b:w6=0 kwk
max

(3.11)

s.t. i :yi (hw, xi i + b)


kwk = 1

(Object Function)

(w , b )
c
(cw , cb )

31


= 1

kwk

max
w,b

s.t. i :yi (hw, xi i + b) ,

(3.12)

= 1

max
w,b

1
kwk

(3.13)

s.t. i :yi (hw, xi i + b) 1


1
kwk
kwk
kwk

kwk2
21
1

1
min kwk2
w,b 2

(3.14)

s.t. i :yi (hw, xi i + b) 1



(Primal Form)


(Quadratic Programming)



(Lagrange Duality)
(Dual
Form)

32

3.3

(Lagrange Duality)

(Lagrange
Multiplier)
l

min f (w)
w

(3.15)

s.t. i {1, , k} : gi (w) 0,


j {1, , l} : hi (w) = 0
(Lagrange Function)

L(w, , ) = f (w) +

k
X

i gi (w) +

i=1

l
X

i hi (w)

(3.16)

i=1

i i (Lagrange Multiplier)

0

L
=0
wi
L
=0
i
L
=0
i

(3.17)
(3.18)
(3.19)

(Stationary Point)(w,)



(Primal Form)
P (w)
P (w) = max L(w, , )
,:i 0

(3.20)

P
min P (w)
w

33

(3.21)

(Primal Form)

w
i

gi (w) > 0
hi (w) 6= 0
i i
i i
P (w) = max L(w, , )
,,i 0

= max

,,i 0

f (w) +

k
X

i gi (w) +

i=1

l
X

!
i hi (w)

(3.22)

i=1

=
w
f

f (w) if w satisfies primal constraints


P (w) =

otherwize

(3.23)

P f


min P (w) = min max L(w, , )
w

,:i 0

(3.24)


p = min P (w)
w

(3.25)

(Dual Form)

34


D (, ) = min L(w, , )
w

(3.26)

P w

max D (, ) = max min L(w, , )

,:i 0

,:i 0

(3.27)



d = max D (w)
,:i 0

(3.28)


d = max min L(w, , ) min max L(w, , ) = p
,:i 0

,:i 0

(3.29)

f gi (Convex)

hi (Affine)
gi (Feasible)(
w

i : gi (w) < 0) d p w w
p = d = L(w , , )
w , , (Karush-Kuhn-Tucker Condition; KKT Condition)

L(w , , ) = 0, 1 i n
wi

L(w , , ) = 0, 1 i l
i

(3.30)
(3.31)

i gi (w ) = 0, 1 i k

(3.32)

gi (w ) 0, 1 i k

(3.33)

0, 1 i k

(3.34)

35


w , ,

(3.32)

(KKT Dual Complementarity Condition)
(
i > 0
gi (w ) = 0

i > 0
(Active)

)

(Support Vectors)

3.4


1
min kwk2
w,b 2

(3.35)

i : yi (hw, xi i + b 1
gi (w)

gi (w) = yi (hw, xi i + b) + 1 0

(3.36)

X
1
L(w, b, ) = kwk2
i (yi (hw, xi i + b) 1)
2
i=1

(3.37)

i i
min max L(w, b, )
w

,:i 0

36

(3.38)


max min L(w, b, )

,:i 0

(3.39)

L(w, b, )
w
b


w L(w, b, ) = w

N
X

L(w, b, ) =
b

i yi xi = 0

i=1
N
X

i y (i) = 0

(3.40)

(3.41)

i=1

(3.40)

w=

N
X

i yi xi

(3.42)

i=1

(3.42)
(3.37)

L(w, b, ) =

N
X
i=1

N
N
X
1X
i
y i y j i j xi xj b
i yi
2 i,j=1
i=1

(3.43)

(3.41)
0
L(w, b, ) =

N
X

i=1

N
1X
y i y j i j xi xj
2 i,j=1

(3.44)

w
b
L

max W () =

N
X
i=1

1X
yi yj i j hxi , xj i
2 i,j=1

s.t.i 0, i
N
X

(3.45)

i yi = 0

i=1

hi

37

3.2:
3.7


(3.42)
w w
b
b =

maxi:yi =1 wT xi + mini:yi =1 wT xi
2

(3.46)


yi (hw, xi i ) = 1 (Funtional Margin)

1) i > 0
3.2

i 0
yi (hw, xi i ) = 1

(Support Vectors)
i > 0

x
x
hw, xi + b

38

0
0
hw, xi + b

* N
+
X
hw, xi + b =
i yi xi , x + b
i=1

N
X

(3.47)

i yi hxi , xi + b

i=1

(3.48)
i > 0

hxi , xi

3.5

hw, xi + b =

N
X

i yi hxi , xi + b

(3.48)

i=1

x

x2 x3
hx, zi
h(x), (z)i


h(x), (z)i


K(x, z) = h(x), (z)i

39

(3.49)

x, z Rn
K(x, z) = (xT z)2

(3.50)

K(x, z)

! n
!
n
X
X
K(x, z) =
xi zi
xi zi
i=1

j=1

n X
n
X

xi xj zi zj

(3.51)

i=1 j=1

n
X

(xi xj )(zi zj )

i,j=1

n = 3

(x) = [x1 x1 , x1 x2 , x1 x3 , x2 x1 , x2 x2 , x2 x3 , x3 x1 , x3 x2 , x3 x3 ]T

(3.52)

(x)
O(n2 )
K(x, z)
O(n)

3.6

(Separable)


(Outlier)
3.3
L1 (L1 Regularization)
(Primal Form)

(Slack Variable)i
N

X
1
i
min kwk2 + C
w,b 2
i=1
s.t.yi (hw, xi i + b) 1 i
i 0
40

(3.53)

3.3:
(Functional Margin)
1

1 i Ci C

1


N

X
X
X
1
i
i [yi (hw, xi + b) 1 + i ]
ri i (3.54)
L(w, b, , , r) = hw, wi + C
2
i=1
i=1
i=1
i ri

i = 0 yi (hw, xi i + b) 1

(3.55)

i = C yi (hw, xi i + b) 1

(3.56)

0 < i < C yi (hw, xi i + b) = 1

(3.57)

41

L1
max W () =

N
X
i=1

N
1X
i
yi yj i j hxi , xj i
2 i,j=1

s.t.0 i C
N
X

(3.58)

i yi = 0

i=1

L1 i 0

0 i C

3.7

(Sequential Minimal Optimization)

(Sequential Minimal Optimization; SMO)


(Platt)
1998

[23]


(Quadratic
Programming)





(Gradient Decent)

42


max W () =

N
X
i=1

N
1X
yi yj i j hxi , xj i
2 i,j=1

(3.59)

s.t. k, 0 k C
N
X

k yk = 0

k=1

i i j

N
X

k y k = 0

k=1

(3.60)

i y i =

j yj

j6=i

i

5
tol

Algorithm 5
1: repeat
2:

(Heuristic)
i j

3:


i j W ()

4:

until

i j
W ()

i j
(3.60)

i yi + j yj =

k y k

(3.61)

k6=i,j


i y i + j y j =
43

(3.62)

1
C
H

i yi + j yj =

3.4:


3.4
i j
[0, C] [0, C]
i yi + j yj =
j
L
H
(L j H)
(i , j )

L
H

If yi 6= yj , L = max(0, j i ), H = min(C, C + j i )
If yi = yj , L = max(0, i + j C), H = min(C, i + j )
i + j =
i j
i = ( j yj )yi

(3.63)

W (i , j , )
j

W (i , j , ) = W (( j yj yi ), j , ) = aj2 + bj + c

44

(3.64)

W (i , j , )

[0, C] [0, C]

yj (Ei Ej )

(3.65)

Ek = f (xk ) yk

(3.66)

= 2hxi , xj i hxi , xi i hxj , xj i

(3.67)

j = j

j [0, C] [0, C]
j

H, if j > H

(3.68)
j =
j , if L j H

L, if j < L
(
L
H
3.4) j
(old)

i = i + yi yj (j

j )

(3.69)

i
i j

b
i = yi (hw, xi i + b) 1

(3.70)

i = C yi (hw, xi i + b) 1

(3.71)

0 < i < C yi (hw, xi i + b) = 1

(3.72)

b
i
(0 < i < C) b1 b
(old)

b1 = b Ei yi (i i

(old)

)hxi , xi i yj (j j
45

)hxi , xj i

(3.73)

0 < j < C
b2 b
(old)

b2 = b Ej yi (i i

(old)

hxi , xj i yj (j j

)hxj , xj i

(3.74)

(i = 0
i = C
j = 0
j = C)

b1 b2 b
b

b1 , if 0 < i < C

b=
b2 , if 0 < j < C

(b1 +b2 ) , otherwise


2

(3.75)

3.7
(

i, j

[23]

Require: C, tol, max passes, Z = {(xi , yi )}N
i=1
Ensure: , b
1:

i = 0, b = 0

2:

passes = 0

3:

while passes < max passes do

4:

num changed alphas = 0

5:

for i = 1, 2, . . . , m do

6:

Ei = f (xi ) yi

7:

if (yi Ei < tol and i < C) or (yi Ei > tol and i > 0) then

8:

j 6= i

9:

Ej = f (xj ) yj

10:

L
H

11:

if L = H then
46

12:
13:

end if

14:

15:

if 0 then

16:
17:

end if

18:

19:

if |j j

(old)

| < 105 then

20:
21:

end if

22:

23:

b1 ,b2

24:

25:

num changed alphas = num changed alphas + 1


end if

26:
27:

end for

28:

end while

29:

if num changed alphas = 0 then

30:
31:
32:
33:

passes = passes + 1
else
passes = 0
end if

47

3.8

48


4.1




(Multiclass Classification)

n

(

n
2

n
2

(
(One Versus One)) k

(
n
)
n

(
(One Versus All))


(Statistical School)

(Parsing)
(

(Parse Tree))
(Bayesian School)


(Structural Support Vector Machine; SVM-struct) [15]


f x X y Y

y
{1, 1}
{1, 2, . . . k}

k
(Sequence)
(String)

49

(Tree)
(Graph)
)

f x
(

)
y(Parse Tree)

Z = (xi , yi )ni=1

(4.1)

xi y i

4.2

(Primal Form)
(Dual Form)





w
x
w

(
4.1
w1 w9 9

w
w4
)
x X
y Y
x
(y

)
y

w
hw, xi + b
y

(x, y)
(
(Conditional
Random Fields; CRF)
(Feature Function)
x
y

50

hw4 , xi + b4
hw6 , xi + b6
hw1 , xi + b1
hw
hw85 ,, xi
xi +
+ bb85
hw7 , xi + b7
hw3 , xi + b3
hw2 , xi + b2
hw9 , xi + b9

4.1:


)

hw, (x, y)i
w
w


(Discriminant
Function)F : X Y R x
y

x
y
(x, y)

F y
x

4.2
4.3
(
F (x, y; w)
x
y
x

x
y
y

x
y
w

4.2
4.3
)

w

(xi , yi )
F (xi , yi ; w)  F (xi , y; w), y 6= yi xi yi
y
w

51

4.2: w

4.3: w

f (x; w) = arg max F (x, y; w)


yY

(4.2)


F (x, y; w) = hw, (x, y)i

(4.3)

(

(No Free Lunch Theorem)

)

F (xi , yi ; w)  F (xi , y; w), y 6= yi


(Maximum Margin)

52


i = F (xi , yi ; w) max F (xi , y; w)
yY\yi

(4.4)

i = hw, (xi , yi )i max {hw, (xi , y)i}


yY\yi

(4.5)

i
max {hw, (xi , yi )i} < hw, (xi , yi )i}

yY\yi

(4.6)

max

n 1

i, y Y\yi : hw, (xi , yi )i hw, (xi , y)i > 0

(4.7)

(
i (y) hw, (xi , yi )i hw, (xi , y)i)
w
kwk 1

1
min kwk2
w 2

(4.8)

s.t.i,y Y\yi : hw, i (y)i 1


(Separable)

n
1
CX
min kwk2 +
i
w, 2
n i=1

(4.9)

s.t.i,i 0,
i,y Y\yi : hw, i (y)i 1 i
53

0-1

(Zero-One Loss Function)






: Y Y R
(y, y)

y
y P (x, y)

(x, y)

RP (f )

Z
(y, f (x))dP (x, y)

(4.10)

X Y

P (x, y)

(Empirical Risk)
RS (f )

n
X

(y, f (x))(f (xi ) 6= yi )

(4.11)

i=1


y 6= yi (yi , y)
(yi , y)

(Slack Variable)


n
1
CX
2
min kwk +
i
w, 2
n i=1

(4.12)

s.t.i,i 0,
i,y Y\yi : hw, i (y)i 1

54

i
(yi , y)


n
1
CX
2
min kwk +
i
w, 2
n i=1

(4.13)

s.t.i,i 0,
i,y Y\yi : hw, i (y)i (yi , y) i


n|Y|
|Y|
(Exponential
Order)
(
)




max

X
i,y6=yi

iy

1X
iy j yhi (y), j (
y )i
2

(4.14)

s.t.i, y 6= Y\yi : iy 0
(Lagrange Multiplier)

iy (xi , yi )
y

4.3

(Cutting Plane Method)

(Approximation)

(Cutting Plane Method)

(Optimization)

(Linear Inequality)

55

(Feasible Set)
(Object Function)


(Integer Linear Programming)


max x1 + 5x2
s.t. x1 + 10x2 20
x1 2

(4.15)

x1 , x2 0
x1 , x2 Z
(Naive)

(Linear Programming Relaxation)

max x1 + 5x2
s.t. x1 + 10x2 20
(4.16)
x1 2
x1 , x2 0
(2, 1.8)
4.4
(2, 1.8)

(2,2)
(
)

(2, 1)
7

4.4
(0,2)
10

7
3

x1 + 2x2 4

56

(0,2)
x1 + 10x2 = 20

(2,0)

4.4:

(
4.5)
max x1 + 5x2
s.t. x1 + 10x2 20
x1 2

(4.17)

x1 + 2x2 4
x1 , x2 0

x1 + 2x2 4
(0,2)


(
)

1. (Feasible Integer Set)



2. (Feasible Set)

57

(0,2)
x1 + 10x2 = 20

(2,0)

x1 + 2x2 = 4
x

4.5:


(Convex Optimization)

4.6
(

)
(
)

(
)

(Active Constraint)

4.6

[24]

(x, y)
k(x, y) (x, y)k, y, y Y

58

4.6:

[24]


6
6
(

) 6


(

)




[24]

6

4.7(

)

(
4.7
)

59

(Most-Violated Constraint)
(
4.7

)
y (

y y H(y)

)
i (
)

i 
(

)
y (
4.7

)
(

4.7
)

Algorithm 6 SVM-struct

1: Si i = 1, . . . , n
2:
3:

repeat
for i = 1, 2 , n do

4:

H(y) (1 hi (y), wi)(yi , y)

5:

y = arg maxyY H(y)

6:

i = max{0, maxySi H(y)}

7:

if H(
y ) > i +  then

8:

Si Si {
y}

9:

S S
S = i Si

10:
11:
12:

end if
end for
until Si

60

4.7:

61

4.4


(Large Margin Principle)
(Conditional Random Field)

(Discriminative Model)

(Discriminative
Training)


[25]

(Bayes Risk)

(Bayesian School)

(Statistical School)
(Logistic
Regression)
(Linear Discriminant Analysis)
(Support
Vector Machine)








62


(Large Margin HMM) [26]



(


)





(Approximation)


TI Digit
(
)

4.5


(x, y)

63

(Parsing)

(Information Retrieval)

64

TIMIT

5.1
5.1.1
TIMIT
(The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus
TIMIT)
TIMIT
(Defense Advanced
Research Projects Agency; DARPA)

(Texas
Instruments; TI)
(Massachusetts Institute of Technology; MIT)

(Stanford Research Institute; SRI)

(TI +
MIT = TIMIT)

TIMIT
6300

(Dialect)
630
10

5.1 10

2
(SA):

65

31 (63%)

18 (27%)

49 (8%)

71 (70%)

31 (30%)

102 (16%)

79 (67%)

23 (23%)

102 (16%)

69 (69%)

31 (31%)

100 (16%)

62 (63%)

36 (47%)

98 (16%)

30 (65%)

16 (35%)

46 (7%)

74 (74%)

26 (26%)

100 (16%)

22 (67%)

11 (33%)

33 (5%)

438 (70%)

192 (30%)

630 (100%)

5.1: TIMIT

5
(Phonetically Balanced)
(SX):
450

3
(SI):
(Brown Corpus)
1890
1890

SA

[27]
SA

SA
3696
(462
)

192
(
TIMIT
24

5
SX
3
SI
)

TIMIT
64
(
5.2)
(Phone
Recognition)
[27]
(Glottal Stops)

48
(
5.4)
48

66

bee

day

gay

pea

tea

key

dx

muddy, dirty

bat

jh

joke

ch

choke

sea

sh

she

zone

zh

azure

fin

th

thin

van

dh

then

mom

noon

ng

sing

em

bottom

en

button

eng

washington

nx

winner

lay

ray

way

yacht

hh

hay

hv

ahead

el

bottle

iy

beet

ih

bit

eh

bet

ey

bait

ae

bat

aa

bott

aw

bout

ay

bite

ah

but

ao

bought

oy

boy

ow

boat

uh

book

uw

boot

ux

toot

er

bird

ax

about

ix

debit

axr

butter

ax-h

suspect

pau

pause

epi

silence

h#

silence

#h

silence

pcl

(unvoiced sil)

tcl

(unvoiced sil)

kcl

(unvoiced sil)

qcl

(unvoiced sil)

vcl

(vioced sil)

bcl

(voiced sil)

dcl

(voiced sil)

gcl

(voiced sil)

5.2: TIMIT

sil, cl, vcl, epi


el, l
en, n
sh, zh
ao, aa
ih, ix
ah, ax

5.3:

67

48
(
5.3)

A
B
39

5.1.2

(Mel-Frequency Cepstrum Coefficient; MFCC) [28]


(Perceptual Linear Prediction; PLP) [29]
,

5.1a

(10ms
25ms
)
13

(Time Derivatices)
39
13
40

(Filter Banks) (
64 Hz
8k Hz)


(Ceptral Mean Substraction; CMS) [30]
5.1b

(Time Domain)

(Pre-emphasis)
(Frequency Domain)


(Autoregressive
Coefficient)
13

39

68

iy

beat

en

button

ih

bit

ng

sing

eh

bet

ch

church

ae

bat

jh

judge

ix

roses

dh

they

ax

the

bob

ah

butt

dad

uw

boot

dx

(butter)

uh

book

gag

ao

about

pop

aa

cot

tot

ey

bait

kick

ay

bite

zoo

oy

boy

zh

measure

aw

bough

very

ow

boat

fief

led

th

thief

el

bottle

sis

red

sh

shoe

yet

hh

hay

hv

wet

cl(sil)

(unvoiced closure)

pcl, tcl, kcl, qcl

er

bird

axr

vcl(sil)

(voiced closure)

bcl, dcl, gcl

mom

em

epi(sil)

(epinthetic closure)

non

nx

sil

(silence)

ux

5.4: 48

69

eng

h#, #h, pau

(a) MFCC

(b) PLP

5.1:

70

5.1.3


(HTK) [31];
(QuickNet) [32]; SV M struct [33];

5.1.4


(Hidden Markov Model; HMM)

(Tandem System)

TIMIT
(Free Phone Decoding) (

)

(Phone Accuracy)


5.1.1
TIMIT
64
48

71

(
)
48

(Left-to-Right Continuous Mono-phone HMM)






20
(Expectation Maximization;
EM)
(
)

:
1 2 4 8 16 32
20

5.1.2


(HTK)


48



72

(Hidden Layer)
1000
351


39
39 9 = 351
48
48




(Principal Component Analysis; PCA)

(Eigen Value)
95%
37

5.1.5
(Framework)

(x, y)

(x, y)

aij bj (ot )

(Conditional Random Fields; CRF)

73

(Feature Function)

p(y, x) =

T
Y

p(yt |yt1 )p(xy |yt )

t=1

(
)
XX
XX X
1
= exp
ij (yt = i)(yt1 = j) +
oi (yt = i)(xt = o)
Z
t i,jS
t iS oOw
(
)
XX
1
k fk (yt , yt1 , xt )
= exp
Z
t
k
(5.1)
x
y
(

(Observable)
)
ij = log aji = log p(yt = i|yt1 = j)
oi =
log bi (o) = log p(xt = o|yt = i)
Z = 1
f
(x, y)

(Tensor Product)

: RD RK RDK , (a b)i+(j1)D ai bj

(5.2)

c (y) ((y1 , y), (y2 , y), . . . , (yK , y)) {0, 1}K


(x, y)

PT
t=1 (xt ) c (y t )
(x, y) =
P
1 c t
Tt=1
(y ) c (y (t+1) )

(5.3)

(5.4)

(xt t
MFCC
PLP

PT

t=1

(xt ) c (y t )

PT 1
t=1

c (y t ) c (y (t+1) )



74

5.2: (x, y)

3
{A, B, C}

5.2 )
(x, y)
(Transition Count)
(Emission
Count)
A
B (A
index
0
B
index 1
C
index 3
= 1
(xt ) = xt )

0 1 0
3.2 5

A=
1 1 0 , B = 3.7 6.4

0 0 0
0
0

(5.5)

(x, y)
A
B


(x, y) =


0 1 0 1 1 0 0 0 0 3.2 5 3.7 6.4 0 0

(5.6)

(x, y)
k
w
w

k

5.3

75

5.3:

(Viterbi
Algorithm)
(Most Violated Constraint)

(Active Constraint)

(y, y)

(y, y) =

T
X

(y t , yt )

(5.7)

t=1

(Upper Bound)



76


64
48
(
)

5.1.6

(Phone Accuracy)


H =N DS
H
100%
N
H I
100%
Acc =
N
Corr =

(5.8)
(5.9)
(5.10)

H(Hit; )
D(Deletion; )

S(Substitution; )

I(Insertion; )

N(Number)
(
word

)
D
S
I
(Edit Distance)

7

5.2
5.2.1
(
5.5
5.4)

(Phone Accuray)
5.5

77

Algorithm 7
1: for i = 0, 1 . . . , m do
2:

d[i, 0] i

3:

end for

4:

for j = 0, 1 . . . , n do

5:

d[0, j] j

6:

end for

7:

for i = 1, 2, . . . , m do

8:

for j = 1, 2, . . . , n do

9:

if a[i] = b[j] then

10:
11:
12:
13:

cost 0
else
cost 1
end if

14:

end for

15:

d[i, j] min(d[i 1, j] + 1, d[i, j 1] + 1, d[i 1, j 1] + cost)

16:

end for

78

MFCC
PLP

MLP-MFCC
MLP-PLP

(PCA)

PCA-37-MLP-MFCC
PCA-37-MLP-PLP
PCA

Flat Start
Init By Label

TIMIT

(Flat
Start)

(Forced Alignment)

(Init By Label)

Init By Label
Flat Start


62%, 63%
68%, 69%
PCA
1%

70%

79

5.4:

MFCC

PLP

MLP-MFCC

MLP-PLP

PCA-37-MLP-MFCC

PCA-37-MLP-PLP

HMM(Flat Start)

62.47%

62.69%

68.77%

69.25%

70.19%

70.26%

HMM(Init By Label)

63.26%

62.91%

69.26%

69.50%

70.30%

70.42%

5.5:

80

MFCC
PLP

PLP

MFCC

5.5

(Slack Variable)


MLP-MFCC
MLP-PLP
PCA-37-MLP-MFCC
PCA37-MLP-PLP

5.6
5.5

(x, y)

(First-Order Markov)

MFCC
PLP

51%
MFCC
PLP

63%

11%


50%
MFCC
PLP




57%

71.7%

70.42%
1%

81

5.5: (
)

MFCC

PLP

MLP-MFCC

MLP-PLP

PCA-37-MLP-MFCC

PCA-37-MLP-PLP

38.87%

39.08%

57.40%

57.57%

56.71%

56.79%

10

46.55%

44.81%

67.74%

67.65%

67.13%

67.20%

100

49.74%

49.65%

70.86%

70.89%

70.18%

70.19%

1000

51.29%

51.22%

71.71%

71.75%

71.29%

71.32%

5.6: (
)

82

PCA

PCA
PCA


5.7
5.6


5.6


(Second-Order
Markov)
(x, y)

5.8
5.7

MFCC
PLP


10
70.07%

100
1000
5.8

83

5.6: (
)

MFCC

PLP

MLP-MFCC

MLP-PLP

PCA-37-MLP-MFCC

PCA-37-MLP-PLP

38.77%

37.03%

57.54%

57.33%

56.35%

56.07%

10

46.13%

44.71%

67.73%

67.48%

67.19%

66.98%

5.7: (
)

84

5.7: (
)

MFCC

PLP

MLP-MFCC

MLP-PLP

PCA-37-MLP-MFCC

PCA-37-MLP-PLP

39.03%

39.19%

64.43%

64.25%

63.65%

63.75%

10

46.38%

44.61%

69.94%

70.07%

69.84%

69.91%

5.8: (
)

85

5.8:

5.3

TIMIT


86


6.1




1%






(
)

(

)

6.2



(Conditional Random Field; CRF)


(Discrminative Training)
(

87

88


[1] Lawrence R. Rabiner, A tutorial on hidden Markov models and selected applications
in speech recognition, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
1990.
[2] J. K. Baker, The dragon system - an overview, in IEEE Trans. Acoust. Speech
Signal Process, 1975, pp. 2429.
[3] L. Bahl, P. Brown, P de Souza, and R. Merce, Maximum mutual information
estimation of hidden markov model parameters for speech recognition, in ICASSP
1986, 1986.
[4] B.-H. Juang, W. Chou, and C.-H Lee, Minimum classification error rate methos for
speech recognition, in IEEE Transactions on Speech and Audio Processing, 1997.
[5] D. Povey and P.C. Woodland, Minimum phone error and i-smoothing for improved
discriminative training, in ICASSP 2002, 2002.
[6] J. Zheng and A. Stolcke, Improved discriminative training using phone lattices,
in Interspeech 2005, 2005.
[7] J. Du, P. Liu, F. K. Soong, J.-L. Zhou, and R.-H. Wang, Minimum divergence based
discriminative training, in Interspeech 2006, 2006.
[8] Daniel Jurafsky and James H. Martin, Speech and Language Processing, Pearson
Education Taiwan Ltd., 2005.

89

[9] Leonard E. Baum and Ted Petrie, Statistical inference for probabilistic functions
of finite state markov chains, The Annals of Mathematical Statistics, vol. 37, no. 6,
pp. 15541563, 1966.
[10] Hynek Hermansky Daniel, Daniel P. W. Ellis, and Sangita Sharma, Tandem connectionist feature extraction for conventional hmm systems, in Proc. ICASSP, 2000,
pp. 16351638.
[11] Eric Fosler Lussier and Jeremy Morris, Crandem systems: Conditional randem
field acoustic models for hidden markove model, in Proc. ICASSP, 2008, pp. 4049
4052.
[12] Asela Gunawardana, Milind Mahajan, Alex Acero, and John C. Platt, Hidden
conditional random fields for phone classification, in in Interspeech, 2005, pp.
11171120.
[13] Yun-Hsuan Sung, Constantinos Boulis, Christopher Manning, and Dan Jurafsky,
Regularization, adaptation, and non-independent features improve hidden conditional random fields for phone classification, in IEEE ASRU 2007, 2007, pp. 639
642.
[14] J. Morris and E. Fosler-Lussier, Discriminative phonetic recognition with conditional random fields, in HLT-NAACL Workshop on Computationally Hard Problems
and Joint Inference in Speech and Language Processing, 2006.
[15] Tsochantaridis Ioannis, Hofmann Thomas, Joachims Thorsten, and Altun Yasemin,
Support vector machine learning for interdependent and structured output spaces,
in ICML 04, New York, NY, USA, 2004, p. 104, ACM.
90

[16] Yasemin Altun, Ioannis Tsochantaridis, and Thomas Hofmann, Hidden markov
support vector machines, 2003.
[17] Thorsten Joachims, A support vector method for multivariate performance measures, in ICML 05: Proceedings of the 22nd international conference on Machine
learning, New York, NY, USA, 2005, pp. 377384, ACM.
[18] Yue Yisong, Finley Thomas, Radlinski Filip, and Joachims Thorsten, A support
vector method for optimizing average precision, in SIGIR 07: Proceedings of the
30th annual international ACM SIGIR conference on Research and development in
information retrieval, New York, NY, USA, 2007, pp. 271278, ACM.
[19] Thorsten Joachims, Training linear svms in linear time, in KDD 06: Proceedings
of the 12th ACM SIGKDD international conference on Knowledge discovery and
data mining, New York, NY, USA, 2006, pp. 217226, ACM.
[20] S. Sathiya Keerthi and S. Sundararajan, Crf versus svm-struct for sequence labeling, Tech. Rep., Yahoo Research Techinical Report, 2007.
[21] Vladimir N. Vapnik, The nature of statistical learning theory, Springer-Verlag New
York, Inc., New York, NY, USA, 1995.
[22] Christopher J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, vol. 2, pp. 121167, 1998.
[23] John Platt Microsoft and John C. Platt, Sequential minimal optimization: A fast
algorithm for training support vector machines, Tech. Rep., Advances in Kernel
Methods - Support Vector Learning, 1998.

91

[24] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun, Large margin methods for structured and interdependent output variables,
J. Mach. Learn. Res., vol. 6, pp. 14531484, 2005.
[25] Tom Minka, Discriminative models, not discriminitave training, Tech. Rep., Microsoft Research, 2005.
[26] X. Li, H. Jiand, and C. Liu, Large margin hmms for speech recognition, in IEEE
Transactions on Acoustics, Speech and Signal Processings (ICASSP 05), 2005, pp.
513516.
[27] K. Lee and H. Hon, Speaker-independent phone recognition using hidden markove
models, in IEEE Transactions on Acoustics, Speech and Signal Processings, 1989,
pp. 16411648.
[28] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing, Pearson Education Taiwan Ltd., 2005.
[29] H. Hermansky, B. Hanson, and H. Wakita, Perceptually based linear predictive
analysis of speech, Apr 1985, vol. 10, pp. 509512.
[30] S. Furui, Cepstral analysis technique for automatic speaker verification, Acoustics,
Speech and Signal Processing, IEEE Transactions on, vol. 29, no. 2, pp. 254272,
Apr 1981.
[31] Machine Intelligence Laboratory Cambridge University Engineering Dept. (CUED),
Htk, http://htk.eng.cam.ac.uk.

92

[32] Speech Group International Computer Science Institue, Quicknet, http://


www.icsi.berkeley.edu/Speech/qn.html.
[33] Thorsten Joachims, Svm-hmm, http://www.cs.cornell.edu/People/
tj/svm_light/svm_hmm.html.

93

You might also like