You are on page 1of 5

Cc m hnh Markov n (HMMs) l phng php linh hot mnh m cho cc i din v phn loi d

liu vi cc xu hng theo thi gian, v l mt thnh phn quan trng trong h thng nhn dng
ging ni trong nhiu nm.
Ti tm thy n rt kh khn tm thy mt v d tt (vi m!) Ca mt h thng nhn dng ging ni
n gin, v vy ti quyt nh to ra bi ny. Mc d thc hin iu ny s khng ginh c bt c
gii thng cho "Best Speech Recognizer", ti hy vng n s cung cp mt s ci nhn su sc vo
cch HMMs c th c s dng nhn dng ging ni v cc nhim v khc.
Trong bi ny, ti s xc nh nhng g m hnh Markov n l, cho thy lm th no thc hin mt
hnh thc (Gaussian Mixture hnh HMM, GMM-HMM) s dng numpy + scipy, v lm th no s
dng thut ton ny cho n nhn dng ging ni loa. i vi mt "lp sn xut" thc hin HMM hn,
xem hmmlearn nm gi nhng hin thc HMM m trc y l mt phn ca sklearn .

D liu
chng minh thut ton ny, chng ta cn mt b d liu hot ng trn. Ti chn s
dng cc d liu mu t d n ny bi Google Code Hakon Sandsmark . Ti cng s dng m
ny nh mt ti liu tham kho khi thc hin to ring ca ti v mt Guassian Hn hp mu HMM
(GMM-HMM). iu ny h tr trong vic kim tra thc hin ca ti, cng nh a ra mt khung tham
chiu cho hiu sut.
B d liu c sn khc l phn ln multispeaker, nhng cc tnh nng nh cao tn n gin c s
dng trong v d ny khng lm vic trong ch multispeaker (loa khc nhau c ni dung tn s
khc nhau ca cng mt t! Hy mt mnh nam / n khc bit ngn lun ...). Cng vic sp ti s
bao gm cc k thut khai thc tnh nng tin tin hn cho m thanh, v m rng cc v d
multispeaker cng nhn.
%matplotlib inline
from utils import progress_bar_downloader
import os
#Hosting files on my dropbox since downloading from google code is painful
#Original project hosting is here: https://code.google.com/p/hmm-speechrecognition/downloads/list
#Audio is included in the zip file
link = 'https://dl.dropboxusercontent.com/u/15378192/audio.tar.gz'
dlname = 'audio.tar.gz'
if not os.path.exists('./%s'%dlname):
progress_bar_downloader(link, dlname)
os.system('tar xzf %s'%dlname)
else:
print '%s already downloaded!'%dlname

audio.tar.gz already downloaded!


In [2]:
fpaths = []
labels = []

spoken = []
for f in os.listdir('audio'):
for w in os.listdir('audio/' + f):
fpaths.append('audio/' + f + '/' + w)
labels.append(f)
if f not in spoken:
spoken.append(f)
print 'Words spoken:',spoken

Words spoken: ['kiwi', 'apple', 'banana', 'orange', 'pineapple', 'peach',


'lime']
Nhng thng tin ny c tng cng 7 li ni khc nhau, v mi ngi ni 15 ln khc nhau, cho tng
cng 105 tp. Tip theo, cc file s c tch ra thnh mt ma trn d liu duy nht (khng c cc
tp tin m chiu di ng phc), v mt vector nhn vi nhn chnh xc cho mi tp tin d liu
c to ra.
In [3]:
#Files can be heard in Linux using the following commands from the command line
#cat kiwi07.wav | aplay -f S16_LE -t wav -r 8000
#Files are signed 16 bit raw, sample rate 8000
from scipy.io import wavfile
import numpy as np
data = np.zeros((len(fpaths), 32000))
maxsize = -1
for n,file in enumerate(fpaths):
_, d = wavfile.read(file)
data[n, :d.shape[0]] = d
if d.shape[0] > maxsize:
maxsize = d.shape[0]
data = data[:, :maxsize]
#Each sample file is one row in data, and has one entry in labels
print 'Number of files total:',data.shape[0]
all_labels = np.zeros(data.shape[0])
for n, l in enumerate(set(labels)):
all_labels[np.array([i for i, _ in enumerate(labels) if _ == l])] = n
print 'Labels and label indices',all_labels

Number of files total: 105


Labels and label indices [ 0.
0. 0. 0. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1.
5. 5. 5. 5. 5. 5. 5.
4. 4. 4. 4. 4. 4. 3.
3. 3. 3. 2. 2. 2. 2.
6. 6. 6. 6. 6. 6. 6.

0.
1.
5.
3.
2.
6.

0.
1.
5.
3.
2.
6.

0.
1.
4.
3.
2.
6.

0.
1.
4.
3.
2.
6.

0.
1.
4.
3.
2.
6.

0.
5.
4.
3.
2.
6.

0.
5.
4.
3.
2.
6.

0.
5.
4.
3.
2.
6.]

0.
5.
4.
3.
2.

0.
5.
4.
3.
2.

0.
5.
4.
3.
2.

Khoa hc vin tng (Feature Double)


Mt khi d liu c ti v v bin thnh mt ma trn u vo, bc tip theo l trch xut cc
tnh nng t cc d liu th, nh c thc hin trong nhiu ng ng my hc khc.
Hu ht cc h thng nhn dng loa "lp ngi tiu dng" s dng ch bin tin tin trch xut mt
lot cc tnh nng m t m thanh trn c hai tn s v thi gian, v cho n gn y "tnh nng ty
chnh" l mt trong nhng cha kha lm cho mt h thng cng nhn tuyt vi. Tnh trng hin ti
ca ngh thut ( kin thc ca ti, t nht) gn y chuyn sang s dng mng thn kinh su
khai thc tnh nng, m ti hy vng s hin th trong mt bi ng trong tng lai. Cn by gi, chng
ti s dnh vo cc tnh nng rt n gin, hin th mt "n gin lm vic v d".
Trong v d ny, pht hin tn s cao im n gin c s dng, ch khng phi l lot cc tnh
nng chuyn gia thng c s dng trong mt ng ng dn nhn dng ging ni hin i
(MFCCs, hay gn y hn, mt mng li thn kinh a pretrained). iu ny c nh hng trc tip
n hiu sut, nhng cho php thc hin mt tng th ph hp trong mt bi duy nht :)
In [4]:
import scipy
import numpy as np
def stft(x, fftsize=64, overlap_pct=.5):
#Modified from http://stackoverflow.com/questions/2459295/stft-and-istft-inpython
hop = int(fftsize * (1 - overlap_pct))
w = scipy.hanning(fftsize + 1)[:-1]
raw = np.array([np.fft.rfft(w * x[i:i + fftsize]) for i in range(0, len(x) fftsize, hop)])
return raw[:, :(fftsize / 2)]

tm cc nh tn s, mt k thut gi l Short Time Fourier Transform (STFT) c s dng.


tng ny l kh n gin - FFT c p dng trn khi d liu u vo, kt qu l mt "hnh nh" 2D
FFT, thng c gi l ph. Thit lp kch c FFT cho php chng ta kim sot c lng phn
gii tn s c sn, trong khi chng cho cc ca s cho php chng ta kim sot thi gian gii quyt
ti cc chi ph ca vic tng kch thc d liu.
Tm li, nu X l mt vector c chiu di 20, chng ti mong mun to ra mt mng 2D,
STFT_X. Nu kch thc FFT l 10, v s chng cho l 0,5 (5 mu), iu ny c ngha l (trong gi):
STFT_X [0,:] = FFT (X [0: 9])
STFT_X [1,:] = FFT (X [5:14])
STFT_X [2:] = FFT (X [10:19])

Sau chng ti c 3 khung FFT c chit xut t cc mu u vo X. i vi khai thc tnh


nng ca chng ti, chng ti s tip tm nh trong mi hng ca STFT_X.

Cc STFT thng l mt yu t quan trng ca hu ht cc ng ng dn DSP, v thi quen hiu


qu cao c sn tnh ton ny (xem FFTW , m NumPy kt thc tt p). Mc d ti thc hin
STFT ring ca ti y, n cng c th s dng matplotlib ca chc nng specgram thay th.
Tip theo, tch nh c p dng cho mi frame FFT ca mi tp tin d liu. Trong mt bi ng
blog trc , ti m t vic s dng wavelets pht hin cao im. y, chng ta s s
dng mt ca s di chuyn tm kim nh thay th. Cc bc chnh thut ton ny nh sau:
1.
To mt ca s d liu c di X. Trong v d ny X = 9, mc d bt k kch
thc ca s c th c s dng.
2.
Chia ca s ny thnh 3 phn: bn tri, trung tm v bn phi. i vi cc
ca s 9 mu, y s l LLLCCCRRR.
3.
p dng mt s chc nng (trung bnh, trung bnh, max, min, vv) trn mi
phn ca ca s.
4.
Nu gi tr ti a ca cc chc nng trong phn trung tm l ln hn so vi
kt qu cho tri hoc phi, tip tc kim tra tip theo. Nu khng GOTO 6.
5.
Nu gi tr ti a cho f (CCC) l trung tm ca ca s, bn tm c mt
nh cao! nh du n v tip tc. Nu khng, i n bc tip theo.
6.
Chuyn cc d liu u vo ca mt mu, v lp li qu trnh ny. (D liu [0:
9] -> d liu [01:10])
7.
Mt khi tt c cc d liu c x l, bn nn c mt s nh ni pht
hin. Sp xp chng theo th t gim dn bin , sau sn lng nh u
N.Trong trng hp ny, N = 6
An thc hin cc thut ton ny c hin th di y.
In [5]:
import matplotlib.pyplot as plt
plt.plot(data[0, :], color='steelblue')
plt.title('Timeseries example for %s'%labels[0])
plt.xlim(0, 3500)
plt.xlabel('Time (samples)')
plt.ylabel('Amplitude (signed 16 bit)')
plt.figure()
log_freq = 20 * np.log(np.abs(stft(data[0, :])))
print log_freq.shape
plt.imshow(log_freq, cmap='gray', interpolation=None)
plt.xlabel('Freq (bin)')
plt.ylabel('Time (overlapped frames)')
plt.ylim(log_freq.shape[1])
plt.title('PSD of %s example'%labels[0])

(216, 32)
-c:9: RuntimeWarning: divide by zero encountered in log
Out[5]:
<matplotlib.text.Text at 0x3eb6890>

You might also like