You are on page 1of 4

A Speech Endpoint Detection Method Based on Wavelet Coefficient Variance

and Sub-Band Amplitude Variance

Xueying Zhang, Zhefeng Zhao, Gaofeng Zhao


College of Information Engineering, Taiyuan University of Technology,
Taiyuan, Shanxi, 030024, P.R.China
zhangxy@tyut.edu.cn

Abstract but its complexity is higher. So we proposed the


synthesized speech end-point detection algorithm
Speech endpoint detection is one key technology for consisted of above two methods was. It selects a way
speech recognition. The paper proposed two kinds of according to noise type. Thus it can decrease system
endpoint detection methods: the algorithm based on complexity. Simulations were made under different
the wavelet coefficient variance and the algorithm SNR and the results show that synthesized method is
based on the sub-band average amplitude variance. efficient to segment noisy speech even at a low SNR.
Speech signal with noise was decomposed by wavelet
to investigate the statistic characteristics of wavelet 2. The wavelet transform and application
coefficient and sub-band amplitude. Their variances
were extracted as feature to make endpoint detection. The wavelet transform is an effective tool in
The first methods adaptability is better than the analyzing and handling no stationary signals. It is a
second method, but its complexity is higher than the time-scale analyzing method with multi-resolution
second method. So the synthesized speech end-point property in signal processing; therefore it can
detection algorithm that is consisted of above two effectively extract the message from the original
methods was proposed. It can select a suitable way to signal[2]. The wavelet transform represents the signal
make operation according to noise type. Thus it can f(t) as weighting sum of a series of functions. The
increase system efficiency and implement endpoint series of function are formed by companding and shift
detection. Simulations were made under different of the base function ( t). If scale is a, time shift is ,
signal-to-noise ratios and the results show that this the wavelet function is shown as equation (1):
method is efficient to segment noisy speech even at a 1 t
low signal-to-noise ratio. WT f (a, ) = f (t ) (
)d (t ) (1)
a a
The main feature of wavelet analysis is to be able
1. Introduction
analyze the local features of signal. By using the
wavelet transform, we can easily find the time of signal
The method of speech endpoint detection is aimed
aberrance, and discover many features that other
at correctly discerning the speech signal from the back
ground noise condition. This is a basic issue in speech methods fail to detect.
signal processing[1]. No matter in the field of military
use or civil use, the speech endpoint detection has a 3. The endpoint detection based on wavelet
variety of applications. The difficulty in endpoint coefficient variance
detection is that the noises from respiration and the
environmental interference make speech endpoint 3.1 The feature of wavelet coefficient variance
fuzzy. The thesis is focused on detecting speech
endpoint by using the wavelets multi-resolution The speech signal is statistically self-similar random
property to decompose the speech signal into some process. Its statistical features in the time domain do
layers and calculate every layer wavelet coefficient not change along with the waveforms compression
variance and sub-band amplitude variance. The and extension, so it has the features of 1/f process [3].
wavelet coefficient variance methods adaptability is According to this property, we know that after the
better than the sub-band amplitude variance method, signal is decomposed by wavelet transformation, the

Proceedings of the First International Conference on Innovative Computing, Information and Control (ICICIC'06)
0-7695-2616-0/06 $20.00 2006
wavelet coefficients of every sub-band are of the same The probability of the example xi = {a1 , a2 ," , an }
statistical property. Therefore, we can make endpoint
detection by using the variance of the wavelet belonging to c j can be deduced by Bayes Theorem, as
coefficient. shown as follows (5):
Suppose there is a discrete speech signal f[n], after P(a1 , a2 ," , an | c j ) P(c j )
using the wavelet transformation, its wavelet P (c j | a1 , a2 ," , an ) =
P (a1 , a2 ," , an ) (5)
coefficient is fk , and variance is ( f )2 , as follows in
equation (2): = P(c j ) P (a1 , a2 ," , an | c j )
1
( f )2 = ( f k E ( f k )) 2 (2) Where, is the normal factor, P (c j ) is the prior
N kN
Where, N stands for the total number of the wavelet probability of class c j , P (c j a1 , a2 ,", an ) is the
coefficients. k is the index of wavelet coefficient. posterior probability of class c j . The prior probability
According to the property of the 1/f process, after is independent of the sample data, but the posterior
using the wavelet transformation to the original signal, probability reflects the influence of the sample data to
the wavelet coefficient can be viewed as a random
variable with the zero mean value, so the equation (2) class c j . From formula (5), we can calculate out the
is changed to the following equation: probability of the sample xi belonging to class c j . On
1 (3) the basis of above, a Bayes classification is built as
( f )2 = ( f k )2
N kN Table 1.
Extracting the noise, unvoice and clean speech
signals wavelet coefficients as the known knowledge, Table 1. The describing of three layers
as shown in equation (4) as follows: Bayes classification
1
( n ) 2 = (nk )2 Class ID input Probability distribution
N k N
1
( q ) 2 = (qk ) 2
(4) noise V0 {(0, )}
m
n

N k N
1
speech V1 {skm } {(0, )}
m
c

( c ) 2 = (ck ) 2
N k N unvoice V2 {(0, )}
m
q

Where, ( n )2 , ( q )2 and ( c )2 stand for the wavelet 3.3 Property classification


coefficient variances of noise, unvoice and clean
speech, respectively. N stands for the total number of In the endpoint detection, we make statistical
the wavelet coefficients. classification to the variance equation (4) by using the
Bayes classification algorithm. V0 stands for the pre-
3.2 Bayes classification model extracted noise variance, V1 stands for pre-extracted
clean speech variance and V2 stands for the unvoices
The Bayes classification model is a typical variance. According to the Bayes classification
mathematical classification model based on the principles, they should be as shown in equation (6)~(8)
statistical method. The Bayes Theorem is one of the as followings:
most important equations in the Bayes theory, and is P ({skm } | V0 ) = p ({skm } | V0 )
also the fundamental theory in the Bayes learning mM k N ( m )

methods. It combined artfully the priorprobability and 1 (s m ) 2 (6)


the posteriorprobability, using the prior information = EXP k m 2
and the sample data information to determine the mM , k N ( m ) 2 ( nm ) 2 2( n )
posteriorprobability.
Suppose U = { A1 , A2 ", An , C} is a finite discrete P ({skm } | V1 ) =
mM k N ( m )
p({skm } | V1 )
random variable set, and A1 , A2 ,", An are attribute
1 (s m ) 2 (7)
variables. The value range of class variable C is = EXP k m 2
{c1 , c2 ,", cl } , and ai is the value of variable Ai . mM , k N ( m ) 2 ( cm ) 2 2( c )

Proceedings of the First International Conference on Innovative Computing, Information and Control (ICICIC'06)
0-7695-2616-0/06 $20.00 2006
P ({skm } | V2 ) =
mM k N ( m )
p({skm } | V2 ) m
(2) Compute the mean Ei of Ei , as following
(8) formula (10)
(s )
m 2
1

M
= EXP k m 2 Ei =
1
E im (10)
mM , k N ( m ) 2 ( qm ) 2 2( q ) M m =1

Where, M is the total number of wavelet layers. And


then, computing the variance ( i ) of sub-band
2
Where, M is the total number of wavelet layers, m is
the index of wavelet layer. N(m) is the total number of average amplitude, as following formula (11)
the mth layer wavelet coefficient. s km is the 1 M
( i )2 = ( Eim Ei ) 2  (11)
k th wavelet coefficient in the mth layer. P ({skm } | V0 ) is M m =1
the probability of s(t) being noise, P ({skm } | V1 ) is the ( i )2 will be selected as feature parameter to reflect
probability of s(t) being clean speech, while the difference among average energies of different
P ({skm } | V2 ) is the possibility of s(t) being unvoice.
frames
(3) Set a threshold T that is equal to two times
If P ({skm } | V0 ) < P ({skm } | V1 ) , then we deduce the
average value of preceding three frames ( i ) . If
2
signal should be speech, otherwise we compare
P ({skm } | V0 ) and P ({skm } | V2 ) . If P ({skm } | V0 ) < ( i ) 2 > T , then mark the frame as speech.
P ({skm } | V2 ) , we judge that the signal is speech, Otherwise, it is marked as noise.
otherwise it is noise.
5. The synthesized implementation of
4. The endpoint detection based on sub- endpoint detection
band average amplitude variance The wavelet coefficient variance methods
adaptability is better than the sub-band amplitude
Gauss white noise belongs to stationary random
variance method, but its complexity is higher than the
course and exists widely in natural world. So it is very latter. The latter is simple and fast, but it is only
important to distinguish it from speech. Gauss white suitable to white noise. The two kinds of methods have
noise has flat power spectrum intensity at a range of their merits and demerits, respectively[4]. For making
wide frequency, so its power distribution is uniform at full use of their merits, the synthesized speech endpoint
every frequency band. On opposite, speech signal detection algorithm that is consisted of the above two
power distributes mainly at low frequency part. Its methods was proposed. It can select a suitable way to
power distribution has a big undulation at a range of all make operation according to noise type. The
frequency. Thus, we can judge whether speech
synthesized algorithm is as followings.
segment or noise segment according to their power
(1) Sample the input speech signal, and then divide
distribution at each frequency band. The method is the
the speech samples into frames. We use
endpoint detection based on sub-band average
amplitude variance by using wavelet analysis tool. Its Ri (0 < i D ) to denote the frames. Where, D is the
steps are as followings.  frames total number.
  Compute the average amplitude Ei of the
m (2) As db4 wavelet shows a good orthogonal
property, after using a five layers db4 wavelet
mth layer wavelet coefficient, as following formula (9): transformation to the ith frame Ri , the resulting
1 m
E im = wavelet coefficient is s k .
s km
N ( m ) kN ( m )
(9)
(3) Extract preceding five frames input signal to
Where, m is the index of wavelet layer. k is the compute sub-band energy. If it is found that the signal
index of wavelet coefficient. N(m) is the total number sub-band energy is very small, we think that the system
th m is under environment of approximate clean or
of the m layer wavelet coefficient. s k is the
stationary noise. At the time, use equations (9), (10)
k th wavelet coefficient in the mth layer. i is frame and (11) to compute ( i ) . If ( i ) > T , the frame
2 2

index.
is speech. Otherwise, it is marked as noise. The other
way round, if it is found that the signal sub-band

Proceedings of the First International Conference on Innovative Computing, Information and Control (ICICIC'06)
0-7695-2616-0/06 $20.00 2006
energy value is big, the wavelet coefficient variance Table 2.The endpoint detection results of 160
method will be used. Use equation (6) and (7) to speech sentences in different SNRs  
compute results, if P ({skm } | V0 ) < P ({skm } | V1 ) , then the
signal is speech, otherwise using equation (8). If method clean 15dB 10dB 0dB
P ({skm } | V0 ) < P ({skm } | V2 ) , then we judge the signal
should be speech, otherwise it should be noise. EZCR 97.9 96.6 75.6 64.0
(4) If i>D, the algorithm ends, otherwise returns to SBAV 98.5 83.3 80.1 72.1
step (2).
(5) After all frames are marked separately, the post- WCV 97.2 96.7 90.4 85.6
process will be started. We define that the minimum
SI 97.6 97.7 90.8 86.7
speech span is 8 frames and the minimum noise span is
4 frames. Thus, when the time spans are shorter than
the defined time period, it will be discarded.
7. Acknowledgements
6. The experimental results and
conclusions The project is sponsored by the Scientific Research
Foundation for the Returned Overseas Chinese
The experiment is carried under different noise Scholars, State Education Ministry of China ([2004]
conditions. First, the speech signal is sampled at No.176), Natural Science Foundation of China
11.025kHz and quantized into 16bit data, and mixed (No.60472094), Shanxi Province Natural Science
with different levels of white noise, and then added Foundation (No.20051039), and Shanxi Province
random with color noise. In all experiments, the speech Scientific Research Foundation for University Young
signals are divided into frames with 220 samples each. Scholars ([2004] No.13).The authors gratefully
The neighboring frames shared 50% overlapping area. acknowledge them.
Marking manually each speech file to distinguish
speech endpoint from the noisy background, and then 8. References
we can use these marks to obtain the accuracy of the
speech endpoint detecting method. [1] A.M. Nassar, N.S. Kader and A.M. Refat, Endpoints
We use simultaneity the Energy and Zero-Crossing detection for noisy speech using a wavelet based algorithm,
EUROSPEECH99 .Budapest Kluwer Academic Publishers ,
Rate (EZCR), Sub-Band Amplitude Variance (SBAV),
pp.903906.
Wavelet Coefficient Variance (WCV) and Synthesis
Implementation (SI) methods to detect the speech [2] J.B. Xu, C.S. Ran, The application of adaptive wavelet
signal endpoint. The endpoint detection results of 160 transformation in speech signal processing, Computer
speech sentences using four methods are shown in the Engineering and Science, vol.26, no.7, 2004.
Table 2. From the table we can see that along with the
increasing in the noise interference, the detection [3] F. Wang, F. Zheng, Speech detection in non-stationary
correct rate of WCV method is rapidly decreased. The noise based on 1/f process, Journal of Computer &
SBAV method is suit for processing the signal with Technology, vol.17, no.1, pp327-330, 2002,
white noise, not for color noise. Its advantage is simple
[4] B. Wu, X.L.Ren, A Noise Model Based Method for
and fast. The WCV method can perform much better Speech/Noise Discrimination, Journal of Shanghai Jiaotong
detection comparing with the EZCR and SBEV University, no.9, Sep.2004.
methods in white and color noise environment. Its
disadvantage is the complexity of algorithm. The SI
method makes full use of the merits of SBAV and
WCV methods, selects one of them by judging noise
type. Thus, it can obtain best detection results not
wasting system resource. It can meet the demand of
endpoint detection in practical applications, such as the
speech strengthening under strong noise interference
and the speech recognition etc.

Proceedings of the First International Conference on Innovative Computing, Information and Control (ICICIC'06)
0-7695-2616-0/06 $20.00 2006

You might also like