You are on page 1of 4

4 th International Conference on Electrical Engineering (ICEE 2015)

IGEE, Boumerdes, December 13 th -15th, 2015

Glottal Source Estimation Based on Bivariate


Empirical Mode Decomposition
Mina KEMIHA

Abdellah KACHA

Radiation Physics and Applications Laboratory


Jijel Uuniversity
Algeria
kemihamina@yahoo.fr

Radiation Physics and Applications Laboratory


Jijel University
Algeria
kacha_a@yahoo.com

AbstractThe Bivariate Empirical Mode Decomposition (BEMD)


is an extension of Empirical Mode Decomposition (EMD)
algorithm. In its classical formulation, the EMD can only be
applied to real-valued time series. In this paper, the BEMD
algorithm is proposed as an alternative to estimate the glottal
source from the speech signal. The bivariate empirical mode
decomposition decomposes the complex log spectrum into its real
part which represents the magnitude and its imaginary part
which represents the phase and guarantees an equal number of
real and imaginary parts of oscillatory modes named intrinsic
mode functions. An adaptive procedure, based on IMFs
variances, is then used to estimate the magnitude of the glottal
source by selecting the appropriate intrinsic mode functions that
constitute the magnitude of the glottal source in the log spectral
domain. The proposed method is tested on synthetic speech
signals and compared to the true model of the glottal source for
different lengths of the weighting window.
Keywords-bivariate empirical mode decomposition; intrinsic
mode function;magnitude decomposition, phase decomposition,
glottal source estimation

I.
INTRODUCTION
The separation of the speech signal into its vocal tract and
glottal source contributions is an important topic in speech
processing. Once the two components are isolated, they can be
modeled independently. The glottal source characterization can
be used in many areas of speech processing such as speaker
recognition [1], analysis of voice disorders [2], speech
recognition [3] and speech synthesis [4]. These reasons justify
the need to develop algorithms able to estimate the glottal
source robustly and reliably.
Although vocal tract modeling techniques are fairly well
established, this is not the case of the representation of the
glottal source. Some works have addressed the problem of
estimating the glottal source directly from the waveform of the
speech signal. Most of the approaches are based on a
parametric modeling of the vocal tract, and then, the inverse
filtering is used to to eliminate the effect of the vocal tract and
get an estimate of the glottal source signal. In [5], the discrete
all pole model has been used to model the vocal tract. The
iterative adaptive inverse filtering method described in [6]
isolates the source by estimating iteratively both components
due to the vocal tract and the source signal. In [7], the
estimated glottal signal is refined over several glottal cycles. A

nonparametric technique based on the zeros of the Z transform


(ZZT) and the bivariate cepstrum has been suggested in [8][9].
This approach is based on the observation that speech is a
mixed phase signal including a causal component and an anticausal component where the anti-causal component
corresponds to the opening phase of the glottis and the causal
component includes both the closure of the glottis and the
contributions of the vocal tract.
Recently, a method of decomposition of the signal, called
empirical mode decomposition (EMD), has been introduced to
analyze data from non-stationary and/or non-linear processes
[10]. The major advantage of the empirical mode
decomposition is that the basic functions are obtained from the
signal itself and not fixed a priori like in conventional analysis
methods (Fourier transform, wavelet transform, etc). In [1113], the empirical mode decomposition has been proposed to
decompose the logarithm of the magnitude spectrum of the
speech signal into three components which are the harmonic
component, the frequency response of the vocal tract, and
noise. These components were subsequently used to define an
acoustic index called harmonic-to-noise ratio. In [14] the
empirical mode decomposition is proposed as an alternative to
estimate the glottal source from the speech signal.
In its classical formulation, the EMD can only be applied to
real-valued time series. The first bivariate extension of EMD
was proposed in [15], it employed the concept of analytical
signal and subsequently applied standard EMD to analyze
bivariate data; however, this method cannot guarantee an equal
number of real and imaginary IMFs, thus limiting its
application. An extension of EMD which operates fully in the
bivariate domain was first proposed in [16], it is termed
rotation-invariant EMD (RI-EMD). The extrema of a bivariate
signal are chosen to be the points where the angle of the
derivative of the bivariate signal becomes zero that is, based on
the change in the phase of the signal. The signal envelopes are
produced by using component-wise spline interpolation, and
the local maxima and minima are then averaged to obtain the
local mean of the bivariate signal.
The Rilling et al. [17] algorithm uses effectively only the
extrema of the imaginary part of the bivariate signal, which
results in envelopes based on only two projected directions. An
algorithm which gives more accurate values of the local mean
is the bivariate EMD (BEMD) [17] where the envelopes

2015 IEEE

corresponding to multiple directions in the bivariate plane are


generated, and then averaged to obtain the local mean. The set
of direction vectors for projections are chosen as equidistant
points along the unit circle. The zero mean rotating components
embedded in the input bivariate signal then become
bivariate/bivariate-valued IMFs. The RI-EMD and BEMD
algorithms are equivalent for K =4 direction vectors.
In this article, the bivariate empirical mode decomposition
is proposed as an alternative to estimate the glottal source from
the speech signal. The proposed estimation method operates in
the log-spectral domain. The effectiveness of the proposed
approach is evaluated on synthetic speech and its performance
is compared to that of the true glottal source signal. The
remainder of the paper is organized as follows. The bivariate
empirical mode decomposition algorithm is presented in
Section II. The approach for glottal source estimation based on
bivariate empirical mode decomposition is presented in Section
III, results based on synthetic signals are presented in Section
IV. Finally, conclusions are given in section V.
II.

BIVARIATE EMPIRICAL MODE DECOMPOSITION

Empirical mode decomposition is an adaptive method for


signal decomposition. The EMD decomposes a signal into
several oscillating components characterized by rapid to slow
oscillations. These series of oscillating components are called
intrinsic mode functions (IMF). The EMD relies heavily on the
notion that it defines oscillation from local extrema of the
signal, it is naturally confined to the analysis of scalar signals
since the concept of local extremum, and therefore oscillation
does not exist in vector signals. In the case of two-component
or bivariate signals, we can consider the notion of rotation,
which is becoming a standard bivariate extension of the
concept of oscillation. Thus, the principle underlying all the
proposed bivariate extensions to date is to replace in the EMD
usual notion of oscillation by the rotation [17].
In the approach of bivariate EMD, any bivariate signal can
be described as the sum of a component (Vector) quickly
turning around zero and another turning more slowly. Bivariate
extensions take up the recursive principle of EMD but with
new screening operators modeled on the original new operators
subtract the signal of "central axis" of its "envelope", which is
now a three-dimensional tube that surrounds the signal; this
axis is actually calculated from what might be called the "side
frames" of these frames. Each tube is associated with a
particular direction, it can be considered as the envelope tube
and should be at any point tangent to the signal, which implies
that each of the side frames interpolates a set of points of the
signal. Given a set of such frameworks associated with a set of
directions well distributed in [0, 2 ], it remains to define
"central axis" of the shell. If it is defined independently at each
moment, the problem is actually to define the center of a closed
contour in the plane from N points, which in the case of N = 4
points associated with the up, down, left and right. The Rilling
and al. bivariate extensions [17] is defined by the same
algorithm as the basic EMD, only with new sifting process, the
different steps algorithm can be summarized as follows
1.
2.

Initialize the residue j


Extract the j

th

IMF :

1(j th IMF ), r j (t )

x (t )

2.1.

For
do
2.1.1. Project the bivariate-value signal
on
direction :
2.1.2. Extract the location
of the maxima of
2.1.2. Interpolate the set
envelope curve in direction
2.2. Compute the mean of
2.3.
2.4.

3.
4.

to obtain the
all

envelope

curves

Subtract the mean to obtain


Iterate on the detail d (t ) by repeating steps 2.1 to 2.3
until the stopping criterion based on the standard
deviation between two consecutive details is below a
predefined threshold leading to IMF j signal
Update
the
residue
j
j 1,
rj (t )
r j 1 ( t ) IMF j ( t )
Iterate the residue by repeating step 2 and step 3 until
the number of extrema of rj (t ) is less than 2.

The signal reconstruction process is given by (1) which


involves the IMFs obtained via the EMD and the residual
N

x(t )

IMF j (t ) rN (t )

(1)

j 1

III.

GLOTTAL SOURCE ESTIMATION

According to the source-filter model of speech production,


the speech signal can be considered as the result of the
convolution between the excitation of the vocal tract (glottal
excitation) and its impulse response [18]
x(t ) e(t ) * v(t )
(2)
where x(t ) is the speech signal, v (t ) is the impulse response of
the vocal tract model, and e(t ) is the excitation signal which
originates at the vocal cords, and * denotes the convolution.
The spectrum of the windowed speech frame can be expressed
as
Xw( f ) Xw( f )e j X
(3)
w

where f denotes the frequency, and X w denotes the phase


of Xw(f), taking the complex logarithm of both sides of the
equation (3) gives
(4)
As can be seen in equation (4), the spectrum is composed of
two components or two signals which are the real part and the
imaginary part, where the real part represents the logarithm
magnitude spectrum of the windowed speech frame and the
imaginary part represents the phase. The bivariate EMD
allows decomposing the complex spectrum into its magnitude
and phasing components in an equal number of oscillatory
modes names intrinsic mode functions (IMFs).

res. imf6 imf5 imf4 imf3 imf2 imf1 sign al

Bivariate Empirical Mode Decomposition

separating the harmonic component and the spectral envelope


are illustrated in figure 2 and they may be summarized as
follows [20]
1. Decompose the logarithm of the magnitude spectrum of
the weighted speech frame via the EMD algorithm
(figure 2 (a)).
2. Calculate the variance V of each IMF (figure 2(b)).
3. Identify the maximum index mp > 4 in the variance V.
4. Identify the minimum index mt .
5. Calculate mb as mb mp mt .
6. Determine the index M mp mb .

Figure 1. Bivariate empirical mode decomposition of log spectrum and its


corresponding IMFs components.

Taking the Fourier transform magnitude of equation (2)


gives
Xw( f )

Ew ( f ) V ( f )

Empirical Mode Decomposition

res. imf4 imf3 imf2 imf1signal

Figure 1 illustrates an example of the BEMD of the


logarithm of the spectrum X w(f).

(a)

(6)

120
90

Where, X w ( f ), E w ( f ) are short-time magnitude spectrum of


the windowed speech frame and windowed excitation signal,
respectively and V ( f ) is the frequency response of the vocal
tract. The logarithm changes the multiplicative components
into additive components.
log Ew ( f )

log V ( f )

(7)

It is observed that the logarithm of the magnitude spectrum


of the weighted speech frame is the sum of two spectral
components: the logarithm of the magnitude spectrum of the
weighted excitation log E w ( f ) , and the spectral envelope
log V ( f ) [11]. The logarithm of the amplitude spectrum of the

voiced speech signal can be considered as composed of a


slowly varying (with respect to frequency) part representing
the contour due to the contribution of the vocal tract and a
series of harmonics, characterized by a periodic structure. The
empirical mode decomposition algorithm provides an effective
tool to separate the two components of the amplitude
spectrum. Indeed, the empirical mode decomposition
algorithm acts as a filter bank [19], so that the decomposition
of the logarithm of the amplitude spectrum results in several
IMFs which can be clustered into two categories (classes)
where each component class is associated with a part of the
magnitude spectrum.
It was shown that the variance of IMF for speech signals
significantly decreases after the fourth IMF as the order of the
IMF increases [20]. It was found experimentally that for the
speech signal, the statistics of the IMFs are characterized by an
energy peak at a high-order IMF. This property is used to select
the optimal index that separates the harmonic component and
the spectral envelope. The different steps of the method for

30
0

mt 2

3
IMFs order

mp 4

(b)
20
Amplitude(dB)

log X w ( f )

60

10
0
-10
-20
0

1000

2000
3000
Frequency(Hz)

4000

5000

(c)
Figure 2. Illustration of the separation of the harmonic component and
spectral envelope of synthetic /a/ via empirical mode decomposition. (a) Log
magnitude spectrum and IMF components. (b) IMF variances. (c) estimated
glottal source magnitude.

If j<M, IMFj belongs to the harmonic component.


If j M, IMFj , belongs the spectral envelope.
The logarithm of the amplitude spectrum of the glottal
source (figure. 1 (c)) is estimated as
M 1

log E w ( f )

IMF j ( f )

(8)

j 1

The glottal source phase is estimated by adding the phases of


IMFs belonging to the harmonic component. The logarithm of
the the glottal source spectrum is estimated as
(9)
The bivariate EMD glottal source estimation method is
illustrated in figure 3.

IMF
IMF

Speech
FFT
signal

CEM
D

IMF
IMF

REFERENCES
[1]

Clustering
method

|E|

Low
pass filter

Estimated
glottal source

IMF
IMF

[3]

Figure 3. Glottal source estimation method based on bivariate empirical


mode decomposition

IV.

[2]

[4]

RESULTS AND DISCUSION

The speech is non-stationary, characterized by two essential


parameters which are the pitch and formant frequencies. The
pitch is a crucial parameter for the analysis and synthesis of
speech; it allows determining if a segment is voiced or
unvoiced. Formant frequencies are the resonant frequencies of
the vocal tract. In this work, the synthetic signal used in the test
is a 1-second synthetic vowel /a/ characterized by a
fundamental frequency f0 = 128 Hz generated by the sourcefilter model of speech production. The sampling frequency of
the speech signals used in the experiment is 20 kHz. The
source-filter model consists of a source that generates a
periodic pulse train which models the glottal air flow and a
vocal tract modeled as an all-pole filter characterized by three
poles [21][18] corresponding to the formant frequencies 800
Hz, 1200 Hz and 2870 Hz with frequency bands of 90 Hz, 110
Hz and 170 Hz, respectively. A radiation lip is modeled by a
first-order differentiator R ( z ) 1 z ( 1) . The speech signal is
divided into k non-overlapping frames using a Hamming
window and the glottal source is estimated for each frame. To
analyze the performance of the proposed method, the
estimation of the glottal source is performed for different sizes
of the window. The results are presented in figure 4. The
bivariate empirical mode decomposition allows obtaining
accurate estimates regardless the window length.

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

14

Amplitude (dB)

x 10

[15]

2
0

[16]

-2
-4
-6

0.01

0.02
0.03
0.04
Frequency (Hz)

0.05

0.06

Figure 4. Comparison between true glottal source (dotted line) and


estimated glottal source based on bivariate EMD (solid line) for a frame length
1024.

V.

CONCLUSION

In this paper, the bivariate empirical mode decomposition


algorithm has been proposed as an alternative to estimate the
glottal source from the speech signal. The performance of the
proposed method has been compared to the true model of the
glottal source for different lengths of the weighting window.
The proposed method is simple and systematic. The results
show that the proposed method provides an accurate estimate
of the glottal source for both short and long frames.

[17]

[18]
[19]

[20]

[21]

M. Plumpe, T. Quatieri, D. Reynolds. (1999). Modeling of the glottal


flow derivative wave form with application to speaker identification.
IEEE Trans. on Speech and Audio Processing, 7: 569-586.
E. Moore, M. Clements, J. Peifer, L.Weisser. (2003). Investigating the
role of glottal features in classifying clinical depression. Proc. of the
25th International Conference of the IEEE Engineering in Medicine and
Biology Society, 3:2849-2852.
D. Yamada, N. Kitaoka, S. Nakagawa. (2002). Speech Recognition
Using Features Based on Glottal Sound Source. Trans. of the Institute of
Electrical Engineers of Japan, 122(12): 2028- 2034.
T. Drugman, B. Bozkurt, T. Dutoit, (2009), Bivariate Cepstrum-based
Decomposition of Speech for Glottal Source Estimation. Interspeech
2009: 116-119.
P. Alku, E. Vilkman. (1994). Estimation of the glottal pulseform based
on discrete all pole modeling. Third international Conference on Spoken
Language Processing: 1619-1622.
Alku, J. Svec, E. Vilkman, F. Sram. (1992). Glottal wave analysis with
pitch synchronous iterative adaptive inverse filtering. Speech
Communication, 11(2-3): 109-118.
D. Brookes, D. Chan. (1994). Speaker characteristics from a glottal
airow model using glottal inverse filtering. Proc. Institue of Acoust,
15:501-508.
B. Bozkurt, B. Doval, C. DAlessandro, T. Dutoit. (2005). Zeros of ZTransform Representation With Application to Source-Filter Separation
in Speech. IEEE Signal Processing Letters, 12(4): 2005.
B. Doval, C. dAlessandro, N. Henrich. (2003). The voice source as a
causal/anticausal
linear
filter.
Proceedings
ISCA
ITRW
VOQUAL03:15-19.
Huang N.E. et al, The empirical mode decomposition and the Hilbert
spectrum for nonlinear and nonstationary time series analysis, Proc. R.
Soc. London Ser. A ,454, 1998, pp. 903-995.
A. Kacha, F. Grenez, J. Schoentgen, Assessment of Disordered Voices
Using Empirical Mode Decomposition in the Log-Spectral Domain,
Interspeech 2012, 2012.
A. Kacha, F. Grenez, J. Schoentgen, Empirical Mode DecompositionBased Spectral Acoustic Cues for Disordered Voices Analysis,
Interspeech 2013, 2013.
A. Kacha, F. Grenez, J. Schoentgen, Multiband vocal dysperiodicities
analysis using empirical modedecomposition in the log-spectral
domain, Biomedical Signal Processing and Control, 17 (2015)
M. Kemiha, A. Kacha, M.Boudjerda. Estimation de la source glottique
par dcomposition modale empirique , XXXe dition des journes
dtude sur la parole (JEP14), 23-27 Juin 2014. France
Tanaka, T. & Mandic, D. P. 2006 Bivariate empirical mode
decomposition. IEEE Signal Process. Lett. 14, 101104.
Altaf, M. U., Gautama, T., Tanaka, T. & Mandic, D. P. 2007 Rotation
invariant bivariate empirical mode decomposition. In Proc. IEEE Int.
Conf. on Acoustics, Speech, Signal Processing, Honolulu, HI, April
2007, pp. 10091012.
G. Rilling, P. Flandrin, P. Goncalves, and J. M. Lilly, Bivariate
empirical mode decomposition, IEEE Signal Processing Letters, vol.
14, pp. 936939, 2007.
J. H. Deller, J. G. Proakis, J. H. L. Hansen. Discrete-time processing of
speech signals. Prentice-Hall, 1993.
P. Flandrin, G. Rilling, and P. Goncalves, Empirical mode
decomposition as a filter bank. IEEE Sig. Proc. Lett, 11(2), 2004,
pp.112114.
N. Chatlani, J. Soraghan, EMD-Based Filtering (EMDF) of LowFrequency Noise for Speech Enhancement, IEEE Trans. audio, speech,
and lang. proc,20(4), 2012, pp. 1158-1166.
L. R. Rabiner, Digital-Formant Synthesizer for Speech-Synthesis
Studies, J. Acoust. Soc. Amer, 43(4), 1968, pp. 822-828

You might also like