You are on page 1of 4

PITCH DETERMINATION AND SPEECH SEGMENTATION

USING THE DISCRETE WAVELET TRANSFORM


Christopher Wendt and Athina P. Petropulu
Electrical and Computer Engineering Department
Drexel University, Philadelphia PA 19104
e-mail: chris @cbis.ece.drexel.edu

ABSTRACT

time period between each event. Thus the


speech signal is processed period-by-period. [6,7]
The wavelet transform is a multiresolutional, multi-scale analysis which has been
shown to be very well suited for speech
processing because of its similarity to how the
human ear processes sound [l]. Using the
wavelet transform, a speech signal can be
analyzed at a specific scale corresponding to the
range of human speech.
In [8], a wavelet based pitch determination
algorithm is proposed, based on Mallats work on
images [9]. Mallat showed that when analyzing
images, the use of wavelet functions with
derivative characteristics produces maximums in
the wavelet transform across many coincident
scales along sharp edges. When a sharp change
of intensity occurs in an image, peaks occur in
corresponding positions throughout many scales.
In [8], Kadambe and Bordreaux-Bartels used the
assumption that when a GCI occurs in a speech
waveform, maximums also occur in the adjacent
scales of the wavelet transform. For rcasons that
will be explained in this paper and through
extensive experimentation, we found that finding
corresponding maximums was not generally
reliable. We propose a similar approach which
improves reliability and further simplifies
computation.
In contrast with [SI, which
chooses maximums if they occur in two adjacent
wavelet coefficient scales, we chose to utilize a
single derivative filtering function defined to
contain a specific bandwidth of voiced speech.
This wavelet function when convolved with a
speech signal will produce a filtered signal
containing well defined local maximums where
GCIs occur in the speech signal. This method
provides a dramatic simplification in processing,
utilizing only convolution and requiring only one
set of coefficients to analyze and is robust to
noise.

Pitch determination and speech segmentation are


two important parts of speech recognition and
speech processing in general.
This paper
proposes a time-based event detection method for
finding the pitch period of a speech signal. Based
on the discrete wavelet transform, it detects
voiced speech, which is local in frequency, and
determines the pitch period. This method is
computationally inexpensive and through
simulations and real speech experiments we show
that it is both accurate and robust to noise.
1.

INTRODUCTION

Pitch is a common parameter utilized in many


types of speech and signal processing. It is
defined as the perceived fundamental frequency of
a signal and is used in many applications
including determination of intonation and
emotional characteristics of speech, speaker
identification, and plays an integral part in many
speech and signal compression schemes [2].
Speech segmentation is the determination
between voiced and unvoiced speech. Voiced
speech is produced using the vocal cords and is
commonly modeled as a filtered train of impulses
and lies in a localized frequency range. Unvoiced
speech is created by forcing air through a
constriction created along the vocal tract. This
type of speech is usually modeled by filtered
white noise [2]. Pitch determination is the
determination of the continuous pitch period
during the voiced segments of speech. Therefore,
by performing pitch determination, speech can be
segmented into its voiced and unvoiced parts.
Pitch detection algorithms can be classified
in two separate categories, spectral-domain based
and time-domain based period detection. Spectral
pitch detectors, such as Cepstrum [3 1, Maximum
Likelihood [4], and Autocorrelation [5] methods,
estimate the pitch period of a signal directly
using windowed segments of speech, applying a
Fourier-type analysis to determine a pitch
average. A time based pitch detector, however,
estimates the pitch period by determining the
glottal closure instant (GCI) and measuring the

0-7803-3073-0/96/$5.OO 1996 IEEE

2. THE WAVELET TRANSFORM

The continuous wavelet transform of a signal is


defined as,
X,,(k,n) = d*
j x ( t ) y ( n T - a P r ) d r (1)
-_

45

where x ( t ) is the signal and y(nT - a-") is the


is dilated and scaled wavelet function.
XDw(k,n)
are the wavelet transform
coefficients which
represent the wavelet
transform at each scale k . If k and n are limited
to integer values, eq. (1) becomes the discrete
wavelet transform. Therefore, given a bandpass
) a finite discrete
wavelet function ~ ( t for
the discrete wavelet
sampled signal x(t),
transform acts as a constant-Q filter bank and
splits the signal into bandpass components. This
is useful in the determination of characteristics of
the signal which are local in frequency.
The frequency range of voiced speech for
men and women lies in the range, 30-500Hz.
Unvoiced speech is usually modeled as filtered
white noise consisting of mostly high frequency
components. The wavelet transform applies very
easily because of its ability to separate both
unvoiced and voiced speech into different scales.
Therefore, using scales containing voiced speech
information, analysis can be performed for
segmentation and pitch determination.

scale, k, , corresponding to approximately 30Hz,


i.e.,

where F, is the sampling rate of the speech


signal, and the highpass wavelet function for the
scale kb corresponding to approximately 500Hz,
i.e.,

The filtering function, p ( t ) , is obtained as:

where * is linear convolution and k, and k, were


given in (4) and (5).
4. COMPARISON OF WAVELET
METHODS

The method we have proposed utilizes a single


filtering function chosen to have both derivative
characteristics and a bandwidth defined by voiced
speech. The method proposed in [SI, while using
a derivative function as a mother wavelet, differs
in that it uses multiple scales in its analysis.
Consecutive scale coefficients are searched for
maximums occurring at or around the same
positions. Although the differences between
methods are slight, they become very important
when working with real speech signals.
It is well known that real voiced speech is
not perfectly periodic and includes many
nonlinearities, i.e. glides from one phoneme to
another, physical and emotional effects on vocal
quality, etc. Therefore, while we found that the
method proposed in [8] produced almost perfect
results for synthesized speech signals, when
tested with real speech recordings failed. Fig. 1
shows a segment of voiced speech, 101, sampled
at 8.192 kHz. In Fig. 2, the segment convolved
with p ( t ) (see (6)) is shown and Fig. 3 shows
two coinciding scales used for the method
described in [8]. Figs. 4, 5, and 6 are as above
for a different voiced speech segment, la/. The
wavelet used for in both Figs. 3 and 6 is the
cubic spline, which was determined to give the
most accurate results in [8]. In Figs. 2 and 5, we
can clearly see that determination of local
maximums and therefore pitch can be easily
performed. However, using the method proposed
in [SI, inspection of wavelet coefficients shows
that different phonemes produce unpredictability
in alignment of local maximums between scales.
Another concern is the correct location of the
GCI itself. While the filtered speech signal gives
clear peaks which can easily be determined, even
the lower scale in Figs. 3 and 6 show
inaccuraciesin the maximums detected.

3. THE PROPOSED METHOD

The choice of the mother wavelet, y ( t ) , is


important, because it defines the characteristics of
the wavelet transform and coefficients. A voiced
speech segment is often modeled as a filtered
impulse train where the period between each
impulse represents a pitch period. During each
period of voiced speech the glottis is excited and
a GCI occurs. In a speech signal this phenomena
corresponds with a zero-crossing in the
waveform. In order to detect a zero-crossing or
GCI, a derivative function can be used. If a
speech signal is filtered by a derivative function,
a maximum will occur at each zero-crossing.
Thus, the time period between each maximum
represents the pitch period of the signal at that
moment.
In order to construct a filtering function, we
propose the use of a wavelet with the derivative
properties, described by Mallet in [9], that also
combines the bandwidth properties of the wavelet
transform at different scales. Let y(t) be the
mother wavelet with derivative properties. The
functions,

y k( t ) = 2 " y(2*t )

(2)

(3)
represent both the wavelet and scaling functions
respectively at each scale, where q ( t ) is a
lowpass function and is the conjugate mirror
filter of y ( t > , which is a highpass function.
Since the range of voiced speech is between 30500Hz, the final filtering function constructed
should have a similar bandwidth. Therefore, we
determine the lowpass scaling function for the
46

5.

[9] S. G. Mallat, S. Zhong, Characterization of


signals from multiscale edges, IEEE Trans. ofpait.
Analy. and Mach. Intell., vo1.14, pp. 710-32, July 1992.

RESULTS AND DISCUSSION

Through experimentation with different wavelets


and both synthesized and real speech signals, we
determined that the Haar Wavelet performed best
in segmentation and pitch determination. Fig. 7
shows the Haar wavelet lowpass and highpass
functions at scales 3 and 6 respectively, and the
resulting wavelet function given by (5). Fig.
8(a) shows a phrase spoken by an American
male. The phrase is Going to the zoo and it
was sampled at 8192 Hz. Fig. 8(b) shows the
pitch determined after searching the filtered
speech signal. The same phrase was also used to
test the effect of noise. Zero mean, gaussian
noise was added to the speech signal and the pitch
detection methods were performed. Fig. 8(c)
shows the results of the noisy signal with a
signal-to-noise ratio of approximately 0 dB. Due
to the bandpass character of the filtering function,
it is expected that high frequency noise will be
rejected.
6.

[lo] N. J. Fliege, Multirate Digital Signal Processing, John


Wiley & Sons, New York, 1994.
0

Figure 1. A segment of the phoneme i o f ,


Fs=8192 Hz.

CONCLUSIONS

We have described a time-based speech


segmentation and pitch determination method
based on the discrete wavelet transform. Using
comparisons to other methods we have shown
that simplifying wavelet analysis to a single
filtering function has lowered computational
complexity, while exceeding performance in
speech segmentation and pitch determination
with real speech signals.
REFERENCES

2. Speech phoneme filtered method


proposed. Lines indicate local maximums.

Figure

P. P. Vaidyanathan, Multirate systems andfilter banks,


Prentice Hall, Englewood Cliffs, NJ, 1993.
J. R. Deller Jr., J. G. Proakis, J. H. L. Hansen,
Discrete-Time Processing of Speech Signals,
Macmillan, New York, 1993.

0.2

A. M. Noll, Cepstrum Pitch Determination, J.


Acoust. Soc. Amer., vol.41, no. 2, pp. 293-309, 1970.

0.1
0

J. D. Wise, J. R. Caprio, and T. W. Parks,


Maximum likelihood pitch estimation, IEEE
Trans. Acoust., Speech, Signal Processing, vol. ASSP24, pp. 418-423, 1976.

4.1

4.2;

50

100

150

200

250

300

350

4M,

400

450

500

M. M. Sondhi, New methods of pitch extraction,


IEEE Trans. Audio Electroacoust., vol. AU-16, pp.
262-266, June 1968.
H. W. Strube, Determination of the instant of
glottal closure from the speech wave, J.
Acoust. Soc. Amer., vol. 56, no. 5, pp. 1625-29, 1974.
a.$

Y. M. Cheng, D. OShaughnessy, Automatic and


Reliable Estimation of Glottal Closure Instant and
Period, IEEE Trans. Acoust., Speech, Signal
Processing, vol. 37, no. 12, pp. 1805-15, 1989.

50

1M)

150

200

250

300

350

450

500

Figure 3. Scales 4 and 5 of wavelet coefficients


using the cubic spline wavelet and method described
in [ 8 ] .

S. Kadambe, G. Faye Boudreaux-Bartels,


Application of the Wavelet Transform for Pitch
Detection of Speech Signals, IEEE Trans. on Info.
Theory, vol. 38, no. 2, March 1992, pp. 917-924.

47

0.6
0.4

0.2

& I

0.2
1

0.4

11

-1

4.8

D
50

100

150

200

Z M

300

350

4w

450

20

30

O
r
5

500

10

60

50

40

'I

/i
Ii

20

10

30

\
40

50

70

60

Figure 7. a) Lowpass scaling function, k=3 b)


Highpass wavelet function, k=6
c) Derivative

Figure 4. A segment of the phoneme Id, Fs=8192


Hz.

wavelet filter

60

....,,

40

loo0

2000

3033

4000

5000

030

7oM)

8ooO

1
SOW

20
2100-

-20

40

4
0
0

"

50

100

150

"

200

250

"

300

350

"

400

450

500

Figure 5. Filtered speech phoneme.

Figure 8. Speech signal, "Going to the Zoo". b)

Pitch determined using filtered speech. c) Pitch


determined using filtered speech, SNR=O dB.

02
0
0.2
0.41

"

50

iW

"

150

200

"

250

300

"

350

400

'

450

5M)

02
0

4 2
4.41

"

50

100

"

150

200

'

250

"

300

350

'

400

'

450

500

Figure 6. Scales 4 and 5 of method described i n


VI.

48

You might also like