Speech Processing Research Paper 23

PITCH DETERMINATION AND SPEECH SEGMENTATION
USING THE DISCRETE WAVELET TRANSFORM

Christopher Wendt and Athina P. Petropulu
Electrical and Computer Engineering Department
Drexel University, Philadelphia PA 19104
e-mail: chris @cbis.ece.drexel.edu
ABSTRACT
time period between each event. Thus the

speech signal is processed period-by-period. [6,7]
The wavelet transform is a multiresolutional, multi-scale analysis which has been
shown to be very well suited for speech
processing because of its similarity to how the
human ear processes sound [l]. Using the
wavelet transform, a speech signal can be
analyzed at a specific scale corresponding to the
range of human speech.
In [8], a wavelet based pitch determination
algorithm is proposed, based on Mallats work on
images [9]. Mallat showed that when analyzing
images, the use of wavelet functions with
derivative characteristics produces maximums in
the wavelet transform across many coincident
scales along sharp edges. When a sharp change
of intensity occurs in an image, peaks occur in
corresponding positions throughout many scales.
In [8], Kadambe and Bordreaux-Bartels used the
assumption that when a GCI occurs in a speech
waveform, maximums also occur in the adjacent
scales of the wavelet transform. For rcasons that
will be explained in this paper and through
extensive experimentation, we found that finding
corresponding maximums was not generally
reliable. We propose a similar approach which
improves reliability and further simplifies
computation.
In contrast with [SI, which
chooses maximums if they occur in two adjacent
wavelet coefficient scales, we chose to utilize a
single derivative filtering function defined to
contain a specific bandwidth of voiced speech.
This wavelet function when convolved with a
speech signal will produce a filtered signal
containing well defined local maximums where
GCIs occur in the speech signal. This method
provides a dramatic simplification in processing,
utilizing only convolution and requiring only one
set of coefficients to analyze and is robust to
noise.
Pitch determination and speech segmentation are

two important parts of speech recognition and
speech processing in general.
This paper
proposes a time-based event detection method for
finding the pitch period of a speech signal. Based
on the discrete wavelet transform, it detects
voiced speech, which is local in frequency, and
determines the pitch period. This method is
computationally inexpensive and through
simulations and real speech experiments we show
that it is both accurate and robust to noise.
1.
INTRODUCTION
Pitch is a common parameter utilized in many

types of speech and signal processing. It is
defined as the perceived fundamental frequency of
a signal and is used in many applications
including determination of intonation and
emotional characteristics of speech, speaker
identification, and plays an integral part in many
speech and signal compression schemes [2].
Speech segmentation is the determination
between voiced and unvoiced speech. Voiced
speech is produced using the vocal cords and is
commonly modeled as a filtered train of impulses
and lies in a localized frequency range. Unvoiced
speech is created by forcing air through a
constriction created along the vocal tract. This
type of speech is usually modeled by filtered
white noise [2]. Pitch determination is the
determination of the continuous pitch period
during the voiced segments of speech. Therefore,
by performing pitch determination, speech can be
segmented into its voiced and unvoiced parts.
Pitch detection algorithms can be classified
in two separate categories, spectral-domain based
and time-domain based period detection. Spectral
pitch detectors, such as Cepstrum [3 1, Maximum
Likelihood [4], and Autocorrelation [5] methods,
estimate the pitch period of a signal directly
using windowed segments of speech, applying a
Fourier-type analysis to determine a pitch
average. A time based pitch detector, however,
estimates the pitch period by determining the
glottal closure instant (GCI) and measuring the
0-7803-3073-0/96/$5.OO 1996 IEEE
2. THE WAVELET TRANSFORM
The continuous wavelet transform of a signal is

defined as,
X,,(k,n) = d*
j x ( t ) y ( n T - a P r ) d r (1)
-_
45
where x ( t ) is the signal and y(nT - a-") is the

is dilated and scaled wavelet function.
XDw(k,n)
are the wavelet transform
coefficients which
represent the wavelet
transform at each scale k . If k and n are limited
to integer values, eq. (1) becomes the discrete
wavelet transform. Therefore, given a bandpass
) a finite discrete
wavelet function ~ ( t for
the discrete wavelet
sampled signal x(t),
transform acts as a constant-Q filter bank and
splits the signal into bandpass components. This
is useful in the determination of characteristics of
the signal which are local in frequency.
The frequency range of voiced speech for
men and women lies in the range, 30-500Hz.
Unvoiced speech is usually modeled as filtered
white noise consisting of mostly high frequency
components. The wavelet transform applies very
easily because of its ability to separate both
unvoiced and voiced speech into different scales.
Therefore, using scales containing voiced speech
information, analysis can be performed for
segmentation and pitch determination.
scale, k, , corresponding to approximately 30Hz,

i.e.,
where F, is the sampling rate of the speech

signal, and the highpass wavelet function for the
scale kb corresponding to approximately 500Hz,
i.e.,
The filtering function, p ( t ) , is obtained as:
where * is linear convolution and k, and k, were

given in (4) and (5).
4. COMPARISON OF WAVELET
METHODS
The method we have proposed utilizes a single

filtering function chosen to have both derivative
characteristics and a bandwidth defined by voiced
speech. The method proposed in [SI, while using
a derivative function as a mother wavelet, differs
in that it uses multiple scales in its analysis.
Consecutive scale coefficients are searched for
maximums occurring at or around the same
positions. Although the differences between
methods are slight, they become very important
when working with real speech signals.
It is well known that real voiced speech is
not perfectly periodic and includes many
nonlinearities, i.e. glides from one phoneme to
another, physical and emotional effects on vocal
quality, etc. Therefore, while we found that the
method proposed in [8] produced almost perfect
results for synthesized speech signals, when
tested with real speech recordings failed. Fig. 1
shows a segment of voiced speech, 101, sampled
at 8.192 kHz. In Fig. 2, the segment convolved
with p ( t ) (see (6)) is shown and Fig. 3 shows
two coinciding scales used for the method
described in [8]. Figs. 4, 5, and 6 are as above
for a different voiced speech segment, la/. The
wavelet used for in both Figs. 3 and 6 is the
cubic spline, which was determined to give the
most accurate results in [8]. In Figs. 2 and 5, we
can clearly see that determination of local
maximums and therefore pitch can be easily
performed. However, using the method proposed
in [SI, inspection of wavelet coefficients shows
that different phonemes produce unpredictability
in alignment of local maximums between scales.
Another concern is the correct location of the
GCI itself. While the filtered speech signal gives
clear peaks which can easily be determined, even
the lower scale in Figs. 3 and 6 show
inaccuraciesin the maximums detected.
3. THE PROPOSED METHOD
The choice of the mother wavelet, y ( t ) , is

important, because it defines the characteristics of
the wavelet transform and coefficients. A voiced
speech segment is often modeled as a filtered
impulse train where the period between each
impulse represents a pitch period. During each
period of voiced speech the glottis is excited and
a GCI occurs. In a speech signal this phenomena
corresponds with a zero-crossing in the
waveform. In order to detect a zero-crossing or
GCI, a derivative function can be used. If a
speech signal is filtered by a derivative function,
a maximum will occur at each zero-crossing.
Thus, the time period between each maximum
represents the pitch period of the signal at that
moment.
In order to construct a filtering function, we
propose the use of a wavelet with the derivative
properties, described by Mallet in [9], that also
combines the bandwidth properties of the wavelet
transform at different scales. Let y(t) be the
mother wavelet with derivative properties. The
functions,
y k( t ) = 2 " y(2*t )
(2)
(3)
represent both the wavelet and scaling functions
respectively at each scale, where q ( t ) is a
lowpass function and is the conjugate mirror
filter of y ( t > , which is a highpass function.
Since the range of voiced speech is between 30500Hz, the final filtering function constructed
should have a similar bandwidth. Therefore, we
determine the lowpass scaling function for the
46
5.
[9] S. G. Mallat, S. Zhong, Characterization of

signals from multiscale edges, IEEE Trans. ofpait.
Analy. and Mach. Intell., vo1.14, pp. 710-32, July 1992.
RESULTS AND DISCUSSION
Through experimentation with different wavelets

and both synthesized and real speech signals, we
determined that the Haar Wavelet performed best
in segmentation and pitch determination. Fig. 7
shows the Haar wavelet lowpass and highpass
functions at scales 3 and 6 respectively, and the
resulting wavelet function given by (5). Fig.
8(a) shows a phrase spoken by an American
male. The phrase is Going to the zoo and it
was sampled at 8192 Hz. Fig. 8(b) shows the
pitch determined after searching the filtered
speech signal. The same phrase was also used to
test the effect of noise. Zero mean, gaussian
noise was added to the speech signal and the pitch
detection methods were performed. Fig. 8(c)
shows the results of the noisy signal with a
signal-to-noise ratio of approximately 0 dB. Due
to the bandpass character of the filtering function,
it is expected that high frequency noise will be
rejected.
6.
[lo] N. J. Fliege, Multirate Digital Signal Processing, John

Wiley & Sons, New York, 1994.
0
Figure 1. A segment of the phoneme i o f ,

Fs=8192 Hz.
CONCLUSIONS
We have described a time-based speech

segmentation and pitch determination method
based on the discrete wavelet transform. Using
comparisons to other methods we have shown
that simplifying wavelet analysis to a single
filtering function has lowered computational
complexity, while exceeding performance in
speech segmentation and pitch determination
with real speech signals.
REFERENCES
2. Speech phoneme filtered method

proposed. Lines indicate local maximums.
Figure
P. P. Vaidyanathan, Multirate systems andfilter banks,

Prentice Hall, Englewood Cliffs, NJ, 1993.
J. R. Deller Jr., J. G. Proakis, J. H. L. Hansen,
Discrete-Time Processing of Speech Signals,
Macmillan, New York, 1993.
0.2
A. M. Noll, Cepstrum Pitch Determination, J.

Acoust. Soc. Amer., vol.41, no. 2, pp. 293-309, 1970.
0.1
0
J. D. Wise, J. R. Caprio, and T. W. Parks,

Maximum likelihood pitch estimation, IEEE
Trans. Acoust., Speech, Signal Processing, vol. ASSP24, pp. 418-423, 1976.
4.1
4.2;
50
100
150
200
250
300
350
4M,
400
450
500
M. M. Sondhi, New methods of pitch extraction,

IEEE Trans. Audio Electroacoust., vol. AU-16, pp.
262-266, June 1968.
H. W. Strube, Determination of the instant of
glottal closure from the speech wave, J.
Acoust. Soc. Amer., vol. 56, no. 5, pp. 1625-29, 1974.
a.$
Y. M. Cheng, D. OShaughnessy, Automatic and

Reliable Estimation of Glottal Closure Instant and
Period, IEEE Trans. Acoust., Speech, Signal
Processing, vol. 37, no. 12, pp. 1805-15, 1989.
50
1M)
150
200
250
300
350
450
500
Figure 3. Scales 4 and 5 of wavelet coefficients

using the cubic spline wavelet and method described
in [ 8 ] .
S. Kadambe, G. Faye Boudreaux-Bartels,

Application of the Wavelet Transform for Pitch
Detection of Speech Signals, IEEE Trans. on Info.
Theory, vol. 38, no. 2, March 1992, pp. 917-924.
47
0.6
0.4
0.2
& I
0.2
1
0.4
11
-1
4.8
D
50
100
150
200
Z M
300
350
4w
450
20
30
O
r
5
500
10
60
50
40
'I
/i
Ii
20
10
30
\
40
50
70
60
Figure 7. a) Lowpass scaling function, k=3 b)

Highpass wavelet function, k=6
c) Derivative
Figure 4. A segment of the phoneme Id, Fs=8192

Hz.
wavelet filter
60
....,,
40
loo0
2000
3033
4000
5000
030
7oM)
8ooO
1
SOW
20
2100-
-20
40
4
0
0
"
50
100
150
"
200
250
"
300
350
"
400
450
500
Figure 5. Filtered speech phoneme.
Figure 8. Speech signal, "Going to the Zoo". b)
Pitch determined using filtered speech. c) Pitch

determined using filtered speech, SNR=O dB.
02
0
0.2
0.41
"
50
iW
"
150
200
"
250
300
"
350
400
'
450
5M)
02
0
4 2
4.41
"
50
100
"
150
200
'
250
"
300
350
'
400
'
450
500
Figure 6. Scales 4 and 5 of method described i n

VI.
48

Speech Processing Research Paper 23

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Processing Research Paper 23

Uploaded by

Copyright:

Available Formats

PITCH DETERMINATION AND SPEECH SEGMENTATION

USING THE DISCRETE WAVELET TRANSFORM

time period between each event. Thus the

Pitch determination and speech segmentation are

Pitch is a common parameter utilized in many

0-7803-3073-0/96/$5.OO 1996 IEEE

2. THE WAVELET TRANSFORM

The continuous wavelet transform of a signal is

where x ( t ) is the signal and y(nT - a-") is the

scale, k, , corresponding to approximately 30Hz,

where F, is the sampling rate of the speech

The filtering function, p ( t ) , is obtained as:

where * is linear convolution and k, and k, were

The method we have proposed utilizes a single

3. THE PROPOSED METHOD

The choice of the mother wavelet, y ( t ) , is

[9] S. G. Mallat, S. Zhong, Characterization of

RESULTS AND DISCUSSION

Through experimentation with different wavelets

[lo] N. J. Fliege, Multirate Digital Signal Processing, John

Figure 1. A segment of the phoneme i o f ,

We have described a time-based speech

2. Speech phoneme filtered method

P. P. Vaidyanathan, Multirate systems andfilter banks,

A. M. Noll, Cepstrum Pitch Determination, J.

J. D. Wise, J. R. Caprio, and T. W. Parks,

M. M. Sondhi, New methods of pitch extraction,

Y. M. Cheng, D. OShaughnessy, Automatic and

Figure 3. Scales 4 and 5 of wavelet coefficients

S. Kadambe, G. Faye Boudreaux-Bartels,

Figure 7. a) Lowpass scaling function, k=3 b)

Figure 4. A segment of the phoneme Id, Fs=8192

Figure 5. Filtered speech phoneme.

Figure 8. Speech signal, "Going to the Zoo". b)

Pitch determined using filtered speech. c) Pitch

Figure 6. Scales 4 and 5 of method described i n

You might also like